CharXiv Reasoning
Can the model correctly interpret complex scientific charts from research papers and reason about them?
What does it measure?
CharXiv Reasoning tests whether multimodal AI models understand scientific charts from arXiv papers. The benchmark contains 2,323 real charts with two kinds of questions: descriptive questions about basic elements, and reasoning questions that require combining information from multiple visual elements. CharXiv-R focuses specifically on the hard reasoning questions.
How to read the score
The percentage of correct answers on the reasoning questions. Human performance is 80.5 percent, which shows the questions are challenging for people too.
Example task
Look at this chart from a climate paper. Which decade shows the largest divergence between the two datasets, and what explains that difference based on the legend?
What to watch out for
CharXiv contains only English-language scientific charts from arXiv. Charts from corporate presentations, dashboards or other sources are not tested. The benchmark is static and is not topped up with new papers.