Multimodal

CharXiv Reasoning

Can the model correctly interpret complex scientific charts from research papers and reason about them?

What does it measure?

CharXiv Reasoning tests whether multimodal AI models understand scientific charts from arXiv papers. The benchmark contains 2,323 real charts with two kinds of questions: descriptive questions about basic elements, and reasoning questions that require combining information from multiple visual elements. CharXiv-R focuses specifically on the hard reasoning questions.

How to read the score

The percentage of correct answers on the reasoning questions. Human performance is 80.5 percent, which shows the questions are challenging for people too.

Example task

Look at this chart from a climate paper. Which decade shows the largest divergence between the two datasets, and what explains that difference based on the legend?

What to watch out for

CharXiv contains only English-language scientific charts from arXiv. Charts from corporate presentations, dashboards or other sources are not tested. The benchmark is static and is not topped up with new papers.

Sources

← Back to all benchmarks