Benchmarks

What does a score on SWE-bench mean? Or MMLU-Pro? For each benchmark: what it measures, how to read the score, an example task, who currently leads and what the pitfalls are.

11 benchmarks Updated monthly Scores and sources

Overview

Benchmark Category Current leader Score As of
CharXiv Reasoning Can the model correctly interpret complex scientific charts from research papers and reason about them? Multimodal Gemini 3.5 Flash 84.2% May 2026
MCP Atlas Can the model find, combine and drive the right tools through the Model Context Protocol to solve a realistic task? Agentic Gemini 3.5 Flash 83.6% May 2026
Terminal-Bench 2.1 Can the model complete complex tasks in a terminal environment on its own, from compiling code to setting up servers? Agentic Gemini 3.5 Flash 76.2% May 2026
SWE-bench Verified Can the model fix real bugs from open-source GitHub projects? Code Claude Mythos Preview ~93.9% April 2026
HumanEval Can the model write a Python function from a short description? Code Claude Sonnet 4.5 ~97.6% April 2026
LiveCodeBench A code benchmark that adds new tasks every month to avoid training leaks. Code Gemini 3 Pro Preview ~91.7% April 2026
Aider Polyglot Can the model write and patch code across six different programming languages? Code GPT-5 ~88% April 2026
MMLU-Pro Multiple-choice knowledge test across 14 domains, with 10 answer options per question. Knowledge Gemini 3 Pro Preview ~89.8% April 2026
GPQA Diamond 198 PhD-level biology, physics and chemistry questions you cannot Google. Knowledge Gemini 3.1 Pro Preview ~94.1% April 2026
AIME (2024/2025) A US high-school math olympiad, now a frontier test for AI. Math GPT-5 (AIME 2024) ~95.7% April 2026
Humanity's Last Exam Nearly 3,000 expert questions across almost every field of knowledge, meant as a final challenge for AI. Reasoning Claude Opus 4.8 57.9% May 2026

By category

Multimodal

Agentic

Code

Knowledge

Math

Reasoning

Recent AI news