Benchmarks

What does a score on SWE-bench mean? Or MMLU-Pro? For each benchmark: what it measures, how to read the score, an example task, who currently leads and what the pitfalls are.

11 benchmarks Updated monthly Scores and sources

Overview

Benchmark	Category	Current leader	Score	As of
CharXiv Reasoning Can the model correctly interpret complex scientific charts from research papers and reason about them?	Multimodal	Gemini 3.5 Flash	84.2%	May 2026
MCP Atlas Can the model find, combine and drive the right tools through the Model Context Protocol to solve a realistic task?	Agentic	Gemini 3.5 Flash	83.6%	May 2026
Terminal-Bench 2.1 Can the model complete complex tasks in a terminal environment on its own, from compiling code to setting up servers?	Agentic	Gemini 3.5 Flash	76.2%	May 2026
SWE-bench Verified Can the model fix real bugs from open-source GitHub projects?	Code	Claude Mythos Preview	~93.9%	April 2026
HumanEval Can the model write a Python function from a short description?	Code	Claude Sonnet 4.5	~97.6%	April 2026
LiveCodeBench A code benchmark that adds new tasks every month to avoid training leaks.	Code	Gemini 3 Pro Preview	~91.7%	April 2026
Aider Polyglot Can the model write and patch code across six different programming languages?	Code	GPT-5	~88%	April 2026
MMLU-Pro Multiple-choice knowledge test across 14 domains, with 10 answer options per question.	Knowledge	Gemini 3 Pro Preview	~89.8%	April 2026
GPQA Diamond 198 PhD-level biology, physics and chemistry questions you cannot Google.	Knowledge	Gemini 3.1 Pro Preview	~94.1%	April 2026
AIME (2024/2025) A US high-school math olympiad, now a frontier test for AI.	Math	GPT-5 (AIME 2024)	~95.7%	April 2026
Humanity's Last Exam Nearly 3,000 expert questions across almost every field of knowledge, meant as a final challenge for AI.	Reasoning	Claude Opus 4.8	57.9%	May 2026

By category

Multimodal

CharXiv Reasoning Can the model correctly interpret complex scientific charts from research papers and reason about them?

Agentic

MCP Atlas Can the model find, combine and drive the right tools through the Model Context Protocol to solve a realistic task?
Terminal-Bench 2.1 Can the model complete complex tasks in a terminal environment on its own, from compiling code to setting up servers?

Code

Knowledge

Math

AIME (2024/2025) A US high-school math olympiad, now a frontier test for AI.

Reasoning

Humanity's Last Exam Nearly 3,000 expert questions across almost every field of knowledge, meant as a final challenge for AI.

Recent AI news