Overview
| Benchmark | Category | Current leader | Score | As of |
|---|---|---|---|---|
| CharXiv Reasoning Can the model correctly interpret complex scientific charts from research papers and reason about them? | Multimodal | Gemini 3.5 Flash | 84.2% | May 2026 |
| MCP Atlas Can the model find, combine and drive the right tools through the Model Context Protocol to solve a realistic task? | Agentic | Gemini 3.5 Flash | 83.6% | May 2026 |
| Terminal-Bench 2.1 Can the model complete complex tasks in a terminal environment on its own, from compiling code to setting up servers? | Agentic | Gemini 3.5 Flash | 76.2% | May 2026 |
| SWE-bench Verified Can the model fix real bugs from open-source GitHub projects? | Code | Claude Mythos Preview | ~93.9% | April 2026 |
| HumanEval Can the model write a Python function from a short description? | Code | Claude Sonnet 4.5 | ~97.6% | April 2026 |
| LiveCodeBench A code benchmark that adds new tasks every month to avoid training leaks. | Code | Gemini 3 Pro Preview | ~91.7% | April 2026 |
| Aider Polyglot Can the model write and patch code across six different programming languages? | Code | GPT-5 | ~88% | April 2026 |
| MMLU-Pro Multiple-choice knowledge test across 14 domains, with 10 answer options per question. | Knowledge | Gemini 3 Pro Preview | ~89.8% | April 2026 |
| GPQA Diamond 198 PhD-level biology, physics and chemistry questions you cannot Google. | Knowledge | Gemini 3.1 Pro Preview | ~94.1% | April 2026 |
| AIME (2024/2025) A US high-school math olympiad, now a frontier test for AI. | Math | GPT-5 (AIME 2024) | ~95.7% | April 2026 |
| Humanity's Last Exam Nearly 3,000 expert questions across almost every field of knowledge, meant as a final challenge for AI. | Reasoning | Claude Opus 4.8 | 57.9% | May 2026 |
By category
Multimodal
Agentic
Code
- SWE-bench Verified Can the model fix real bugs from open-source GitHub projects?
- HumanEval Can the model write a Python function from a short description?
- LiveCodeBench A code benchmark that adds new tasks every month to avoid training leaks.
- Aider Polyglot Can the model write and patch code across six different programming languages?