Knowledge

MMLU-Pro

Multiple-choice knowledge test across 14 domains, with 10 answer options per question.

Top models

# Model Provider Score Date
1 Gemini 3 Pro Preview (high) reasoning Google 89.8% 2026-04
2 Gemini 3 Pro Preview (low) Google 89.5% 2026-04
3 Claude Opus 4.5 (Reasoning) reasoning Anthropic 89.5% 2026-03
4 Qwen3.6 Plus Alibaba 88.5% 2026-04
5 MiniMax M2.1 MiniMax 88% 2026-04
6 Qwen3.5-397B-A17B Alibaba 87.8% 2026-04
7 Kimi K2.5 Moonshot AI 87.1% 2026-04
8 ERNIE 5.0 Baidu 87% 2026-04

What does it measure?

MMLU-Pro is a 2024 revision of the well-known MMLU test. Roughly 12,000 multiple-choice questions across 14 domains: biology, law, mathematics, philosophy, medicine, engineering and more. Each question has 10 answer options (instead of 4 in MMLU) and the set was manually filtered for trivial or leaking questions.

The goal: a pure knowledge test that cannot be cracked by guessing or shallow pattern matching, requiring real reasoning.

How to read the score

The score is the percentage of questions answered correctly.

  • Random guessing: 10% (ten answer options).
  • Human expert baseline: ~78% with chain-of-thought.
  • Current top: ~90%. The top three are within one percentage point, practically saturated.

Example task

Example (mathematics):

"Find the characteristic of the ring 2ℤ."
A. 0 · B. 30 · C. 3 · D. 10 · E. 12 · F. 50 · G. 2 · H. 100 · I. 20 · J. 5

Correct answer: A (0) — there is no positive n for which n·x = 0 for all x in 2ℤ.

What to watch out for

  • Saturated. Top models sit above the human expert baseline. Less useful to separate frontier models, use GPQA Diamond or HLE for the genuinely hard questions.
  • Self-reported vs. independent. The figures in model-release posts are almost always run by the maker. Independent reruns (Artificial Analysis) can differ by 2-5 points.
  • Skewed domain split. Mathematics and law are over-represented. An average score can mask weakness in a domain.

Sources

← Back to all benchmarks