Knowledge

MMLU-Pro

Multiple-choice knowledge test across 14 domains, with 10 answer options per question.

Top models

#	Model	Provider	Score	Date
1	Gemini 3 Pro Preview (high) reasoning	Google	89.8%	2026-04
2	Gemini 3 Pro Preview (low)	Google	89.5%	2026-04
3	Claude Opus 4.5 (Reasoning) reasoning	Anthropic	89.5%	2026-03
4	Qwen3.6 Plus	Alibaba	88.5%	2026-04
5	MiniMax M2.1	MiniMax	88%	2026-04
6	Qwen3.5-397B-A17B	Alibaba	87.8%	2026-04
7	Kimi K2.5	Moonshot AI	87.1%	2026-04
8	ERNIE 5.0	Baidu	87%	2026-04

What does it measure?

MMLU-Pro is a 2024 revision of the well-known MMLU test. Roughly 12,000 multiple-choice questions across 14 domains: biology, law, mathematics, philosophy, medicine, engineering and more. Each question has 10 answer options (instead of 4 in MMLU) and the set was manually filtered for trivial or leaking questions.

The goal: a pure knowledge test that cannot be cracked by guessing or shallow pattern matching, requiring real reasoning.

How to read the score

The score is the percentage of questions answered correctly.

Random guessing: 10% (ten answer options).
Human expert baseline: ~78% with chain-of-thought.
Current top: ~90%. The top three are within one percentage point, practically saturated.

Example task

Example (mathematics):

"Find the characteristic of the ring 2ℤ."
A. 0 · B. 30 · C. 3 · D. 10 · E. 12 · F. 50 · G. 2 · H. 100 · I. 20 · J. 5

Correct answer: A (0) — there is no positive n for which n·x = 0 for all x in 2ℤ.

What to watch out for

Saturated. Top models sit above the human expert baseline. Less useful to separate frontier models, use GPQA Diamond or HLE for the genuinely hard questions.
Self-reported vs. independent. The figures in model-release posts are almost always run by the maker. Independent reruns (Artificial Analysis) can differ by 2-5 points.
Skewed domain split. Mathematics and law are over-represented. An average score can mask weakness in a domain.

Sources

← Back to all benchmarks