Reasoning

Humanity's Last Exam

Nearly 3,000 expert questions across almost every field of knowledge, meant as a final challenge for AI.

Top models

#	Model	Provider	Score	Date
1	Claude Mythos 5 reasoning	Anthropic	64.5%	2026-06-09
2	Claude Fable 5	Anthropic	59%	2026-06-09
3	Claude Opus 4.8	Anthropic	57.9%	May 2026
4	Gemini 3.1 Pro Preview reasoning	Google	44.7%	2026-04
5	GPT-5.4 (xhigh) reasoning	OpenAI	41.6%	2026-03
6	GPT-5.3 Codex (xhigh) reasoning	OpenAI	39.9%	2026-02

What does it measure?

Humanity's Last Exam (HLE) is a collaboration between Scale AI and the Center for AI Safety, launched in early 2025. It contains roughly 3,000 closed questions (multiple choice or exact match) in mathematics (41%), physics, biology, the humanities, computer science, engineering and chemistry.

Questions were written by professors and domain experts. They were chosen specifically because frontier models scored below 10% on them at launch. The name is marketing, but the difficulty is real.

How to read the score

The score is the percentage correct across all domains.

Random guessing: not meaningful on open answers; ~25% on the MC part.
Domain experts: ~90% within their own field; no single human reaches 90% cross-domain.
Current top: ~45%. At launch (Jan 2025) this was below 10%. Quadrupled in 15 months.

Example task

Example (biology, verbatim from the paper):

"Hummingbirds within Apodiformes have a unique bilaterally paired oval bone, a sesamoid embedded in the caudolateral part of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone?"

Answer: a specific integer. Requires deep anatomical knowledge of birds.

What to watch out for

Ground-truth errors. FutureHouse reported that ~30% of the chemistry and biology answers are probably wrong or disputable. That puts a ceiling on the score no one can break.
"Last"? Not really. From below 10% (Jan 2025) to ~45% (Apr 2026). The benchmark's makers expected a much longer shelf life.
Self-reported preview scores. New models are scored by their makers, not by Scale. Wait for independent verification via the SEAL leaderboard.

Sources

← Back to all benchmarks