Reasoning

Humanity's Last Exam

Nearly 3,000 expert questions across almost every field of knowledge, meant as a final challenge for AI.

Top models

# Model Provider Score Date
1 Claude Mythos 5 reasoning Anthropic 64.5% 2026-06-09
2 Claude Fable 5 Anthropic 59% 2026-06-09
3 Claude Opus 4.8 Anthropic 57.9% May 2026
4 Gemini 3.1 Pro Preview reasoning Google 44.7% 2026-04
5 GPT-5.4 (xhigh) reasoning OpenAI 41.6% 2026-03
6 GPT-5.3 Codex (xhigh) reasoning OpenAI 39.9% 2026-02

What does it measure?

Humanity's Last Exam (HLE) is a collaboration between Scale AI and the Center for AI Safety, launched in early 2025. It contains roughly 3,000 closed questions (multiple choice or exact match) in mathematics (41%), physics, biology, the humanities, computer science, engineering and chemistry.

Questions were written by professors and domain experts. They were chosen specifically because frontier models scored below 10% on them at launch. The name is marketing, but the difficulty is real.

How to read the score

The score is the percentage correct across all domains.

  • Random guessing: not meaningful on open answers; ~25% on the MC part.
  • Domain experts: ~90% within their own field; no single human reaches 90% cross-domain.
  • Current top: ~45%. At launch (Jan 2025) this was below 10%. Quadrupled in 15 months.

Example task

Example (biology, verbatim from the paper):

"Hummingbirds within Apodiformes have a unique bilaterally paired oval bone, a sesamoid embedded in the caudolateral part of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone?"

Answer: a specific integer. Requires deep anatomical knowledge of birds.

What to watch out for

  • Ground-truth errors. FutureHouse reported that ~30% of the chemistry and biology answers are probably wrong or disputable. That puts a ceiling on the score no one can break.
  • "Last"? Not really. From below 10% (Jan 2025) to ~45% (Apr 2026). The benchmark's makers expected a much longer shelf life.
  • Self-reported preview scores. New models are scored by their makers, not by Scale. Wait for independent verification via the SEAL leaderboard.

Sources

← Back to all benchmarks