Humanity's Last Exam
Nearly 3,000 expert questions across almost every field of knowledge, meant as a final challenge for AI.
Top models
| # | Model | Provider | Score | Date |
|---|---|---|---|---|
| 1 | Claude Mythos 5 reasoning | Anthropic | 64.5% | 2026-06-09 |
| 2 | Claude Fable 5 | Anthropic | 59% | 2026-06-09 |
| 3 | Claude Opus 4.8 | Anthropic | 57.9% | May 2026 |
| 4 | Gemini 3.1 Pro Preview reasoning | 44.7% | 2026-04 | |
| 5 | GPT-5.4 (xhigh) reasoning | OpenAI | 41.6% | 2026-03 |
| 6 | GPT-5.3 Codex (xhigh) reasoning | OpenAI | 39.9% | 2026-02 |
What does it measure?
Humanity's Last Exam (HLE) is a collaboration between Scale AI and the Center for AI Safety, launched in early 2025. It contains roughly 3,000 closed questions (multiple choice or exact match) in mathematics (41%), physics, biology, the humanities, computer science, engineering and chemistry.
Questions were written by professors and domain experts. They were chosen specifically because frontier models scored below 10% on them at launch. The name is marketing, but the difficulty is real.
How to read the score
The score is the percentage correct across all domains.
- Random guessing: not meaningful on open answers; ~25% on the MC part.
- Domain experts: ~90% within their own field; no single human reaches 90% cross-domain.
- Current top: ~45%. At launch (Jan 2025) this was below 10%. Quadrupled in 15 months.
Example task
Example (biology, verbatim from the paper):
"Hummingbirds within Apodiformes have a unique bilaterally paired oval bone, a sesamoid embedded in the caudolateral part of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone?"
Answer: a specific integer. Requires deep anatomical knowledge of birds.
What to watch out for
- Ground-truth errors. FutureHouse reported that ~30% of the chemistry and biology answers are probably wrong or disputable. That puts a ceiling on the score no one can break.
- "Last"? Not really. From below 10% (Jan 2025) to ~45% (Apr 2026). The benchmark's makers expected a much longer shelf life.
- Self-reported preview scores. New models are scored by their makers, not by Scale. Wait for independent verification via the SEAL leaderboard.