Knowledge

GPQA Diamond

198 PhD-level biology, physics and chemistry questions you cannot Google.

Top models

# Model Provider Score Date
1 Claude Mythos Preview reasoning Anthropic 94.6% 2026-04
2 Gemini 3.1 Pro reasoning Google 94.3% 2026-04
3 Claude Opus 4.7 reasoning Anthropic 94.2% 2026-04
4 GPT-5.2 Pro reasoning OpenAI 93.2% 2025-12-11
5 GPT-5.4 reasoning OpenAI 92.8% 2026-03
6 GPT-5.2 reasoning OpenAI 92.4% 2025-12-11
7 Gemini 3 Pro reasoning Google 91.9% 2026-04
8 Claude Opus 4.6 reasoning Anthropic 91.3% 2026-03

What does it measure?

GPQA stands for Graduate-Level Google-Proof Q&A. The "Diamond" subset is 198 questions that domain experts (PhDs in the field) mostly answer correctly, but that skilled non-experts with Google access cannot solve. Hence "Google-proof".

Domains covered: biology, physics, chemistry. Multiple choice with 4 options. The test exists to make a point: this is not something LLMs can fake through training memorization.

How to read the score

The score is the percentage correct.

  • Random guessing: 25% (4 options).
  • PhD expert in own domain: ~65% (~74% without obvious mistakes).
  • Non-expert with Google: ~34%.
  • Current top: ~94%. Top models sit well above the expert baseline.

Example task

Example (physics):

"Two quantum states with energies E₁ and E₂ have lifetimes of 10⁻⁹ s and 10⁻⁸ s respectively. We want to clearly distinguish the two energy levels. Which of the following energy differences is sufficient to resolve them?"
A. 10⁻⁸ eV · B. 10⁻⁹ eV · C. 10⁻⁴ eV · D. 10⁻¹¹ eV

Correct answer: C — via the energy-time uncertainty relation.

What to watch out for

  • Saturation in sight. Top models sit far above the PhD baseline (65%). The test becomes less discriminating each release cycle.
  • Only 198 questions. One question is ~0.5 percentage points; statistical noise is high. Differences of 1-2 points between models do not mean much.
  • Self-consistency inflation. Top scores often use "multiple samples + majority vote", which is not the same as single-shot. Always check the methodology.

Sources

← Back to all benchmarks