Knowledge

GPQA Diamond

198 PhD-level biology, physics and chemistry questions you cannot Google.

Top models

#	Model	Provider	Score	Date
1	Claude Mythos Preview reasoning	Anthropic	94.6%	2026-04
2	Gemini 3.1 Pro reasoning	Google	94.3%	2026-04
3	Claude Opus 4.7 reasoning	Anthropic	94.2%	2026-04
4	GPT-5.2 Pro reasoning	OpenAI	93.2%	2025-12-11
5	GPT-5.4 reasoning	OpenAI	92.8%	2026-03
6	GPT-5.2 reasoning	OpenAI	92.4%	2025-12-11
7	Gemini 3 Pro reasoning	Google	91.9%	2026-04
8	Claude Opus 4.6 reasoning	Anthropic	91.3%	2026-03

What does it measure?

GPQA stands for Graduate-Level Google-Proof Q&A. The "Diamond" subset is 198 questions that domain experts (PhDs in the field) mostly answer correctly, but that skilled non-experts with Google access cannot solve. Hence "Google-proof".

Domains covered: biology, physics, chemistry. Multiple choice with 4 options. The test exists to make a point: this is not something LLMs can fake through training memorization.

How to read the score

The score is the percentage correct.

Random guessing: 25% (4 options).
PhD expert in own domain: ~65% (~74% without obvious mistakes).
Non-expert with Google: ~34%.
Current top: ~94%. Top models sit well above the expert baseline.

Example task

Example (physics):

"Two quantum states with energies E₁ and E₂ have lifetimes of 10⁻⁹ s and 10⁻⁸ s respectively. We want to clearly distinguish the two energy levels. Which of the following energy differences is sufficient to resolve them?"
A. 10⁻⁸ eV · B. 10⁻⁹ eV · C. 10⁻⁴ eV · D. 10⁻¹¹ eV

Correct answer: C — via the energy-time uncertainty relation.

What to watch out for

Saturation in sight. Top models sit far above the PhD baseline (65%). The test becomes less discriminating each release cycle.
Only 198 questions. One question is ~0.5 percentage points; statistical noise is high. Differences of 1-2 points between models do not mean much.
Self-consistency inflation. Top scores often use "multiple samples + majority vote", which is not the same as single-shot. Always check the methodology.

Sources

← Back to all benchmarks