GPQA Diamond
198 PhD-level biology, physics and chemistry questions you cannot Google.
Top models
| # | Model | Provider | Score | Date |
|---|---|---|---|---|
| 1 | Claude Mythos Preview reasoning | Anthropic | 94.6% | 2026-04 |
| 2 | Gemini 3.1 Pro reasoning | 94.3% | 2026-04 | |
| 3 | Claude Opus 4.7 reasoning | Anthropic | 94.2% | 2026-04 |
| 4 | GPT-5.2 Pro reasoning | OpenAI | 93.2% | 2025-12-11 |
| 5 | GPT-5.4 reasoning | OpenAI | 92.8% | 2026-03 |
| 6 | GPT-5.2 reasoning | OpenAI | 92.4% | 2025-12-11 |
| 7 | Gemini 3 Pro reasoning | 91.9% | 2026-04 | |
| 8 | Claude Opus 4.6 reasoning | Anthropic | 91.3% | 2026-03 |
What does it measure?
GPQA stands for Graduate-Level Google-Proof Q&A. The "Diamond" subset is 198 questions that domain experts (PhDs in the field) mostly answer correctly, but that skilled non-experts with Google access cannot solve. Hence "Google-proof".
Domains covered: biology, physics, chemistry. Multiple choice with 4 options. The test exists to make a point: this is not something LLMs can fake through training memorization.
How to read the score
The score is the percentage correct.
- Random guessing: 25% (4 options).
- PhD expert in own domain: ~65% (~74% without obvious mistakes).
- Non-expert with Google: ~34%.
- Current top: ~94%. Top models sit well above the expert baseline.
Example task
Example (physics):
"Two quantum states with energies E₁ and E₂ have lifetimes of 10⁻⁹ s and 10⁻⁸ s respectively. We want to clearly distinguish the two energy levels. Which of the following energy differences is sufficient to resolve them?"
A. 10⁻⁸ eV · B. 10⁻⁹ eV · C. 10⁻⁴ eV · D. 10⁻¹¹ eVCorrect answer: C — via the energy-time uncertainty relation.
What to watch out for
- Saturation in sight. Top models sit far above the PhD baseline (65%). The test becomes less discriminating each release cycle.
- Only 198 questions. One question is ~0.5 percentage points; statistical noise is high. Differences of 1-2 points between models do not mean much.
- Self-consistency inflation. Top scores often use "multiple samples + majority vote", which is not the same as single-shot. Always check the methodology.