Math

AIME (2024/2025)

A US high-school math olympiad, now a frontier test for AI.

Top models

# Model Provider Score Date
1 Grok-3 Mini reasoning xAI 95.8% 2026-04
2 o4-mini reasoning OpenAI 93.4% 2026-04
3 Grok-3 reasoning xAI 93.3% 2026-04
4 LongCat-Flash-Thinking reasoning Meituan 93.3% 2026-04
5 Gemini 2.5 Pro reasoning Google 92% 2026-04
6 o3 reasoning OpenAI 91.6% 2026-04
7 DeepSeek-R1-0528 reasoning DeepSeek 91.4% 2026-04
8 GLM-4.5 reasoning Zhipu AI 91% 2026-04

What does it measure?

The American Invitational Mathematics Examination is originally a 15-question math contest for US high-school students who qualify through AMC 10/12. All answers are integers from 000 to 999. In 2025/2026, AI leaderboards use AIME 2024 (both papers together = 30 problems) and AIME 2025 as a hard math benchmark.

It involves creative proof and algebra problems where contest experience helps, not just memorization.

How to read the score

The score is the number of correct answers as a percentage.

  • Random guessing: 1/1000 per question = effectively zero.
  • Median AIME participant: ~33% (5 of 15).
  • Top-tier AIME participants (USAMO candidates): ~80-100%.
  • Current top: AIME 2024: ~95%. AIME 2025: some models touch 100% (with self-consistency).

Example task

Example — AIME 2024, problem 1:

"Every morning Aya walks 9 km and then visits a coffee shop. Walking at speed s km/h, the walk takes 4 hours including t minutes at the coffee shop. At speed s+2 km/h the whole thing takes 2 hours and 24 minutes including t minutes. Find the number of minutes she walks plus t when she walks at speed s+½ km/h."

Answer: 204.

What to watch out for

  • Contamination. AIME 2024 problems were all over the internet soon after publication. Models have almost certainly seen them. AIME 2025 and OTIS Mock versions are more reliable.
  • 100% scores are usually not single-shot. Top scores often use pass@k or majority-of-n: the model may try several times and the majority vote counts.
  • Few questions. Only 30 (or 15) problems. One question = 3-7 percentage points. Small score gaps are noise.

Sources

← Back to all benchmarks