Math

AIME (2024/2025)

A US high-school math olympiad, now a frontier test for AI.

Top models

#	Model	Provider	Score	Date
1	Grok-3 Mini reasoning	xAI	95.8%	2026-04
2	o4-mini reasoning	OpenAI	93.4%	2026-04
3	Grok-3 reasoning	xAI	93.3%	2026-04
4	LongCat-Flash-Thinking reasoning	Meituan	93.3%	2026-04
5	Gemini 2.5 Pro reasoning	Google	92%	2026-04
6	o3 reasoning	OpenAI	91.6%	2026-04
7	DeepSeek-R1-0528 reasoning	DeepSeek	91.4%	2026-04
8	GLM-4.5 reasoning	Zhipu AI	91%	2026-04

What does it measure?

The American Invitational Mathematics Examination is originally a 15-question math contest for US high-school students who qualify through AMC 10/12. All answers are integers from 000 to 999. In 2025/2026, AI leaderboards use AIME 2024 (both papers together = 30 problems) and AIME 2025 as a hard math benchmark.

It involves creative proof and algebra problems where contest experience helps, not just memorization.

How to read the score

The score is the number of correct answers as a percentage.

Random guessing: 1/1000 per question = effectively zero.
Median AIME participant: ~33% (5 of 15).
Top-tier AIME participants (USAMO candidates): ~80-100%.
Current top: AIME 2024: ~95%. AIME 2025: some models touch 100% (with self-consistency).

Example task

Example — AIME 2024, problem 1:

"Every morning Aya walks 9 km and then visits a coffee shop. Walking at speed s km/h, the walk takes 4 hours including t minutes at the coffee shop. At speed s+2 km/h the whole thing takes 2 hours and 24 minutes including t minutes. Find the number of minutes she walks plus t when she walks at speed s+½ km/h."

Answer: 204.

What to watch out for

Contamination. AIME 2024 problems were all over the internet soon after publication. Models have almost certainly seen them. AIME 2025 and OTIS Mock versions are more reliable.
100% scores are usually not single-shot. Top scores often use pass@k or majority-of-n: the model may try several times and the majority vote counts.
Few questions. Only 30 (or 15) problems. One question = 3-7 percentage points. Small score gaps are noise.

Sources

← Back to all benchmarks