Code

Aider Polyglot

Can the model write and patch code across six different programming languages?

Top models

#	Model	Provider	Score	Date
1	GPT-5 (high) reasoning	OpenAI	88%	2025-08-23
2	GPT-5 (medium) reasoning	OpenAI	86.7%	2025-08-25
3	o3-pro (high) reasoning	OpenAI	84.9%	2025-06-28
4	Gemini 2.5 Pro (32k think) reasoning	Google	83.1%	2025-06-06
5	GPT-5 (low) reasoning	OpenAI	81.3%	2025-08-25
6	o3 (high) reasoning	OpenAI	81.3%	2025-06-25
7	Grok-4 (high) reasoning	xAI	79.6%	2025-07-11
8	Gemini 2.5 Pro (default think) reasoning	Google	79.1%	2025-06-06

What does it measure?

Aider Polyglot tests a model on 225 hard Exercism tasks spread across six languages: C++, Go, Java, JavaScript, Python and Rust. The 225 were chosen specifically because three or fewer earlier models could solve them, meant as a demanding bar.

The benchmark has two parts: writing and patching. If the model gives a wrong solution, it gets the test errors back and may make a second attempt via a diff. So it also tests self-correction on code.

How to read the score

The score is the percentage of tasks that pass (pass@1 or pass after one correction).

Random guessing: not meaningful.
Human baseline: no formal measurement, but these 225 are specifically "hard for models".
Current top: ~88%. Average across 22 evaluated models: ~58%.

Example task

Example task:

Rust: build a reactive system with compute cells and input cells (like spreadsheet formulas). Compute cells recompute automatically when their input cells change. Observers can subscribe to compute cells. Skeleton files and a cargo test suite are provided.

The model gets the task, skeleton code and language-specific tests. It has to write a diff that makes the tests pass.

What to watch out for

Prompt format matters. The edit format (whole file, unified diff, udiff) shifts the score by 5-10 points. Comparing models requires the same format.
Only 225 tasks. Wide variance between runs, one task = 0.4 percentage points, so small score gaps are noise.
The official leaderboard lags. Third-party tracking (Epoch AI, Artificial Analysis) can differ because Aider updates its own figures slowly.

Sources

← Back to all benchmarks