Code

Aider Polyglot

Can the model write and patch code across six different programming languages?

Top models

# Model Provider Score Date
1 GPT-5 (high) reasoning OpenAI 88% 2025-08-23
2 GPT-5 (medium) reasoning OpenAI 86.7% 2025-08-25
3 o3-pro (high) reasoning OpenAI 84.9% 2025-06-28
4 Gemini 2.5 Pro (32k think) reasoning Google 83.1% 2025-06-06
5 GPT-5 (low) reasoning OpenAI 81.3% 2025-08-25
6 o3 (high) reasoning OpenAI 81.3% 2025-06-25
7 Grok-4 (high) reasoning xAI 79.6% 2025-07-11
8 Gemini 2.5 Pro (default think) reasoning Google 79.1% 2025-06-06

What does it measure?

Aider Polyglot tests a model on 225 hard Exercism tasks spread across six languages: C++, Go, Java, JavaScript, Python and Rust. The 225 were chosen specifically because three or fewer earlier models could solve them, meant as a demanding bar.

The benchmark has two parts: writing and patching. If the model gives a wrong solution, it gets the test errors back and may make a second attempt via a diff. So it also tests self-correction on code.

How to read the score

The score is the percentage of tasks that pass (pass@1 or pass after one correction).

  • Random guessing: not meaningful.
  • Human baseline: no formal measurement, but these 225 are specifically "hard for models".
  • Current top: ~88%. Average across 22 evaluated models: ~58%.

Example task

Example task:

Rust: build a reactive system with compute cells and input cells (like spreadsheet formulas). Compute cells recompute automatically when their input cells change. Observers can subscribe to compute cells. Skeleton files and a cargo test suite are provided.

The model gets the task, skeleton code and language-specific tests. It has to write a diff that makes the tests pass.

What to watch out for

  • Prompt format matters. The edit format (whole file, unified diff, udiff) shifts the score by 5-10 points. Comparing models requires the same format.
  • Only 225 tasks. Wide variance between runs, one task = 0.4 percentage points, so small score gaps are noise.
  • The official leaderboard lags. Third-party tracking (Epoch AI, Artificial Analysis) can differ because Aider updates its own figures slowly.

Sources

← Back to all benchmarks