Aider Polyglot
Can the model write and patch code across six different programming languages?
Top models
| # | Model | Provider | Score | Date |
|---|---|---|---|---|
| 1 | GPT-5 (high) reasoning | OpenAI | 88% | 2025-08-23 |
| 2 | GPT-5 (medium) reasoning | OpenAI | 86.7% | 2025-08-25 |
| 3 | o3-pro (high) reasoning | OpenAI | 84.9% | 2025-06-28 |
| 4 | Gemini 2.5 Pro (32k think) reasoning | 83.1% | 2025-06-06 | |
| 5 | GPT-5 (low) reasoning | OpenAI | 81.3% | 2025-08-25 |
| 6 | o3 (high) reasoning | OpenAI | 81.3% | 2025-06-25 |
| 7 | Grok-4 (high) reasoning | xAI | 79.6% | 2025-07-11 |
| 8 | Gemini 2.5 Pro (default think) reasoning | 79.1% | 2025-06-06 |
What does it measure?
Aider Polyglot tests a model on 225 hard Exercism tasks spread across six languages: C++, Go, Java, JavaScript, Python and Rust. The 225 were chosen specifically because three or fewer earlier models could solve them, meant as a demanding bar.
The benchmark has two parts: writing and patching. If the model gives a wrong solution, it gets the test errors back and may make a second attempt via a diff. So it also tests self-correction on code.
How to read the score
The score is the percentage of tasks that pass (pass@1 or pass after one correction).
- Random guessing: not meaningful.
- Human baseline: no formal measurement, but these 225 are specifically "hard for models".
- Current top: ~88%. Average across 22 evaluated models: ~58%.
Example task
Example task:
Rust: build a reactive system with compute cells and input cells (like spreadsheet formulas). Compute cells recompute automatically when their input cells change. Observers can subscribe to compute cells. Skeleton files and a
cargo testsuite are provided.The model gets the task, skeleton code and language-specific tests. It has to write a diff that makes the tests pass.
What to watch out for
- Prompt format matters. The edit format (whole file, unified diff, udiff) shifts the score by 5-10 points. Comparing models requires the same format.
- Only 225 tasks. Wide variance between runs, one task = 0.4 percentage points, so small score gaps are noise.
- The official leaderboard lags. Third-party tracking (Epoch AI, Artificial Analysis) can differ because Aider updates its own figures slowly.