Leaderboard
Model scope| System / Submission | Score | Organization | Reported | Source |
|---|---|---|---|---|
| gpt-5 (high) Official Aider polyglot leaderboard; diff edit format, 91.6% well-formed edits, ~$29.08 per run. | 88.0% | OpenAI | Source | |
| gpt-5 (medium) Official Aider polyglot leaderboard; diff edit format, 88.4% well-formed edits, ~$17.69 per run. | 86.7% | OpenAI | Source | |
| o3-pro (high) Official Aider polyglot leaderboard; diff edit format, 97.8% well-formed edits, ~$146.32 per run. | 84.9% | OpenAI | Source | |
| gemini-2.5-pro-preview-06-05 (32k think) Official Aider polyglot leaderboard; diff-fenced edit format, 99.6% well-formed edits, ~$49.88 per run. | 83.1% | Source | ||
| o3 (high) Official Aider polyglot leaderboard; diff edit format, 94.7% well-formed edits, ~$21.23 per run. | 81.3% | OpenAI | Source | |
| gpt-5 (low) Official Aider polyglot leaderboard; diff edit format, 86.7% well-formed edits, ~$10.37 per run. | 81.3% | OpenAI | Source | |
| grok-4 (high) Official Aider polyglot leaderboard; diff edit format, 97.3% well-formed edits, ~$59.62 per run. | 79.6% | xAI | Source | |
| gemini-2.5-pro-preview-06-05 (default think) Official Aider polyglot leaderboard; diff-fenced edit format, 100% well-formed edits, ~$45.60 per run. | 79.1% | Source | ||
| o3 (high) + gpt-4.1 Official Aider polyglot leaderboard; architect mode (planner + editor), 100% well-formed edits, ~$17.55 per run. | 78.2% | OpenAI | Source | |
| Gemini 2.5 Pro Preview 05-06 Official Aider polyglot leaderboard; diff-fenced edit format, 97.3% well-formed edits, ~$37.41 per run. | 76.9% | Source | ||
| o3 Official Aider polyglot leaderboard; diff edit format, 93.8% well-formed edits, ~$13.75 per run. | 76.9% | OpenAI | Source | |
| 74.2% | DeepSeek | Source | ||
| Gemini 2.5 Pro Preview 03-25 Official Aider polyglot leaderboard; diff-fenced edit format, 92.4% well-formed edits. | 72.9% | Source | ||
| o4-mini (high) Official Aider polyglot leaderboard; diff edit format, 90.7% well-formed edits, ~$19.64 per run. | 72.0% | OpenAI | Source | |
| claude-opus-4-20250514 (32k thinking) Official Aider polyglot leaderboard; diff edit format, 97.3% well-formed edits, ~$65.75 per run. | 72.0% | Anthropic | Source |
About this benchmark
Aider Polyglot is the coding benchmark behind the Aider pair-programming tool. It tests an LLM on 225 of Exercism's hardest exercises (the ones few models could solve on the earlier single-language benchmark) across C++, Go, Java, JavaScript, Python, and Rust.
Unlike isolated code-generation tests, Polyglot scores the model inside Aider's real edit loop: the model must emit changes in a structured edit format (diff, diff-fenced, whole, or architect) and gets a second attempt with the failing unit-test output if the first attempt fails.
Because it measures instruction-following and reliable file editing rather than raw synthesis, the leaderboard is a strong practical signal for choosing a model for an autonomous or pair-programming coding assistant, and it reports cost per run alongside accuracy.
Numbers are the official Aider polyglot leaderboard; frontier models released after Aider's last run may be missing until it re-runs them.
Aider also has an older single-language code-editing leaderboard; this page tracks only the modern Polyglot benchmark, so do not mix scores from the two.
Architect-mode and planner+editor rows are system results, not single-model numbers; rows with a thinking-token or reasoning-effort label are configuration-specific.
Example tasks
Three public tasks quoted from benchmark sources:
- "To try and encourage more sales of different books from a popular 5 book series, a bookshop has decided to offer discounts on multiple book purchases." Citation: Aider polyglot benchmark, book-store exercise
- "Given students' names along with the grade that they are in, create a roster for the school." Citation: Aider polyglot benchmark, grade-school exercise
- "Pick the best hand(s) from a list of poker hands." Citation: Aider polyglot benchmark, poker exercise
Methodology
- Primary metric is percent correct (pass_rate_2): the share of the 225 exercises where all hidden unit tests pass after the model's second attempt.
- A secondary metric, percent using correct edit format, reports how often the model emitted edits Aider could apply without retry; low edit-format compliance drags down effective accuracy.
- Each model runs with Aider's standard prompting and a per-model edit format; some rows fix a thinking-token budget or reasoning effort, which the leaderboard records in the model label.
- Architect-mode rows pair a planner model with a separate editor model, so they are system results rather than single-model numbers; read the model label before comparing.
Links
Related benchmarks
Compare this benchmark with related pages from the hub: