About this benchmark

Aider Polyglot is the coding benchmark behind the Aider pair-programming tool. It tests an LLM on 225 of Exercism's hardest exercises (the ones few models could solve on the earlier single-language benchmark) across C++, Go, Java, JavaScript, Python, and Rust.

Unlike isolated code-generation tests, Polyglot scores the model inside Aider's real edit loop: the model must emit changes in a structured edit format (diff, diff-fenced, whole, or architect) and gets a second attempt with the failing unit-test output if the first attempt fails.

Because it measures instruction-following and reliable file editing rather than raw synthesis, the leaderboard is a strong practical signal for choosing a model for an autonomous or pair-programming coding assistant, and it reports cost per run alongside accuracy.

Numbers are the official Aider polyglot leaderboard; frontier models released after Aider's last run may be missing until it re-runs them.

Aider also has an older single-language code-editing leaderboard; this page tracks only the modern Polyglot benchmark, so do not mix scores from the two.

Architect-mode and planner+editor rows are system results, not single-model numbers; rows with a thinking-token or reasoning-effort label are configuration-specific.

Example tasks

Three public tasks quoted from benchmark sources:

Methodology

  • Primary metric is percent correct (pass_rate_2): the share of the 225 exercises where all hidden unit tests pass after the model's second attempt.
  • A secondary metric, percent using correct edit format, reports how often the model emitted edits Aider could apply without retry; low edit-format compliance drags down effective accuracy.
  • Each model runs with Aider's standard prompting and a per-model edit format; some rows fix a thinking-token budget or reasoning effort, which the leaderboard records in the model label.
  • Architect-mode rows pair a planner model with a separate editor model, so they are system results rather than single-model numbers; read the model label before comparing.

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on Aider? +
gpt-5 (high) is the model currently leading with a tracked score of 88.0%. This page is model-focused, so rankings mostly reflect model capability under the reported harness. Based on our latest tracked results, last updated Jun 8, 2026.
What should I read into a Aider score? +
Aider scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.