Aider Leaderboard 2026: Latest Coding LLM Scores

Leaderboard

Model scope

System / Submission	Score	Organization	Reported	Source
gpt-5 (high)	88.0%	OpenAI	Aug 2025	Source
gpt-5 (medium)	86.7%	OpenAI	Aug 2025	Source
o3-pro (high)	84.9%	OpenAI	Jun 2025	Source
gemini-2.5-pro-preview-06-05 (32k think)	83.1%	Google	Jun 2025	Source
o3 (high)	81.3%	OpenAI	Jun 2025	Source
gpt-5 (low)	81.3%	OpenAI	Aug 2025	Source
grok-4 (high)	79.6%	xAI	Jul 2025	Source
gemini-2.5-pro-preview-06-05 (default think)	79.1%	Google	Jun 2025	Source
o3 (high) + gpt-4.1	78.2%	OpenAI	Jun 2025	Source
Gemini 2.5 Pro Preview 05-06	76.9%	Google	May 2025	Source
o3	76.9%	OpenAI	Jun 2025	Source
DeepSeek-V3.2-Exp (Reasoner)	74.2%	DeepSeek	Oct 2025	Source
Gemini 2.5 Pro Preview 03-25	72.9%	Google	Apr 2025	Source
o4-mini (high)	72.0%	OpenAI	Apr 2025	Source
claude-opus-4-20250514 (32k thinking)	72.0%	Anthropic	May 2025	Source

About this benchmark

Aider Polyglot is the coding benchmark behind the Aider pair-programming tool. It tests an LLM on 225 of Exercism's hardest exercises (the ones few models could solve on the earlier single-language benchmark) across C++, Go, Java, JavaScript, Python, and Rust.

Unlike isolated code-generation tests, Polyglot scores the model inside Aider's real edit loop: the model must emit changes in a structured edit format (diff, diff-fenced, whole, or architect) and gets a second attempt with the failing unit-test output if the first attempt fails.

Because it measures instruction-following and reliable file editing rather than raw synthesis, the leaderboard is a strong practical signal for choosing a model for an autonomous or pair-programming coding assistant, and it reports cost per run alongside accuracy.

Numbers are the official Aider polyglot leaderboard; frontier models released after Aider's last run may be missing until it re-runs them.

Aider also has an older single-language code-editing leaderboard; this page tracks only the modern Polyglot benchmark, so do not mix scores from the two.

Architect-mode and planner+editor rows are system results, not single-model numbers; rows with a thinking-token or reasoning-effort label are configuration-specific.

Example tasks

Three public tasks quoted from benchmark sources:

"To try and encourage more sales of different books from a popular 5 book series, a bookshop has decided to offer discounts on multiple book purchases." Citation: Aider polyglot benchmark, book-store exercise
"Given students' names along with the grade that they are in, create a roster for the school." Citation: Aider polyglot benchmark, grade-school exercise
"Pick the best hand(s) from a list of poker hands." Citation: Aider polyglot benchmark, poker exercise

Methodology

Primary metric is percent correct (pass_rate_2): the share of the 225 exercises where all hidden unit tests pass after the model's second attempt.
A secondary metric, percent using correct edit format, reports how often the model emitted edits Aider could apply without retry; low edit-format compliance drags down effective accuracy.
Each model runs with Aider's standard prompting and a per-model edit format; some rows fix a thinking-token budget or reasoning effort, which the leaderboard records in the model label.
Architect-mode rows pair a planner model with a separate editor model, so they are system results rather than single-model numbers; read the model label before comparing.

Related benchmarks

Compare this benchmark with related pages from the hub:

swe-bench-verified tau-bench agentbench

Back to benchmark hub

Frequently asked questions

Which system is currently best on Aider? +

gpt-5 (high) is the model currently leading with a tracked score of 88.0%. This page is model-focused, so rankings mostly reflect model capability under the reported harness. Based on our latest tracked results, last updated Jun 8, 2026.

What should I read into a Aider score? +

Aider scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.