About this benchmark

SWE-bench Verified is the 500-instance human-reviewed split of SWE-bench, built from real GitHub issues in popular Python repositories. Agents receive an issue and repository state, then generate a patch.

It became the standard public signal for autonomous coding agents because scoring uses actual test execution rather than preference judgments or synthetic unit tests.

The benchmark is now mature and heavily exposed in public training data. Recent audits argue that top frontier scores should be interpreted with contamination and test-design caveats, especially when comparing very high-scoring systems.

Strong at measuring public issue-resolution workflows; weaker as a frontier-only signal once scores approach saturation or contamination dominates.

Example tasks

Three public tasks quoted from benchmark sources:

Methodology

  • Metric is % Resolved: the share of instances where the generated patch passes the benchmark tests after being applied in the evaluation harness.
  • SWE-bench uses containerized execution to improve reproducibility, though environment details, tool permissions, time limits, and scaffold design still matter.
  • Verified was curated by expert review from the larger SWE-bench set, but later audits found remaining flawed or underspecified tests at high performance levels.
  • We retain Verified because it is widely reported, while linking to source notes so readers can distinguish official leaderboard entries from launch-post claims.

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on SWE-bench Verified? +
Claude Mythos is the model currently leading with a tracked score of 93.9%. This page is model-focused, so rankings mostly reflect model capability under the reported harness. Based on our latest tracked results, last updated May 28, 2026.
What should I read into a SWE-bench Verified score? +
SWE-bench Verified scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? +
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.