Leaderboard
Model scope| System / Submission | Score | Organization | Reported | Source |
|---|---|---|---|---|
| Claude Mythos Utilizes Mythos reasoning loops to reach near-human resolution on verified tasks. | 93.9% | Anthropic | Source | |
| Claude Opus 4.8 New Anthropic's May 2026 frontier release; standard configuration with thinking blocks included. Self-reported in the Opus 4.8 system card. | 88.6% | Anthropic | Source | |
| Claude Opus 4.7 Anthropic's April 2026 frontier release; optimized for long-context codebase understanding. | 87.6% | Anthropic | Source | |
| Claude Opus 4.5 Self-reported on the official leaderboard; high-throughput frontier model. | 80.9% | Anthropic | Source | |
| Claude Opus 4.6 Self-reported by Anthropic; near-parity with Opus 4.5. | 80.8% | Anthropic | Source | |
| DeepSeek-V4-Pro-Max New Large-scale MoE model with specialized coding reinforcement learning. | 80.6% | DeepSeek | Source | |
| Gemini 3.1 Pro Self-reported by Google DeepMind at Gemini 3.1 Pro launch, February 2026. | 80.6% | Google DeepMind | Source | |
| Kimi K2.6 New Advanced reasoning model with integrated terminal and editor tools. | 80.2% | Moonshot AI | Source | |
| MiniMax M2.5 Leading open-weight model on the official leaderboard. | 80.2% | MiniMax | Source | |
| GPT-5.2 Self-reported by OpenAI on the official leaderboard. | 80.0% | OpenAI | Source | |
| Claude Sonnet 4.6 Self-reported; high efficiency with frontier-class coding performance. | 79.6% | Anthropic | Source | |
| DeepSeek-V4-Flash-Max SWE-bench Verified resolve rate; reported on huggingface.co. | 79.0% | DeepSeek | Source | |
| Qwen3.6 Plus SWE-bench Verified resolve rate; reported on qwen.ai. | 78.8% | Alibaba Cloud / Qwen Team | Source | |
| Gemini 3 Flash SWE-bench Verified resolve rate; reported on blog.google. | 78.0% | Google DeepMind | Source | |
| MiMo-V2-Pro SWE-bench Verified resolve rate; reported on mimo.xiaomi.com. | 78.0% | Xiaomi | Source | |
| GLM-5 SWE-bench Verified resolve rate; reported on docs.z.ai. | 77.8% | Zhipu AI | Source |
About this benchmark
SWE-bench Verified is the 500-instance human-reviewed split of SWE-bench, built from real GitHub issues in popular Python repositories. Agents receive an issue and repository state, then generate a patch.
It became the standard public signal for autonomous coding agents because scoring uses actual test execution rather than preference judgments or synthetic unit tests.
The benchmark is now mature and heavily exposed in public training data. Recent audits argue that top frontier scores should be interpreted with contamination and test-design caveats, especially when comparing very high-scoring systems.
Strong at measuring public issue-resolution workflows; weaker as a frontier-only signal once scores approach saturation or contamination dominates.
Example tasks
Three public tasks quoted from benchmark sources:
- "Subclassed SkyCoord gives misleading attribute access message" Citation: SWE-bench Verified dataset
- "Please support header rows in RestructuredText output" Citation: SWE-bench Verified dataset
- "IndexError: tuple index out of range in identify_format (io.registry)" Citation: SWE-bench Verified dataset
Methodology
- Metric is % Resolved: the share of instances where the generated patch passes the benchmark tests after being applied in the evaluation harness.
- SWE-bench uses containerized execution to improve reproducibility, though environment details, tool permissions, time limits, and scaffold design still matter.
- Verified was curated by expert review from the larger SWE-bench set, but later audits found remaining flawed or underspecified tests at high performance levels.
- We retain Verified because it is widely reported, while linking to source notes so readers can distinguish official leaderboard entries from launch-post claims.
Links
Related benchmarks
Compare this benchmark with related pages from the hub: