Canonical benchmark page
SWE-bench Verified leaderboard
Benchmark page for SWE-bench Verified with standardized structure: about, leaderboard table, and FAQ.
Last updated: 2026-03-22
About this benchmark
SWE-bench Verified evaluates software engineering performance on real GitHub issues with stricter quality controls than the broader SWE-bench set.
This benchmark helps teams estimate bug-fixing and code-edit reliability in realistic repository contexts.
Compared with browsing benchmarks, this page leans more model-centric, though harness details and agent wrappers can still influence observed scores.
Model-focused benchmark, but harness and evaluation policy still affect outcomes.
Methodology
- Results typically report issue-resolution success rates on the verified subset.
- Leaderboard sources can differ by harness, timeout policy, or tool permissions.
- Treat rows as model-focused directional signals unless source methodology is fully matched.
Links
SWE-bench Verified
Model scope| Rank | System / Submission | Score | Organization | Notes | Source | Share |
|---|---|---|---|---|---|---|
| 1 | Claude Mythos New | 93.9% | Anthropic | Utilizes Mythos reasoning loops to reach near-human resolution on verified tasks. | Source | |
| 2 | Claude Opus 4.7 New | 87.6% | Anthropic | Anthropic's April 2026 frontier release; optimized for long-context codebase understanding. | Source | |
| 3 | Claude Opus 4.5 | 80.9% | Anthropic | Self-reported on the official leaderboard; high-throughput frontier model. | Source | |
| 4 | Claude Opus 4.6 | 80.8% | Anthropic | Self-reported by Anthropic; near-parity with Opus 4.5. | Source | |
| 5 | DeepSeek-V4-Pro-Max New | 80.6% | DeepSeek | Large-scale MoE model with specialized coding reinforcement learning. | Source | |
| 5 | Gemini 3.1 Pro New | 80.6% | Google DeepMind | Self-reported by Google DeepMind at Gemini 3.1 Pro launch, February 2026. | Source | |
| 7 | Kimi K2.6 New | 80.2% | Moonshot AI | Advanced reasoning model with integrated terminal and editor tools. | Source | |
| 7 | MiniMax M2.5 | 80.2% | MiniMax | Leading open-weight model on the official leaderboard. | Source | |
| 9 | GPT-5.2 | 80.0% | OpenAI | Self-reported by OpenAI on the official leaderboard. | Source | |
| 10 | Claude Sonnet 4.6 | 79.6% | Anthropic | Self-reported; high efficiency with frontier-class coding performance. | Source | |
| 11 | DeepSeek-V4-Flash-Max | 79.0% | DeepSeek | SWE-bench Verified resolve rate; reported on huggingface.co. | Source | |
| 12 | Qwen3.6 Plus | 78.8% | Alibaba Cloud / Qwen Team | SWE-bench Verified resolve rate; reported on qwen.ai. | Source | |
| 13 | Gemini 3 Flash | 78.0% | Google DeepMind | SWE-bench Verified resolve rate; reported on blog.google. | Source | |
| 13 | MiMo-V2-Pro | 78.0% | Xiaomi | SWE-bench Verified resolve rate; reported on mimo.xiaomi.com. | Source | |
| 15 | GLM-5 | 77.8% | Zhipu AI | SWE-bench Verified resolve rate; reported on docs.z.ai. | Source |
Related benchmarks
Compare this benchmark with related pages from the hub:
Frequently asked questions
Which system is currently best on SWE-bench Verified? + -
Claude Mythos is the model currently leading with a tracked score of 93.9%. This page is model-focused, so rankings mostly reflect model capability under the reported harness. Based on our latest tracked results, last updated Mar 22, 2026.
What should I read into a SWE-bench Verified score? + -
SWE-bench Verified scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? + -
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.