Canonical benchmark page

SWE-bench Verified leaderboard

Benchmark page for SWE-bench Verified with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-03-22

About this benchmark

SWE-bench Verified evaluates software engineering performance on real GitHub issues with stricter quality controls than the broader SWE-bench set.

This benchmark helps teams estimate bug-fixing and code-edit reliability in realistic repository contexts.

Compared with browsing benchmarks, this page leans more model-centric, though harness details and agent wrappers can still influence observed scores.

Model-focused benchmark, but harness and evaluation policy still affect outcomes.

Methodology

  • Results typically report issue-resolution success rates on the verified subset.
  • Leaderboard sources can differ by harness, timeout policy, or tool permissions.
  • Treat rows as model-focused directional signals unless source methodology is fully matched.

Links

SWE-bench Verified

Model scope
Rank System / Submission Score Organization Notes Source Share
1
Claude Mythos New
93.9% Anthropic Utilizes Mythos reasoning loops to reach near-human resolution on verified tasks. Source
2
Claude Opus 4.7 New
87.6% Anthropic Anthropic's April 2026 frontier release; optimized for long-context codebase understanding. Source
3
Claude Opus 4.5
80.9% Anthropic Self-reported on the official leaderboard; high-throughput frontier model. Source
4
Claude Opus 4.6
80.8% Anthropic Self-reported by Anthropic; near-parity with Opus 4.5. Source
5
DeepSeek-V4-Pro-Max New
80.6% DeepSeek Large-scale MoE model with specialized coding reinforcement learning. Source
5
Gemini 3.1 Pro New
80.6% Google DeepMind Self-reported by Google DeepMind at Gemini 3.1 Pro launch, February 2026. Source
7
Kimi K2.6 New
80.2% Moonshot AI Advanced reasoning model with integrated terminal and editor tools. Source
7
MiniMax M2.5
80.2% MiniMax Leading open-weight model on the official leaderboard. Source
9
GPT-5.2
80.0% OpenAI Self-reported by OpenAI on the official leaderboard. Source
10
Claude Sonnet 4.6
79.6% Anthropic Self-reported; high efficiency with frontier-class coding performance. Source
11
DeepSeek-V4-Flash-Max
79.0% DeepSeek SWE-bench Verified resolve rate; reported on huggingface.co. Source
12
Qwen3.6 Plus
78.8% Alibaba Cloud / Qwen Team SWE-bench Verified resolve rate; reported on qwen.ai. Source
13
Gemini 3 Flash
78.0% Google DeepMind SWE-bench Verified resolve rate; reported on blog.google. Source
13
MiMo-V2-Pro
78.0% Xiaomi SWE-bench Verified resolve rate; reported on mimo.xiaomi.com. Source
15
GLM-5
77.8% Zhipu AI SWE-bench Verified resolve rate; reported on docs.z.ai. Source

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on SWE-bench Verified? +
Claude Mythos is the model currently leading with a tracked score of 93.9%. This page is model-focused, so rankings mostly reflect model capability under the reported harness. Based on our latest tracked results, last updated Mar 22, 2026.
What should I read into a SWE-bench Verified score? +
SWE-bench Verified scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? +
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.