Canonical benchmark page

SWE-bench Verified leaderboard

Benchmark page for SWE-bench Verified with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-03-22

About this benchmark

SWE-bench Verified evaluates software engineering performance on real GitHub issues with stricter quality controls than the broader SWE-bench set.

This benchmark helps teams estimate bug-fixing and code-edit reliability in realistic repository contexts.

Compared with browsing benchmarks, this page leans more model-centric, though harness details and agent wrappers can still influence observed scores.

Model-focused benchmark, but harness and evaluation policy still affect outcomes.

Methodology

Results typically report issue-resolution success rates on the verified subset.
Leaderboard sources can differ by harness, timeout policy, or tool permissions.
Treat rows as model-focused directional signals unless source methodology is fully matched.

SWE-bench Verified

Model scope

Rank	System / Submission	Score	Organization	Notes	Source	Share
1	Claude Mythos New	93.9%	Anthropic	Utilizes Mythos reasoning loops to reach near-human resolution on verified tasks.	Source	Share on X Share on LinkedIn
2	Claude Opus 4.7 New	87.6%	Anthropic	Anthropic's April 2026 frontier release; optimized for long-context codebase understanding.	Source	Share on X Share on LinkedIn
3	Claude Opus 4.5	80.9%	Anthropic	Self-reported on the official leaderboard; high-throughput frontier model.	Source	Share on X Share on LinkedIn
4	Claude Opus 4.6	80.8%	Anthropic	Self-reported by Anthropic; near-parity with Opus 4.5.	Source	Share on X Share on LinkedIn
5	DeepSeek-V4-Pro-Max New	80.6%	DeepSeek	Large-scale MoE model with specialized coding reinforcement learning.	Source	Share on X Share on LinkedIn
5	Gemini 3.1 Pro New	80.6%	Google DeepMind	Self-reported by Google DeepMind at Gemini 3.1 Pro launch, February 2026.	Source	Share on X Share on LinkedIn
7	Kimi K2.6 New	80.2%	Moonshot AI	Advanced reasoning model with integrated terminal and editor tools.	Source	Share on X Share on LinkedIn
7	MiniMax M2.5	80.2%	MiniMax	Leading open-weight model on the official leaderboard.	Source	Share on X Share on LinkedIn
9	GPT-5.2	80.0%	OpenAI	Self-reported by OpenAI on the official leaderboard.	Source	Share on X Share on LinkedIn
10	Claude Sonnet 4.6	79.6%	Anthropic	Self-reported; high efficiency with frontier-class coding performance.	Source	Share on X Share on LinkedIn
11	DeepSeek-V4-Flash-Max	79.0%	DeepSeek	SWE-bench Verified resolve rate; reported on huggingface.co.	Source	Share on X Share on LinkedIn
12	Qwen3.6 Plus	78.8%	Alibaba Cloud / Qwen Team	SWE-bench Verified resolve rate; reported on qwen.ai.	Source	Share on X Share on LinkedIn
13	Gemini 3 Flash	78.0%	Google DeepMind	SWE-bench Verified resolve rate; reported on blog.google.	Source	Share on X Share on LinkedIn
13	MiMo-V2-Pro	78.0%	Xiaomi	SWE-bench Verified resolve rate; reported on mimo.xiaomi.com.	Source	Share on X Share on LinkedIn
15	GLM-5	77.8%	Zhipu AI	SWE-bench Verified resolve rate; reported on docs.z.ai.	Source	Share on X Share on LinkedIn

Related benchmarks

Compare this benchmark with related pages from the hub:

webvoyager browsecomp

Back to benchmark hub

Frequently asked questions

Which system is currently best on SWE-bench Verified? +

Claude Mythos is the model currently leading with a tracked score of 93.9%. This page is model-focused, so rankings mostly reflect model capability under the reported harness. Based on our latest tracked results, last updated Mar 22, 2026.

What should I read into a SWE-bench Verified score? +

SWE-bench Verified scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.

Are these independently verified? +

Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.