Coding agent benchmark - Public

SWE-bench Lite

300-task curated subset of SWE-bench focusing on self-contained issues. Designed for faster, cheaper evaluation while remaining representative of the full benchmark.

BENCHMARK

Benchmark by Princeton

Benchmark type:: Public benchmark
Benchmark domain:: Coding agent
Task count:: 300
Evaluation method:: Test suite

Top model score: ~55%
Human score: N/A

View SWE-bench Lite benchmark paper SWE-bench Lite GitHub repository

About this benchmark

SWE-bench Lite is a lightweight subset of the SWE-bench benchmark, created by Princeton NLP to provide a more accessible evaluation of language models on real-world software engineering tasks. Drawn from the full SWE-bench dataset of 2,294 GitHub issues across 12 popular Python repositories, Lite selects a representative subset of problems that are less resource-intensive to evaluate while still testing meaningful code generation capabilities. Each task presents a model with a codebase and an issue description, requiring it to produce a patch that resolves the described problem, often involving changes across multiple functions, classes, and files.

Evaluation runs through the same Docker-based containerized harness as the full SWE-bench, applying generated patches and running repository test suites to verify correctness. When the benchmark was first introduced alongside the ICLR 2024 paper, Claude 2 achieved only 1.96% on the full set. SWE-bench Lite serves as a cost-effective proxy for the full benchmark, enabling faster iteration during agent development.

SWE-bench Lite remains widely used as a quick evaluation target for coding agent research and development. It is available under the MIT license on GitHub and HuggingFace, with the same tooling and evaluation infrastructure as the full SWE-bench.

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the coding view . You can also jump straight to this benchmark in the master registry list .