Steel.dev®

AI Agent Benchmark Registry

Steel.dev Logo

Explore an AI agent eval registry and benchmark leaderboard covering web navigation, coding, desktop control, tool use, deep research, and general reasoning. Compare evaluation suites, tests, frameworks, tasks, evaluators, top scores, and benchmark scope in one place.

How to read this registry

Compare results only when task scope and evaluation method are reasonably comparable. Reproducible suites like WebArena are easier to rerun, while live-web evals like WebVoyager better capture production drift. Start with the category routes for web navigation, coding, and tool use before comparing leaderboard numbers across very different evaluation suites. If you want a single place to browse reported scores across many benchmarks, jump to the Benchmark Index.

REGISTRY

Coding agent benchmarks

Coding agent benchmarks should look less like puzzle sets and more like real software work. The strongest ones test repository understanding, file edits, tool use, debugging, and whether changes satisfy executable checks without breaking surrounding code. When you compare coding-agent evals, task realism and grading quality matter more than flashy single-shot scores. For most teams, SWE-bench Verified is the obvious starting point, while Terminal-Bench is useful if command-line workflows are part of the product. If your agents spend as much time calling tools as editing code, compare them with the tool-use benchmarks too.
Coding agent benchmark - Public
Benchmark By Princeton

500 human-verified GitHub issues from real-world Python repos. Expert-curated to remove ambiguous or unreliable tasks. The most trusted coding agent benchmark — resolving a real bug in a real repo.

Top Model Score
~72%
OpenAI o3
Human Score
~94%
Coding agent benchmark - Public
Benchmark By Princeton

300-task curated subset of SWE-bench focusing on self-contained issues. Designed for faster, cheaper evaluation while remaining representative of the full benchmark.

Top Model Score
~55%
OpenAI o3
Human Score
N/A
Coding agent benchmark - Public
Benchmark By Harbor

Purely terminal-based coding and system tasks with no GUI. Tests command-line proficiency across bash, Python, and system administration. Harder and more realistic than sandbox coding benchmarks.

Top Model Score
~45%
Claude 3.7 Sonnet
Human Score
N/A
Coding agent benchmark - Public
Benchmark By OpenAI

75 Kaggle competitions used to evaluate ML engineering agents. Agents must write, run, and iterate on ML pipelines to achieve competitive leaderboard scores.

Top Model Score
~17% medals
AIDE (Claude 3.5)
Human Score
N/A
Coding agent benchmark - Public
Benchmark By SciCode

338 scientific coding subproblems from 80 research problems across math, physics, chemistry, biology, and materials science. Tests research-grade code generation against expert-written tests.

Top Model Score
~26%
Claude 3.5 Sonnet
Human Score
~81%
Coding agent benchmark - Self-hosted
Benchmark By UIUC

Evaluates agents on real-world cybersecurity vulnerability exploitation. Agents are scored on successfully exploiting CVEs from public vulnerability databases in sandboxed environments.

Top Model Score
~47%
o3 (high)
Human Score
N/A
Coding agent benchmark - Public
Benchmark By EvalPlus

Enhanced version of OpenAI's HumanEval with 80x more test cases per problem to reduce false positives. Tests Python code generation against significantly stricter test coverage.

Top Model Score
~99%
o3 / Claude 3.7
Human Score
N/A
Coding agent benchmark - Public
Benchmark By Aider

LLM code editing benchmark using real open-source repos. Measures ability to apply targeted code changes from natural language instructions without breaking existing tests.

Top Model Score
~79%
o3
Human Score
N/A
Coding agent benchmark - Self-hosted
Benchmark By Princeton

Interactive coding benchmark using bash and SQL environments. Agents iteratively execute code and receive environment feedback, testing multi-turn code generation and debugging.

Top Model Score
~60%
GPT-4 + ReAct
Human Score
N/A
Coding agent benchmark - Public

Repository-level code completion benchmark. Tests retrieval and generation across entire codebases — agents must understand cross-file context to complete functions correctly.

Top Model Score
~55%
GPT-4
Human Score
N/A

MISSING A BENCHMARK? OPEN A PR ON GITHUB TO ADD IT TO THE REGISTRY.

What is an AI agent benchmark?

An AI agent benchmark, eval, or evaluation suite is a structured way to test how well an agent completes tasks in an environment, not just how well a model writes a plausible answer. Instead of grading one response, these tests look at sequences of actions across websites, codebases, tools, desktops, or research workflows. In practice, they measure whether the system can make progress, stay grounded, and reach the correct end state.

That is the main difference between an agent benchmark and a standard LLM eval. A classic LLM test asks whether the model produced the right answer to a prompt. An agent evaluation asks whether the system can plan, recover from mistakes, use the right tools, and complete a workflow under realistic constraints. Strong benchmark leaderboards often track not only accuracy, but also task success, reliability, latency, and cost.

Common methods include exact-match grading, executable test suites, environment-state checks, human review, and LLM-as-judge scoring for open-ended work. Each has tradeoffs in rigor, scalability, and realism. Self-hosted suites are easier to rerun and compare over time, while public-web or live-software evaluations better reflect drift and production messiness. The best way to evaluate AI agents is usually to combine both.

FAQ
What are AI agent benchmarks? [+]
AI agent benchmarks are evaluations that measure whether an agent can complete multi-step tasks in an environment such as a browser, terminal, desktop, or tool stack. Unlike single-prompt model tests, they focus on action quality, task completion, recovery from mistakes, and end-to-end execution.
What is an agent eval registry? [+]
An agent eval registry is a curated index of AI agent benchmarks, evaluations, leaderboards, test suites, and frameworks. Instead of covering just one benchmark family, it helps you compare multiple evaluation options across web navigation, coding, desktop control, and tool use in one place.
How do you evaluate AI agents? [+]
You evaluate AI agents by testing them on multi-step tasks in realistic environments and measuring whether they reach the correct end state. Strong agent evaluations usually track task success, evaluator design, reliability, cost, latency, and recovery from mistakes. The right eval framework depends on whether you care about browser use, coding, tool use, desktop control, or general reasoning.
What is the best benchmark for coding agents? [+]
There is no single best benchmark for every use case, but SWE-bench Verified is widely treated as the most trusted benchmark for coding agents because it uses real repository issues and executable test suites. Terminal-Bench is also useful when you want to evaluate autonomous agents in command-line and systems workflows.
How do browser agent benchmarks differ from coding benchmarks? [+]
Browser agent benchmarks evaluate interaction with websites, page state, navigation, and visual or DOM-grounded actions. Coding benchmarks evaluate repository understanding, file edits, tool use, debugging, and test execution. Compare WebVoyager with SWE-bench Verified to see how different the environments and failure modes are.
What does self-hosted mean in an agent benchmark? [+]
Self-hosted means the benchmark environment can be run in a controlled local or containerized setup instead of depending entirely on live public services. That usually improves reproducibility and evaluation stability, but may be less representative of the messy real web or production software. Benchmarks like WebArena and OSWorld are good examples.
Why do benchmark scores differ across evaluators? [+]
Benchmark scores differ because evaluators measure success in different ways. Some use exact match, some use executable test suites, some verify environment state, and others rely on human review or LLM judges. A score on one benchmark or evaluator is not directly comparable to a score produced by a different evaluation method. Use the How to read this registry note before comparing results.