AI Agent Benchmark Registry
Explore an AI agent eval registry and benchmark leaderboard covering web navigation, coding, desktop control, tool use, deep research, and general reasoning. Compare evaluation suites, tests, frameworks, tasks, evaluators, top scores, and benchmark scope in one place.
How to read this registry
Compare results only when task scope and evaluation method are reasonably comparable. Reproducible suites like WebArena are easier to rerun, while live-web evals like WebVoyager better capture production drift. Start with the category routes for web navigation, coding, and tool use before comparing leaderboard numbers across very different evaluation suites. If you want a single place to browse reported scores across many benchmarks, jump to the Benchmark Index.
Coding agent benchmarks
500 human-verified GitHub issues from real-world Python repos. Expert-curated to remove ambiguous or unreliable tasks. The most trusted coding agent benchmark — resolving a real bug in a real repo.
- Evaluation Method
- Test suite
- Top Model Score
- ~72%
- Human Score
- ~94%
- Task Count
- 500
300-task curated subset of SWE-bench focusing on self-contained issues. Designed for faster, cheaper evaluation while remaining representative of the full benchmark.
- Evaluation Method
- Test suite
- Top Model Score
- ~55%
- Human Score
- N/A
- Task Count
- 300
Purely terminal-based coding and system tasks with no GUI. Tests command-line proficiency across bash, Python, and system administration. Harder and more realistic than sandbox coding benchmarks.
- Evaluation Method
- Execution-based
- Top Model Score
- ~45%
- Human Score
- N/A
- Task Count
- ~200
75 Kaggle competitions used to evaluate ML engineering agents. Agents must write, run, and iterate on ML pipelines to achieve competitive leaderboard scores.
- Evaluation Method
- Kaggle leaderboard
- Top Model Score
- ~17% medals
- Human Score
- N/A
- Task Count
- 75 competitions
338 scientific coding subproblems from 80 research problems across math, physics, chemistry, biology, and materials science. Tests research-grade code generation against expert-written tests.
- Evaluation Method
- Test suite
- Top Model Score
- ~26%
- Human Score
- ~81%
- Task Count
- 338
Evaluates agents on real-world cybersecurity vulnerability exploitation. Agents are scored on successfully exploiting CVEs from public vulnerability databases in sandboxed environments.
- Evaluation Method
- Exploit success
- Top Model Score
- ~47%
- Human Score
- N/A
- Task Count
- ~50 CVEs
Enhanced version of OpenAI's HumanEval with 80x more test cases per problem to reduce false positives. Tests Python code generation against significantly stricter test coverage.
- Evaluation Method
- Test suite
- Top Model Score
- ~99%
- Human Score
- N/A
- Task Count
- 164
LLM code editing benchmark using real open-source repos. Measures ability to apply targeted code changes from natural language instructions without breaking existing tests.
- Evaluation Method
- Test suite
- Top Model Score
- ~79%
- Human Score
- N/A
- Task Count
- 133
Interactive coding benchmark using bash and SQL environments. Agents iteratively execute code and receive environment feedback, testing multi-turn code generation and debugging.
- Evaluation Method
- Execution-based
- Top Model Score
- ~60%
- Human Score
- N/A
- Task Count
- ~700
Repository-level code completion benchmark. Tests retrieval and generation across entire codebases — agents must understand cross-file context to complete functions correctly.
- Evaluation Method
- Exact match / CodeBLEU
- Top Model Score
- ~55%
- Human Score
- N/A
- Task Count
- ~900
MISSING A BENCHMARK? OPEN A PR ON GITHUB TO ADD IT TO THE REGISTRY.
What is an AI agent benchmark?
An AI agent benchmark, eval, or evaluation suite is a structured way to test how well an agent completes tasks in an environment, not just how well a model writes a plausible answer. Instead of grading one response, these tests look at sequences of actions across websites, codebases, tools, desktops, or research workflows. In practice, they measure whether the system can make progress, stay grounded, and reach the correct end state.
That is the main difference between an agent benchmark and a standard LLM eval. A classic LLM test asks whether the model produced the right answer to a prompt. An agent evaluation asks whether the system can plan, recover from mistakes, use the right tools, and complete a workflow under realistic constraints. Strong benchmark leaderboards often track not only accuracy, but also task success, reliability, latency, and cost.
Common methods include exact-match grading, executable test suites, environment-state checks, human review, and LLM-as-judge scoring for open-ended work. Each has tradeoffs in rigor, scalability, and realism. Self-hosted suites are easier to rerun and compare over time, while public-web or live-software evaluations better reflect drift and production messiness. The best way to evaluate AI agents is usually to combine both.