AI Agent Benchmark Registry
Explore an AI agent eval registry and benchmark leaderboard covering web navigation, coding, desktop control, tool use, deep research, and general reasoning. Compare evaluation suites, tests, frameworks, tasks, evaluators, top scores, and benchmark scope in one place.
How to read this registry
Compare results only when task scope and evaluation method are reasonably comparable. Reproducible suites like WebArena are easier to rerun, while live-web evals like WebVoyager better capture production drift. Start with the category routes for web navigation, coding, and tool use before comparing leaderboard numbers across very different evaluation suites. If you want a single place to browse reported scores across many benchmarks, jump to the Benchmark Index.
General reasoning benchmarks
466 tasks across 3 difficulty levels requiring tool use, multimodal reasoning, and web browsing. With 587 submissions on HuggingFace, the most submitted-to AI agent benchmark in existence.
- Evaluation Method
- Exact match
- Top Model Score
- ~75%
- Human Score
- 92%
- Task Count
- 466
8 distinct environments spanning web browsing, OS, database, and game interaction. Tests agents across diverse real-world-like scenarios in a single unified framework.
- Evaluation Method
- Task-specific
- Top Model Score
- ~4.27 score
- Human Score
- N/A
- Task Count
- 1,091
3,000 expert-level questions across 100+ academic disciplines, crowd-sourced from domain experts. Designed to be at or beyond the frontier of human knowledge — the hardest factual benchmark yet.
- Evaluation Method
- Exact match
- Top Model Score
- ~26%
- Human Score
- N/A
- Task Count
- 3,000
The second generation of François Chollet's Abstraction and Reasoning Corpus. Novel visual pattern tasks designed to resist memorization — requires genuine program synthesis from examples.
- Evaluation Method
- Exact match
- Top Model Score
- ~4%
- Human Score
- ~60%
- Task Count
- ~500
448 expert-level multiple-choice questions in biology, physics, and chemistry — written and validated by domain PhDs. Only experts in the relevant field consistently score above random.
- Evaluation Method
- Multiple choice
- Top Model Score
- ~87%
- Human Score
- ~69% (experts)
- Task Count
- 448
Monthly-refreshed benchmark with questions sourced from recent news, papers, and competition math. Designed to prevent data contamination — the benchmark evolves so models can't memorize answers.
- Evaluation Method
- Verifiable
- Top Model Score
- ~80%
- Human Score
- N/A
- Task Count
- ~900 (rotating)
4,326 short factual questions with a single unambiguous correct answer. Measures factual accuracy and hallucination rate — designed to have no trick questions, only clear facts.
- Evaluation Method
- Exact match
- Top Model Score
- ~97%
- Human Score
- ~94%
- Task Count
- 4,326
Analytical benchmark across 9 diverse agent scenarios. Provides fine-grained progress rates beyond binary success/fail — measures how far along a task an agent gets even when it fails.
- Evaluation Method
- Progress rate
- Top Model Score
- ~58% progress
- Human Score
- N/A
- Task Count
- ~1,000
Long-horizon agent benchmark requiring sustained reasoning and planning over 50+ steps. Tests whether agents can maintain coherent goals across very long task horizons without losing context.
- Evaluation Method
- Functional
- Top Model Score
- ~35%
- Human Score
- N/A
- Task Count
- ~200
9 app ecosystem with 750 tasks spanning contacts, music, email, maps, and calendar. Tests agents on realistic app-based workflows requiring coordination across multiple simulated apps.
- Evaluation Method
- Functional
- Top Model Score
- ~49%
- Human Score
- N/A
- Task Count
- 750
MISSING A BENCHMARK? OPEN A PR ON GITHUB TO ADD IT TO THE REGISTRY.
What is an AI agent benchmark?
An AI agent benchmark, eval, or evaluation suite is a structured way to test how well an agent completes tasks in an environment, not just how well a model writes a plausible answer. Instead of grading one response, these tests look at sequences of actions across websites, codebases, tools, desktops, or research workflows. In practice, they measure whether the system can make progress, stay grounded, and reach the correct end state.
That is the main difference between an agent benchmark and a standard LLM eval. A classic LLM test asks whether the model produced the right answer to a prompt. An agent evaluation asks whether the system can plan, recover from mistakes, use the right tools, and complete a workflow under realistic constraints. Strong benchmark leaderboards often track not only accuracy, but also task success, reliability, latency, and cost.
Common methods include exact-match grading, executable test suites, environment-state checks, human review, and LLM-as-judge scoring for open-ended work. Each has tradeoffs in rigor, scalability, and realism. Self-hosted suites are easier to rerun and compare over time, while public-web or live-software evaluations better reflect drift and production messiness. The best way to evaluate AI agents is usually to combine both.