AI Agent Benchmark Registry
Explore an AI agent eval registry and benchmark leaderboard covering web navigation, coding, desktop control, tool use, deep research, and general reasoning. Compare evaluation suites, tests, frameworks, tasks, evaluators, top scores, and benchmark scope in one place.
How to read this registry
Compare results only when task scope and evaluation method are reasonably comparable. Reproducible suites like WebArena are easier to rerun, while live-web evals like WebVoyager better capture production drift. Start with the category routes for web navigation, coding, and tool use before comparing leaderboard numbers across very different evaluation suites. If you want a single place to browse reported scores across many benchmarks, jump to the Benchmark Index.
643 tasks across 15 live public websites. Evaluated by GPT-4V judge. The most widely adopted web agent benchmark — de facto standard for comparing commercial and research agents.
- Evaluation Method
- GPT-4V
- Top Model Score
- 97.1%
- Human Score
- ~90%
- Task Count
- 643
812 tasks across self-hosted Docker environments: e-commerce, CMS, GitLab, forum, and map. Programmatic evaluation — no LLM judge. Gold standard for reproducible, verifiable web agent evaluation.
- Evaluation Method
- Programmatic
- Top Model Score
- 71.6%
- Human Score
- ~78%
- Task Count
- 812
910 tasks requiring visual reasoning across classifieds, shopping, and Reddit environments. Sister benchmark to WebArena — tests agents that rely on screenshots rather than HTML/DOM.
- Evaluation Method
- Programmatic
- Top Model Score
- ~38%
- Human Score
- ~88%
- Task Count
- 910
300 verified tasks across 136 live websites. Independently verified by Princeton HAL with cost tracking alongside accuracy — unique Pareto frontier view of performance vs. cost.
- Evaluation Method
- HAL Verified
- Top Model Score
- 42.33%
- Human Score
- N/A
- Task Count
- 300
Unified gym environment for web tasks, aggregating WebArena, WorkArena, and other benchmarks under a single interface. Enables standardized agent development and cross-benchmark comparison.
- Evaluation Method
- Programmatic
- Top Model Score
- ~55%
- Human Score
- N/A
- Task Count
- 1,000+
214 realistic, time-consuming tasks sourced from 525+ pages across 258 websites. Designed to test agents that must retrieve, synthesize, and reason — not just navigate. Best score is 25.2%.
- Evaluation Method
- HAL Verified
- Top Model Score
- 25.2%
- Human Score
- ~70%
- Task Count
- 214
Benchmark of tedious, multi-step web chores requiring persistent state tracking and real-world interaction. Designed to test agents on tasks humans find repetitive and boring.
- Evaluation Method
- Programmatic
- Top Model Score
- 54.8%
- Human Score
- N/A
- Task Count
- ~500
ServiceNow-based enterprise workflow benchmark. Tests agents on realistic IT, HR, and operations tasks inside a real enterprise SaaS environment via BrowserGym.
- Evaluation Method
- Programmatic
- Top Model Score
- ~42%
- Human Score
- ~78%
- Task Count
- 33 task types
1.18M real Amazon products across a simulated e-commerce environment. Agents must find and purchase specific products matching user instructions. Reward based on product attribute matching.
- Evaluation Method
- Attribute matching
- Top Model Score
- ~75% reward
- Human Score
- 82.1%
- Task Count
- 12,087
2,454 tasks across 452 live websites from the global top-1,000 by traffic. Direct spiritual successor to WebVoyager with much broader website coverage. Released May 2025 by Halluminate + Skyvern.
- Evaluation Method
- GPT-4V
- Top Model Score
- N/A
- Human Score
- N/A
- Task Count
- 2,454
MISSING A BENCHMARK? OPEN A PR ON GITHUB TO ADD IT TO THE REGISTRY.
What is an AI agent benchmark?
An AI agent benchmark, eval, or evaluation suite is a structured way to test how well an agent completes tasks in an environment, not just how well a model writes a plausible answer. Instead of grading one response, these tests look at sequences of actions across websites, codebases, tools, desktops, or research workflows. In practice, they measure whether the system can make progress, stay grounded, and reach the correct end state.
That is the main difference between an agent benchmark and a standard LLM eval. A classic LLM test asks whether the model produced the right answer to a prompt. An agent evaluation asks whether the system can plan, recover from mistakes, use the right tools, and complete a workflow under realistic constraints. Strong benchmark leaderboards often track not only accuracy, but also task success, reliability, latency, and cost.
Common methods include exact-match grading, executable test suites, environment-state checks, human review, and LLM-as-judge scoring for open-ended work. Each has tradeoffs in rigor, scalability, and realism. Self-hosted suites are easier to rerun and compare over time, while public-web or live-software evaluations better reflect drift and production messiness. The best way to evaluate AI agents is usually to combine both.