AI Agent Benchmark Registry
Explore an AI agent eval registry and benchmark leaderboard covering web navigation, coding, desktop control, tool use, deep research, and general reasoning. Compare evaluation suites, tests, frameworks, tasks, evaluators, top scores, and benchmark scope in one place.
How to read this registry
Compare results only when task scope and evaluation method are reasonably comparable. Reproducible suites like WebArena are easier to rerun, while live-web evals like WebVoyager better capture production drift. Start with the category routes for web navigation, coding, and tool use before comparing leaderboard numbers across very different evaluation suites. If you want a single place to browse reported scores across many benchmarks, jump to the Benchmark Index.
643 tasks across 15 live public websites. Evaluated by GPT-4V judge. The most widely adopted web agent benchmark — de facto standard for comparing commercial and research agents.
- Evaluation Method
- GPT-4V
- Top Model Score
- 97.1%
- Human Score
- ~90%
- Task Count
- 643
812 tasks across self-hosted Docker environments: e-commerce, CMS, GitLab, forum, and map. Programmatic evaluation — no LLM judge. Gold standard for reproducible, verifiable web agent evaluation.
- Evaluation Method
- Programmatic
- Top Model Score
- 71.6%
- Human Score
- ~78%
- Task Count
- 812
910 tasks requiring visual reasoning across classifieds, shopping, and Reddit environments. Sister benchmark to WebArena — tests agents that rely on screenshots rather than HTML/DOM.
- Evaluation Method
- Programmatic
- Top Model Score
- ~38%
- Human Score
- ~88%
- Task Count
- 910
300 verified tasks across 136 live websites. Independently verified by Princeton HAL with cost tracking alongside accuracy — unique Pareto frontier view of performance vs. cost.
- Evaluation Method
- HAL Verified
- Top Model Score
- 42.33%
- Human Score
- N/A
- Task Count
- 300
Unified gym environment for web tasks, aggregating WebArena, WorkArena, and other benchmarks under a single interface. Enables standardized agent development and cross-benchmark comparison.
- Evaluation Method
- Programmatic
- Top Model Score
- ~55%
- Human Score
- N/A
- Task Count
- 1,000+
214 realistic, time-consuming tasks sourced from 525+ pages across 258 websites. Designed to test agents that must retrieve, synthesize, and reason — not just navigate. Best score is 25.2%.
- Evaluation Method
- HAL Verified
- Top Model Score
- 25.2%
- Human Score
- ~70%
- Task Count
- 214
Benchmark of tedious, multi-step web chores requiring persistent state tracking and real-world interaction. Designed to test agents on tasks humans find repetitive and boring.
- Evaluation Method
- Programmatic
- Top Model Score
- 54.8%
- Human Score
- N/A
- Task Count
- ~500
ServiceNow-based enterprise workflow benchmark. Tests agents on realistic IT, HR, and operations tasks inside a real enterprise SaaS environment via BrowserGym.
- Evaluation Method
- Programmatic
- Top Model Score
- ~42%
- Human Score
- ~78%
- Task Count
- 33 task types
1.18M real Amazon products across a simulated e-commerce environment. Agents must find and purchase specific products matching user instructions. Reward based on product attribute matching.
- Evaluation Method
- Attribute matching
- Top Model Score
- ~75% reward
- Human Score
- 82.1%
- Task Count
- 12,087
2,454 tasks across 452 live websites from the global top-1,000 by traffic. Direct spiritual successor to WebVoyager with much broader website coverage. Released May 2025 by Halluminate + Skyvern.
- Evaluation Method
- GPT-4V
- Top Model Score
- N/A
- Human Score
- N/A
- Task Count
- 2,454
1,266 hard research questions designed to be easy to verify but extremely hard to find. Tests persistent multi-step web browsing and information synthesis. Scores are low across the board.
- Evaluation Method
- Exact match
- Top Model Score
- 60.2%
- Human Score
- 29.2%
- Task Count
- 1,266
Multimodal search benchmark testing agents on complex queries requiring both visual and textual web search. Evaluates image-grounded research across live search engines.
- Evaluation Method
- LLM judge
- Top Model Score
- ~58%
- Human Score
- N/A
- Task Count
- ~300
369 cross-application desktop tasks across Ubuntu, Windows, and macOS. Covers Chrome, LibreOffice, VS Code, and more. Execution-based evaluation. Agents still well below the human baseline of 72%.
- Evaluation Method
- Execution-based
- Top Model Score
- 66.2%
- Human Score
- 72.4%
- Task Count
- 369
Cross-platform desktop benchmark covering macOS, Windows, and Ubuntu with 2,000+ tasks. Focuses on real-world app interactions and long-horizon task completion.
- Evaluation Method
- Execution-based
- Top Model Score
- ~40%
- Human Score
- N/A
- Task Count
- 2,000+
macOS-specific benchmark with 369 tasks spanning system preferences, Finder, Safari, and productivity apps. Complements OSWorld with platform-specific depth.
- Evaluation Method
- Execution-based
- Top Model Score
- ~35%
- Human Score
- ~72%
- Task Count
- 369
154 tasks across real Windows 11 applications running in Azure VMs. Tests document editing, file management, system settings, and browser tasks. Full reproducibility via cloud snapshots.
- Evaluation Method
- Programmatic
- Top Model Score
- 19.5%
- Human Score
- 74.5%
- Task Count
- 154
116 tasks across 20 real Android apps in a live emulated environment. Functional evaluation without cached states. Tests agents on real apps including Gmail, Chrome, and Settings.
- Evaluation Method
- Functional
- Top Model Score
- ~30%
- Human Score
- ~88%
- Task Count
- 116
A gym environment for mobile UI interaction built on Android emulator. Provides step-level rewards for fine-grained evaluation of touch-based agent interaction.
- Evaluation Method
- Step reward
- Top Model Score
- N/A
- Human Score
- N/A
- Task Count
- ~70 tasks
500 human-verified GitHub issues from real-world Python repos. Expert-curated to remove ambiguous or unreliable tasks. The most trusted coding agent benchmark — resolving a real bug in a real repo.
- Evaluation Method
- Test suite
- Top Model Score
- ~72%
- Human Score
- ~94%
- Task Count
- 500
300-task curated subset of SWE-bench focusing on self-contained issues. Designed for faster, cheaper evaluation while remaining representative of the full benchmark.
- Evaluation Method
- Test suite
- Top Model Score
- ~55%
- Human Score
- N/A
- Task Count
- 300
Purely terminal-based coding and system tasks with no GUI. Tests command-line proficiency across bash, Python, and system administration. Harder and more realistic than sandbox coding benchmarks.
- Evaluation Method
- Execution-based
- Top Model Score
- ~45%
- Human Score
- N/A
- Task Count
- ~200
75 Kaggle competitions used to evaluate ML engineering agents. Agents must write, run, and iterate on ML pipelines to achieve competitive leaderboard scores.
- Evaluation Method
- Kaggle leaderboard
- Top Model Score
- ~17% medals
- Human Score
- N/A
- Task Count
- 75 competitions
338 scientific coding subproblems from 80 research problems across math, physics, chemistry, biology, and materials science. Tests research-grade code generation against expert-written tests.
- Evaluation Method
- Test suite
- Top Model Score
- ~26%
- Human Score
- ~81%
- Task Count
- 338
Evaluates agents on real-world cybersecurity vulnerability exploitation. Agents are scored on successfully exploiting CVEs from public vulnerability databases in sandboxed environments.
- Evaluation Method
- Exploit success
- Top Model Score
- ~47%
- Human Score
- N/A
- Task Count
- ~50 CVEs
Enhanced version of OpenAI's HumanEval with 80x more test cases per problem to reduce false positives. Tests Python code generation against significantly stricter test coverage.
- Evaluation Method
- Test suite
- Top Model Score
- ~99%
- Human Score
- N/A
- Task Count
- 164
LLM code editing benchmark using real open-source repos. Measures ability to apply targeted code changes from natural language instructions without breaking existing tests.
- Evaluation Method
- Test suite
- Top Model Score
- ~79%
- Human Score
- N/A
- Task Count
- 133
Interactive coding benchmark using bash and SQL environments. Agents iteratively execute code and receive environment feedback, testing multi-turn code generation and debugging.
- Evaluation Method
- Execution-based
- Top Model Score
- ~60%
- Human Score
- N/A
- Task Count
- ~700
Repository-level code completion benchmark. Tests retrieval and generation across entire codebases — agents must understand cross-file context to complete functions correctly.
- Evaluation Method
- Exact match / CodeBLEU
- Top Model Score
- ~55%
- Human Score
- N/A
- Task Count
- ~900
16,000+ real-world APIs from RapidAPI across 49 categories. Tests agents on planning and chaining API calls to complete complex instructions. Includes a neural retriever for API selection.
- Evaluation Method
- Pass rate / win rate
- Top Model Score
- ~60% pass rate
- Human Score
- N/A
- Task Count
- 2,746
Agent-computer interaction benchmark focused on realistic customer service scenarios. Agents must complete multi-turn tasks using tools (database lookups, reservations) while following strict policies.
- Evaluation Method
- Functional
- Top Model Score
- ~60%
- Human Score
- N/A
- Task Count
- ~200
73 API tools across 3 difficulty levels testing tool retrieval, plan selection, and API call correctness. One of the earliest systematic tool-use benchmarks for LLMs.
- Evaluation Method
- Exact match
- Top Model Score
- ~75%
- Human Score
- N/A
- Task Count
- 314
1,645 API tasks across HuggingFace, TorchHub, and TensorHub. Evaluates if agents generate accurate API calls including correct arguments and library usage without hallucination.
- Evaluation Method
- AST matching
- Top Model Score
- ~80%
- Human Score
- N/A
- Task Count
- 1,645
Stateful tool-use benchmark with interdependencies between tool calls. Agents must manage tool state across multi-step tasks — calling one tool affects what another returns.
- Evaluation Method
- State verification
- Top Model Score
- ~52%
- Human Score
- N/A
- Task Count
- ~200
Evaluation suite for agents using Model Context Protocol servers. Tests correctness of MCP tool invocation, schema understanding, and multi-server orchestration.
- Evaluation Method
- Functional
- Top Model Score
- N/A
- Human Score
- N/A
- Task Count
- N/A
466 tasks across 3 difficulty levels requiring tool use, multimodal reasoning, and web browsing. With 587 submissions on HuggingFace, the most submitted-to AI agent benchmark in existence.
- Evaluation Method
- Exact match
- Top Model Score
- ~75%
- Human Score
- 92%
- Task Count
- 466
8 distinct environments spanning web browsing, OS, database, and game interaction. Tests agents across diverse real-world-like scenarios in a single unified framework.
- Evaluation Method
- Task-specific
- Top Model Score
- ~4.27 score
- Human Score
- N/A
- Task Count
- 1,091
3,000 expert-level questions across 100+ academic disciplines, crowd-sourced from domain experts. Designed to be at or beyond the frontier of human knowledge — the hardest factual benchmark yet.
- Evaluation Method
- Exact match
- Top Model Score
- ~26%
- Human Score
- N/A
- Task Count
- 3,000
The second generation of François Chollet's Abstraction and Reasoning Corpus. Novel visual pattern tasks designed to resist memorization — requires genuine program synthesis from examples.
- Evaluation Method
- Exact match
- Top Model Score
- ~4%
- Human Score
- ~60%
- Task Count
- ~500
448 expert-level multiple-choice questions in biology, physics, and chemistry — written and validated by domain PhDs. Only experts in the relevant field consistently score above random.
- Evaluation Method
- Multiple choice
- Top Model Score
- ~87%
- Human Score
- ~69% (experts)
- Task Count
- 448
Monthly-refreshed benchmark with questions sourced from recent news, papers, and competition math. Designed to prevent data contamination — the benchmark evolves so models can't memorize answers.
- Evaluation Method
- Verifiable
- Top Model Score
- ~80%
- Human Score
- N/A
- Task Count
- ~900 (rotating)
4,326 short factual questions with a single unambiguous correct answer. Measures factual accuracy and hallucination rate — designed to have no trick questions, only clear facts.
- Evaluation Method
- Exact match
- Top Model Score
- ~97%
- Human Score
- ~94%
- Task Count
- 4,326
Analytical benchmark across 9 diverse agent scenarios. Provides fine-grained progress rates beyond binary success/fail — measures how far along a task an agent gets even when it fails.
- Evaluation Method
- Progress rate
- Top Model Score
- ~58% progress
- Human Score
- N/A
- Task Count
- ~1,000
Long-horizon agent benchmark requiring sustained reasoning and planning over 50+ steps. Tests whether agents can maintain coherent goals across very long task horizons without losing context.
- Evaluation Method
- Functional
- Top Model Score
- ~35%
- Human Score
- N/A
- Task Count
- ~200
9 app ecosystem with 750 tasks spanning contacts, music, email, maps, and calendar. Tests agents on realistic app-based workflows requiring coordination across multiple simulated apps.
- Evaluation Method
- Functional
- Top Model Score
- ~49%
- Human Score
- N/A
- Task Count
- 750
Social intelligence benchmark placing agents in realistic social scenarios. Evaluates believability, social goal completion, relationship management, and secret keeping across 11 social dimensions.
- Evaluation Method
- LLM judge (GPT-4)
- Top Model Score
- ~7.6/10
- Human Score
- ~8.3/10
- Task Count
- ~600 episodes
Safety red-teaming benchmark with 440 harmful agent tasks across 11 categories. Tests whether agent frameworks allow harmful behaviors — jailbreaking, weapon synthesis, fraud, and more.
- Evaluation Method
- Human / LLM judge
- Top Model Score
- N/A
- Human Score
- N/A
- Task Count
- 440
300 clinical tasks across 10 medical categories using real EHR data. Tests agents on diagnosis reasoning, treatment planning, and medical record navigation in realistic hospital environments.
- Evaluation Method
- Expert validation
- Top Model Score
- ~77%
- Human Score
- N/A
- Task Count
- 300
Evaluates both cooperative and competitive multi-agent systems. Tasks include collaborative problem-solving and adversarial games — measures emergent coordination and strategic behavior.
- Evaluation Method
- Task-specific
- Top Model Score
- N/A
- Human Score
- N/A
- Task Count
- ~300
Cybersecurity benchmark testing agents on capture-the-flag style challenges. Covers reverse engineering, web exploitation, and cryptography. Designed to stress-test autonomous offensive security agents.
- Evaluation Method
- Flag capture
- Top Model Score
- ~35%
- Human Score
- N/A
- Task Count
- ~150
Transaction and inventory reasoning benchmark. Agents manage a virtual vending machine over many turns — testing whether models understand real-world economics, stock levels, and pricing logic.
- Evaluation Method
- State verification
- Top Model Score
- ~62%
- Human Score
- N/A
- Task Count
- ~200
Role-playing and character consistency benchmark. Evaluates agents on maintaining persona fidelity, character knowledge accuracy, and in-character behavior across long conversations.
- Evaluation Method
- LLM + human judge
- Top Model Score
- ~75%
- Human Score
- N/A
- Task Count
- ~1,000 dialogues
Tests model resistance to confidently stated falsehoods in prompts. Evaluates whether agents can identify and reject plausible-sounding but incorrect premises before acting on them.
- Evaluation Method
- Rejection rate
- Top Model Score
- N/A
- Human Score
- N/A
- Task Count
- N/A
MISSING A BENCHMARK? OPEN A PR ON GITHUB TO ADD IT TO THE REGISTRY.
What is an AI agent benchmark?
An AI agent benchmark, eval, or evaluation suite is a structured way to test how well an agent completes tasks in an environment, not just how well a model writes a plausible answer. Instead of grading one response, these tests look at sequences of actions across websites, codebases, tools, desktops, or research workflows. In practice, they measure whether the system can make progress, stay grounded, and reach the correct end state.
That is the main difference between an agent benchmark and a standard LLM eval. A classic LLM test asks whether the model produced the right answer to a prompt. An agent evaluation asks whether the system can plan, recover from mistakes, use the right tools, and complete a workflow under realistic constraints. Strong benchmark leaderboards often track not only accuracy, but also task success, reliability, latency, and cost.
Common methods include exact-match grading, executable test suites, environment-state checks, human review, and LLM-as-judge scoring for open-ended work. Each has tradeoffs in rigor, scalability, and realism. Self-hosted suites are easier to rerun and compare over time, while public-web or live-software evaluations better reflect drift and production messiness. The best way to evaluate AI agents is usually to combine both.