AI Agent Benchmark Registry
Explore an AI agent eval registry and benchmark leaderboard covering web navigation, coding, desktop control, tool use, deep research, and general reasoning. Compare evaluation suites, tests, frameworks, tasks, evaluators, top scores, and benchmark scope in one place.
How to read this registry
Compare results only when task scope and evaluation method are reasonably comparable. Reproducible suites like WebArena are easier to rerun, while live-web evals like WebVoyager better capture production drift. Start with the category routes for web navigation, coding, and tool use before comparing leaderboard numbers across very different evaluation suites. If you want a single place to browse reported scores across many benchmarks, jump to the Benchmark Index.
Desktop control benchmarks
369 cross-application desktop tasks across Ubuntu, Windows, and macOS. Covers Chrome, LibreOffice, VS Code, and more. Execution-based evaluation. Agents still well below the human baseline of 72%.
- Evaluation Method
- Execution-based
- Top Model Score
- 66.2%
- Human Score
- 72.4%
- Task Count
- 369
Cross-platform desktop benchmark covering macOS, Windows, and Ubuntu with 2,000+ tasks. Focuses on real-world app interactions and long-horizon task completion.
- Evaluation Method
- Execution-based
- Top Model Score
- ~40%
- Human Score
- N/A
- Task Count
- 2,000+
macOS-specific benchmark with 369 tasks spanning system preferences, Finder, Safari, and productivity apps. Complements OSWorld with platform-specific depth.
- Evaluation Method
- Execution-based
- Top Model Score
- ~35%
- Human Score
- ~72%
- Task Count
- 369
154 tasks across real Windows 11 applications running in Azure VMs. Tests document editing, file management, system settings, and browser tasks. Full reproducibility via cloud snapshots.
- Evaluation Method
- Programmatic
- Top Model Score
- 19.5%
- Human Score
- 74.5%
- Task Count
- 154
116 tasks across 20 real Android apps in a live emulated environment. Functional evaluation without cached states. Tests agents on real apps including Gmail, Chrome, and Settings.
- Evaluation Method
- Functional
- Top Model Score
- ~30%
- Human Score
- ~88%
- Task Count
- 116
A gym environment for mobile UI interaction built on Android emulator. Provides step-level rewards for fine-grained evaluation of touch-based agent interaction.
- Evaluation Method
- Step reward
- Top Model Score
- N/A
- Human Score
- N/A
- Task Count
- ~70 tasks
MISSING A BENCHMARK? OPEN A PR ON GITHUB TO ADD IT TO THE REGISTRY.
What is an AI agent benchmark?
An AI agent benchmark, eval, or evaluation suite is a structured way to test how well an agent completes tasks in an environment, not just how well a model writes a plausible answer. Instead of grading one response, these tests look at sequences of actions across websites, codebases, tools, desktops, or research workflows. In practice, they measure whether the system can make progress, stay grounded, and reach the correct end state.
That is the main difference between an agent benchmark and a standard LLM eval. A classic LLM test asks whether the model produced the right answer to a prompt. An agent evaluation asks whether the system can plan, recover from mistakes, use the right tools, and complete a workflow under realistic constraints. Strong benchmark leaderboards often track not only accuracy, but also task success, reliability, latency, and cost.
Common methods include exact-match grading, executable test suites, environment-state checks, human review, and LLM-as-judge scoring for open-ended work. Each has tradeoffs in rigor, scalability, and realism. Self-hosted suites are easier to rerun and compare over time, while public-web or live-software evaluations better reflect drift and production messiness. The best way to evaluate AI agents is usually to combine both.