What are AI agent benchmarks?

AI agent benchmarks are evaluations that measure whether an agent can complete multi-step tasks in an environment such as a browser, terminal, desktop, or tool stack. Unlike single-prompt model tests, they focus on action quality, task completion, recovery from mistakes, and end-to-end execution.

What is an agent eval registry?

An agent eval registry is a curated index of AI agent benchmarks, evaluations, leaderboards, test suites, and frameworks. Instead of covering just one benchmark family, it helps you compare multiple evaluation options across web navigation, coding, desktop control, and tool use in one place.

How do you evaluate AI agents?

You evaluate AI agents by testing them on multi-step tasks in realistic environments and measuring whether they reach the correct end state. Strong agent evaluations usually track task success, evaluator design, reliability, cost, latency, and recovery from mistakes. The right eval framework depends on whether you care about browser use, coding, tool use, desktop control, or general reasoning.

What is the best benchmark for coding agents?

There is no single best benchmark for every use case, but SWE-bench Verified is widely treated as the most trusted benchmark for coding agents because it uses real repository issues and executable test suites. Terminal-Bench is also useful when you want to evaluate autonomous agents in command-line and systems workflows.

How do browser agent benchmarks differ from coding benchmarks?

Browser agent benchmarks evaluate interaction with websites, page state, navigation, and visual or DOM-grounded actions. Coding benchmarks evaluate repository understanding, file edits, tool use, debugging, and test execution. Compare WebVoyager with SWE-bench Verified to see how different the environments and failure modes are.

What does self-hosted mean in an agent benchmark?

Self-hosted means the benchmark environment can be run in a controlled local or containerized setup instead of depending entirely on live public services. That usually improves reproducibility and evaluation stability, but may be less representative of the messy real web or production software. Benchmarks like WebArena and OSWorld are good examples.

Why do benchmark scores differ across evaluators?

Benchmark scores differ because evaluators measure success in different ways. Some use exact match, some use executable test suites, some verify environment state, and others rely on human review or LLM judges. A score on one benchmark or evaluator is not directly comparable to a score produced by a different evaluation method. Use the How to read this registry note before comparing results.

AI Agent Benchmark Registry | Web, Coding, Desktop & Tool-Use Benchmarks

Web navigation benchmark - Public

WebVoyager

643 tasks across 15 live public websites. Evaluated by GPT-4V judge. The most widely adopted web agent benchmark — de facto standard for comparing commercial and research agents.

Evaluation Method: GPT-4V
Top Model Score: 97.1%
Human Score: ~90%
Task Count: 643

View WebVoyager benchmark paper WebVoyager GitHub repository

Web navigation benchmark - Self-hosted

WebArena

812 tasks across self-hosted Docker environments: e-commerce, CMS, GitLab, forum, and map. Programmatic evaluation — no LLM judge. Gold standard for reproducible, verifiable web agent evaluation.

Evaluation Method: Programmatic
Top Model Score: 71.6%
Human Score: ~78%
Task Count: 812

View WebArena benchmark paper WebArena GitHub repository

Web navigation benchmark - Self-hosted

VisualWebArena

910 tasks requiring visual reasoning across classifieds, shopping, and Reddit environments. Sister benchmark to WebArena — tests agents that rely on screenshots rather than HTML/DOM.

Evaluation Method: Programmatic
Top Model Score: ~38%
Human Score: ~88%
Task Count: 910

View VisualWebArena benchmark paper VisualWebArena GitHub repository

Web navigation benchmark - Public

Online-Mind2Web

300 verified tasks across 136 live websites. Independently verified by Princeton HAL with cost tracking alongside accuracy — unique Pareto frontier view of performance vs. cost.

Evaluation Method: HAL Verified
Top Model Score: 42.33%
Human Score: N/A
Task Count: 300

View Online-Mind2Web benchmark paper Online-Mind2Web GitHub repository

Web navigation benchmark - Self-hosted

BrowserGym

Benchmark By ServiceNow

Unified gym environment for web tasks, aggregating WebArena, WorkArena, and other benchmarks under a single interface. Enables standardized agent development and cross-benchmark comparison.

Evaluation Method: Programmatic
Top Model Score: ~55%
Human Score: N/A
Task Count: 1,000+

View BrowserGym benchmark paper BrowserGym GitHub repository

Web navigation benchmark - Public

AssistantBench

214 realistic, time-consuming tasks sourced from 525+ pages across 258 websites. Designed to test agents that must retrieve, synthesize, and reason — not just navigate. Best score is 25.2%.

Evaluation Method: HAL Verified
Top Model Score: 25.2%
Human Score: ~70%
Task Count: 214

View AssistantBench benchmark paper AssistantBench GitHub repository

Web navigation benchmark - Public

WebChoreArena

Benchmark By Google DeepMind

Benchmark of tedious, multi-step web chores requiring persistent state tracking and real-world interaction. Designed to test agents on tasks humans find repetitive and boring.

Evaluation Method: Programmatic
Top Model Score: 54.8%
Human Score: N/A
Task Count: ~500

View WebChoreArena benchmark paper WebChoreArena GitHub repository

Web navigation benchmark - Self-hosted

WorkArena

Benchmark By ServiceNow

ServiceNow-based enterprise workflow benchmark. Tests agents on realistic IT, HR, and operations tasks inside a real enterprise SaaS environment via BrowserGym.

Evaluation Method: Programmatic
Top Model Score: ~42%
Human Score: ~78%
Task Count: 33 task types

View WorkArena benchmark paper WorkArena GitHub repository

Web navigation benchmark - Self-hosted

WebShop

Benchmark By Princeton

1.18M real Amazon products across a simulated e-commerce environment. Agents must find and purchase specific products matching user instructions. Reward based on product attribute matching.

Evaluation Method: Attribute matching
Top Model Score: ~75% reward
Human Score: 82.1%
Task Count: 12,087

View WebShop benchmark paper WebShop GitHub repository

Web navigation benchmark - Public

WebBench

Benchmark By Halluminate / Skyvern

2,454 tasks across 452 live websites from the global top-1,000 by traffic. Direct spiritual successor to WebVoyager with much broader website coverage. Released May 2025 by Halluminate + Skyvern.

Evaluation Method: GPT-4V
Top Model Score: N/A
Human Score: N/A
Task Count: 2,454

View WebBench benchmark paper WebBench GitHub repository

Deep research benchmark - Public

BrowseComp

Benchmark By OpenAI

1,266 hard research questions designed to be easy to verify but extremely hard to find. Tests persistent multi-step web browsing and information synthesis. Scores are low across the board.

Evaluation Method: Exact match
Top Model Score: 60.2%
Human Score: 29.2%
Task Count: 1,266

View BrowseComp benchmark paper

Deep research benchmark - Public

MMSearch-Plus

Multimodal search benchmark testing agents on complex queries requiring both visual and textual web search. Evaluates image-grounded research across live search engines.

Evaluation Method: LLM judge
Top Model Score: ~58%
Human Score: N/A
Task Count: ~300

View MMSearch-Plus benchmark paper MMSearch-Plus GitHub repository

Desktop control benchmark - Self-hosted

OSWorld

Benchmark By xLang AI

369 cross-application desktop tasks across Ubuntu, Windows, and macOS. Covers Chrome, LibreOffice, VS Code, and more. Execution-based evaluation. Agents still well below the human baseline of 72%.

Evaluation Method: Execution-based
Top Model Score: 66.2%
Human Score: 72.4%
Task Count: 369

View OSWorld benchmark paper OSWorld GitHub repository

Desktop control benchmark - Self-hosted

OSUniverse

Benchmark By AgentSea

Cross-platform desktop benchmark covering macOS, Windows, and Ubuntu with 2,000+ tasks. Focuses on real-world app interactions and long-horizon task completion.

Evaluation Method: Execution-based
Top Model Score: ~40%
Human Score: N/A
Task Count: 2,000+

View OSUniverse benchmark paper OSUniverse GitHub repository

Desktop control benchmark - Self-hosted

macOSWorld

Benchmark By Show Lab

macOS-specific benchmark with 369 tasks spanning system preferences, Finder, Safari, and productivity apps. Complements OSWorld with platform-specific depth.

Evaluation Method: Execution-based
Top Model Score: ~35%
Human Score: ~72%
Task Count: 369

View macOSWorld benchmark paper macOSWorld GitHub repository

Desktop control benchmark - Self-hosted

Windows Agent Arena

Benchmark By Microsoft

154 tasks across real Windows 11 applications running in Azure VMs. Tests document editing, file management, system settings, and browser tasks. Full reproducibility via cloud snapshots.

Evaluation Method: Programmatic
Top Model Score: 19.5%
Human Score: 74.5%
Task Count: 154

View Windows Agent Arena benchmark paper Windows Agent Arena GitHub repository

Desktop control benchmark - Self-hosted

AndroidWorld

Benchmark By Google DeepMind

116 tasks across 20 real Android apps in a live emulated environment. Functional evaluation without cached states. Tests agents on real apps including Gmail, Chrome, and Settings.

Evaluation Method: Functional
Top Model Score: ~30%
Human Score: ~88%
Task Count: 116

View AndroidWorld benchmark paper AndroidWorld GitHub repository

Desktop control benchmark - Self-hosted

Mobile-Env

Benchmark By X-LANCE

A gym environment for mobile UI interaction built on Android emulator. Provides step-level rewards for fine-grained evaluation of touch-based agent interaction.

Evaluation Method: Step reward
Top Model Score: N/A
Human Score: N/A
Task Count: ~70 tasks

View Mobile-Env benchmark paper Mobile-Env GitHub repository

Coding agent benchmark - Public

SWE-bench Verified

Benchmark By Princeton

500 human-verified GitHub issues from real-world Python repos. Expert-curated to remove ambiguous or unreliable tasks. The most trusted coding agent benchmark — resolving a real bug in a real repo.

Evaluation Method: Test suite
Top Model Score: ~72%
Human Score: ~94%
Task Count: 500

View SWE-bench Verified benchmark paper SWE-bench Verified GitHub repository

Coding agent benchmark - Public

SWE-bench Lite

Benchmark By Princeton

300-task curated subset of SWE-bench focusing on self-contained issues. Designed for faster, cheaper evaluation while remaining representative of the full benchmark.

Evaluation Method: Test suite
Top Model Score: ~55%
Human Score: N/A
Task Count: 300

View SWE-bench Lite benchmark paper SWE-bench Lite GitHub repository

Coding agent benchmark - Public

Terminal-Bench 2.0

Benchmark By Harbor

Purely terminal-based coding and system tasks with no GUI. Tests command-line proficiency across bash, Python, and system administration. Harder and more realistic than sandbox coding benchmarks.

Evaluation Method: Execution-based
Top Model Score: ~45%
Human Score: N/A
Task Count: ~200

View Terminal-Bench 2.0 benchmark paper Terminal-Bench 2.0 GitHub repository

Coding agent benchmark - Public

MLE-bench

Benchmark By OpenAI

75 Kaggle competitions used to evaluate ML engineering agents. Agents must write, run, and iterate on ML pipelines to achieve competitive leaderboard scores.

Evaluation Method: Kaggle leaderboard
Top Model Score: ~17% medals
Human Score: N/A
Task Count: 75 competitions

View MLE-bench benchmark paper MLE-bench GitHub repository

Coding agent benchmark - Public

SciCode

Benchmark By SciCode

338 scientific coding subproblems from 80 research problems across math, physics, chemistry, biology, and materials science. Tests research-grade code generation against expert-written tests.

Evaluation Method: Test suite
Top Model Score: ~26%
Human Score: ~81%
Task Count: 338

View SciCode benchmark paper SciCode GitHub repository

Coding agent benchmark - Self-hosted

CVE-Bench

Benchmark By UIUC

Evaluates agents on real-world cybersecurity vulnerability exploitation. Agents are scored on successfully exploiting CVEs from public vulnerability databases in sandboxed environments.

Evaluation Method: Exploit success
Top Model Score: ~47%
Human Score: N/A
Task Count: ~50 CVEs

View CVE-Bench benchmark paper CVE-Bench GitHub repository

Coding agent benchmark - Public

HumanEval+

Benchmark By EvalPlus

Enhanced version of OpenAI's HumanEval with 80x more test cases per problem to reduce false positives. Tests Python code generation against significantly stricter test coverage.

Evaluation Method: Test suite
Top Model Score: ~99%
Human Score: N/A
Task Count: 164

View HumanEval+ benchmark paper HumanEval+ GitHub repository

Coding agent benchmark - Public

Aider Benchmark

Benchmark By Aider

LLM code editing benchmark using real open-source repos. Measures ability to apply targeted code changes from natural language instructions without breaking existing tests.

Evaluation Method: Test suite
Top Model Score: ~79%
Human Score: N/A
Task Count: 133

View Aider Benchmark benchmark paper Aider Benchmark GitHub repository

Coding agent benchmark - Self-hosted

InterCode

Benchmark By Princeton

Interactive coding benchmark using bash and SQL environments. Agents iteratively execute code and receive environment feedback, testing multi-turn code generation and debugging.

Evaluation Method: Execution-based
Top Model Score: ~60%
Human Score: N/A
Task Count: ~700

View InterCode benchmark paper InterCode GitHub repository

Coding agent benchmark - Public

RepoBench

Repository-level code completion benchmark. Tests retrieval and generation across entire codebases — agents must understand cross-file context to complete functions correctly.

Evaluation Method: Exact match / CodeBLEU
Top Model Score: ~55%
Human Score: N/A
Task Count: ~900

View RepoBench benchmark paper RepoBench GitHub repository

Tool use benchmark - Public

ToolBench

Benchmark By OpenBMB

16,000+ real-world APIs from RapidAPI across 49 categories. Tests agents on planning and chaining API calls to complete complex instructions. Includes a neural retriever for API selection.

Evaluation Method: Pass rate / win rate
Top Model Score: ~60% pass rate
Human Score: N/A
Task Count: 2,746

View ToolBench benchmark paper ToolBench GitHub repository

Tool use benchmark - Public

Tau-bench

Benchmark By Sierra

Agent-computer interaction benchmark focused on realistic customer service scenarios. Agents must complete multi-turn tasks using tools (database lookups, reservations) while following strict policies.

Evaluation Method: Functional
Top Model Score: ~60%
Human Score: N/A
Task Count: ~200

View Tau-bench benchmark paper Tau-bench GitHub repository

Tool use benchmark - Public

API-Bank

Benchmark By Alibaba DAMO

73 API tools across 3 difficulty levels testing tool retrieval, plan selection, and API call correctness. One of the earliest systematic tool-use benchmarks for LLMs.

Evaluation Method: Exact match
Top Model Score: ~75%
Human Score: N/A
Task Count: 314

View API-Bank benchmark paper API-Bank GitHub repository

Tool use benchmark - Public

Gorilla APIBench

Benchmark By UC Berkeley

1,645 API tasks across HuggingFace, TorchHub, and TensorHub. Evaluates if agents generate accurate API calls including correct arguments and library usage without hallucination.

Evaluation Method: AST matching
Top Model Score: ~80%
Human Score: N/A
Task Count: 1,645

View Gorilla APIBench benchmark paper Gorilla APIBench GitHub repository

Tool use benchmark - Self-hosted

ToolSandbox

Benchmark By Apple

Stateful tool-use benchmark with interdependencies between tool calls. Agents must manage tool state across multi-step tasks — calling one tool affects what another returns.

Evaluation Method: State verification
Top Model Score: ~52%
Human Score: N/A
Task Count: ~200

View ToolSandbox benchmark paper ToolSandbox GitHub repository

Tool use benchmark - Public

MCP Atlas

Benchmark By Scale AI

Evaluation suite for agents using Model Context Protocol servers. Tests correctness of MCP tool invocation, schema understanding, and multi-server orchestration.

Evaluation Method: Functional
Top Model Score: N/A
Human Score: N/A
Task Count: N/A

View MCP Atlas benchmark paper MCP Atlas GitHub repository

General reasoning benchmark - Public

GAIA

466 tasks across 3 difficulty levels requiring tool use, multimodal reasoning, and web browsing. With 587 submissions on HuggingFace, the most submitted-to AI agent benchmark in existence.

Evaluation Method: Exact match
Top Model Score: ~75%
Human Score: 92%
Task Count: 466

View GAIA benchmark paper GAIA GitHub repository

General reasoning benchmark - Self-hosted

AgentBench

Benchmark By THUDM

8 distinct environments spanning web browsing, OS, database, and game interaction. Tests agents across diverse real-world-like scenarios in a single unified framework.

Evaluation Method: Task-specific
Top Model Score: ~4.27 score
Human Score: N/A
Task Count: 1,091

View AgentBench benchmark paper AgentBench GitHub repository

General reasoning benchmark - Public

Humanity's Last Exam

3,000 expert-level questions across 100+ academic disciplines, crowd-sourced from domain experts. Designed to be at or beyond the frontier of human knowledge — the hardest factual benchmark yet.

Evaluation Method: Exact match
Top Model Score: ~26%
Human Score: N/A
Task Count: 3,000

View Humanity's Last Exam benchmark paper

General reasoning benchmark - Public

ARC-AGI-2

Benchmark By ARC Prize

The second generation of François Chollet's Abstraction and Reasoning Corpus. Novel visual pattern tasks designed to resist memorization — requires genuine program synthesis from examples.

Evaluation Method: Exact match
Top Model Score: ~4%
Human Score: ~60%
Task Count: ~500

View ARC-AGI-2 benchmark paper ARC-AGI-2 GitHub repository

General reasoning benchmark - Public

GPQA Diamond

448 expert-level multiple-choice questions in biology, physics, and chemistry — written and validated by domain PhDs. Only experts in the relevant field consistently score above random.

Evaluation Method: Multiple choice
Top Model Score: ~87%
Human Score: ~69% (experts)
Task Count: 448

View GPQA Diamond benchmark paper GPQA Diamond GitHub repository

General reasoning benchmark - Public

LiveBench

Benchmark By LiveBench

Monthly-refreshed benchmark with questions sourced from recent news, papers, and competition math. Designed to prevent data contamination — the benchmark evolves so models can't memorize answers.

Evaluation Method: Verifiable
Top Model Score: ~80%
Human Score: N/A
Task Count: ~900 (rotating)

View LiveBench benchmark paper LiveBench GitHub repository

General reasoning benchmark - Public

SimpleQA

Benchmark By OpenAI

4,326 short factual questions with a single unambiguous correct answer. Measures factual accuracy and hallucination rate — designed to have no trick questions, only clear facts.

Evaluation Method: Exact match
Top Model Score: ~97%
Human Score: ~94%
Task Count: 4,326

View SimpleQA benchmark paper SimpleQA GitHub repository

General reasoning benchmark - Self-hosted

AgentBoard

Benchmark By HKUST

Analytical benchmark across 9 diverse agent scenarios. Provides fine-grained progress rates beyond binary success/fail — measures how far along a task an agent gets even when it fails.

Evaluation Method: Progress rate
Top Model Score: ~58% progress
Human Score: N/A
Task Count: ~1,000

View AgentBoard benchmark paper AgentBoard GitHub repository

General reasoning benchmark - Public

OdysseyBench

Long-horizon agent benchmark requiring sustained reasoning and planning over 50+ steps. Tests whether agents can maintain coherent goals across very long task horizons without losing context.

Evaluation Method: Functional
Top Model Score: ~35%
Human Score: N/A
Task Count: ~200

View OdysseyBench benchmark paper

General reasoning benchmark - Self-hosted

AppWorld

Benchmark By Stony Brook

9 app ecosystem with 750 tasks spanning contacts, music, email, maps, and calendar. Tests agents on realistic app-based workflows requiring coordination across multiple simulated apps.

Evaluation Method: Functional
Top Model Score: ~49%
Human Score: N/A
Task Count: 750

View AppWorld benchmark paper AppWorld GitHub repository

Specialized agent benchmark - Public

Sotopia

Benchmark By Sotopia Lab

Social intelligence benchmark placing agents in realistic social scenarios. Evaluates believability, social goal completion, relationship management, and secret keeping across 11 social dimensions.

Evaluation Method: LLM judge (GPT-4)
Top Model Score: ~7.6/10
Human Score: ~8.3/10
Task Count: ~600 episodes

View Sotopia benchmark paper Sotopia GitHub repository

Specialized agent benchmark - Public

AgentHarm

Benchmark By AIEvals

Safety red-teaming benchmark with 440 harmful agent tasks across 11 categories. Tests whether agent frameworks allow harmful behaviors — jailbreaking, weapon synthesis, fraud, and more.

Evaluation Method: Human / LLM judge
Top Model Score: N/A
Human Score: N/A
Task Count: 440

View AgentHarm benchmark paper AgentHarm GitHub repository

Specialized agent benchmark - Public

MedAgentBench

Benchmark By Stanford

300 clinical tasks across 10 medical categories using real EHR data. Tests agents on diagnosis reasoning, treatment planning, and medical record navigation in realistic hospital environments.

Evaluation Method: Expert validation
Top Model Score: ~77%
Human Score: N/A
Task Count: 300

View MedAgentBench benchmark paper MedAgentBench GitHub repository

Specialized agent benchmark - Public

MultiAgentBench

Benchmark By UIUC

Evaluates both cooperative and competitive multi-agent systems. Tasks include collaborative problem-solving and adversarial games — measures emergent coordination and strategic behavior.

Evaluation Method: Task-specific
Top Model Score: N/A
Human Score: N/A
Task Count: ~300

View MultiAgentBench benchmark paper MultiAgentBench GitHub repository

Specialized agent benchmark - Self-hosted

FORTRESS

Cybersecurity benchmark testing agents on capture-the-flag style challenges. Covers reverse engineering, web exploitation, and cryptography. Designed to stress-test autonomous offensive security agents.

Evaluation Method: Flag capture
Top Model Score: ~35%
Human Score: N/A
Task Count: ~150

View FORTRESS benchmark paper

Specialized agent benchmark - Public

Vending-Bench

Transaction and inventory reasoning benchmark. Agents manage a virtual vending machine over many turns — testing whether models understand real-world economics, stock levels, and pricing logic.

Evaluation Method: State verification
Top Model Score: ~62%
Human Score: N/A
Task Count: ~200

View Vending-Bench benchmark paper

Specialized agent benchmark - Public

CharacterEval

Role-playing and character consistency benchmark. Evaluates agents on maintaining persona fidelity, character knowledge accuracy, and in-character behavior across long conversations.

Evaluation Method: LLM + human judge
Top Model Score: ~75%
Human Score: N/A
Task Count: ~1,000 dialogues

View CharacterEval benchmark paper CharacterEval GitHub repository

Specialized agent benchmark - Public

Bullshit Benchmark

Benchmark By Peter Gostev

Tests model resistance to confidently stated falsehoods in prompts. Evaluates whether agents can identify and reject plausible-sounding but incorrect premises before acting on them.

Evaluation Method: Rejection rate
Top Model Score: N/A
Human Score: N/A
Task Count: N/A

View Bullshit Benchmark benchmark paper Bullshit Benchmark GitHub repository

AI Agent Benchmark Registry

How to read this registry

Web navigation benchmarks

Coding agent benchmarks

Desktop control benchmarks

Tool-use benchmarks

General reasoning benchmarks

Specialized agent benchmarks

What is an AI agent benchmark?