AI Agent Benchmark Registry
Explore an AI agent eval registry and benchmark leaderboard covering web navigation, coding, desktop control, tool use, deep research, and general reasoning. Compare evaluation suites, tests, frameworks, tasks, evaluators, top scores, and benchmark scope in one place.
How to read this registry
Compare results only when task scope and evaluation method are reasonably comparable. Reproducible suites like WebArena are easier to rerun, while live-web evals like WebVoyager better capture production drift. Start with the category routes for web navigation, coding, and tool use before comparing leaderboard numbers across very different evaluation suites. If you want a single place to browse reported scores across many benchmarks, jump to the Benchmark Index.
Specialized agent benchmarks
Social intelligence benchmark placing agents in realistic social scenarios. Evaluates believability, social goal completion, relationship management, and secret keeping across 11 social dimensions.
- Evaluation Method
- LLM judge (GPT-4)
- Top Model Score
- ~7.6/10
- Human Score
- ~8.3/10
- Task Count
- ~600 episodes
Safety red-teaming benchmark with 440 harmful agent tasks across 11 categories. Tests whether agent frameworks allow harmful behaviors — jailbreaking, weapon synthesis, fraud, and more.
- Evaluation Method
- Human / LLM judge
- Top Model Score
- N/A
- Human Score
- N/A
- Task Count
- 440
300 clinical tasks across 10 medical categories using real EHR data. Tests agents on diagnosis reasoning, treatment planning, and medical record navigation in realistic hospital environments.
- Evaluation Method
- Expert validation
- Top Model Score
- ~77%
- Human Score
- N/A
- Task Count
- 300
Evaluates both cooperative and competitive multi-agent systems. Tasks include collaborative problem-solving and adversarial games — measures emergent coordination and strategic behavior.
- Evaluation Method
- Task-specific
- Top Model Score
- N/A
- Human Score
- N/A
- Task Count
- ~300
Cybersecurity benchmark testing agents on capture-the-flag style challenges. Covers reverse engineering, web exploitation, and cryptography. Designed to stress-test autonomous offensive security agents.
- Evaluation Method
- Flag capture
- Top Model Score
- ~35%
- Human Score
- N/A
- Task Count
- ~150
Transaction and inventory reasoning benchmark. Agents manage a virtual vending machine over many turns — testing whether models understand real-world economics, stock levels, and pricing logic.
- Evaluation Method
- State verification
- Top Model Score
- ~62%
- Human Score
- N/A
- Task Count
- ~200
Role-playing and character consistency benchmark. Evaluates agents on maintaining persona fidelity, character knowledge accuracy, and in-character behavior across long conversations.
- Evaluation Method
- LLM + human judge
- Top Model Score
- ~75%
- Human Score
- N/A
- Task Count
- ~1,000 dialogues
Tests model resistance to confidently stated falsehoods in prompts. Evaluates whether agents can identify and reject plausible-sounding but incorrect premises before acting on them.
- Evaluation Method
- Rejection rate
- Top Model Score
- N/A
- Human Score
- N/A
- Task Count
- N/A
MISSING A BENCHMARK? OPEN A PR ON GITHUB TO ADD IT TO THE REGISTRY.
What is an AI agent benchmark?
An AI agent benchmark, eval, or evaluation suite is a structured way to test how well an agent completes tasks in an environment, not just how well a model writes a plausible answer. Instead of grading one response, these tests look at sequences of actions across websites, codebases, tools, desktops, or research workflows. In practice, they measure whether the system can make progress, stay grounded, and reach the correct end state.
That is the main difference between an agent benchmark and a standard LLM eval. A classic LLM test asks whether the model produced the right answer to a prompt. An agent evaluation asks whether the system can plan, recover from mistakes, use the right tools, and complete a workflow under realistic constraints. Strong benchmark leaderboards often track not only accuracy, but also task success, reliability, latency, and cost.
Common methods include exact-match grading, executable test suites, environment-state checks, human review, and LLM-as-judge scoring for open-ended work. Each has tradeoffs in rigor, scalability, and realism. Self-hosted suites are easier to rerun and compare over time, while public-web or live-software evaluations better reflect drift and production messiness. The best way to evaluate AI agents is usually to combine both.