Steel.dev®

AI Agent Benchmark Registry

Steel.dev Logo

Explore an AI agent eval registry and benchmark leaderboard covering web navigation, coding, desktop control, tool use, deep research, and general reasoning. Compare evaluation suites, tests, frameworks, tasks, evaluators, top scores, and benchmark scope in one place.

How to read this registry

Compare results only when task scope and evaluation method are reasonably comparable. Reproducible suites like WebArena are easier to rerun, while live-web evals like WebVoyager better capture production drift. Start with the category routes for web navigation, coding, and tool use before comparing leaderboard numbers across very different evaluation suites. If you want a single place to browse reported scores across many benchmarks, jump to the Benchmark Index.

REGISTRY

General reasoning benchmarks

General reasoning benchmarks are useful because they show how well a model handles math, logic, planning, reading comprehension, and hard problem solving before environment noise gets involved. They are not the whole story for agents, but they do provide a baseline for how much raw reasoning power a system brings into browser, coding, or desktop tasks. The most informative evals keep answers verifiable and resist contamination from training data exposure. If you want a broad sense of agent capability, compare GAIA with AgentBench, then read those results next to the more execution-heavy web navigation and coding benchmarks.
General reasoning benchmark - Public

466 tasks across 3 difficulty levels requiring tool use, multimodal reasoning, and web browsing. With 587 submissions on HuggingFace, the most submitted-to AI agent benchmark in existence.

Top Model Score
~75%
Manus / h2oGPTe
Human Score
92%
General reasoning benchmark - Self-hosted
Benchmark By THUDM

8 distinct environments spanning web browsing, OS, database, and game interaction. Tests agents across diverse real-world-like scenarios in a single unified framework.

Top Model Score
~4.27 score
GPT-4
Human Score
N/A
General reasoning benchmark - Public

3,000 expert-level questions across 100+ academic disciplines, crowd-sourced from domain experts. Designed to be at or beyond the frontier of human knowledge — the hardest factual benchmark yet.

Top Model Score
~26%
o3 (high)
Human Score
N/A
General reasoning benchmark - Public
Benchmark By ARC Prize

The second generation of François Chollet's Abstraction and Reasoning Corpus. Novel visual pattern tasks designed to resist memorization — requires genuine program synthesis from examples.

Top Model Score
~4%
o3 (high)
Human Score
~60%
General reasoning benchmark - Public

448 expert-level multiple-choice questions in biology, physics, and chemistry — written and validated by domain PhDs. Only experts in the relevant field consistently score above random.

Top Model Score
~87%
o3
Human Score
~69% (experts)
General reasoning benchmark - Public
Benchmark By LiveBench

Monthly-refreshed benchmark with questions sourced from recent news, papers, and competition math. Designed to prevent data contamination — the benchmark evolves so models can't memorize answers.

Top Model Score
~80%
o3 / Claude 3.7
Human Score
N/A
General reasoning benchmark - Public
Benchmark By OpenAI

4,326 short factual questions with a single unambiguous correct answer. Measures factual accuracy and hallucination rate — designed to have no trick questions, only clear facts.

Top Model Score
~97%
o3
Human Score
~94%
General reasoning benchmark - Self-hosted
Benchmark By HKUST

Analytical benchmark across 9 diverse agent scenarios. Provides fine-grained progress rates beyond binary success/fail — measures how far along a task an agent gets even when it fails.

Top Model Score
~58% progress
GPT-4
Human Score
N/A
General reasoning benchmark - Public

Long-horizon agent benchmark requiring sustained reasoning and planning over 50+ steps. Tests whether agents can maintain coherent goals across very long task horizons without losing context.

Top Model Score
~35%
Claude 3.5 Sonnet
Human Score
N/A
General reasoning benchmark - Self-hosted
Benchmark By Stony Brook

9 app ecosystem with 750 tasks spanning contacts, music, email, maps, and calendar. Tests agents on realistic app-based workflows requiring coordination across multiple simulated apps.

Top Model Score
~49%
GPT-4o
Human Score
N/A

MISSING A BENCHMARK? OPEN A PR ON GITHUB TO ADD IT TO THE REGISTRY.

What is an AI agent benchmark?

An AI agent benchmark, eval, or evaluation suite is a structured way to test how well an agent completes tasks in an environment, not just how well a model writes a plausible answer. Instead of grading one response, these tests look at sequences of actions across websites, codebases, tools, desktops, or research workflows. In practice, they measure whether the system can make progress, stay grounded, and reach the correct end state.

That is the main difference between an agent benchmark and a standard LLM eval. A classic LLM test asks whether the model produced the right answer to a prompt. An agent evaluation asks whether the system can plan, recover from mistakes, use the right tools, and complete a workflow under realistic constraints. Strong benchmark leaderboards often track not only accuracy, but also task success, reliability, latency, and cost.

Common methods include exact-match grading, executable test suites, environment-state checks, human review, and LLM-as-judge scoring for open-ended work. Each has tradeoffs in rigor, scalability, and realism. Self-hosted suites are easier to rerun and compare over time, while public-web or live-software evaluations better reflect drift and production messiness. The best way to evaluate AI agents is usually to combine both.

FAQ
What are AI agent benchmarks? [+]
AI agent benchmarks are evaluations that measure whether an agent can complete multi-step tasks in an environment such as a browser, terminal, desktop, or tool stack. Unlike single-prompt model tests, they focus on action quality, task completion, recovery from mistakes, and end-to-end execution.
What is an agent eval registry? [+]
An agent eval registry is a curated index of AI agent benchmarks, evaluations, leaderboards, test suites, and frameworks. Instead of covering just one benchmark family, it helps you compare multiple evaluation options across web navigation, coding, desktop control, and tool use in one place.
How do you evaluate AI agents? [+]
You evaluate AI agents by testing them on multi-step tasks in realistic environments and measuring whether they reach the correct end state. Strong agent evaluations usually track task success, evaluator design, reliability, cost, latency, and recovery from mistakes. The right eval framework depends on whether you care about browser use, coding, tool use, desktop control, or general reasoning.
What is the best benchmark for coding agents? [+]
There is no single best benchmark for every use case, but SWE-bench Verified is widely treated as the most trusted benchmark for coding agents because it uses real repository issues and executable test suites. Terminal-Bench is also useful when you want to evaluate autonomous agents in command-line and systems workflows.
How do browser agent benchmarks differ from coding benchmarks? [+]
Browser agent benchmarks evaluate interaction with websites, page state, navigation, and visual or DOM-grounded actions. Coding benchmarks evaluate repository understanding, file edits, tool use, debugging, and test execution. Compare WebVoyager with SWE-bench Verified to see how different the environments and failure modes are.
What does self-hosted mean in an agent benchmark? [+]
Self-hosted means the benchmark environment can be run in a controlled local or containerized setup instead of depending entirely on live public services. That usually improves reproducibility and evaluation stability, but may be less representative of the messy real web or production software. Benchmarks like WebArena and OSWorld are good examples.
Why do benchmark scores differ across evaluators? [+]
Benchmark scores differ because evaluators measure success in different ways. Some use exact match, some use executable test suites, some verify environment state, and others rely on human review or LLM judges. A score on one benchmark or evaluator is not directly comparable to a score produced by a different evaluation method. Use the How to read this registry note before comparing results.