Steel.dev®
AI Agent Benchmark Results Index
Browse 121 reported results across 16 benchmarks — WebVoyager, WebArena, OSWorld, SWE-bench, GAIA, BrowseComp, and more. Filter by category, benchmark, or search by agent or organization name.
Need benchmark definitions, evaluator details, and links to papers before comparing scores? Start with the Benchmark Registry.
BENCHMARK INDEX
121 RESULTS
BENCH:
NO RESULTS MATCHING FILTERS
SRC: SELF = SELF-REPORTED · 3RD = INDEPENDENTLY VERIFIED · OSS = OPEN SOURCE · CLICK SCORE HEADER TO SORT · CLICK CATEGORY BADGE TO FILTER · SUBMIT A RESULT →
FAQ
What is an AI agent benchmark index? [+] [-]
A benchmark index collects results from multiple evaluations in one place so you can compare agents across different task types without visiting each benchmark's leaderboard separately. This index tracks results across web navigation, desktop control, coding, research, tool use, general reasoning, and specialized categories.
Why can't I compare scores across different benchmarks? [+] [-]
Each benchmark measures something different under different conditions. A 70% on WebArena (programmatic evaluation, self-hosted Docker) and a 70% on WebVoyager (GPT-4V judge, live websites) are not equivalent — tasks, environments, graders, and difficulty levels all differ. Use scores within a single benchmark for head-to-head comparison.
What is the difference between self-reported and third-party verified scores? [+] [-]
A self-reported (SELF) score was published by the organization that built the agent. These may be accurate but harder to verify — organizations sometimes use custom evaluation settings, filtered subsets, or different prompt configurations. A third-party verified (3RD) score was independently evaluated by the benchmark authors, an academic lab, or a neutral platform like Princeton HAL. Prefer 3RD scores when comparing agents head-to-head.
Which benchmark should I use to evaluate a browser agent? [+] [-]
Start with WebVoyager — it's the most widely adopted, uses live websites, and has the most agents benchmarked for easy comparison. If you need reproducibility and programmatic grading with no LLM judge, use WebArena instead. For cost-aware evaluation with independent verification, use Online-Mind2Web via Princeton HAL.
Which benchmark should I use to evaluate a coding agent? [+] [-]
SWE-bench Verified is the most trusted signal — 500 human-verified GitHub issues from real Python repos, programmatically graded. For a faster, cheaper proxy use SWE-bench Lite. For command-line and sysadmin work, use Terminal-Bench 2.0.
Which benchmark should I use to evaluate a desktop automation agent? [+] [-]
OSWorld is the most comprehensive — 369 cross-application tasks across Ubuntu, Windows, and macOS with execution-based evaluation. For platform-specific depth, Windows Agent Arena covers Windows 11 via Azure VMs and AndroidWorld covers 20 real Android apps.
Why are scores on some benchmarks so low? [+] [-]
Some benchmarks are intentionally hard. BrowseComp was designed so most agents fail — the best scores hover around 60%. ARC-AGI-2 sits at ~4% for top models because it tests genuine visual reasoning that resists memorization. Low scores signal a benchmark still has meaningful room for improvement, making it more useful for tracking progress.
How do I add a result to this index? [+] [-]
Open a pull request on GitHub adding an entry to
src/lib/index-data.ts. Include the agent name, organization, benchmark, score, a source link, and whether the result is self-reported or independently verified.