Steel.dev®

AI Agent Benchmark Results Index

Steel.dev Logo

Browse 121 reported results across 16 benchmarks — WebVoyager, WebArena, OSWorld, SWE-bench, GAIA, BrowseComp, and more. Filter by category, benchmark, or search by agent or organization name.

Need benchmark definitions, evaluator details, and links to papers before comparing scores? Start with the Benchmark Registry.

BENCHMARK INDEX
121 RESULTS
BENCH:
Surfer 2 NEW
H Company
WebVoyager
SELF
NO
Magnitude NEW
Magnitude
WebVoyager
SELF
YES
AIME Browser-Use NEW
Aime
WebVoyager
SELF
NO
Surfer-H + Holo1 NEW
H Company
WebVoyager
3RD
NO
Browserable NEW
Browserable
WebVoyager
SELF
YES
Browser Use
Browser Use
WebVoyager
SELF
YES
Operator
OpenAI
WebVoyager
SELF
NO
Skyvern 2.0
Skyvern
WebVoyager
SELF
YES
Project Mariner
Google
WebVoyager
SELF
NO
Agent-E
Emergence AI
WebVoyager
SELF
NO
Proxy Lite
Convergence AI
WebVoyager
SELF
NO
WebSight
Academic
WebVoyager
3RD
YES
Runner H 0.1
H Company
WebVoyager
SELF
NO
WebVoyager
Academic
WebVoyager
3RD
YES
WILBUR
Academic
WebVoyager
3RD
NO
DeepSeek v3.2 NEW
DeepSeek
WebArena
3RD
YES
OpAgent
CodeFuse AI
WebArena
3RD
YES
ColorBrowserAgent
ColorBrowser
WebArena
3RD
YES
Claude Code+GBOX
GBOX AI
WebArena
3RD
NO
DeepSky Agent
DeepSky
WebArena
SELF
NO
Narada AI
Narada
WebArena
SELF
NO
IBM CUGA
IBM
WebArena
3RD
NO
OpenAI Operator
OpenAI
WebArena
SELF
NO
Jace.AI
Jace AI
WebArena
SELF
NO
ORCHESTRA
UNC x Ventus
WebArena
3RD
NO
WebOperator+GPT-4o
WebOperator
WebArena
3RD
YES
ScribeAgent+GPT-4o
Scribe
WebArena
3RD
NO
AgentSymbiotic
Academic
WebArena
3RD
YES
Learn-by-Interact
Academic
WebArena
3RD
YES
AgentOccam-Judge
Academic
WebArena
3RD
YES
WebPilot
WebPilot
WebArena
3RD
NO
GUI-API Hybrid
Academic
WebArena
3RD
YES
AWM
Academic
WebArena
3RD
YES
Magentic-One
Microsoft
WebArena
3RD
YES
GPT-4 baseline
OpenAI
WebArena
3RD
NO
SeeAct + GPT-5 NEW
Academic
Online-Mind2Web
3RD
YES
Browser-Use
Browser Use
Online-Mind2Web
3RD
YES
SeeAct + o3
Academic
Online-Mind2Web
3RD
YES
Kimi K2 Thinking NEW
Moonshot AI
BrowseComp
SELF
YES
Deep Research
OpenAI
BrowseComp
SELF
NO
WebSailor-72B
Academic
BrowseComp
3RD
YES
GPT-4o + browsing
OpenAI
BrowseComp
SELF
NO
openJiuwen-deepagent NEW
Alibaba Cloud
GAIA
3RD
YES
Lemon Agent NEW
openJiuwen
GAIA
3RD
YES
JoinAI_V2.2 NEW
Lenovo CTO Org
GAIA
3RD
NO
Nemotron-ToolOrchestra NEW
NVIDIA
GAIA
3RD
YES
SU Zero (Shuqian Pro) NEW
Shuqian Tech
GAIA
3RD
NO
HALO V1217-1 NEW
Microsoft AI Asia
GAIA
3RD
NO
MiroThinker (Shawn) NEW
MiroMindAI
GAIA
3RD
YES
h2oGPTe Agent NEW
H2O.ai
GAIA
SELF
NO
Manus
Manus AI
GAIA
3RD
NO
Deep Research
OpenAI
GAIA
SELF
NO
MS Research (o1)
Microsoft
GAIA
3RD
NO
HF Agents
Hugging Face
GAIA
3RD
YES
GPT-4 + tools
OpenAI
GAIA
3RD
NO
GPT-5.4 NEW
OpenAI
OSWorld
SELF
NO
Claude Opus 4.6 NEW
Anthropic
OSWorld
SELF
NO
UiPath Screen Agent NEW
UiPath
OSWorld
SELF
NO
Simular Agent S2
Simular
OSWorld
3RD
NO
Agent S3 NEW
Simular AI
OSWorld
3RD
YES
AskUI VisionAgent
AskUI
OSWorld
SELF
NO
CoACT-1
USC / Salesforce
OSWorld
3RD
YES
Agent S2.5 w/ o3
Simular AI
OSWorld
3RD
YES
GTA1 w/ o3
Salesforce
OSWorld
3RD
NO
OpenAI CUA (o3)
OpenAI
OSWorld
3RD
NO
UI-TARS-1.5
ByteDance
OSWorld
3RD
YES
Agent S2 w/ Gemini
Simular AI
OSWorld
3RD
YES
OpenAI CUA (4o)
OpenAI
OSWorld
3RD
NO
Claude 3.7 (CU)
Anthropic
OSWorld
3RD
NO
Qwen2.5-VL-72B
Alibaba
OSWorld
3RD
YES
UI-TARS-7B
ByteDance
OSWorld
3RD
YES
GPT-4o (SoM)
OpenAI
OSWorld
3RD
NO
Agent S3 NEW
Simular AI
AndroidWorld
3RD
YES
AskUI AndroidVA
AskUI
AndroidWorld
SELF
NO
M3A (Gemini 1.5)
Google DeepMind
AndroidWorld
3RD
NO
Sonar Foundation NEW
Sonar
3RD
NO
Claude Opus 4.5 NEW
Anthropic
3RD
NO
Gemini 3 Pro NEW
Google
3RD
NO
Claude Opus 4.6 NEW
Anthropic
3RD
NO
GPT-5.2 Codex NEW
OpenAI
3RD
NO
OpenAI o3
OpenAI
SELF
NO
Claude 3.7 Sonnet
Anthropic
SELF
NO
Devin 2.0
Cognition
SELF
NO
Gru
Mutable AI
SELF
NO
Devstral (2512)
Mistral
SELF
YES
Qwen3-Coder-480B
Alibaba
SELF
YES
Gemini 2.5 Pro
Google
SELF
NO
GPT-4.1
OpenAI
SELF
NO
Kimi K2 Thinking
Moonshot AI
SELF
YES
mini-SWE-agent
Academic
3RD
YES
Claude Opus 4.1 NEW
Anthropic
SWE-bench Pro
3RD
NO
OpenAI GPT-5 NEW
OpenAI
SWE-bench Pro
3RD
NO
Claude Sonnet 4.5 NEW
Anthropic
SWE-bench Pro
3RD
NO
GPT-5.3-Codex (CLI) NEW
OpenAI
SWE-bench Pro
SELF
NO
Auggie NEW
Augment Code
SWE-bench Pro
SELF
NO
Cursor NEW
Anysphere
SWE-bench Pro
SELF
NO
Claude Code NEW
Anthropic
SWE-bench Pro
SELF
NO
AIDE (Claude 3.5)
Anthropic
MLE-bench
3RD
NO
o3
OpenAI
GPQA Diamond
SELF
NO
Gemini 2.5 Pro
Google
GPQA Diamond
SELF
NO
Claude 3.7 Sonnet
Anthropic
GPQA Diamond
SELF
NO
Llama 4 Maverick
Meta
GPQA Diamond
SELF
YES
DeepSeek-R1
DeepSeek
GPQA Diamond
SELF
YES
GPT-4o
OpenAI
GPQA Diamond
SELF
NO
o3 (high)
OpenAI
ARC-AGI-2
~4%
3RD
NO
Gemini 2.5 Pro
Google
ARC-AGI-2
~2%
3RD
NO
ToolLLaMA-v2
Academic
ToolBench
3RD
YES
GPT-4
OpenAI
ToolBench
3RD
NO
Claude 3 Opus
Anthropic
ToolBench
3RD
NO
GPT-4
OpenAI
AgentBench
3RD
NO
Claude 2
Anthropic
AgentBench
3RD
NO
GPT-3.5-turbo
OpenAI
AgentBench
3RD
NO
Claude 3.5 Sonnet NEW
Anthropic
Tau-bench
3RD
NO
GPT-4o
OpenAI
Tau-bench
3RD
NO
GPT-4-turbo
OpenAI
Tau-bench
3RD
NO
Llama 3.1 70B
Meta
Tau-bench
3RD
YES
o3-mini (high) NEW
OpenAI
HumanEval+
SELF
NO
Claude 3.7 Sonnet
Anthropic
HumanEval+
SELF
NO
GPT-4o
OpenAI
HumanEval+
SELF
NO
DeepSeek-Coder-V2
DeepSeek
HumanEval+
SELF
YES
Llama 3.1 405B
Meta
HumanEval+
SELF
YES

SRC: SELF = SELF-REPORTED · 3RD = INDEPENDENTLY VERIFIED · OSS = OPEN SOURCE · CLICK SCORE HEADER TO SORT · CLICK CATEGORY BADGE TO FILTER · SUBMIT A RESULT →

FAQ
What is an AI agent benchmark index? [+]
A benchmark index collects results from multiple evaluations in one place so you can compare agents across different task types without visiting each benchmark's leaderboard separately. This index tracks results across web navigation, desktop control, coding, research, tool use, general reasoning, and specialized categories.
Why can't I compare scores across different benchmarks? [+]
Each benchmark measures something different under different conditions. A 70% on WebArena (programmatic evaluation, self-hosted Docker) and a 70% on WebVoyager (GPT-4V judge, live websites) are not equivalent — tasks, environments, graders, and difficulty levels all differ. Use scores within a single benchmark for head-to-head comparison.
What is the difference between self-reported and third-party verified scores? [+]
A self-reported (SELF) score was published by the organization that built the agent. These may be accurate but harder to verify — organizations sometimes use custom evaluation settings, filtered subsets, or different prompt configurations. A third-party verified (3RD) score was independently evaluated by the benchmark authors, an academic lab, or a neutral platform like Princeton HAL. Prefer 3RD scores when comparing agents head-to-head.
Which benchmark should I use to evaluate a browser agent? [+]
Start with WebVoyager — it's the most widely adopted, uses live websites, and has the most agents benchmarked for easy comparison. If you need reproducibility and programmatic grading with no LLM judge, use WebArena instead. For cost-aware evaluation with independent verification, use Online-Mind2Web via Princeton HAL.
Which benchmark should I use to evaluate a coding agent? [+]
SWE-bench Verified is the most trusted signal — 500 human-verified GitHub issues from real Python repos, programmatically graded. For a faster, cheaper proxy use SWE-bench Lite. For command-line and sysadmin work, use Terminal-Bench 2.0.
Which benchmark should I use to evaluate a desktop automation agent? [+]
OSWorld is the most comprehensive — 369 cross-application tasks across Ubuntu, Windows, and macOS with execution-based evaluation. For platform-specific depth, Windows Agent Arena covers Windows 11 via Azure VMs and AndroidWorld covers 20 real Android apps.
Why are scores on some benchmarks so low? [+]
Some benchmarks are intentionally hard. BrowseComp was designed so most agents fail — the best scores hover around 60%. ARC-AGI-2 sits at ~4% for top models because it tests genuine visual reasoning that resists memorization. Low scores signal a benchmark still has meaningful room for improvement, making it more useful for tracking progress.
How do I add a result to this index? [+]
Open a pull request on GitHub adding an entry to src/lib/index-data.ts. Include the agent name, organization, benchmark, score, a source link, and whether the result is self-reported or independently verified.