steel.dev benchmark hub

Browser Agent Leaderboards

Track the best-performing AI agents and models across browser automation, computer use, research, and coding benchmarks. Each leaderboard page includes methodology notes, scope labels, and source-linked results.

Featured: WebVoyager

View full leaderboard

WebVoyager measures end-to-end browser task completion across live websites. It is widely used for comparing production-style browser agent systems.

Current top system: Jina (98.9%)

WebVoyager leaderboard

Agent scope
Rank System / Submission Score Organization Notes Source Share
1
Jina New
98.9% Om Labs Om Labs custom tracker; Jina multi-model system on self-hosted WebVoyager harness. Source
2
Alumnium New
98.6% Alumnium Accessibility-based tree parsing with integrated visual reasoning. Source
3
Surfer 2 New
97.1% H Company System-level orchestration with submitter-defined setup details. Source
4
Magnitude
93.9% Magnitude Open-source architecture utilizing a modular agentic stack. Source
5
AIME Browser-Use New
92.34% Aime Custom orchestration layer with specialized browser tooling. Source
6
Surfer-H + Holo1 New
92.2% H Company Multi-modal action kernels integrated via H-Company research. Source
7
Browserable New
90.4% Browserable Fine-tuned browser control models within a commercial framework. Source
8
Browser Use
89.1% Browser Use Multi-step orchestration framework for open-source automation. Source
9
GLM-5V-Turbo New
88.5% Z.ai Multimodal vision model optimized for GUI automation and coding tasks. Source
10
Agent Kura
87.0% Kura 602/643 tasks (41 removed for invalid/auth issues); reported on trykura.com. Source
10
Operator
87% OpenAI Native browser integration using proprietary vision-control models. Source
12
Skyvern 2.0
85.85% Skyvern DOM-level reasoning coupled with real-time error-correction. Source
13
Project Mariner
83.5% Google Gemini-powered reasoning with precise visual grounding. Source
14
Agent-E
73.1% Emergence AI Hierarchical planning modules within a multi-agent framework. Source
14
Notte New
73.1% Notte Standardized operator stack for open-source performance evaluation. Source
16
WebSight
68% Academic Research Navigation system prioritizing visual-only perceptual inputs. Source
17
Runner H 0.1
67% H Company Foundational agent architecture for general web interaction. Source
18
WebVoyager
59.1% Academic Research Baseline implementation using standard multimodal LLM control. Source
19
Anthropic Computer Use 3.5
56.0% Anthropic Sampled 50/602 tasks for direct comparison; reported on trykura.com. Source
20
WILBUR
53% Academic Research Research implementation using black-box optimization techniques. Source
21
GPT-4 (All Tools)
30.8% OpenAI ChatGPT integrated tool baseline from original WebVoyager paper; reported on arxiv.org. Source

Explore benchmarks

Browser agents

WebVoyager

WebVoyager is a benchmark for browser agents operating on live websites. It focuses on practical tasks such as navigation, search, form completion, and multi-step workflows across a broad website mix.

Top: Jina (98.9%)

Caveat: Rows may use different evaluation settings and are not always strict apples-to-apples.

Open WebVoyager leaderboard

Research/search

BrowseComp

BrowseComp targets difficult browse-and-synthesize research questions that are easy to verify but hard to answer without strong search and reasoning strategy.

Top: GPT-5.5 Pro (90.1%)

Caveat: Mixed-scope benchmark: model-only and tool-augmented rows are not inherently directly comparable.

Open BrowseComp leaderboard

Browser agents

WebArena

WebArena evaluates browser agents in controlled, self-hosted web environments that represent realistic application patterns such as e-commerce, forums, and developer workflows.

Top: DeepSeek V3.2 (74.3%)

Caveat: Even with controlled environments, ranking rows can differ by setup and submission policy.

Open WebArena leaderboard

Coding

SWE-bench Verified

SWE-bench Verified evaluates software engineering performance on real GitHub issues with stricter quality controls than the broader SWE-bench set.

Top: Claude Mythos (93.9%)

Caveat: Model-focused benchmark, but harness and evaluation policy still affect outcomes.

Open SWE-bench Verified leaderboard

Computer use

OSWorld

OSWorld evaluates computer-use agents across 369 real desktop tasks spanning Ubuntu, Windows, and macOS — covering web apps, desktop software, file I/O, and multi-application workflows.

Top: Mythos Preview (79.6%)

Caveat: Self-reported and independently verified rows coexist — check the source before comparing directly.

Open OSWorld leaderboard

Model evals / reasoning

GAIA

GAIA (General AI Assistants) evaluates agents on over 450 real-world questions with unambiguous, verifiable answers — requiring multi-step reasoning, tool use, web search, and file handling across three difficulty levels.

Top: OPS-Agentic-Search (92.36%)

Caveat: Top entries are multi-model ensembles — scores cannot be attributed to any single model.

Open GAIA leaderboard

Browser agents

ClawBench

ClawBench evaluates AI agents on 153 everyday tasks that real people need to complete regularly — booking appointments, completing purchases, submitting job applications, and filling in forms — across 144 live production websites in 15 categories.

Top: Claude Sonnet 4.6 (33.3%)

Caveat: Very new benchmark (April 2026) — published results cover only 7 frontier models. Expect the leaderboard to expand rapidly.

Open ClawBench leaderboard

Browser agents

Online-Mind2Web

Online-Mind2Web is a live browser agent benchmark of 300 diverse, realistic tasks across 136 popular websites — spanning shopping, finance, travel, government, and more. Unlike static offline benchmarks, agents interact with real, dynamic pages as they exist at evaluation time.

Top: Browser Use Cloud (bu-max) (97.0%)

Caveat: Judge methodology varies significantly across submissions — human eval, WebJudge, and custom agentic judges produce different scores for the same agent. Always check the Notes column before comparing rows.

Open Online-Mind2Web leaderboard

Model evals / reasoning

τ-bench

τ-bench (TAU-bench) evaluates AI agents in realistic enterprise tool-use scenarios across retail and airline domains — testing multi-turn conversation, policy adherence, database interactions, and rule-following consistency over many trials.

Top: Step-3.5-Flash (88.2%)

Caveat: Score comparisons across organizations require caution — prompt setup, tool schema, and trial count differ between submissions.

Open τ-bench leaderboard

Model evals / reasoning

AgentBench

AgentBench evaluates LLMs as agents across 8 distinct interactive environments — including OS interaction, database querying, knowledge graph traversal, digital card games, lateral thinking puzzles, house-holding tasks, web browsing, and web shopping.

Top: AgentRL w/ Qwen2.5-32B-Instruct (70.4%)

Caveat: Community-submitted leaderboard — rows are self-reported and not independently verified. Check source links before drawing strong conclusions.

Open AgentBench leaderboard

Methodology reminder: some benchmarks measure model capability, while others measure full systems (model + tools + policy). Mixed pages include both and should be read as directional unless setups are fully aligned.

Frequently asked questions

How should I choose a benchmark for my use case? +
Start from deployment context: browser workflow automation usually maps to WebVoyager or WebArena, desktop automation maps to OSWorld, deep research maps to BrowseComp, and code-fixing reliability maps to SWE-bench Verified.
Are scores comparable across different benchmarks? +
No. Benchmark objectives, datasets, evaluators, and pass criteria differ. Use each benchmark page for within-benchmark comparison, then validate directly on your own workload.
Do leaderboard scores belong to models or systems? +
Both exist, depending on page scope. Model pages emphasize base-model capability, while agent pages represent full systems (model + tooling + policy). Mixed pages include both and require extra caution.
Who maintains this leaderboard? +
Steel maintains it as an open reference for the browser-agent ecosystem. Steel is browser infrastructure for AI agents — cloud browser sessions with anti-bot handling, proxy rotation, and session replay — used by teams building agents against the benchmarks tracked here. Contributions and corrections are welcome on GitHub.
How do AI browser agents work? +
Browser agents combine LLMs with browser automation to complete web tasks. A vision model sees the webpage via screenshots or DOM. A reasoning model decides actions like clicking, typing, or scrolling. An execution layer drives the browser via Chrome DevTools Protocol or Playwright. A memory component tracks state across steps. Most agents run on cloud infrastructure like Steel for reliability and anti-bot handling.
How do I build my own AI browser agent? +
Three layers are needed. Browser infrastructure: Steel provides managed sessions, proxies, anti-bot handling, and replay. AI layer: a vision-capable model like GPT-4o, Claude, or Gemini with prompting for action selection. Orchestration: frameworks like Browser Use or Skyvern handle clicking, typing, and state tracking. See the production agents guide. Once your agent has a publicly verifiable benchmark score, open a pull request on GitHub.
How often is the leaderboard updated? +
The leaderboard updates as new benchmark results are published. New results appear weekly. If you know of a missing agent or score, pull requests and issues are welcome on GitHub.
How do I add my agent to the leaderboard? +
Open a pull request on GitHub with your entry. You need a publicly verifiable benchmark score, a link to the source (paper or blog post), and a homepage or GitHub repo for your agent.