Browser Agent Leaderboards

Track the best-performing AI agents and models across browser agent benchmarks, computer use, research, and coding. Each leaderboard page includes methodology notes, scope labels, and source-linked benchmark results.

WebVoyager

WebVoyager evaluates end-to-end browser agents on 643 tasks across 15 popular real-world websites. Tasks cover search, navigation, form filling, map and travel lookup, shopping, and information retrieval on live pages rather than static snapshots.

Model Score
Alumnium 98.5%
Surfer 2 97.1%
Magnitude 93.9%

Current leader: Alumnium

View Full Leaderboard page

BrowseComp

BrowseComp is OpenAI's benchmark for difficult agentic web research: 1,266 short-answer questions where the answer is easy to verify once found but hard to locate without persistent browsing.

Model Score
GPT-5.5 Pro 90.1%
GPT-5.4 Pro 89.3%
MiroThinker-H1 88.2%

Current leader: GPT-5.5 Pro

View Full Leaderboard page

WebArena

WebArena evaluates browser agents in reproducible, self-hosted websites instead of the open live web. Its 812 tasks cover e-commerce, forum discussion, collaborative software development, content management, maps, and reference lookup.

Model Score
WebTactix (DeepSeek v3.2) 74.3%
OpAgent 71.6%
ColorBrowserAgent 71.2%

Current leader: WebTactix (DeepSeek v3.2)

View Full Leaderboard page

SWE-bench Verified

SWE-bench Verified is the 500-instance human-reviewed split of SWE-bench, built from real GitHub issues in popular Python repositories. Agents receive an issue and repository state, then generate a patch.

Model Score
Claude Mythos 93.9%
Claude Opus 4.7 87.6%
Claude Opus 4.5 80.9%

Current leader: Claude Mythos

View Full Leaderboard page

OSWorld

OSWorld evaluates multimodal computer-use agents in real desktop environments across 369 tasks involving web apps, desktop software, files, and workflows spanning multiple applications.

Model Score
Mythos Preview 79.6%
OSAgent 76.26%
GPT-5.4 75.0%

Current leader: Mythos Preview

View Full Leaderboard page

GAIA

GAIA evaluates general AI assistants on 466 real-world questions requiring reasoning, web browsing, multimodal understanding, file handling, and tool use.

Model Score
OPS-Agentic-Search 92.36%
openJiuwen-deepagent 92.36%
openJiuwen-deepagent (GPT5/Gemini) 91.69%

Current leader: OPS-Agentic-Search

View Full Leaderboard page

ClawBench

ClawBench evaluates browser agents on 153 everyday online tasks across 144 live platforms in 15 categories, including purchases, appointments, job applications, and detailed forms.

Model Score
Claude Sonnet 4.6 33.3%
GLM-5 24.2%
Gemini 3 Flash 19.0%

Current leader: Claude Sonnet 4.6

View Full Leaderboard page

Online-Mind2Web

Online-Mind2Web turns the static Mind2Web idea into a live benchmark of 300 tasks across 136 websites, covering shopping, finance, travel, government, and other consumer workflows.

Model Score
Browser Use Cloud (bu-max) 97.0%
GPT-5.4 Native Computer Use 93.0%
ABP + Claude Opus 4.6 90.53%

Current leader: Browser Use Cloud (bu-max)

View Full Leaderboard page

τ-bench

τ-bench evaluates conversational agents in realistic customer-service tasks where the agent must talk to a simulated user, call domain APIs, and follow a policy manual.

Model Score
Step-3.5-Flash 88.2%
GLM-4.7 87.4%
MiMo-V2-Flash 80.3%

Current leader: Step-3.5-Flash

View Full Leaderboard page

AgentBench

AgentBench evaluates LLMs as agents across 8 interactive environments, including operating-system tasks, database querying, knowledge graphs, games, lateral-thinking puzzles, ALFWorld, WebShop, and Mind2Web-style browsing.

Model Score
AgentRL w/ Qwen2.5-32B-Instruct 70.4%
AgentRL w/ Qwen2.5-14B-Instruct 67.7%
AgentRL w/ GLM-4-9B-0414 65.0%

Current leader: AgentRL w/ Qwen2.5-32B-Instruct

View Full Leaderboard page

Frequently asked questions

How should I choose a benchmark for my use case? +
Start from deployment context: browser workflow automation usually maps to WebVoyager or WebArena, desktop automation maps to OSWorld, deep research maps to BrowseComp, and code-fixing reliability maps to SWE-bench Verified.
Are scores comparable across different benchmarks? +
No. Benchmark objectives, datasets, evaluators, and pass criteria differ. Use each benchmark page for within-benchmark comparison, then validate directly on your own workload.
Do leaderboard scores belong to models or systems? +
Both exist, depending on page scope. Model pages emphasize base-model capability, while agent pages represent full systems (model + tooling + policy). Mixed pages include both and require extra caution.
Who maintains this leaderboard? +
Steel maintains it as an open reference for the browser-agent ecosystem. Steel is browser infrastructure for AI agents — cloud browser sessions with anti-bot handling, proxy rotation, and session replay — used by teams building agents against the benchmarks tracked here. Contributions and corrections are welcome on GitHub.
How do AI browser agents work? +
Browser agents combine LLMs with browser automation to complete web tasks. A vision model sees the webpage via screenshots or DOM. A reasoning model decides actions like clicking, typing, or scrolling. An execution layer drives the browser via Chrome DevTools Protocol or Playwright. A memory component tracks state across steps. Most agents run on cloud infrastructure like Steel for reliability and anti-bot handling.
How do I build my own AI browser agent? +
Three layers are needed. Browser infrastructure: Steel provides managed sessions, proxies, anti-bot handling, and replay. AI layer: a vision-capable model like GPT-4o, Claude, or Gemini with prompting for action selection. Orchestration: frameworks like Browser Use or Skyvern handle clicking, typing, and state tracking. See the production agents guide. Once your agent has a publicly verifiable benchmark score, open a pull request on GitHub.
How often is the leaderboard updated? +
The leaderboard updates as new benchmark results are published. New results appear weekly. If you know of a missing agent or score, pull requests and issues are welcome on GitHub.
How do I add my agent to the leaderboard? +
Open a pull request on GitHub with your entry. You need a publicly verifiable benchmark score, a link to the source (paper or blog post), and a homepage or GitHub repo for your agent.