Question 1

How should I choose a benchmark for my use case?

Accepted Answer

Start from deployment context: browser workflow automation usually maps to WebVoyager or WebArena, desktop automation maps to OSWorld, deep research maps to BrowseComp, and code-fixing reliability maps to SWE-bench Verified.

Question 2

Are scores comparable across different benchmarks?

Accepted Answer

No. Benchmark objectives, datasets, evaluators, and pass criteria differ. Use each benchmark page for within-benchmark comparison, then validate directly on your own workload.

Question 3

Do leaderboard scores belong to models or systems?

Accepted Answer

Both exist, depending on page scope. Model pages emphasize base-model capability, while agent pages represent full systems (model + tooling + policy). Mixed pages include both and require extra caution.

Question 4

Who maintains this leaderboard?

Accepted Answer

Steel maintains it as an open reference for the browser-agent ecosystem. Steel is browser infrastructure for AI agents — cloud browser sessions with anti-bot handling, proxy rotation, and session replay — used by teams building agents against the benchmarks tracked here. Contributions and corrections are welcome on GitHub.

Question 5

How do AI browser agents work?

Accepted Answer

Browser agents combine LLMs with browser automation to complete web tasks. A vision model sees the webpage via screenshots or DOM. A reasoning model decides actions like clicking, typing, or scrolling. An execution layer drives the browser via Chrome DevTools Protocol or Playwright. A memory component tracks state across steps. Most agents run on cloud infrastructure like Steel for reliability and anti-bot handling.

Question 6

How do I build my own AI browser agent?

Accepted Answer

Three layers are needed. Browser infrastructure: Steel provides managed sessions, proxies, anti-bot handling, and replay. AI layer: a vision-capable model like GPT-4o, Claude, or Gemini with prompting for action selection. Orchestration: frameworks like Browser Use or Skyvern handle clicking, typing, and state tracking. See the production agents guide. Once your agent has a publicly verifiable benchmark score, open a pull request on GitHub.

Question 7

How often is the leaderboard updated?

Accepted Answer

The leaderboard updates as new benchmark results are published. New results appear weekly. If you know of a missing agent or score, pull requests and issues are welcome on GitHub.

Question 8

How do I add my agent to the leaderboard?

Accepted Answer

Open a pull request on GitHub with your entry. You need a publicly verifiable benchmark score, a link to the source (paper or blog post), and a homepage or GitHub repo for your agent.

Model	Score
Alumnium	98.5%
Surfer 2	97.1%
Magnitude	93.9%

Model	Score
GPT-5.5 Pro	90.1%
GPT-5.4 Pro	89.3%
MiroThinker-H1	88.2%

Model	Score
WebTactix (DeepSeek v3.2)	74.3%
OpAgent	71.6%
ColorBrowserAgent	71.2%

Model	Score
Claude Mythos	93.9%
Claude Opus 4.8	88.6%
Claude Opus 4.7	87.6%

Model	Score
Claude Opus 4.8	83.4%
Mythos Preview	79.6%
OSAgent	76.26%

Browser Agent Leaderboards

WebVoyager

BrowseComp

WebArena

SWE-bench Verified

OSWorld

GAIA

ClawBench

Online-Mind2Web

τ-bench

AgentBench

Frequently asked questions

Model	Score
OPS-Agentic-Search	92.36%
openJiuwen-deepagent	92.36%
openJiuwen-deepagent (GPT5/Gemini)	91.69%

Model	Score
Browser Use Cloud (bu-max)	97.0%
GPT-5.4 Native Computer Use	93.0%
ABP + Claude Opus 4.6	90.53%

Model	Score
AgentRL w/ Qwen2.5-32B-Instruct	70.4%
AgentRL w/ Qwen2.5-14B-Instruct	67.7%
AgentRL w/ GLM-4-9B-0414	65.0%

Model	Score
Claude Sonnet 4.6	33.3%
GLM-5	24.2%
Gemini 3 Flash	19.0%

Model	Score
Step-3.5-Flash	88.2%
GLM-4.7	87.4%
MiMo-V2-Flash	80.3%