Leaderboard
Agent scope| System / Submission | Score | Organization | Reported | Source |
|---|---|---|---|---|
| Browser Use Cloud (bu-max) New Self-reported using a custom agentic judge built on Claude Agent SDK; OpenAI's score uses a different screenshot-based judge — not directly comparable. | 97.0% | Browser-Use | Source | |
| GPT-5.4 Native Computer Use OpenAI-reported native computer-use score from GPT-5.4 launch announcement; per Browser Use leaderboard, raw data is not public. | 93.0% | OpenAI | Source | |
| ABP + Claude Opus 4.6 Agent Browser Protocol with Claude Opus 4.6; all 300 task results published publicly. Previous SOTA was 78.7%. | 90.53% | theredsix | Source | |
| TinyFish All 300 task runs published publicly; outperformed Gemini by 21 points and OpenAI Operator by 29 points at time of submission. | 90.0% | TinyFish AI | Source | |
| UI-TARS-2 Native GUI agent trained with multi-turn RL; score from technical report, evaluated under standard Online-Mind2Web conditions. | 88.2% | ByteDance / VLM-Research | Source | |
| OpenAGI Lux Foundation computer-use model trained via Agentic Active Pre-training on OSGym; self-reported Online-Mind2Web score at launch. | 83.6% | OpenAGI Foundation | Source | |
| Navigator Human-evaluation score; also achieves 64.7% on auto-evaluation (WebJudge). 3.3x faster per-step than Claude 4.5. | 78.7% | Yutori | Source | |
| ChatGPT Atlas Agent Mode OpenAI-reported Atlas browser Agent Mode score cited in GPT-5.4 announcement; underlying run data not public. | 71.0% | OpenAI | Source | |
| Gemini 2.5 Computer Use Score reported by Yutori under identical evaluation settings; 57.3% on auto-evaluation (WebJudge). | 69.0% | Google DeepMind | Source | |
| Stagehand (Gemini 2.5 CU) New Stagehand SDK wrapping Gemini 2.5 Computer Use; score from Browserbase's public Computer Use evaluations leaderboard. | 65.0% | Browserbase | Source | |
| OpenAI Operator Score from Online-Mind2Web paper; OpenAI did not publish judge, harness, or task-level results for independent verification. | 61.3% | OpenAI | Source | |
| Claude 4.0 Human-evaluation score reported by Yutori; 47.7% on auto-evaluation (WebJudge). | 61.0% | Anthropic | Source | |
| ACT-1-20250814 Online-Mind2Web SR (Easy 81.9 / Med 54.5 / Hard 35.1); reported on osunlp HF leaderboard. | 57.3% | Enhans | Source | |
| Claude Computer Use 3.7 (w/o thinking) Online-Mind2Web SR (Easy 90.4 / Med 49.0 / Hard 32.4); reported on osunlp HF leaderboard. | 56.3% | Anthropic | Source | |
| Claude 4.5 Human-evaluation score reported by Yutori under identical evaluation settings; 59.3% on auto-evaluation (WebJudge). | 55.0% | Anthropic | Source | |
| Stagehand (Sonnet 4.5) Stagehand SDK with Claude Sonnet 4.5; score from Browserbase's public Computer Use evaluations leaderboard. | 55.0% | Browserbase | Source | |
| ACT-1-20250703 Online-Mind2Web SR (Easy 65.1 / Med 46.2 / Hard 23.0); reported on osunlp HF leaderboard. | 45.7% | Enhans | Source | |
| SeeAct (gpt-4o) Online-Mind2Web SR (Easy 60.2 / Med 25.2 / Hard 8.1); reported on osu-nlp-group.github.io. | 30.7% | OSU NLP | Source | |
| Browser Use (gpt-4o) Online-Mind2Web SR (Easy 55.4 / Med 26.6 / Hard 8.1); reported on osunlp HF leaderboard. | 30.0% | Browser Use | Source | |
| HAL Leaderboard baseline (best open) Reference baseline from the HAL leaderboard tracker; illustrates the gap between frontier commercial systems and open models. | ~30% | Princeton / OSU | Source | |
| Claude Computer Use 3.5 Online-Mind2Web SR (Easy 56.6 / Med 20.3 / Hard 14.9); reported on osunlp HF leaderboard. | 29.0% | Anthropic | Source | |
| Agent-E (gpt-4o) Online-Mind2Web SR (Easy 49.4 / Med 26.6 / Hard 6.8); reported on osunlp HF leaderboard. | 28.0% | Emergence AI | Source |
About this benchmark
Online-Mind2Web turns the static Mind2Web idea into a live benchmark of 300 tasks across 136 websites, covering shopping, finance, travel, government, and other consumer workflows.
The paper was framed around the gap between offline benchmark progress and real online performance; agents that look strong on static snapshots can fail when pages, timing, and interaction flows change.
It is one of the most useful web-agent benchmarks for current product work, but reported scores can depend heavily on whether evaluation used human judging, WebJudge, or a custom agentic judge.
Judge methodology varies across submissions; human eval, WebJudge, and custom agentic judges can produce different scores for the same agent.
Example tasks
Three public tasks quoted from benchmark sources:
- "Open the page with an overview of the submission of releases on Discogs." Citation: Online-Mind2Web example result
- "Open the reviews of a recipe with beef sirloin" Citation: Browser Use Online-Mind2Web benchmark post
- "Find full-time legal jobs in San Diego County, min $4,000+/month" Citation: Browser Use Online-Mind2Web benchmark post
Methodology
- Primary score is task success rate across easy, medium, and hard tasks, where difficulty is stratified by reference human step count.
- The paper introduced WebJudge, an LLM-as-judge method with roughly 85% agreement with human judgment, but newer submissions sometimes use custom judges.
- Human evaluation is the clearest comparison point; automated judge scores should be compared only when judge, screenshots or traces, and task-level results are published.
- Rows are included when the source provides a benchmark score and enough information to identify the evaluator or setup.
Links
Related benchmarks
Compare this benchmark with related pages from the hub: