Canonical benchmark page
Online-Mind2Web leaderboard
Benchmark page for Online-Mind2Web with standardized structure: about, leaderboard table, and FAQ.
Last updated: 2026-04-16
About this benchmark
Online-Mind2Web is a live browser agent benchmark of 300 diverse, realistic tasks across 136 popular websites — spanning shopping, finance, travel, government, and more. Unlike static offline benchmarks, agents interact with real, dynamic pages as they exist at evaluation time.
Published at COLM 2025 by OSU, it was introduced specifically to expose over-optimism in previously reported web agent results. The paper's central finding was that agents scoring highly on static benchmarks performed dramatically worse on live websites — hence the title 'An Illusion of Progress?'
It has since become the most widely cited live browser benchmark, with commercial agents (Browser Use, TinyFish, Yutori Navigator, UI-TARS-2) using it as the primary competitive signal for browser agent capability.
Judge methodology varies significantly across submissions — human eval, WebJudge, and custom agentic judges produce different scores for the same agent. Always check the Notes column before comparing rows.
Methodology
- Tasks span three difficulty levels — easy (83), medium (143), and hard (74) — stratified by reference human step count. Performance drops sharply between levels: easy→medium sees ~30% drop, medium→hard a further ~15%.
- Two evaluation methods coexist: human evaluation (gold standard, slower) and WebJudge (LLM-as-a-Judge, ~85% agreement with human judgment). Many teams report both — the Notes column specifies which applies to each row.
- Because teams use different judges (WebJudge, screenshot-based, or custom agentic judges), scores are not always directly comparable across organizations. Browser Use's 97% uses a custom agentic judge built on the Claude Agent SDK, which is not the same as WebJudge.
Links
Online-Mind2Web
Agent scope| Rank | System / Submission | Score | Organization | Notes | Source | Share |
|---|---|---|---|---|---|---|
| 1 | Browser Use Cloud (bu-max) New | 97.0% | Browser-Use | Self-reported using a custom agentic judge built on Claude Agent SDK; OpenAI's score uses a different screenshot-based judge — not directly comparable. | Source | |
| 2 | ABP + Claude Opus 4.6 New | 90.53% | theredsix | Agent Browser Protocol with Claude Opus 4.6; all 300 task results published publicly. Previous SOTA was 78.7%. | Source | |
| 3 | TinyFish New | 90.0% | TinyFish AI | All 300 task runs published publicly; outperformed Gemini by 21 points and OpenAI Operator by 29 points at time of submission. | Source | |
| 4 | UI-TARS-2 New | 88.2% | ByteDance / VLM-Research | Native GUI agent trained with multi-turn RL; score from technical report, evaluated under standard Online-Mind2Web conditions. | Source | |
| 5 | Navigator | 78.7% | Yutori | Human-evaluation score; also achieves 64.7% on auto-evaluation (WebJudge). 3.3x faster per-step than Claude 4.5. | Source | |
| 6 | Gemini 2.5 Computer Use | 69.0% | Google DeepMind | Score reported by Yutori under identical evaluation settings; 57.3% on auto-evaluation (WebJudge). | Source | |
| 7 | OpenAI Operator | 61.3% | OpenAI | Score from Online-Mind2Web paper; OpenAI did not publish judge, harness, or task-level results for independent verification. | Source | |
| 8 | Claude 4.0 | 61.0% | Anthropic | Human-evaluation score reported by Yutori; 47.7% on auto-evaluation (WebJudge). | Source | |
| 9 | ACT-1-20250814 | 57.3% | Enhans | Online-Mind2Web SR (Easy 81.9 / Med 54.5 / Hard 35.1); reported on osunlp HF leaderboard. | Source | |
| 10 | Claude Computer Use 3.7 (w/o thinking) | 56.3% | Anthropic | Online-Mind2Web SR (Easy 90.4 / Med 49.0 / Hard 32.4); reported on osunlp HF leaderboard. | Source | |
| 11 | Claude 4.5 | 55.0% | Anthropic | Human-evaluation score reported by Yutori under identical evaluation settings; 59.3% on auto-evaluation (WebJudge). | Source | |
| 12 | ACT-1-20250703 | 45.7% | Enhans | Online-Mind2Web SR (Easy 65.1 / Med 46.2 / Hard 23.0); reported on osunlp HF leaderboard. | Source | |
| 13 | SeeAct (gpt-4o) | 30.7% | OSU NLP | Online-Mind2Web SR (Easy 60.2 / Med 25.2 / Hard 8.1); reported on osu-nlp-group.github.io. | Source | |
| 14 | Browser Use (gpt-4o) | 30.0% | Browser Use | Online-Mind2Web SR (Easy 55.4 / Med 26.6 / Hard 8.1); reported on osunlp HF leaderboard. | Source | |
| 14 | HAL Leaderboard baseline (best open) | ~30% | Princeton / OSU | Reference baseline from the HAL leaderboard tracker; illustrates the gap between frontier commercial systems and open models. | Source | |
| 16 | Claude Computer Use 3.5 | 29.0% | Anthropic | Online-Mind2Web SR (Easy 56.6 / Med 20.3 / Hard 14.9); reported on osunlp HF leaderboard. | Source | |
| 17 | Agent-E (gpt-4o) | 28.0% | Emergence AI | Online-Mind2Web SR (Easy 49.4 / Med 26.6 / Hard 6.8); reported on osunlp HF leaderboard. | Source |
Related benchmarks
Compare this benchmark with related pages from the hub: