Canonical benchmark page

Online-Mind2Web leaderboard

Benchmark page for Online-Mind2Web with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-04-16

About this benchmark

Online-Mind2Web is a live browser agent benchmark of 300 diverse, realistic tasks across 136 popular websites — spanning shopping, finance, travel, government, and more. Unlike static offline benchmarks, agents interact with real, dynamic pages as they exist at evaluation time.

Published at COLM 2025 by OSU, it was introduced specifically to expose over-optimism in previously reported web agent results. The paper's central finding was that agents scoring highly on static benchmarks performed dramatically worse on live websites — hence the title 'An Illusion of Progress?'

It has since become the most widely cited live browser benchmark, with commercial agents (Browser Use, TinyFish, Yutori Navigator, UI-TARS-2) using it as the primary competitive signal for browser agent capability.

Judge methodology varies significantly across submissions — human eval, WebJudge, and custom agentic judges produce different scores for the same agent. Always check the Notes column before comparing rows.

Methodology

Tasks span three difficulty levels — easy (83), medium (143), and hard (74) — stratified by reference human step count. Performance drops sharply between levels: easy→medium sees ~30% drop, medium→hard a further ~15%.
Two evaluation methods coexist: human evaluation (gold standard, slower) and WebJudge (LLM-as-a-Judge, ~85% agreement with human judgment). Many teams report both — the Notes column specifies which applies to each row.
Because teams use different judges (WebJudge, screenshot-based, or custom agentic judges), scores are not always directly comparable across organizations. Browser Use's 97% uses a custom agentic judge built on the Claude Agent SDK, which is not the same as WebJudge.

Online-Mind2Web

Agent scope

Rank	System / Submission	Score	Organization	Notes	Source	Share
1	Browser Use Cloud (bu-max) New	97.0%	Browser-Use	Self-reported using a custom agentic judge built on Claude Agent SDK; OpenAI's score uses a different screenshot-based judge — not directly comparable.	Source	Share on X Share on LinkedIn
2	ABP + Claude Opus 4.6 New	90.53%	theredsix	Agent Browser Protocol with Claude Opus 4.6; all 300 task results published publicly. Previous SOTA was 78.7%.	Source	Share on X Share on LinkedIn
3	TinyFish New	90.0%	TinyFish AI	All 300 task runs published publicly; outperformed Gemini by 21 points and OpenAI Operator by 29 points at time of submission.	Source	Share on X Share on LinkedIn
4	UI-TARS-2 New	88.2%	ByteDance / VLM-Research	Native GUI agent trained with multi-turn RL; score from technical report, evaluated under standard Online-Mind2Web conditions.	Source	Share on X Share on LinkedIn
5	Navigator	78.7%	Yutori	Human-evaluation score; also achieves 64.7% on auto-evaluation (WebJudge). 3.3x faster per-step than Claude 4.5.	Source	Share on X Share on LinkedIn
6	Gemini 2.5 Computer Use	69.0%	Google DeepMind	Score reported by Yutori under identical evaluation settings; 57.3% on auto-evaluation (WebJudge).	Source	Share on X Share on LinkedIn
7	OpenAI Operator	61.3%	OpenAI	Score from Online-Mind2Web paper; OpenAI did not publish judge, harness, or task-level results for independent verification.	Source	Share on X Share on LinkedIn
8	Claude 4.0	61.0%	Anthropic	Human-evaluation score reported by Yutori; 47.7% on auto-evaluation (WebJudge).	Source	Share on X Share on LinkedIn
9	ACT-1-20250814	57.3%	Enhans	Online-Mind2Web SR (Easy 81.9 / Med 54.5 / Hard 35.1); reported on osunlp HF leaderboard.	Source	Share on X Share on LinkedIn
10	Claude Computer Use 3.7 (w/o thinking)	56.3%	Anthropic	Online-Mind2Web SR (Easy 90.4 / Med 49.0 / Hard 32.4); reported on osunlp HF leaderboard.	Source	Share on X Share on LinkedIn
11	Claude 4.5	55.0%	Anthropic	Human-evaluation score reported by Yutori under identical evaluation settings; 59.3% on auto-evaluation (WebJudge).	Source	Share on X Share on LinkedIn
12	ACT-1-20250703	45.7%	Enhans	Online-Mind2Web SR (Easy 65.1 / Med 46.2 / Hard 23.0); reported on osunlp HF leaderboard.	Source	Share on X Share on LinkedIn
13	SeeAct (gpt-4o)	30.7%	OSU NLP	Online-Mind2Web SR (Easy 60.2 / Med 25.2 / Hard 8.1); reported on osu-nlp-group.github.io.	Source	Share on X Share on LinkedIn
14	Browser Use (gpt-4o)	30.0%	Browser Use	Online-Mind2Web SR (Easy 55.4 / Med 26.6 / Hard 8.1); reported on osunlp HF leaderboard.	Source	Share on X Share on LinkedIn
14	HAL Leaderboard baseline (best open)	~30%	Princeton / OSU	Reference baseline from the HAL leaderboard tracker; illustrates the gap between frontier commercial systems and open models.	Source	Share on X Share on LinkedIn
16	Claude Computer Use 3.5	29.0%	Anthropic	Online-Mind2Web SR (Easy 56.6 / Med 20.3 / Hard 14.9); reported on osunlp HF leaderboard.	Source	Share on X Share on LinkedIn
17	Agent-E (gpt-4o)	28.0%	Emergence AI	Online-Mind2Web SR (Easy 49.4 / Med 26.6 / Hard 6.8); reported on osunlp HF leaderboard.	Source	Share on X Share on LinkedIn

Related benchmarks

Compare this benchmark with related pages from the hub:

webvoyager webarena browsecomp

Back to benchmark hub

Frequently asked questions

Which system is currently best on Online-Mind2Web? +

Browser Use Cloud (bu-max) is the system/agent setup currently leading with a tracked score of 97.0%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Apr 16, 2026.

What should I read into a Online-Mind2Web score? +

Online-Mind2Web scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.

Are these independently verified? +

Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.