Canonical benchmark page

Online-Mind2Web leaderboard

Benchmark page for Online-Mind2Web with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-04-16

About this benchmark

Online-Mind2Web is a live browser agent benchmark of 300 diverse, realistic tasks across 136 popular websites — spanning shopping, finance, travel, government, and more. Unlike static offline benchmarks, agents interact with real, dynamic pages as they exist at evaluation time.

Published at COLM 2025 by OSU, it was introduced specifically to expose over-optimism in previously reported web agent results. The paper's central finding was that agents scoring highly on static benchmarks performed dramatically worse on live websites — hence the title 'An Illusion of Progress?'

It has since become the most widely cited live browser benchmark, with commercial agents (Browser Use, TinyFish, Yutori Navigator, UI-TARS-2) using it as the primary competitive signal for browser agent capability.

Judge methodology varies significantly across submissions — human eval, WebJudge, and custom agentic judges produce different scores for the same agent. Always check the Notes column before comparing rows.

Methodology

  • Tasks span three difficulty levels — easy (83), medium (143), and hard (74) — stratified by reference human step count. Performance drops sharply between levels: easy→medium sees ~30% drop, medium→hard a further ~15%.
  • Two evaluation methods coexist: human evaluation (gold standard, slower) and WebJudge (LLM-as-a-Judge, ~85% agreement with human judgment). Many teams report both — the Notes column specifies which applies to each row.
  • Because teams use different judges (WebJudge, screenshot-based, or custom agentic judges), scores are not always directly comparable across organizations. Browser Use's 97% uses a custom agentic judge built on the Claude Agent SDK, which is not the same as WebJudge.

Links

Online-Mind2Web

Agent scope
Rank System / Submission Score Organization Notes Source Share
1
Browser Use Cloud (bu-max) New
97.0% Browser-Use Self-reported using a custom agentic judge built on Claude Agent SDK; OpenAI's score uses a different screenshot-based judge — not directly comparable. Source
2
ABP + Claude Opus 4.6 New
90.53% theredsix Agent Browser Protocol with Claude Opus 4.6; all 300 task results published publicly. Previous SOTA was 78.7%. Source
3
TinyFish New
90.0% TinyFish AI All 300 task runs published publicly; outperformed Gemini by 21 points and OpenAI Operator by 29 points at time of submission. Source
4
UI-TARS-2 New
88.2% ByteDance / VLM-Research Native GUI agent trained with multi-turn RL; score from technical report, evaluated under standard Online-Mind2Web conditions. Source
5
Navigator
78.7% Yutori Human-evaluation score; also achieves 64.7% on auto-evaluation (WebJudge). 3.3x faster per-step than Claude 4.5. Source
6
Gemini 2.5 Computer Use
69.0% Google DeepMind Score reported by Yutori under identical evaluation settings; 57.3% on auto-evaluation (WebJudge). Source
7
OpenAI Operator
61.3% OpenAI Score from Online-Mind2Web paper; OpenAI did not publish judge, harness, or task-level results for independent verification. Source
8
Claude 4.0
61.0% Anthropic Human-evaluation score reported by Yutori; 47.7% on auto-evaluation (WebJudge). Source
9
ACT-1-20250814
57.3% Enhans Online-Mind2Web SR (Easy 81.9 / Med 54.5 / Hard 35.1); reported on osunlp HF leaderboard. Source
10
Claude Computer Use 3.7 (w/o thinking)
56.3% Anthropic Online-Mind2Web SR (Easy 90.4 / Med 49.0 / Hard 32.4); reported on osunlp HF leaderboard. Source
11
Claude 4.5
55.0% Anthropic Human-evaluation score reported by Yutori under identical evaluation settings; 59.3% on auto-evaluation (WebJudge). Source
12
ACT-1-20250703
45.7% Enhans Online-Mind2Web SR (Easy 65.1 / Med 46.2 / Hard 23.0); reported on osunlp HF leaderboard. Source
13
SeeAct (gpt-4o)
30.7% OSU NLP Online-Mind2Web SR (Easy 60.2 / Med 25.2 / Hard 8.1); reported on osu-nlp-group.github.io. Source
14
Browser Use (gpt-4o)
30.0% Browser Use Online-Mind2Web SR (Easy 55.4 / Med 26.6 / Hard 8.1); reported on osunlp HF leaderboard. Source
14
HAL Leaderboard baseline (best open)
~30% Princeton / OSU Reference baseline from the HAL leaderboard tracker; illustrates the gap between frontier commercial systems and open models. Source
16
Claude Computer Use 3.5
29.0% Anthropic Online-Mind2Web SR (Easy 56.6 / Med 20.3 / Hard 14.9); reported on osunlp HF leaderboard. Source
17
Agent-E (gpt-4o)
28.0% Emergence AI Online-Mind2Web SR (Easy 49.4 / Med 26.6 / Hard 6.8); reported on osunlp HF leaderboard. Source

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on Online-Mind2Web? +
Browser Use Cloud (bu-max) is the system/agent setup currently leading with a tracked score of 97.0%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Apr 16, 2026.
What should I read into a Online-Mind2Web score? +
Online-Mind2Web scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? +
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.