Web navigation benchmark - Public

Online-Mind2Web

300 verified tasks across 136 live websites. Independently verified by Princeton HAL with cost tracking alongside accuracy — unique Pareto frontier view of performance vs. cost.

BENCHMARK
Benchmark type:
Public benchmark
Benchmark domain:
Web navigation
Task count:
300
Evaluation method:
HAL Verified
Top model score
42.33%
SeeAct + GPT-5
Academic
Human score
N/A

About this benchmark

Online-Mind2Web is an online evaluation benchmark introduced by the OSU NLP Group in April 2025, building on the original Mind2Web dataset. It consists of 300 diverse and realistic tasks spanning 136 real websites, designed to evaluate web agents under conditions that approximate how real users interact with the open web. The benchmark addresses shortcomings in existing evaluations by moving from offline, static HTML snapshots to live online environments where agents must navigate actual websites end-to-end.

Evaluation employs a novel LLM-as-a-Judge automatic method that achieves approximately 85% agreement with human judgment, substantially higher than prior automatic evaluation approaches. The authors' comprehensive assessment reveals a significantly different picture of agent competency than previously reported results, suggesting over-optimism in the field. The benchmark enables the first large-scale comparative analysis of current web agents, highlighting both strengths and critical limitations.

Online-Mind2Web is particularly significant because it demonstrates that many gains reported on static benchmarks do not transfer to realistic online settings. The original Mind2Web dataset covers 2,000+ tasks from 137 websites across 31 domains with crowdsourced action sequences, and Online-Mind2Web refines this into a focused evaluation set. The data is licensed under Creative Commons Attribution 4.0.

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the web navigation view . You can also jump straight to this benchmark in the master registry list .