About this benchmark

Online-Mind2Web turns the static Mind2Web idea into a live benchmark of 300 tasks across 136 websites, covering shopping, finance, travel, government, and other consumer workflows.

The paper was framed around the gap between offline benchmark progress and real online performance; agents that look strong on static snapshots can fail when pages, timing, and interaction flows change.

It is one of the most useful web-agent benchmarks for current product work, but reported scores can depend heavily on whether evaluation used human judging, WebJudge, or a custom agentic judge.

Judge methodology varies across submissions; human eval, WebJudge, and custom agentic judges can produce different scores for the same agent.

Example tasks

Three public tasks quoted from benchmark sources:

Methodology

  • Primary score is task success rate across easy, medium, and hard tasks, where difficulty is stratified by reference human step count.
  • The paper introduced WebJudge, an LLM-as-judge method with roughly 85% agreement with human judgment, but newer submissions sometimes use custom judges.
  • Human evaluation is the clearest comparison point; automated judge scores should be compared only when judge, screenshots or traces, and task-level results are published.
  • Rows are included when the source provides a benchmark score and enough information to identify the evaluator or setup.

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on Online-Mind2Web? +
Browser Use Cloud (bu-max) is the system/agent setup currently leading with a tracked score of 97.0%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Apr 16, 2026.
What should I read into a Online-Mind2Web score? +
Online-Mind2Web scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? +
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.
Can I compare every row directly? +
Use caution. Rows can vary by evaluator, harness, attempt budget, tool access, task filtering, or verification level. Source links and notes are part of the score, not decoration.