About this benchmark

WebArena evaluates browser agents in reproducible, self-hosted websites instead of the open live web. Its 812 tasks cover e-commerce, forum discussion, collaborative software development, content management, maps, and reference lookup.

The benchmark is strongest when you care about repeatable web-agent experiments: every task has a controlled start state and functional success criteria rather than a changing production website.

Because many rows come from a public community tracker, a WebArena score should be read alongside the source, submitted scaffold, observation mode, and whether the result was independently reproduced.

Controlled environments improve reproducibility, but tracker rows still vary by scaffold and submission policy.

Filtered task-set or modified-grader reports are not ranked as full WebArena results unless the row notes that setup explicitly.

Example tasks

Three public tasks quoted from benchmark sources:

Methodology

  • Primary metric is end-to-end task success rate on the WebArena task set; the original GPT-4-based baseline was 14.41% versus 78.24% human performance.
  • Evaluation checks functional correctness through task-specific validators and answer checks in the hosted environment.
  • Scores can change with prompt scaffolding, observation mode, browser action space, and retry or step budget.
  • We prefer rows tied to WebArena's public leaderboard, papers, or repositories that include enough setup detail to reproduce the run.

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on WebArena? +
WebTactix (DeepSeek v3.2) is the system/agent setup currently leading with a tracked score of 74.3%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated May 27, 2026.
What should I read into a WebArena score? +
WebArena scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.