About this benchmark

WebVoyager evaluates end-to-end browser agents on 643 tasks across 15 popular real-world websites. Tasks cover search, navigation, form filling, map and travel lookup, shopping, and information retrieval on live pages rather than static snapshots.

It is useful as a browser-agent adoption signal because many commercial and open-source agents report it, but it is unusually sensitive to task drift, removed tasks, evaluator choice, and whether the run used the full task suite.

Read each row as a full-system result: model, prompt, browser execution layer, retry policy, DOM or accessibility extraction, and visual grounding can all contribute to the final score.

WebVoyager is high-visibility but not fully standardized across modern submissions; small score gaps can reflect setup choices as much as capability.

Example tasks

Three public tasks quoted from benchmark sources:

Methodology

  • Primary metric is task success rate: completed tasks divided by evaluated tasks. The original paper used GPT-4V as an automatic evaluator and reported 85.3% agreement with human judgment.
  • We prioritize public sources that identify the system, score, task subset or evaluator when available, and a paper, repository, model card, or launch post that can be checked later.
  • Direct comparisons are strongest when systems run the same task set, same evaluator, same attempt policy, and same handling of stale or auth-gated tasks.
  • Rows that use filtered task subsets, manual correction, or custom judges are kept when source-linked, but notes should be read before treating adjacent ranks as meaningful differences.

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on WebVoyager? +
Alumnium is the system/agent setup currently leading with a tracked score of 98.5%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Mar 22, 2026.
What should I read into a WebVoyager score? +
WebVoyager scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? +
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.
Can I compare every row directly? +
Use caution. Rows can vary by evaluator, harness, attempt budget, tool access, task filtering, or verification level. Source links and notes are part of the score, not decoration.
How is WebVoyager different from WebArena? +
WebVoyager runs on live public websites and therefore captures drift, anti-bot behavior, and production UI variance. WebArena is self-hosted and more reproducible, making it better for controlled experiments and ablations.
Why do WebVoyager scores vary between sources? +
Modern submissions may remove stale tasks, use different judges, allow different retry budgets, or manually audit evaluator mistakes. Those choices can move scores without representing a clean capability difference.
Is WebVoyager enough to pick a production browser agent? +
No. It is a useful directional signal for navigation and retrieval, but production selection should also test latency, cost, authentication flows, CAPTCHA or bot defenses, reliability on your own target sites, and recovery from partial failures.