WebVoyager
643 tasks across 15 live public websites. Evaluated by GPT-4V judge. The most widely adopted web agent benchmark — de facto standard for comparing commercial and research agents.
- Benchmark type:
- Public benchmark
- Benchmark domain:
- Web navigation
- Task count:
- 643
- Evaluation method:
- GPT-4V
- Top model score
- 97.1%
- Human score
- ~90%
About this benchmark
WebVoyager is a multimodal web agent benchmark introduced by He et al. (Tencent) in January 2024. It comprises 643 tasks across 15 popular real-world websites, generated through a semi-automated pipeline and filtered for quality. The benchmark tests end-to-end web navigation using a Selenium-based online browsing environment, where agents interact with live websites rather than static snapshots or simulators. The agent receives both screenshots (with bounding-box overlays on interactive elements) and accessibility tree text as observations, supporting multimodal or text-only configurations.
Evaluation uses a novel GPT-4V-based automatic protocol that examines the agent's final screenshots and responses to judge task success, achieving 85.3% agreement with human judgment. The original WebVoyager agent (GPT-4V-powered) achieves a 59.1% task success rate, significantly outperforming GPT-4 All Tools and text-only baselines. Tasks include time-sensitive queries on sites like Booking and Google Flights, plus 90 supplementary tasks extracted from the GAIA benchmark validation set.
WebVoyager is notable for being one of the first benchmarks to evaluate agents on live, unmodified websites rather than controlled environments, making it a key reference for real-world web agent evaluation. The code and data are released under an Apache 2.0 license.
Where this benchmark fits
Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the web navigation view . You can also jump straight to this benchmark in the master registry list .