Canonical benchmark page
WebVoyager leaderboard
Benchmark page for WebVoyager with standardized structure: about, leaderboard table, and FAQ.
Last updated: 2026-03-22
About this benchmark
WebVoyager is a benchmark for browser agents operating on live websites. It focuses on practical tasks such as navigation, search, form completion, and multi-step workflows across a broad website mix.
Builders use WebVoyager to compare end-to-end browsing systems under a shared task suite. It is one of the most commonly cited public references for web agent capability in production-like browsing flows.
Scores here generally reflect full system setup, not only a base model. Prompting strategy, tool policies, retries, and browser execution stack can all materially change outcomes.
Rows may use different evaluation settings and are not always strict apples-to-apples.
Methodology
- Evaluation typically reports pass rates over benchmark tasks and may differ by run policy (for example pass@1 vs multi-attempt settings).
- Reported rows can mix independent studies and team-published updates; always check source methodology before direct comparisons.
- Last tracked update for this page uses the timestamp shown on this page, and rows can be revised as better-source reports are added.
Links
WebVoyager
Agent scope| Rank | System / Submission | Score | Organization | Notes | Source | Share |
|---|---|---|---|---|---|---|
| 1 | Jina New | 98.9% | Om Labs | Om Labs custom tracker; Jina multi-model system on self-hosted WebVoyager harness. | Source | |
| 2 | Alumnium New | 98.6% | Alumnium | Accessibility-based tree parsing with integrated visual reasoning. | Source | |
| 3 | Surfer 2 New | 97.1% | H Company | System-level orchestration with submitter-defined setup details. | Source | |
| 4 | Magnitude | 93.9% | Magnitude | Open-source architecture utilizing a modular agentic stack. | Source | |
| 5 | AIME Browser-Use New | 92.34% | Aime | Custom orchestration layer with specialized browser tooling. | Source | |
| 6 | Surfer-H + Holo1 New | 92.2% | H Company | Multi-modal action kernels integrated via H-Company research. | Source | |
| 7 | Browserable New | 90.4% | Browserable | Fine-tuned browser control models within a commercial framework. | Source | |
| 8 | Browser Use | 89.1% | Browser Use | Multi-step orchestration framework for open-source automation. | Source | |
| 9 | GLM-5V-Turbo New | 88.5% | Z.ai | Multimodal vision model optimized for GUI automation and coding tasks. | Source | |
| 10 | Agent Kura | 87.0% | Kura | 602/643 tasks (41 removed for invalid/auth issues); reported on trykura.com. | Source | |
| 10 | Operator | 87% | OpenAI | Native browser integration using proprietary vision-control models. | Source | |
| 12 | Skyvern 2.0 | 85.85% | Skyvern | DOM-level reasoning coupled with real-time error-correction. | Source | |
| 13 | Project Mariner | 83.5% | Gemini-powered reasoning with precise visual grounding. | Source | ||
| 14 | Agent-E | 73.1% | Emergence AI | Hierarchical planning modules within a multi-agent framework. | Source | |
| 14 | Notte New | 73.1% | Notte | Standardized operator stack for open-source performance evaluation. | Source | |
| 16 | WebSight | 68% | Academic Research | Navigation system prioritizing visual-only perceptual inputs. | Source | |
| 17 | Runner H 0.1 | 67% | H Company | Foundational agent architecture for general web interaction. | Source | |
| 18 | WebVoyager | 59.1% | Academic Research | Baseline implementation using standard multimodal LLM control. | Source | |
| 19 | Anthropic Computer Use 3.5 | 56.0% | Anthropic | Sampled 50/602 tasks for direct comparison; reported on trykura.com. | Source | |
| 20 | WILBUR | 53% | Academic Research | Research implementation using black-box optimization techniques. | Source | |
| 21 | GPT-4 (All Tools) | 30.8% | OpenAI | ChatGPT integrated tool baseline from original WebVoyager paper; reported on arxiv.org. | Source |
Related benchmarks
Compare this benchmark with related pages from the hub: