Leaderboard
Agent scope| System / Submission | Score | Organization | Reported | Source |
|---|---|---|---|---|
| 98.5% | Alumnium | Source | ||
| Surfer 2 System-level orchestration with submitter-defined setup details. | 97.1% | H Company | Source | |
| 93.9% | Magnitude | Source | ||
| 92.2% | H Company | Source | ||
| 90.4% | Browserable | Source | ||
| 89.1% | Browser Use | Source | ||
| GLM-5V-Turbo New Multimodal vision model optimized for GUI automation and coding tasks. | 88.5% | Z.ai | Source | |
| Agent Kura 602/643 tasks (41 removed for invalid/auth issues); reported on trykura.com. | 87.0% | Kura | Source | |
| Operator Native browser integration using proprietary vision-control models. | 87% | OpenAI | Source | |
| 86.2% | Notte | Source | ||
| 85.85% | Skyvern | Source | ||
| Project Mariner Gemini-powered reasoning with precise visual grounding. | 83.5% | Source | ||
| 73.2% | Emergence AI | Source | ||
| WebSight Navigation system prioritizing visual-only perceptual inputs. | 68% | Academic Research | Source | |
| Runner H 0.1 Foundational agent architecture for general web interaction. | 67% | H Company | Source | |
| 59.1% | Academic Research | Source | ||
| Anthropic Computer Use 3.5 Sampled 50/602 tasks for direct comparison; reported on trykura.com. | 56.0% | Anthropic | Source | |
| WILBUR Research implementation using black-box optimization techniques. | 53% | Bardeen / UC Berkeley | Source | |
| GPT-4 (All Tools) ChatGPT integrated tool baseline from original WebVoyager paper; reported on arxiv.org. | 30.8% | OpenAI | Source |
About this benchmark
WebVoyager evaluates end-to-end browser agents on 643 tasks across 15 popular real-world websites. Tasks cover search, navigation, form filling, map and travel lookup, shopping, and information retrieval on live pages rather than static snapshots.
It is useful as a browser-agent adoption signal because many commercial and open-source agents report it, but it is unusually sensitive to task drift, removed tasks, evaluator choice, and whether the run used the full task suite.
Read each row as a full-system result: model, prompt, browser execution layer, retry policy, DOM or accessibility extraction, and visual grounding can all contribute to the final score.
WebVoyager is high-visibility but not fully standardized across modern submissions; small score gaps can reflect setup choices as much as capability.
Example tasks
Three public tasks quoted from benchmark sources:
- "Provide a recipe for vegetarian lasagna with more than 100 reviews and a rating of at least 4.5 stars suitable for 6 people." Citation: WebVoyager dataset
- "Search an Xbox Wireless controller with green color and rated above 4 stars." Citation: WebVoyager dataset
- "Find a Blue iPhone 12 Pro 128gb and add to cart." Citation: WebVoyager dataset
Methodology
- Primary metric is task success rate: completed tasks divided by evaluated tasks. The original paper used GPT-4V as an automatic evaluator and reported 85.3% agreement with human judgment.
- We prioritize public sources that identify the system, score, task subset or evaluator when available, and a paper, repository, model card, or launch post that can be checked later.
- Direct comparisons are strongest when systems run the same task set, same evaluator, same attempt policy, and same handling of stale or auth-gated tasks.
- Rows that use filtered task subsets, manual correction, or custom judges are kept when source-linked, but notes should be read before treating adjacent ranks as meaningful differences.
Links
Related benchmarks
Compare this benchmark with related pages from the hub: