Canonical benchmark page

WebVoyager leaderboard

Benchmark page for WebVoyager with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-03-22

About this benchmark

WebVoyager is a benchmark for browser agents operating on live websites. It focuses on practical tasks such as navigation, search, form completion, and multi-step workflows across a broad website mix.

Builders use WebVoyager to compare end-to-end browsing systems under a shared task suite. It is one of the most commonly cited public references for web agent capability in production-like browsing flows.

Scores here generally reflect full system setup, not only a base model. Prompting strategy, tool policies, retries, and browser execution stack can all materially change outcomes.

Rows may use different evaluation settings and are not always strict apples-to-apples.

Methodology

  • Evaluation typically reports pass rates over benchmark tasks and may differ by run policy (for example pass@1 vs multi-attempt settings).
  • Reported rows can mix independent studies and team-published updates; always check source methodology before direct comparisons.
  • Last tracked update for this page uses the timestamp shown on this page, and rows can be revised as better-source reports are added.

Links

WebVoyager

Agent scope
Rank System / Submission Score Organization Notes Source Share
1
Jina New
98.9% Om Labs Om Labs custom tracker; Jina multi-model system on self-hosted WebVoyager harness. Source
2
Alumnium New
98.6% Alumnium Accessibility-based tree parsing with integrated visual reasoning. Source
3
Surfer 2 New
97.1% H Company System-level orchestration with submitter-defined setup details. Source
4
Magnitude
93.9% Magnitude Open-source architecture utilizing a modular agentic stack. Source
5
AIME Browser-Use New
92.34% Aime Custom orchestration layer with specialized browser tooling. Source
6
Surfer-H + Holo1 New
92.2% H Company Multi-modal action kernels integrated via H-Company research. Source
7
Browserable New
90.4% Browserable Fine-tuned browser control models within a commercial framework. Source
8
Browser Use
89.1% Browser Use Multi-step orchestration framework for open-source automation. Source
9
GLM-5V-Turbo New
88.5% Z.ai Multimodal vision model optimized for GUI automation and coding tasks. Source
10
Agent Kura
87.0% Kura 602/643 tasks (41 removed for invalid/auth issues); reported on trykura.com. Source
10
Operator
87% OpenAI Native browser integration using proprietary vision-control models. Source
12
Skyvern 2.0
85.85% Skyvern DOM-level reasoning coupled with real-time error-correction. Source
13
Project Mariner
83.5% Google Gemini-powered reasoning with precise visual grounding. Source
14
Agent-E
73.1% Emergence AI Hierarchical planning modules within a multi-agent framework. Source
14
Notte New
73.1% Notte Standardized operator stack for open-source performance evaluation. Source
16
WebSight
68% Academic Research Navigation system prioritizing visual-only perceptual inputs. Source
17
Runner H 0.1
67% H Company Foundational agent architecture for general web interaction. Source
18
WebVoyager
59.1% Academic Research Baseline implementation using standard multimodal LLM control. Source
19
Anthropic Computer Use 3.5
56.0% Anthropic Sampled 50/602 tasks for direct comparison; reported on trykura.com. Source
20
WILBUR
53% Academic Research Research implementation using black-box optimization techniques. Source
21
GPT-4 (All Tools)
30.8% OpenAI ChatGPT integrated tool baseline from original WebVoyager paper; reported on arxiv.org. Source

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on WebVoyager? +
Jina is the system/agent setup currently leading with a tracked score of 98.9%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Mar 22, 2026.
What should I read into a WebVoyager score? +
WebVoyager scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Can I compare model-only and agent-with-tools rows directly? +
Not directly. Mixed pages can combine model-centric and full-system submissions. Treat those comparisons as directional unless evaluation setup and tool policy are explicitly aligned.
What is the WebVoyager benchmark for AI browser agents? +
WebVoyager is the standard benchmark for evaluating browser agents, introduced in the 2024 paper WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. It consists of 643 tasks across 15 websites including Google, Amazon, GitHub, Reddit, and Wikipedia. Tasks cover form filling, navigation, search, and shopping. GPT-4V evaluates each task by analyzing the final page state. Scores represent the percentage of tasks completed successfully.
Can WebVoyager scores be compared across different agents? +
Not always. Three factors affect comparability: dataset size (full 643 tasks vs filtered subsets), evaluator (GPT-4V vs custom methods), and verification (third-party vs self-reported). Filtered subsets typically produce higher scores. Click any leaderboard row to see methodology. The most reliable comparisons use full dataset, GPT-4V evaluation, and third-party verification.
How is the WebVoyager score calculated? +
Score = (tasks completed / total tasks) x 100. GPT-4V evaluates each task by analyzing the final page state to determine if the goal was achieved — correct page reached, information displayed, forms filled accurately, and flows completed.
What websites can AI browser agents navigate? +
Agents can navigate any website. WebVoyager evaluates on 15 specific sites including Amazon, eBay, Google, Google Maps, Wikipedia, Reddit, Twitter/X, GitHub, ArXiv, and Booking.com. Real-world challenges include CAPTCHAs, bot detection, dynamic content, auth flows, and rate limiting. Production agents use infrastructure like Steel for anti-bot measures and proxy rotation.
Is a higher WebVoyager score always better for production use? +
Not necessarily. WebVoyager measures task completion on a fixed website set under controlled conditions. Production depends on factors not captured by the benchmark — latency, cost per task, CAPTCHA handling, anti-bot resilience, and generalization to new websites. An agent optimized for benchmark scores may overfit. Use the leaderboard as a directional signal and test on your actual target websites.
Why is WebVoyager used instead of other benchmarks? +
WebVoyager is the most widely adopted public benchmark for browser agents, enabling cross-agent comparison. Other benchmarks exist — Mind2Web (2000+ tasks), OSWorld (desktop interaction), WorkArena (enterprise apps) — but have seen less adoption. WebVoyager's real-world task design, consistent GPT-4V evaluation, and widespread usage make it the current standard.