Question 1

Which system is currently best on WebVoyager?

Accepted Answer

Jina is the system/agent setup currently leading with a tracked score of 98.9%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Mar 22, 2026.

Question 2

What should I read into a WebVoyager score?

Accepted Answer

WebVoyager scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.

Question 3

Can I compare model-only and agent-with-tools rows directly?

Accepted Answer

Not directly. Mixed pages can combine model-centric and full-system submissions. Treat those comparisons as directional unless evaluation setup and tool policy are explicitly aligned.

Question 4

What is the WebVoyager benchmark for AI browser agents?

Accepted Answer

WebVoyager is the standard benchmark for evaluating browser agents, introduced in the 2024 paper WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. It consists of 643 tasks across 15 websites including Google, Amazon, GitHub, Reddit, and Wikipedia. Tasks cover form filling, navigation, search, and shopping. GPT-4V evaluates each task by analyzing the final page state. Scores represent the percentage of tasks completed successfully.

Question 5

Can WebVoyager scores be compared across different agents?

Accepted Answer

Not always. Three factors affect comparability: dataset size (full 643 tasks vs filtered subsets), evaluator (GPT-4V vs custom methods), and verification (third-party vs self-reported). Filtered subsets typically produce higher scores. Click any leaderboard row to see methodology. The most reliable comparisons use full dataset, GPT-4V evaluation, and third-party verification.

Question 6

How is the WebVoyager score calculated?

Accepted Answer

Score = (tasks completed / total tasks) x 100. GPT-4V evaluates each task by analyzing the final page state to determine if the goal was achieved — correct page reached, information displayed, forms filled accurately, and flows completed.

Question 7

What websites can AI browser agents navigate?

Accepted Answer

Agents can navigate any website. WebVoyager evaluates on 15 specific sites including Amazon, eBay, Google, Google Maps, Wikipedia, Reddit, Twitter/X, GitHub, ArXiv, and Booking.com. Real-world challenges include CAPTCHAs, bot detection, dynamic content, auth flows, and rate limiting. Production agents use infrastructure like Steel for anti-bot measures and proxy rotation.

Question 8

Is a higher WebVoyager score always better for production use?

Accepted Answer

Not necessarily. WebVoyager measures task completion on a fixed website set under controlled conditions. Production depends on factors not captured by the benchmark — latency, cost per task, CAPTCHA handling, anti-bot resilience, and generalization to new websites. An agent optimized for benchmark scores may overfit. Use the leaderboard as a directional signal and test on your actual target websites.

Question 9

Why is WebVoyager used instead of other benchmarks?

Accepted Answer

WebVoyager is the most widely adopted public benchmark for browser agents, enabling cross-agent comparison. Other benchmarks exist — Mind2Web (2000+ tasks), OSWorld (desktop interaction), WorkArena (enterprise apps) — but have seen less adoption. WebVoyager's real-world task design, consistent GPT-4V evaluation, and widespread usage make it the current standard.

Rank	System / Submission	Score	Organization	Notes	Source	Share
1	Jina New	98.9%	Om Labs	Om Labs custom tracker; Jina multi-model system on self-hosted WebVoyager harness.	Source	Share on X Share on LinkedIn
2	Alumnium New	98.6%	Alumnium	Accessibility-based tree parsing with integrated visual reasoning.	Source	Share on X Share on LinkedIn
3	Surfer 2 New	97.1%	H Company	System-level orchestration with submitter-defined setup details.	Source	Share on X Share on LinkedIn
4	Magnitude	93.9%	Magnitude	Open-source architecture utilizing a modular agentic stack.	Source	Share on X Share on LinkedIn
5	AIME Browser-Use New	92.34%	Aime	Custom orchestration layer with specialized browser tooling.	Source	Share on X Share on LinkedIn
6	Surfer-H + Holo1 New	92.2%	H Company	Multi-modal action kernels integrated via H-Company research.	Source	Share on X Share on LinkedIn
7	Browserable New	90.4%	Browserable	Fine-tuned browser control models within a commercial framework.	Source	Share on X Share on LinkedIn
8	Browser Use	89.1%	Browser Use	Multi-step orchestration framework for open-source automation.	Source	Share on X Share on LinkedIn
9	GLM-5V-Turbo New	88.5%	Z.ai	Multimodal vision model optimized for GUI automation and coding tasks.	Source	Share on X Share on LinkedIn
10	Agent Kura	87.0%	Kura	602/643 tasks (41 removed for invalid/auth issues); reported on trykura.com.	Source	Share on X Share on LinkedIn
10	Operator	87%	OpenAI	Native browser integration using proprietary vision-control models.	Source	Share on X Share on LinkedIn
12	Skyvern 2.0	85.85%	Skyvern	DOM-level reasoning coupled with real-time error-correction.	Source	Share on X Share on LinkedIn
13	Project Mariner	83.5%	Google	Gemini-powered reasoning with precise visual grounding.	Source	Share on X Share on LinkedIn
14	Agent-E	73.1%	Emergence AI	Hierarchical planning modules within a multi-agent framework.	Source	Share on X Share on LinkedIn
14	Notte New	73.1%	Notte	Standardized operator stack for open-source performance evaluation.	Source	Share on X Share on LinkedIn
16	WebSight	68%	Academic Research	Navigation system prioritizing visual-only perceptual inputs.	Source	Share on X Share on LinkedIn
17	Runner H 0.1	67%	H Company	Foundational agent architecture for general web interaction.	Source	Share on X Share on LinkedIn
18	WebVoyager	59.1%	Academic Research	Baseline implementation using standard multimodal LLM control.	Source	Share on X Share on LinkedIn
19	Anthropic Computer Use 3.5	56.0%	Anthropic	Sampled 50/602 tasks for direct comparison; reported on trykura.com.	Source	Share on X Share on LinkedIn
20	WILBUR	53%	Academic Research	Research implementation using black-box optimization techniques.	Source	Share on X Share on LinkedIn
21	GPT-4 (All Tools)	30.8%	OpenAI	ChatGPT integrated tool baseline from original WebVoyager paper; reported on arxiv.org.	Source	Share on X Share on LinkedIn

WebVoyager leaderboard

About this benchmark

Methodology

Links

WebVoyager

Related benchmarks

Frequently asked questions