Steel.dev®

Browser Agent Leaderboard

Steel.dev Logo
See how various AI browser agents stack up based on their accuracy in completing web-based tasks on the WebVoyager benchmark.
LEADERBOARD
RANK
AGENT
WEB VOYAGER
ORGANIZATION
1
97.1% SOTA
H Company
2
Magnitude
3
Aime
4
H Company
5
Browserable
6
Browser Use
7
OpenAI
8
Skyvern
9
Google
10
Notte
11
Emergence AI
12
Academic Research
13
H Company
14
Academic Research
15
Academic Research

STEEL.DEV: STEEL IS AN OPEN-SOURCE BROWSER API PURPOSE-BUILT FOR AI AGENTS.

⚠ METHODOLOGY NOTE — SCORES ARE NOT ALWAYS DIRECTLY COMPARABLE. ORGANIZATIONS USE DIFFERENT VARIANTS OF THE WEBVOYAGER DATASET (FULL 643-TASK SUITE VS. FILTERED SUBSETS), DIFFERENT EVALUATORS (GPT-4V JUDGE VS. CUSTOM), AND SOME RESULTS ARE SELF-REPORTED RATHER THAN INDEPENDENTLY VERIFIED. CLICK ANY SCORE TO VIEW ITS ORIGINAL SOURCE. IF YOU SPOT AN ERROR OR WANT TO SUBMIT A RESULT, OPEN A PR ON GITHUB.

FAQ
What is the WebVoyager benchmark for AI browser agents? [+]
WebVoyager is the standard benchmark for evaluating browser agents, introduced in the 2024 paper WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. It consists of 643 tasks across 15 websites including Google, Amazon, GitHub, Reddit, and Wikipedia. Tasks cover form filling, navigation, search, and shopping. GPT-4V evaluates each task by analyzing the final page state. Scores represent the percentage of tasks completed successfully. As of February 2026, the highest score is 97.1% held by Surfer 2 from H Company.
Can WebVoyager scores be compared across different agents? [+]
Not always. Three factors affect comparability: dataset size (full 643 tasks vs filtered subsets), evaluator (GPT-4V vs custom methods), and verification (third-party vs self-reported). Filtered subsets typically produce higher scores. Click any leaderboard row to see methodology. The most reliable comparisons use full dataset, GPT-4V evaluation, and third-party verification.
What is Steel.dev? [+]
Steel is browser infrastructure for AI agents - cloud browser sessions controlled through code. The Session works like a fresh incognito window running in the cloud. Steel provides anti-bot capabilities including CAPTCHA solving, proxy rotation, and fingerprint management, plus observability features like live viewing and replay. Steel offers a REST API, Python SDK, and Node SDK for web scraping, form automation, and research agents. Learn more at steel.dev, the beginner's guide, or docs.
What is the best AI browser agent in 2026? [+]
By WebVoyager score: Surfer 2 at 97.1% (H Company), Magnitude at 93.9%, AIME Browser-Use at 92.34%, Browserable at 90.4%, Browser Use at 89.1%, OpenAI Operator at 87%, Skyvern 2.0 at 85.85%, and Google Project Mariner at 83.5%. Surfer-H (92.2%) is a multi-attempt benchmark (10 attempts), so it is not directly comparable to single-run percentages. Surfer 2 leads accuracy. Browser Use and Skyvern are strong open-source options. Rankings update as new results are submitted.
What is the difference between OpenAI Operator, Browser Use, and other AI browser agents? [+]
The main agents differ in accuracy and positioning. OpenAI Operator (87%, GPT-4o) is a consumer product in ChatGPT. Browser Use (89.1%) is open-source and supports multiple models. Surfer 2 (97.1%) leads with a proprietary enterprise model. Skyvern 2.0 (85.85%) is open-source with strong visual reasoning. Google Mariner (83.5%, Gemini) integrates with Chrome. For custom agents, Steel provides the browser infrastructure layer.
How do AI browser agents work? [+]
Browser agents combine LLMs with browser automation to complete web tasks. A vision model sees the webpage via screenshots or DOM. A reasoning model decides actions like clicking, typing, or scrolling. An execution layer drives the browser via Chrome DevTools Protocol or Playwright. A memory component tracks state across steps. Most agents run on cloud infrastructure like Steel for reliability and anti-bot handling.
What websites can AI browser agents navigate? [+]
Agents can navigate any website. WebVoyager evaluates on 15 specific sites: Amazon, eBay, Google, Google Maps, Wikipedia, Reddit, Twitter/X, GitHub, ArXiv, and Booking.com. Real-world challenges include CAPTCHAs, bot detection, dynamic content, auth flows, and rate limiting. Production agents use infrastructure like Steel for anti-bot measures and proxy rotation.
How is the WebVoyager score calculated? [+]
Score = (tasks completed / total tasks) x 100. An agent scoring 97.1% completed 624 of 643 tasks correctly. GPT-4V evaluates each task by analyzing the final page state to determine if the goal was achieved - correct page reached, information displayed, forms filled accurately, and flows completed.
What does SOTA mean? [+]
SOTA stands for State of the Art - the highest-performing result on a benchmark. On this leaderboard, the SOTA badge is awarded to the agent with the highest WebVoyager score and transfers automatically when a new high score is submitted. As of February 2026, the SOTA holder is Surfer 2 by H Company at 97.1%.
Are OpenAI Operator and Google Project Mariner on this leaderboard? [+]
Yes. OpenAI Operator scores 87% (ranked 6th) and Google Project Mariner scores 83.5% (ranked 8th). Both are consumer products integrated into their ecosystems - Operator via ChatGPT, Mariner via Chrome. They score lower than specialized agents like Surfer 2 (97.1%) because they prioritize broad capability over benchmark optimization.
How do I build my own AI browser agent? [+]
Three layers are needed. Browser infrastructure: Steel provides managed sessions, proxies, anti-bot handling, and replay. AI layer: a vision-capable model like GPT-4o, Claude, or Gemini with prompting for action selection. Orchestration: frameworks like Browser Use or Skyvern handle clicking, typing, and state tracking. See the production agents guide. Once your agent has a verifiable WebVoyager score, open a pull request on GitHub.
Is a higher WebVoyager score always better for production use? [+]
Not necessarily. WebVoyager measures task completion on a fixed website set under controlled conditions. Production depends on factors not captured by the benchmark - latency, cost per task, CAPTCHA handling, anti-bot resilience, and generalization to new websites. An agent optimized for benchmark scores may overfit. Use the leaderboard as a directional signal and test on your actual target websites.
Why is WebVoyager used instead of other benchmarks? [+]
WebVoyager is the most widely adopted public benchmark for browser agents, enabling cross-agent comparison. Other benchmarks exist - Mind2Web (2000+ tasks), OSWorld (desktop interaction), WorkArena (enterprise apps) - but have seen less adoption. WebVoyager's real-world task design, consistent GPT-4V evaluation, and widespread usage make it the current standard.
What is Browser Use's WebVoyager benchmark score? [+]
Browser Use scores 89.1% on WebVoyager, ranking 5th overall. It's an open-source framework supporting GPT-4, Claude, and other LLMs via API. Many teams pair Browser Use with Steel for production infrastructure and anti-bot handling.
What is OpenAI Operator's WebVoyager benchmark score? [+]
OpenAI Operator scores 87% on WebVoyager, ranking 6th overall. Built into ChatGPT Pro, it uses GPT-4o for vision and reasoning. Operator requires no setup and handles web tasks like booking reservations and filling forms.
What is Skyvern's WebVoyager benchmark score? [+]
Skyvern 2.0 scores 85.85% on WebVoyager, ranking 7th overall. It's an open-source agent emphasizing visual understanding for complex layouts. Skyvern works with any LLM backend and integrates with Steel for production infrastructure.
What is Google Project Mariner's WebVoyager benchmark score? [+]
Google Project Mariner scores 83.5% on WebVoyager, ranking 8th overall. Built on Gemini and integrated with Chrome, it handles web navigation and form filling. Currently in limited availability.
What is Surfer 2's WebVoyager benchmark score? [+]
Surfer 2 by H Company holds the current SOTA at 97.1% on WebVoyager for pass@1 (as of February 2026), while the same benchmark also reports 100% at pass@10. The gap to the next agent at 93.9% makes it the clear leader for accuracy-focused use cases.
How often is the leaderboard updated? [+]
The leaderboard updates as new benchmark results are published. New results appear weekly. If you know of a missing agent or score, pull requests and issues are welcome on GitHub.
How do I add my agent to the leaderboard? [+]
Open a pull request on GitHub with your entry. You need a publicly verifiable WebVoyager score, a link to the source (paper or blog post), and a homepage or GitHub repo for your agent.