Browser Agent Leaderboards
Track the best-performing AI agents and models across browser agent benchmarks, computer use, research, and coding. Each leaderboard page includes methodology notes, scope labels, and source-linked benchmark results.
WebVoyager
WebVoyager evaluates end-to-end browser agents on 643 tasks across 15 popular real-world websites. Tasks cover search, navigation, form filling, map and travel lookup, shopping, and information retrieval on live pages rather than static snapshots.
| Model | Score |
|---|---|
| Alumnium | 98.5% |
| Surfer 2 | 97.1% |
| Magnitude | 93.9% |
Current leader: Alumnium
BrowseComp
BrowseComp is OpenAI's benchmark for difficult agentic web research: 1,266 short-answer questions where the answer is easy to verify once found but hard to locate without persistent browsing.
| Model | Score |
|---|---|
| GPT-5.5 Pro | 90.1% |
| GPT-5.4 Pro | 89.3% |
| MiroThinker-H1 | 88.2% |
Current leader: GPT-5.5 Pro
WebArena
WebArena evaluates browser agents in reproducible, self-hosted websites instead of the open live web. Its 812 tasks cover e-commerce, forum discussion, collaborative software development, content management, maps, and reference lookup.
| Model | Score |
|---|---|
| WebTactix (DeepSeek v3.2) | 74.3% |
| OpAgent | 71.6% |
| ColorBrowserAgent | 71.2% |
Current leader: WebTactix (DeepSeek v3.2)
SWE-bench Verified
SWE-bench Verified is the 500-instance human-reviewed split of SWE-bench, built from real GitHub issues in popular Python repositories. Agents receive an issue and repository state, then generate a patch.
| Model | Score |
|---|---|
| Claude Mythos | 93.9% |
| Claude Opus 4.7 | 87.6% |
| Claude Opus 4.5 | 80.9% |
Current leader: Claude Mythos
OSWorld
OSWorld evaluates multimodal computer-use agents in real desktop environments across 369 tasks involving web apps, desktop software, files, and workflows spanning multiple applications.
| Model | Score |
|---|---|
| Mythos Preview | 79.6% |
| OSAgent | 76.26% |
| GPT-5.4 | 75.0% |
Current leader: Mythos Preview
GAIA
GAIA evaluates general AI assistants on 466 real-world questions requiring reasoning, web browsing, multimodal understanding, file handling, and tool use.
| Model | Score |
|---|---|
| OPS-Agentic-Search | 92.36% |
| openJiuwen-deepagent | 92.36% |
| openJiuwen-deepagent (GPT5/Gemini) | 91.69% |
Current leader: OPS-Agentic-Search
ClawBench
ClawBench evaluates browser agents on 153 everyday online tasks across 144 live platforms in 15 categories, including purchases, appointments, job applications, and detailed forms.
| Model | Score |
|---|---|
| Claude Sonnet 4.6 | 33.3% |
| GLM-5 | 24.2% |
| Gemini 3 Flash | 19.0% |
Current leader: Claude Sonnet 4.6
Online-Mind2Web
Online-Mind2Web turns the static Mind2Web idea into a live benchmark of 300 tasks across 136 websites, covering shopping, finance, travel, government, and other consumer workflows.
| Model | Score |
|---|---|
| Browser Use Cloud (bu-max) | 97.0% |
| GPT-5.4 Native Computer Use | 93.0% |
| ABP + Claude Opus 4.6 | 90.53% |
Current leader: Browser Use Cloud (bu-max)
τ-bench
τ-bench evaluates conversational agents in realistic customer-service tasks where the agent must talk to a simulated user, call domain APIs, and follow a policy manual.
| Model | Score |
|---|---|
| Step-3.5-Flash | 88.2% |
| GLM-4.7 | 87.4% |
| MiMo-V2-Flash | 80.3% |
Current leader: Step-3.5-Flash
AgentBench
AgentBench evaluates LLMs as agents across 8 interactive environments, including operating-system tasks, database querying, knowledge graphs, games, lateral-thinking puzzles, ALFWorld, WebShop, and Mind2Web-style browsing.
| Model | Score |
|---|---|
| AgentRL w/ Qwen2.5-32B-Instruct | 70.4% |
| AgentRL w/ Qwen2.5-14B-Instruct | 67.7% |
| AgentRL w/ GLM-4-9B-0414 | 65.0% |
Current leader: AgentRL w/ Qwen2.5-32B-Instruct