AI Agent Benchmark Results — All Leaderboards

Browse 281 sourced results across 12 AI agent benchmarks — WebVoyager, WebArena, OSWorld, SWE-bench Verified, GAIA, BrowseComp, and more. Every row links to the original paper, model card, repository, or launch post. Filter by category, benchmark, or search by agent or organization name.

Need methodology, evaluator details, and example tasks before comparing scores? Open the individual benchmark hub page for any benchmark below.

Scores on this page are not directly comparable across benchmarks.

This index puts every result in one place for browsing — it does not normalize methodology. Each benchmark uses its own task set, evaluator, scoring metric, and scope (model vs. agent). A 90% on one benchmark does not mean the same thing as a 90% on another. Compare scores within a single benchmark using the filters below, and read each benchmark's methodology notes on its dedicated page before drawing conclusions.

Category
Benchmark

Showing 281 of 281 results

Results

Sorted by reported date
Benchmark Rank System / Submission Score Organization Reported Source
HealthAdminBench #1
Claude Mythos 5 (browser-use)
51.9% Anthropic 2026-06 Source
HealthAdminBench #1
Claude Opus 4.8 (browser-use)
51.9% Anthropic 2026-06 Source
HealthAdminBench #3
Claude Mythos Preview (browser-use)
47.4% Anthropic 2026-06 Source
HealthAdminBench #4
Claude Sonnet 4.6 (browser-use)
45.2% Anthropic 2026-06 Source
OSWorld #1
Claude Opus 4.8
83.4% Anthropic 2026-05-28 Source
SWE-bench Verified #2
Claude Opus 4.8
88.6% Anthropic 2026-05-28 Source
BrowseComp #8
Claude Opus 4.8
84.3% Anthropic 2026-05-28 Source
BrowseComp #29
LongSeeker
61.5% Academic Research 2026-05-06 Source
BrowseComp #49
OpenSeeker-v2
46.0% PolarSeeker 2026-05-05 Source
SWE-bench Verified #6
DeepSeek-V4-Pro-Max
80.6% DeepSeek 2026-04-24 Source
BrowseComp #10
DeepSeek-V4-Pro-Max
83.4% DeepSeek 2026-04-24 Source
SWE-bench Verified #12
DeepSeek-V4-Flash-Max
79.0% DeepSeek 2026-04-24 Source
BrowseComp #22
DeepSeek-V4-Flash-Max
73.2% DeepSeek 2026-04-24 Source
BrowseComp #38
Parallel Basic + GPT-5.4 harness
53.0% Parallel 2026-04-21 Source
BrowseComp #43
Parallel Advanced + GPT-5.4 harness
51.0% Parallel 2026-04-21 Source
BrowseComp #48
MiroThinker v1.0-72B
47.1% MiroMind 2026-04-21 Source
BrowseComp #57
Tavily + GPT-5.4 harness
42.0% Tavily 2026-04-21 Source
BrowseComp #59
Exa + GPT-5.4 harness
40.0% Exa 2026-04-21 Source
BrowseComp #5
Kimi K2.6
86.3% Moonshot AI 2026-04-20 Source
SWE-bench Verified #8
Kimi K2.6
80.2% Moonshot AI 2026-04-20 Source
SWE-bench Verified #3
Claude Opus 4.7
87.6% Anthropic 2026-04-16 Source
BrowseComp #12
Claude Opus 4.7
79.3% Anthropic 2026-04-16 Source
SWE-bench Verified #1
Claude Mythos
93.9% Anthropic 2026-04-09 Source
ClawBench #1
Claude Sonnet 4.6
33.3% Anthropic 2026-04-09 Source
ClawBench #2
GLM-5
24.2% Z.ai 2026-04-09 Source
ClawBench #3
Gemini 3 Flash
19.0% Google 2026-04-09 Source
ClawBench #4
Claude Haiku 4.5
18.3% Anthropic 2026-04-09 Source
ClawBench #5
GPT-5.4
6.5% OpenAI 2026-04-09 Source
ClawBench #6
Gemini 3.1 Flash Lite
3.3% Google 2026-04-09 Source
ClawBench #7
Kimi K2.5
0.7% Moonshot AI 2026-04-09 Source
OSWorld #2
Mythos Preview
79.6% Anthropic 2026-04-07 Source
BrowseComp #4
Claude Mythos Preview
86.9% Anthropic 2026-04-07 Source
BrowseComp #9
Claude Opus 4.6
83.7% Anthropic 2026-04-07 Source
SWE-bench Verified #13
Qwen3.6 Plus
78.8% Alibaba Cloud / Qwen Team 2026-04-02 Source
BrowseComp #1
GPT-5.5 Pro
90.1% OpenAI 2026-04 Source
BrowseComp #2
GPT-5.4 Pro
89.3% OpenAI 2026-04 Source
HealthAdminBench #5
Claude Opus 4.6 CUA
36.3% Anthropic 2026-04 Source
HealthAdminBench #6
GPT-5.4 CUA
26.7% OpenAI 2026-04 Source
WebVoyager #7
GLM-5V-Turbo
88.5% Z.ai 2026-04 Source
BrowseComp #7
GPT-5.5
84.4% OpenAI 2026-04 Source
HealthAdminBench #7
Kimi K2.5
15.6% Moonshot AI 2026-04 Source
HealthAdminBench #8
Claude Opus 4.6
14.8% Anthropic 2026-04 Source
HealthAdminBench #9
Qwen 3.5
13.3% Alibaba 2026-04 Source
OSWorld #10
GLM-5V-Turbo
62.3% Zhipu AI 2026-04-01 Source
HealthAdminBench #10
Gemini 3.1 Pro
11.9% Google 2026-04 Source
HealthAdminBench #11
GPT-5.4
5.9% OpenAI 2026-04 Source
BrowseComp #12
GLM-5.1
79.3% Zhipu AI 2026-04 Source
WebArena #16
WebUncertainty + GPT-4-Turbo
46.9% Academic Research 2026-04 Source
WebArena #18
A3-Qwen3.5-9B
42.1% McGill NLP 2026-04 Source
BrowseComp #19
MiniMax M2.5
76.3% MiniMax 2026-04 Source
Online-Mind2Web #1
Browser Use Cloud (bu-max)
97.0% Browser-Use 2026-03-25 Source
Online-Mind2Web #10
Stagehand (Gemini 2.5 CU)
65.0% Browserbase 2026-03-25 Source
Online-Mind2Web #15
Stagehand (Sonnet 4.5)
55.0% Browserbase 2026-03-25 Source
SWE-bench Verified #14
MiMo-V2-Pro
78.0% Xiaomi 2026-03-18 Source
BrowseComp #3
MiroThinker-H1
88.2% MiroMind 2026-03-16 Source
BrowseComp #66
OpenSeeker
29.5% PolarSeeker 2026-03-16 Source
GAIA #1
OPS-Agentic-Search
92.36% Alibaba Cloud 2026-03-11 Source
GAIA #1
openJiuwen-deepagent
92.36% Suzhou AI Lab / Shuqian Tech 2026-03-11 Source
BrowseComp #63
Nemotron 3 Super (120B A12B)
31.3% NVIDIA 2026-03-11 Source
BrowseComp #46
Sarvam-105B
49.5% Sarvam AI 2026-03-06 Source
BrowseComp #61
Sarvam-30B
35.5% Sarvam AI 2026-03-06 Source
Online-Mind2Web #2
GPT-5.4 Native Computer Use
93.0% OpenAI 2026-03-05 Source
OSWorld #4
GPT-5.4
75.0% OpenAI 2026-03-05 Source
Online-Mind2Web #8
ChatGPT Atlas Agent Mode
71.0% OpenAI 2026-03-05 Source
BrowseComp #11
GPT-5.4
82.7% OpenAI 2026-03-05 Source
BrowseComp #17
GPT-5.3-Codex
77.3% OpenAI 2026-03-05 Source
Online-Mind2Web #3
ABP + Claude Opus 4.6
90.53% theredsix 2026-03-03 Source
WebVoyager #1
Alumnium
98.5% Alumnium 2026-03 Source
BrowseComp #47
SMTL
48.6% Academic Research 2026-02-27 Source
BrowseComp #6
Gemini 3.1 Pro
85.9% Google 2026-02-19 Source
SWE-bench Verified #6
Gemini 3.1 Pro
80.6% Google DeepMind 2026-02-19 Source
BrowseComp #33
Gemini 3 Pro
59.2% Google 2026-02-19 Source
OSWorld #6
Claude Sonnet 4.6
72.5% Anthropic 2026-02-17 Source
SWE-bench Verified #11
Claude Sonnet 4.6
79.6% Anthropic 2026-02-17 Source
BrowseComp #21
Claude Sonnet 4.6
74.7% Anthropic 2026-02-17 Source
τ-bench #8
Qwen3.5-397B-A17B
68.4% Alibaba 2026-02-16 Source
BrowseComp #17
Seed 2.0 Pro
77.3% ByteDance 2026-02-15 Source
Online-Mind2Web #4
TinyFish
90.0% TinyFish AI 2026-02-12 Source
SWE-bench Verified #8
MiniMax M2.5
80.2% MiniMax 2026-02-12 Source
τ-bench #11
GLM-5
63.2% Zhipu AI 2026-02-11 Source
SWE-bench Verified #16
GLM-5
77.8% Zhipu AI 2026-02-11 Source
GAIA #3
openJiuwen-deepagent (GPT5/Gemini)
91.69% openJiuwen 2026-02-09 Source
GAIA #4
Lemon Agent
91.36% Lenovo CTO Org 2026-02-06 Source
SWE-bench Verified #5
Claude Opus 4.6
80.8% Anthropic 2026-02-05 Source
OSWorld #5
Claude Opus 4.6
72.7% Anthropic 2026-02-05 Source
WebArena #1
WebTactix (DeepSeek v3.2)
74.3% WebTactix 2026-02 Source
WebArena #8
Kimi K2.5
58.9% Moonshot AI 2026-02 Source
WebArena #11
Plan-MCTS + GPT-5-mini
55.3% Academic Research 2026-02 Source
BrowseComp #14
Qwen3.5-397B-A17B
78.6% Alibaba Cloud / Qwen Team 2026-02 Source
BrowseComp #15
Kimi K2.5
78.4% Moonshot AI 2026-02 Source
BrowseComp #20
GLM-5
75.9% Zhipu AI 2026-02 Source
BrowseComp #24
Step-3.5-Flash
69.0% StepFun 2026-02 Source
BrowseComp #27
Qwen3.5-122B-A10B
63.8% Alibaba Cloud / Qwen Team 2026-02 Source
BrowseComp #30
Qwen3.5-27B
61.0% Alibaba Cloud / Qwen Team 2026-02 Source
BrowseComp #30
Qwen3.5-35B-A3B
61.0% Alibaba Cloud / Qwen Team 2026-02 Source
BrowseComp #56
GLM-4.7-Flash
42.8% Zhipu AI 2026-02 Source
τ-bench #1
Step-3.5-Flash
88.2% StepFun 2026-01-29 Source
OSWorld #9
Kimi K2.5
63.3% Moonshot AI 2026-01-27 Source
GAIA #9
ShawnAgent v3.1
89.37% Independent 2026-01-16 Source
GAIA #5
JoinAI V2.2
90.7% JoinAI-CMCC 2026-01-14 Source
GAIA #11
ShawnAgent v3.0
89.04% Independent 2026-01-14 Source
OSWorld #12
UiPath Screen Agent
53.6% UiPath 2026-01-14 Source
GAIA #7
JoinAI V2.1
90.03% JoinAI-CMCC 2026-01-13 Source
BrowseComp #49
WebAnchor-30B
46.0% Academic Research 2026-01-07 Source
τ-bench #3
MiMo-V2-Flash
80.3% Xiaomi 2026-01-06 Source
GAIA #6
Nemotron-ToolOrchestra
90.37% NVIDIA 2026-01-06 Source
WebArena #2
OpAgent
71.6% CodeFuse AI 2026-01 Source
BrowseComp #23
LongCat-Flash-Thinking-2601
73.1% Meituan 2026-01 Source
BrowseComp #34
MiMo-V2-Flash
58.3% Xiaomi 2026-01 Source
GAIA #11
JoinAI V2
89.04% JoinAI-CMCC 2025-12-28 Source
GAIA #7
SU Zero (Shuqian Pro)
90.03% Shuqian Tech 2025-12-23 Source
τ-bench #2
GLM-4.7
87.4% Z.ai 2025-12-22 Source
τ-bench #4
GLM-4.7-Flash
79.5% Z.ai 2025-12-22 Source
GAIA #9
HALO V1217-1
89.37% Microsoft AI Asia 2025-12-17 Source
τ-bench #9
Gemini 3 Flash
67.8% Google DeepMind 2025-12-17 Source
GAIA #11
HALO V1217
89.04% Microsoft AI Asia 2025-12-17 Source
SWE-bench Verified #14
Gemini 3 Flash
78.0% Google DeepMind 2025-12-17 Source
τ-bench #7
GPT-5.2
69.9% OpenAI 2025-12-11 Source
SWE-bench Verified #10
GPT-5.2
80.0% OpenAI 2025-12-11 Source
GAIA #11
Su Zero + SQ Pro
89.04% Suzhou AI Lab / Shuqian Tech 2025-12-11 Source
BrowseComp #16
GPT-5.2 Pro
77.9% OpenAI 2025-12-11 Source
BrowseComp #26
GPT-5.2
65.8% OpenAI 2025-12-11 Source
GAIA #16
Su Zero + Shuqian Lite
87.38% Suzhou AI Lab / Shuqian Tech 2025-12-07 Source
GAIA #15
Lemon Agent v1.0.8
88.37% Lenovo CTO Org 2025-12-04 Source
WebArena #3
ColorBrowserAgent
71.2% MadeAgents 2025-12 Source
Online-Mind2Web #6
OpenAGI Lux
83.6% OpenAGI Foundation 2025-12-01 Source
WebArena #12
WebOperator + GPT-4o
54.6% KAIST KAG NLP 2025-12 Source
BrowseComp #25
GLM-4.7
67.5% Zhipu AI 2025-12 Source
BrowseComp #28
MiniMax M2.1
62.0% MiniMax 2025-12 Source
BrowseComp #41
DeepSeek-V3.2
51.4% DeepSeek 2025-12-01 Source
BrowseComp #41
DeepSeek-V3.2 (Thinking)
51.4% DeepSeek 2025-12-01 Source
SWE-bench Verified #4
Claude Opus 4.5
80.9% Anthropic 2025-11-24 Source
τ-bench #6
Claude Opus 4.5
70.2% Anthropic 2025-11-24 Source
OSWorld #8
Claude Opus 4.5
66.3% Anthropic 2025-11-24 Source
Online-Mind2Web #7
Navigator
78.7% Yutori 2025-11-19 Source
Online-Mind2Web #12
Claude 4.0
61.0% Anthropic 2025-11-19 Source
Online-Mind2Web #15
Claude 4.5
55.0% Anthropic 2025-11-19 Source
τ-bench #10
Gemini 3 Pro
65.8% Google DeepMind 2025-11-18 Source
BrowseComp #32
Kimi K2-Thinking-0905
60.2% Moonshot AI 2025-11 Source
τ-bench #5
MiniMax M2
77.2% MiniMax 2025-10-27 Source
WebArena #4
Claude Code + GBOX MCP
68.0% GBOX AI 2025-10-25 Source
OSWorld #3
OSAgent
76.26% TheAGI Company 2025-10-23 Source
OSWorld #13
Claude Haiku 4.5
50.7% Anthropic 2025-10-15 Source
GAIA #17
Co-Sight v2.1.0
87.04% ZTE-AICloud 2025-10-13 Source
GAIA #18
JoinAI V1.1
86.71% JoinAI 2025-10-09 Source
BrowseComp #62
DeepMiner-32B
33.5% Academic Research 2025-10-09 Source
Online-Mind2Web #9
Gemini 2.5 Computer Use
69.0% Google DeepMind 2025-10-07 Source
AgentBench #1
AgentRL w/ Qwen2.5-32B-Instruct
70.4% Tsinghua University 2025-10-05 Source
AgentBench #2
AgentRL w/ Qwen2.5-14B-Instruct
67.7% Tsinghua University 2025-10-05 Source
AgentBench #3
AgentRL w/ GLM-4-9B-0414
65.0% Tsinghua University 2025-10-05 Source
AgentBench #4
AgentRL w/ Qwen2.5-7B-Instruct
62.0% Tsinghua University 2025-10-05 Source
AgentBench #5
AgentRL w/ Qwen2.5-3B-Instruct
60.0% Tsinghua University 2025-10-05 Source
Aider #12
DeepSeek-V3.2-Exp (Reasoner)
74.2% DeepSeek 2025-10-03 Source
WebVoyager #2
Surfer 2
97.1% H Company 2025-10 Source
WebArena #6
Narada AI
64.2% Narada AI 2025-10 Source
BrowseComp #51
GLM-4.6
45.1% Zhipu AI 2025-10 Source
BrowseComp #54
MiniMax M2
44.0% MiniMax 2025-10 Source
BrowseComp #58
DeepSeek-V3.2-Exp
40.1% DeepSeek 2025-09-30 Source
AgentBench #6
Claude Sonnet 4.5
58.9% Anthropic 2025-09-29 Source
AgentBench #7
Claude Sonnet 4.5 Thinking
58.3% Anthropic 2025-09-29 Source
OSWorld #11
Claude Sonnet 4.5
61.4% Anthropic 2025-09-29 Source
τ-bench #12
Claude Sonnet 4.5
62.9% Anthropic 2025-09-29 Source
BrowseComp #70
InfoAgent
15.3% Academic Research 2025-09-29 Source
OSWorld #7
Qwen3 VL 235B
66.7% Alibaba 2025-09-23 Source
BrowseComp #53
Grok 4 Fast
44.9% xAI 2025-09-22 Source
BrowseComp #55
Tongyi DeepResearch
43.4% Alibaba Cloud / Tongyi Lab 2025-09-16 Source
BrowseComp #60
AgentFounder-30B
39.9% Alibaba Cloud / Tongyi Lab 2025-09-16 Source
BrowseComp #70
DeepDive-32B
15.3% THUDM / Tsinghua University 2025-09-12 Source
BrowseComp #69
WebExplorer-8B (RL)
15.7% HKUST NLP Group 2025-09-08 Source
BrowseComp #74
WebSailor-32B
10.5% Alibaba Cloud / Tongyi Lab 2025-09-08 Source
BrowseComp #78
WebSailor-7B
6.7% Alibaba Cloud / Tongyi Lab 2025-09-08 Source
Online-Mind2Web #5
UI-TARS-2
88.2% ByteDance / VLM-Research 2025-09-02 Source
WebArena #5
DeepSky Agent
66.9% DeepSky 2025-09 Source
Aider #2
gpt-5 (medium)
86.7% OpenAI 2025-08-25 Source
Aider #5
gpt-5 (low)
81.3% OpenAI 2025-08-25 Source
Aider #1
gpt-5 (high)
88.0% OpenAI 2025-08-23 Source
Online-Mind2Web #13
ACT-1-20250814
57.3% Enhans 2025-08-14 Source
BrowseComp #64
BrowseMaster
30.0% Academic Research 2025-08-12 Source
WebVoyager #14
WebSight
68% Academic Research 2025-08 Source
BrowseComp #35
Parallel Ultra8x
58.0% Parallel 2025-08 Source
BrowseComp #36
Parallel Ultra4x
56.0% Parallel 2025-08 Source
BrowseComp #37
GPT-5
54.9% OpenAI 2025-08 Source
BrowseComp #43
Parallel Ultra2x
51.0% Parallel 2025-08 Source
BrowseComp #52
Parallel Ultra
45.0% Parallel 2025-08 Source
BrowseComp #64
DeepSeek-V3.1
30.0% DeepSeek 2025-08 Source
BrowseComp #72
Exa Research Pro
14.0% Exa 2025-08 Source
BrowseComp #77
Claude Opus 4.1 (Parallel Task API benchmark)
7.0% Anthropic 2025-08 Source
BrowseComp #79
Perplexity Sonar Deep Research
6.0% Perplexity 2025-08 Source
Aider #7
grok-4 (high)
79.6% xAI 2025-07-11 Source
Online-Mind2Web #17
ACT-1-20250703
45.7% Enhans 2025-07-03 Source
WebVoyager #3
Magnitude
93.9% Magnitude 2025-07 Source
BrowseComp #67
GLM-4.5
26.4% Zhipu AI 2025-07 Source
BrowseComp #68
GLM-4.5-Air
21.3% Zhipu AI 2025-07 Source
BrowseComp #73
WebSailor-72B
12.0% Alibaba Cloud / Tongyi Lab 2025-07 Source
Aider #3
o3-pro (high)
84.9% OpenAI 2025-06-28 Source
Aider #9
o3 (high) + gpt-4.1
78.2% OpenAI 2025-06-27 Source
Aider #5
o3 (high)
81.3% OpenAI 2025-06-25 Source
Aider #10
o3
76.9% OpenAI 2025-06-25 Source
Aider #4
gemini-2.5-pro-preview-06-05 (32k think)
83.1% Google 2025-06-06 Source
Aider #8
gemini-2.5-pro-preview-06-05 (default think)
79.1% Google 2025-06-06 Source
WebVoyager #4
Surfer-H + Holo1
92.2% H Company 2025-06 Source
WebArena #19
GUI-API Hybrid Agent
38.9% Academic Research 2025-06 Source
WebArena #23
TTI
26.1% Academic Research 2025-06 Source
BrowseComp #76
DeepSeek-R1-0528
8.9% DeepSeek 2025-05-28 Source
Aider #14
claude-opus-4-20250514 (32k thinking)
72.0% Anthropic 2025-05-25 Source
AgentBench #8
Claude Sonnet 4 Thinking
58.2% Anthropic 2025-05-22 Source
AgentBench #9
Claude Sonnet 4
57.4% Anthropic 2025-05-22 Source
Aider #10
Gemini 2.5 Pro Preview 05-06
76.9% Google 2025-05-07 Source
Aider #14
o4-mini (high)
72.0% OpenAI 2025-04-16 Source
BrowseComp #39
o4-mini
51.5% OpenAI 2025-04-16 Source
BrowseComp #45
o3
49.7% OpenAI 2025-04-16 Source
Aider #13
Gemini 2.5 Pro Preview 03-25
72.9% Google 2025-04-12 Source
BrowseComp #39
OpenAI Deep Research
51.5% OpenAI 2025-04-10 Source
BrowseComp #75
OpenAI o1
9.9% OpenAI 2025-04-10 Source
BrowseComp #80
GPT-4o + browsing
1.9% OpenAI 2025-04-10 Source
BrowseComp #81
GPT-4.5
0.9% OpenAI 2025-04-10 Source
BrowseComp #82
GPT-4o
0.6% OpenAI 2025-04-10 Source
Online-Mind2Web #11
OpenAI Operator
61.3% OpenAI 2025-04-02 Source
WebVoyager #5
Browserable
90.4% Browserable 2025-04 Source
WebVoyager #10
Notte
86.2% Notte 2025-04 Source
Online-Mind2Web #19
HAL Leaderboard baseline (best open)
~30% Princeton / OSU 2025-04 Source
OSWorld #17
Qwen2.5 VL 32B Instruct
5.9% Alibaba Cloud / Qwen Team 2025-03-24 Source
OSWorld #14
Agent S2 + Claude 3.7
34.5% Simular AI 2025-03-12 Source
GAIA #19
Manus
86.5% Monica AI 2025-03-06 Source
WebVoyager #15
Runner H 0.1
67% H Company 2025-03 Source
AgentBench #10
Claude Sonnet 3.7
53.2% Anthropic 2025-02-24 Source
Online-Mind2Web #14
Claude Computer Use 3.7 (w/o thinking)
56.3% Anthropic 2025-02-24 Source
WebArena #7
IBM CUGA
61.7% IBM 2025-02-17 Source
GAIA #20
Deep Research (o3, cons@64)
72.57% OpenAI 2025-02-02 Source
GAIA #21
Deep Research (o3)
67.36% OpenAI 2025-02-02 Source
OSWorld #16
Qwen2.5 VL 72B Instruct
8.8% Alibaba Cloud / Qwen Team 2025-01-26 Source
OSWorld #15
OpenAI Operator (CUA)
32.6% OpenAI 2025-01-23 Source
WebVoyager #8
Operator
87% OpenAI 2025-01 Source
WebArena #9
OpenAI Operator
58.1% OpenAI 2025-01 Source
WebVoyager #11
Skyvern 2.0
85.85% Skyvern 2025-01 Source
WebArena #14
AgentSymbiotic
52.1% Academic Research 2025-01 Source
WebArena #15
Learn-by-Interact
48.0% Academic Research 2025-01 Source
WebArena #25
AgentTrek-1.0-32B
22.4% xLang AI 2025-01 Source
WebArena #29
NNetNav
16.3% Stanford NLP 2025-01 Source
WebArena #13
ScribeAgent + GPT-4o
53.0% Academic Research 2024-12-24 Source
WebVoyager #6
Browser Use
89.1% Browser Use 2024-12 Source
WebVoyager #12
Project Mariner
83.5% Google 2024-12 Source
Online-Mind2Web #19
Browser Use (gpt-4o)
30.0% Browser Use 2024-11-06 Source
WebVoyager #8
Agent Kura
87.0% Kura 2024-11 Source
WebVoyager #17
Anthropic Computer Use 3.5
56.0% Anthropic 2024-11 Source
Online-Mind2Web #21
Claude Computer Use 3.5
29.0% Anthropic 2024-10-22 Source
WebArena #17
AgentOccam-Judge
45.7% Amazon Science 2024-10 Source
WebArena #38
Synatra-CodeLLama7b
6.28% Academic Research 2024-10 Source
WebArena #21
Agent Workflow Memory
35.5% Academic Research 2024-09 Source
WebArena #10
Jace.AI (AWA-1.5)
57.1% Jace AI 2024-08 Source
WebArena #20
WebPilot
37.2% Academic Research 2024-08 Source
Online-Mind2Web #22
Agent-E (gpt-4o)
28.0% Emergence AI 2024-07-16 Source
WebVoyager #13
Agent-E
73.2% Emergence AI 2024-07 Source
WebArena #27
GPT-4o + Tree Search
19.2% Academic Research 2024-06 Source
WebArena #31
gpt-4o-2024-05-13
13.05% OpenAI 2024-05 Source
WebArena #33
Patel et al. + GPT-4
9.36% Academic Research 2024-05 Source
WebArena #37
Llama3-chat-70b
7.02% Meta 2024-04-02 Source
WebVoyager #18
WILBUR
53% Bardeen / UC Berkeley 2024-04 Source
WebArena #22
SteP
33.5% ASAPP Research 2024-04 Source
WebArena #24
BrowserGym + GPT-4
23.5% ServiceNow Research 2024-04 Source
WebArena #26
GPT-4 + Auto Eval
20.2% Academic Research 2024-04 Source
WebArena #28
AutoWebGLM
18.2% THUDM 2024-04 Source
WebArena #43
Llama3-chat-8b
3.32% Meta 2024-04 Source
WebArena #40
Agent-FLAN
4.68% InternLM 2024-03 Source
WebArena #44
CodeAct Agent
2.3% Academic Research 2024-02 Source
Online-Mind2Web #18
SeeAct (gpt-4o)
30.7% OSU NLP 2024-01-16 Source
WebVoyager #16
WebVoyager
59.1% Academic Research 2024-01 Source
WebVoyager #19
GPT-4 (All Tools)
30.8% OpenAI 2024-01 Source
WebArena #46
Mixtral-8x7B
1.39% Mistral AI 2024-01 Source
WebArena #36
Gemini Pro
7.12% Google 2023-12 Source
WebArena #39
Lemur-chat-70b
5.3% OpenLemur 2023-10 Source
WebArena #42
AgentLM-70b
3.81% THUDM 2023-10 Source
WebArena #45
AgentLM-13b
1.6% THUDM 2023-10 Source
WebArena #47
AgentLM-7b
0.74% THUDM 2023-10 Source
WebArena #48
FireAct
0.25% Academic Research 2023-10 Source
WebArena #35
Qwen-1.5-chat-72b
7.14% Qwen 2023-09 Source
WebArena #41
CodeLlama-instruct-34b
4.06% Meta 2023-08 Source
WebArena #49
CodeLlama-instruct-7b
0.0% Meta 2023-08 Source
WebArena #30
gpt-4-0613 (no not-achievable hint)
14.9% OpenAI 2023-06 Source
WebArena #32
gpt-4-0613 (with not-achievable hint)
11.7% OpenAI 2023-06 Source
WebArena #34
gpt-3.5-turbo-16k-0613
8.87% OpenAI 2023-03 Source

Scores come from public papers, model cards, repositories, and launch posts. Comparisons are most useful within a single benchmark — across benchmarks, evaluators, task sets, judges, attempt budgets, and tool access can differ, so treat cross-benchmark rankings as directional. Use the individual benchmark pages for methodology and interpretation notes.

Frequently asked questions

Are scores on this page comparable across benchmarks? +
No. Each benchmark uses its own task set, evaluator, scoring metric, and scope (model vs. agent). A 90% on one benchmark and a 90% on another do not measure the same thing. Compare scores within a single benchmark, and read methodology notes on the individual benchmark pages before drawing conclusions.
Why is the default order by reported date rather than score? +
Scores across different benchmarks aren't on the same scale, so a global score sort would be misleading. Sorting by reported date surfaces the freshest results first. When you filter to a single benchmark the order switches to within-benchmark rank, which is the meaningful comparison.
Are these scores independently verified? +
Not always. Some rows are independently benchmarked and some are self-reported by the agent or model team. Every row links to its source (paper, model card, repository, or launch post). Use those links and any per-row notes to assess evidence level before drawing strong conclusions.
How is this index sourced and updated? +
Rows are taken from public papers, model cards, repositories, and launch posts. New results appear as they are published — typically weekly. If you spot a missing agent or a stale score, open a pull request or issue on GitHub.
How do I add my agent or model to this list? +
Open a pull request on GitHub. You need a publicly verifiable benchmark score, a link to the source (paper, post, or repo), and a homepage or repository for your agent.
Who maintains this leaderboard? +
Steel maintains it as an open reference for the browser-agent ecosystem. Steel is browser infrastructure for AI agents — cloud browser sessions with anti-bot handling, proxy rotation, and session replay. Contributions and corrections are welcome on GitHub.