WebArena Leaderboard 2026: Latest Browser Agent Scores

Leaderboard

Agent scope

System / Submission	Score	Organization	Reported	Source
WebTactix (DeepSeek v3.2)	74.3%	WebTactix	Feb 2026	Source
OpAgent	71.6%	CodeFuse AI	Jan 2026	Source
ColorBrowserAgent	71.2%	MadeAgents	Dec 2025	Source
Claude Code + GBOX MCP	68.0%	GBOX AI	Oct 2025	Source
DeepSky Agent	66.9%	DeepSky	Sep 2025	Source
Narada AI	64.2%	Narada AI	Oct 2025	Source
IBM CUGA	61.7%	IBM	Feb 2025	Source
Kimi K2.5	58.9%	Moonshot AI	Feb 2026	Source
OpenAI Operator	58.1%	OpenAI	Jan 2025	Source
Jace.AI (AWA-1.5)	57.1%	Jace AI	Aug 2024	Source
Plan-MCTS + GPT-5-mini	55.3%	Academic Research	Feb 2026	Source
WebOperator + GPT-4o	54.6%	KAIST KAG NLP	Dec 2025	Source
ScribeAgent + GPT-4o	53.0%	Academic Research	Dec 2024	Source
AgentSymbiotic	52.1%	Academic Research	Jan 2025	Source
Learn-by-Interact	48.0%	Academic Research	Jan 2025	Source
WebUncertainty + GPT-4-Turbo New	46.9%	Academic Research	Apr 2026	Source
AgentOccam-Judge	45.7%	Amazon Science	Oct 2024	Source
A3-Qwen3.5-9B New	42.1%	McGill NLP	Apr 2026	Source
GUI-API Hybrid Agent	38.9%	Academic Research	Jun 2025	Source
WebPilot	37.2%	Academic Research	Aug 2024	Source
Agent Workflow Memory	35.5%	Academic Research	Sep 2024	Source
SteP	33.5%	ASAPP Research	Apr 2024	Source
TTI	26.1%	Academic Research	Jun 2025	Source
BrowserGym + GPT-4	23.5%	ServiceNow Research	Apr 2024	Source
AgentTrek-1.0-32B	22.4%	xLang AI	Jan 2025	Source
GPT-4 + Auto Eval	20.2%	Academic Research	Apr 2024	Source
GPT-4o + Tree Search	19.2%	Academic Research	Jun 2024	Source
AutoWebGLM	18.2%	THUDM	Apr 2024	Source
NNetNav	16.3%	Stanford NLP	Jan 2025	Source
gpt-4-0613 (no not-achievable hint)	14.9%	OpenAI	Jun 2023	Source
gpt-4o-2024-05-13	13.05%	OpenAI	May 2024	Source
gpt-4-0613 (with not-achievable hint)	11.7%	OpenAI	Jun 2023	Source
Patel et al. + GPT-4	9.36%	Academic Research	May 2024	Source
gpt-3.5-turbo-16k-0613	8.87%	OpenAI	Mar 2023	Source
Qwen-1.5-chat-72b	7.14%	Qwen	Sep 2023	Source
Gemini Pro	7.12%	Google	Dec 2023	Source
Llama3-chat-70b	7.02%	Meta	Apr 2024	Source
Synatra-CodeLLama7b	6.28%	Academic Research	Oct 2024	Source
Lemur-chat-70b	5.3%	OpenLemur	Oct 2023	Source
Agent-FLAN	4.68%	InternLM	Mar 2024	Source
CodeLlama-instruct-34b	4.06%	Meta	Aug 2023	Source
AgentLM-70b	3.81%	THUDM	Oct 2023	Source
Llama3-chat-8b	3.32%	Meta	Apr 2024	Source
CodeAct Agent	2.3%	Academic Research	Feb 2024	Source
AgentLM-13b	1.6%	THUDM	Oct 2023	Source
Mixtral-8x7B	1.39%	Mistral AI	Jan 2024	Source
AgentLM-7b	0.74%	THUDM	Oct 2023	Source
FireAct	0.25%	Academic Research	Oct 2023	Source
CodeLlama-instruct-7b	0.0%	Meta	Aug 2023	Source

About this benchmark

WebArena evaluates browser agents in reproducible, self-hosted websites instead of the open live web. Its 812 tasks cover e-commerce, forum discussion, collaborative software development, content management, maps, and reference lookup.

The benchmark is strongest when you care about repeatable web-agent experiments: every task has a controlled start state and functional success criteria rather than a changing production website.

Because many rows come from a public community tracker, a WebArena score should be read alongside the source, submitted scaffold, observation mode, and whether the result was independently reproduced.

Controlled environments improve reproducibility, but tracker rows still vary by scaffold and submission policy.

Filtered task-set or modified-grader reports are not ranked as full WebArena results unless the row notes that setup explicitly.

Example tasks

Three public tasks quoted from benchmark sources:

"What is the top-1 best-selling product in 2022" Citation: WebArena test config
"Tell me the full address of all international airports that are within a driving distance of 50 km to Carnegie Mellon University" Citation: WebArena test config
"Tell me the the number of reviews that our store received by far that mention term "disappointed"" Citation: WebArena test config

Methodology

Primary metric is end-to-end task success rate on the WebArena task set; the original GPT-4-based baseline was 14.41% versus 78.24% human performance.
Evaluation checks functional correctness through task-specific validators and answer checks in the hosted environment.
Scores can change with prompt scaffolding, observation mode, browser action space, and retry or step budget.
We prefer rows tied to WebArena's public leaderboard, papers, or repositories that include enough setup detail to reproduce the run.

Related benchmarks

Compare this benchmark with related pages from the hub:

webvoyager online-mind2web osworld

Back to benchmark hub

Frequently asked questions

Which system is currently best on WebArena? +

WebTactix (DeepSeek v3.2) is the system/agent setup currently leading with a tracked score of 74.3%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated May 27, 2026.

What should I read into a WebArena score? +

WebArena scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.