Canonical benchmark page

WebArena leaderboard

Benchmark page for WebArena with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-03-22

About this benchmark

WebArena evaluates browser agents in controlled, self-hosted web environments that represent realistic application patterns such as e-commerce, forums, and developer workflows.

It is commonly used when teams want more reproducible benchmarking conditions than fully live-web tasks.

Scores still represent configured end-to-end systems, including model choice, planning approach, and browser interaction stack.

Even with controlled environments, ranking rows can differ by setup and submission policy.

Methodology

The benchmark uses programmatic evaluation with task-level success criteria.
Rows often come from a shared public tracker and can reflect different submission dates and system revisions.
Use notes and source links to verify attempt policy and setup assumptions.

WebArena

Agent scope

Rank	System / Submission	Score	Organization	Notes	Source	Share
1	DeepSeek V3.2 New	74.3%	DeepSeek	Community tracker; third-party verified open-source submission.	Source	Share on X Share on LinkedIn
2	OpAgent New	71.6%	CodeFuse AI	Community tracker; third-party verified open-source submission.	Source	Share on X Share on LinkedIn
3	ColorBrowserAgent New	71.2%	ColorBrowser	Community tracker; third-party verified open-source submission.	Source	Share on X Share on LinkedIn
4	Claude Code + GBOX	68.0%	GBOX AI	Community tracker; Claude Code as backbone with GBOX scaffolding.	Source	Share on X Share on LinkedIn
5	DeepSky Agent	66.9%	DeepSky	Self-reported on community tracker.	Source	Share on X Share on LinkedIn
6	Narada AI	64.2%	Narada	Self-reported on community tracker.	Source	Share on X Share on LinkedIn
7	IBM CUGA	61.7%	IBM	Third-party verified; IBM's agentic system evaluated via research paper.	Source	Share on X Share on LinkedIn
8	Kimi K2.5 New	58.9%	Moonshot AI	Self-reported in Kimi K2.5 technical paper; GUI-based browsing without external tools.	Source	Share on X Share on LinkedIn
9	OpenAI Operator	58.1%	OpenAI	Self-reported by OpenAI at Operator launch.	Source	Share on X Share on LinkedIn
10	Jace.AI (AWA-1.5)	57.1%	Jace AI	Self-reported at product launch blog post.	Source	Share on X Share on LinkedIn
11	WebOperator + GPT-4o	54.6%	WebOperator	WebArena success rate; reported on the WebArena Google Sheet leaderboard.	Source	Share on X Share on LinkedIn

Related benchmarks

Compare this benchmark with related pages from the hub:

webvoyager browsecomp osworld

Back to benchmark hub

Frequently asked questions

Which system is currently best on WebArena? +

DeepSeek V3.2 is the system/agent setup currently leading with a tracked score of 74.3%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Mar 22, 2026.

What should I read into a WebArena score? +

WebArena scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.

Are these independently verified? +

Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.