Canonical benchmark page

WebArena leaderboard

Benchmark page for WebArena with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-03-22

About this benchmark

WebArena evaluates browser agents in controlled, self-hosted web environments that represent realistic application patterns such as e-commerce, forums, and developer workflows.

It is commonly used when teams want more reproducible benchmarking conditions than fully live-web tasks.

Scores still represent configured end-to-end systems, including model choice, planning approach, and browser interaction stack.

Even with controlled environments, ranking rows can differ by setup and submission policy.

Methodology

  • The benchmark uses programmatic evaluation with task-level success criteria.
  • Rows often come from a shared public tracker and can reflect different submission dates and system revisions.
  • Use notes and source links to verify attempt policy and setup assumptions.

Links

WebArena

Agent scope
Rank System / Submission Score Organization Notes Source Share
1
DeepSeek V3.2 New
74.3% DeepSeek Community tracker; third-party verified open-source submission. Source
2
OpAgent New
71.6% CodeFuse AI Community tracker; third-party verified open-source submission. Source
3
ColorBrowserAgent New
71.2% ColorBrowser Community tracker; third-party verified open-source submission. Source
4
Claude Code + GBOX
68.0% GBOX AI Community tracker; Claude Code as backbone with GBOX scaffolding. Source
5
DeepSky Agent
66.9% DeepSky Self-reported on community tracker. Source
6
Narada AI
64.2% Narada Self-reported on community tracker. Source
7
IBM CUGA
61.7% IBM Third-party verified; IBM's agentic system evaluated via research paper. Source
8
Kimi K2.5 New
58.9% Moonshot AI Self-reported in Kimi K2.5 technical paper; GUI-based browsing without external tools. Source
9
OpenAI Operator
58.1% OpenAI Self-reported by OpenAI at Operator launch. Source
10
Jace.AI (AWA-1.5)
57.1% Jace AI Self-reported at product launch blog post. Source
11
WebOperator + GPT-4o
54.6% WebOperator WebArena success rate; reported on the WebArena Google Sheet leaderboard. Source

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on WebArena? +
DeepSeek V3.2 is the system/agent setup currently leading with a tracked score of 74.3%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Mar 22, 2026.
What should I read into a WebArena score? +
WebArena scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? +
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.