Canonical benchmark page
WebArena leaderboard
Benchmark page for WebArena with standardized structure: about, leaderboard table, and FAQ.
Last updated: 2026-03-22
About this benchmark
WebArena evaluates browser agents in controlled, self-hosted web environments that represent realistic application patterns such as e-commerce, forums, and developer workflows.
It is commonly used when teams want more reproducible benchmarking conditions than fully live-web tasks.
Scores still represent configured end-to-end systems, including model choice, planning approach, and browser interaction stack.
Even with controlled environments, ranking rows can differ by setup and submission policy.
Methodology
- The benchmark uses programmatic evaluation with task-level success criteria.
- Rows often come from a shared public tracker and can reflect different submission dates and system revisions.
- Use notes and source links to verify attempt policy and setup assumptions.
Links
WebArena
Agent scope| Rank | System / Submission | Score | Organization | Notes | Source | Share |
|---|---|---|---|---|---|---|
| 1 | DeepSeek V3.2 New | 74.3% | DeepSeek | Community tracker; third-party verified open-source submission. | Source | |
| 2 | OpAgent New | 71.6% | CodeFuse AI | Community tracker; third-party verified open-source submission. | Source | |
| 3 | ColorBrowserAgent New | 71.2% | ColorBrowser | Community tracker; third-party verified open-source submission. | Source | |
| 4 | Claude Code + GBOX | 68.0% | GBOX AI | Community tracker; Claude Code as backbone with GBOX scaffolding. | Source | |
| 5 | DeepSky Agent | 66.9% | DeepSky | Self-reported on community tracker. | Source | |
| 6 | Narada AI | 64.2% | Narada | Self-reported on community tracker. | Source | |
| 7 | IBM CUGA | 61.7% | IBM | Third-party verified; IBM's agentic system evaluated via research paper. | Source | |
| 8 | Kimi K2.5 New | 58.9% | Moonshot AI | Self-reported in Kimi K2.5 technical paper; GUI-based browsing without external tools. | Source | |
| 9 | OpenAI Operator | 58.1% | OpenAI | Self-reported by OpenAI at Operator launch. | Source | |
| 10 | Jace.AI (AWA-1.5) | 57.1% | Jace AI | Self-reported at product launch blog post. | Source | |
| 11 | WebOperator + GPT-4o | 54.6% | WebOperator | WebArena success rate; reported on the WebArena Google Sheet leaderboard. | Source |
Related benchmarks
Compare this benchmark with related pages from the hub:
Frequently asked questions
Which system is currently best on WebArena? + -
DeepSeek V3.2 is the system/agent setup currently leading with a tracked score of 74.3%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Mar 22, 2026.
What should I read into a WebArena score? + -
WebArena scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? + -
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.