Leaderboard
Agent scope| System / Submission | Score | Organization | Reported | Source |
|---|---|---|---|---|
| 74.3% | WebTactix | Source | ||
| 71.6% | CodeFuse AI | Source | ||
| 71.2% | MadeAgents | Source | ||
| 68.0% | GBOX AI | Source | ||
| DeepSky Agent Public tracker row; no separate public source or repository link was found in the tracker. | 66.9% | DeepSky | Source | |
| Narada AI Narada blog and public tracker report the WebArena result; no public code repository was found. | 64.2% | Narada AI | Source | |
| 61.7% | IBM | Source | ||
| 58.9% | Moonshot AI | Source | ||
| OpenAI Operator OpenAI Computer-Using Agent result reported with the Operator system card and public WebArena tracker. | 58.1% | OpenAI | Source | |
| Jace.AI (AWA-1.5) Jace reports AWA 1.5 at 57.14% on WebArena; implementation is proprietary. | 57.1% | Jace AI | Source | |
| Plan-MCTS + GPT-5-mini Plan-MCTS paper reports 55.3% on the full 812-task WebArena test set with GPT-5-mini. | 55.3% | Academic Research | Source | |
| 54.6% | KAIST KAG NLP | Source | ||
| ScribeAgent + GPT-4o ScribeAgent reports 53.0% with GPT-4o; the tracker marks data as proprietary. | 53.0% | Academic Research | Source | |
| 52.1% | Academic Research | Source | ||
| Learn-by-Interact Paper reports 48.0% on WebArena; no public implementation repository was found. | 48.0% | Academic Research | Source | |
| 46.9% | Academic Research | Source | ||
| 45.7% | Amazon Science | Source | ||
| 42.1% | McGill NLP | Source | ||
| GUI-API Hybrid Agent Latest arXiv version reports 38.9%; the public tracker still lists 35.8 for an earlier or alternate setup. | 38.9% | Academic Research | Source | |
| WebPilot Paper reports 37.2%; the public tracker notes no open-source code or trajectories were released. | 37.2% | Academic Research | Source | |
| 35.5% | Academic Research | Source | ||
| 33.5% | ASAPP Research | Source | ||
| 26.1% | Academic Research | Source | ||
| 23.5% | ServiceNow Research | Source | ||
| 22.4% | xLang AI | Source | ||
| GPT-4 + Auto Eval Automatic evaluator study reports a GPT-4 WebArena result; no separate implementation repository was found. | 20.2% | Academic Research | Source | |
| GPT-4o + Tree Search Tree-search agent result from the Search Agents project page. | 19.2% | Academic Research | Source | |
| 18.2% | THUDM | Source | ||
| 16.3% | Stanford NLP | Source | ||
| gpt-4-0613 (no not-achievable hint) Original WebArena baseline without providing the not-achievable task hint. | 14.9% | OpenAI | Source | |
| gpt-4o-2024-05-13 Public tracker row from the WebArena team with the not-achievable task hint provided. | 13.05% | OpenAI | Source | |
| gpt-4-0613 (with not-achievable hint) Original WebArena GPT-4 baseline when the not-achievable task hint is provided. | 11.7% | OpenAI | Source | |
| Patel et al. + GPT-4 Patel et al. report a GPT-4 WebArena evaluation row in the public tracker. | 9.36% | Academic Research | Source | |
| gpt-3.5-turbo-16k-0613 Original WebArena GPT-3.5 baseline. | 8.87% | OpenAI | Source | |
| 7.14% | Qwen | Source | ||
| Gemini Pro WebArena tracker baseline for Gemini Pro; implementation is proprietary. | 7.12% | Source | ||
| 7.02% | Meta | Source | ||
| Synatra-CodeLLama7b Synatra CodeLlama-7B WebArena row from the paper and public tracker. | 6.28% | Academic Research | Source | |
| 5.3% | OpenLemur | Source | ||
| 4.68% | InternLM | Source | ||
| 4.06% | Meta | Source | ||
| 3.81% | THUDM | Source | ||
| 3.32% | Meta | Source | ||
| 2.3% | Academic Research | Source | ||
| 1.6% | THUDM | Source | ||
| 1.39% | Mistral AI | Source | ||
| 0.74% | THUDM | Source | ||
| FireAct FireAct WebArena row from the public tracker; no public WebArena implementation artifact was found. | 0.25% | Academic Research | Source | |
| 0.0% | Meta | Source |
About this benchmark
WebArena evaluates browser agents in reproducible, self-hosted websites instead of the open live web. Its 812 tasks cover e-commerce, forum discussion, collaborative software development, content management, maps, and reference lookup.
The benchmark is strongest when you care about repeatable web-agent experiments: every task has a controlled start state and functional success criteria rather than a changing production website.
Because many rows come from a public community tracker, a WebArena score should be read alongside the source, submitted scaffold, observation mode, and whether the result was independently reproduced.
Controlled environments improve reproducibility, but tracker rows still vary by scaffold and submission policy.
Filtered task-set or modified-grader reports are not ranked as full WebArena results unless the row notes that setup explicitly.
Example tasks
Three public tasks quoted from benchmark sources:
- "What is the top-1 best-selling product in 2022" Citation: WebArena test config
- "Tell me the full address of all international airports that are within a driving distance of 50 km to Carnegie Mellon University" Citation: WebArena test config
- "Tell me the the number of reviews that our store received by far that mention term "disappointed"" Citation: WebArena test config
Methodology
- Primary metric is end-to-end task success rate on the WebArena task set; the original GPT-4-based baseline was 14.41% versus 78.24% human performance.
- Evaluation checks functional correctness through task-specific validators and answer checks in the hosted environment.
- Scores can change with prompt scaffolding, observation mode, browser action space, and retry or step budget.
- We prefer rows tied to WebArena's public leaderboard, papers, or repositories that include enough setup detail to reproduce the run.
Links
Related benchmarks
Compare this benchmark with related pages from the hub: