Leaderboard
Agent scope| System / Submission | Score | Organization | Reported | Source |
|---|---|---|---|---|
| Claude Sonnet 4.6 Native multi-modal reasoning with high success in Finance and Academic tasks. | 33.3% | Anthropic | Source | |
| GLM-5 New Strongest text-only baseline; high performance in Developer-centric workflows. | 24.2% | Z.ai | Source | |
| Gemini 3 Flash Efficiency-optimized vision model with consistent performance in Travel categories. | 19.0% | Source | ||
| Claude Haiku 4.5 Balanced agentic loop demonstrating strong reasoning in Academic task groups. | 18.3% | Anthropic | Source | |
| GPT-5.4 Large-scale reasoning model baseline; highlights difficulty of live-web transactions. | 6.5% | OpenAI | Source | |
| Gemini 3.1 Flash Lite New Lightweight inference model tested on real-world multi-step website interactions. | 3.3% | Source | ||
| Kimi K2.5 Early-stage agentic baseline demonstrating challenges in state-changing operations. | 0.7% | Moonshot AI | Source |
About this benchmark
ClawBench evaluates browser agents on 153 everyday online tasks across 144 live platforms in 15 categories, including purchases, appointments, job applications, and detailed forms.
Its emphasis is on state-changing, write-heavy workflows. A lightweight interception layer blocks final submissions so agents can be evaluated safely on production sites without causing real-world side effects.
The first reported results show a large gap: the best of seven frontier models completed 33.3%, making ClawBench useful for measuring robustness beyond navigation-only or read-only web tasks.
New benchmark with limited independent submissions; current rows mainly reflect the initial paper's model suite.
Example tasks
Three public tasks quoted from benchmark sources:
- "On Uber Eats, order delivery: one Pad Thai, deliver to home address, note "no peanuts"" Citation: ClawBench task JSON
- "Search Zillow for a one-bedroom apartment in Toronto downtown under $3500/month, select one and submit a rental application" Citation: ClawBench task JSON
- "Search "Senior Software Engineer" (Toronto) on Indeed, apply to the top-ranked listing" Citation: ClawBench task JSON
Methodology
- Evaluation uses human ground truth and an agentic evaluator over session replay, screenshots, HTTP traffic, reasoning traces, and browser actions.
- Tasks often require using user-provided documents, filling many fields correctly, and recovering from dynamic live-site behavior.
- Because ClawBench is new, most rows currently come from the paper's initial model suite rather than independent follow-up submissions.
- Compare ClawBench with WebVoyager and Online-Mind2Web when separating read/navigation ability from transactional form-completion ability.
Links
Related benchmarks
Compare this benchmark with related pages from the hub: