Canonical benchmark page
ClawBench leaderboard
Benchmark page for ClawBench with standardized structure: about, leaderboard table, and FAQ.
Last updated: 2026-04-16
About this benchmark
ClawBench evaluates AI agents on 153 everyday tasks that real people need to complete regularly — booking appointments, completing purchases, submitting job applications, and filling in forms — across 144 live production websites in 15 categories.
Unlike most browser benchmarks that use offline sandboxes with static pages, ClawBench runs entirely on live websites, preserving the full complexity and dynamic nature of real-world web interaction. This makes it a particularly demanding and realistic signal for production browser agent capability.
The best of 7 frontier models evaluated at publication time (Claude Sonnet 4.6) completed only 33% of tasks, making it one of the most challenging publicly available browser agent benchmarks today.
Very new benchmark (April 2026) — published results cover only 7 frontier models. Expect the leaderboard to expand rapidly.
Methodology
- Tasks require agents to obtain information from user-provided documents, navigate multi-step workflows, and complete write-heavy operations like filling in detailed forms — capabilities explicitly beyond existing benchmarks.
- Evaluation captures 5 layers of behavioral data: session replay, screenshots, HTTP traffic, agent reasoning traces, and browser actions. An agentic evaluator scores results with step-level traceable diagnostics.
- Human ground-truth is collected for every task. The agentic evaluator provides step-level diagnostics rather than a single pass/fail, making failure analysis more actionable.
Links
ClawBench
Agent scope| Rank | System / Submission | Score | Organization | Notes | Source | Share |
|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.6 | 33.3% | Anthropic | Native multi-modal reasoning with high success in Finance and Academic tasks. | Source | |
| 2 | GLM-5 New | 24.2% | Z.ai | Strongest text-only baseline; high performance in Developer-centric workflows. | Source | |
| 3 | Gemini 3 Flash | 19.0% | Efficiency-optimized vision model with consistent performance in Travel categories. | Source | ||
| 4 | Claude Haiku 4.5 | 18.3% | Anthropic | Balanced agentic loop demonstrating strong reasoning in Academic task groups. | Source | |
| 5 | GPT-5.4 | 6.5% | OpenAI | Large-scale reasoning model baseline; highlights difficulty of live-web transactions. | Source | |
| 6 | Gemini 3.1 Flash Lite New | 3.3% | Lightweight inference model tested on real-world multi-step website interactions. | Source | ||
| 7 | Kimi K2.5 New | 0.7% | Moonshot AI | Early-stage agentic baseline demonstrating challenges in state-changing operations. | Source |
Related benchmarks
Compare this benchmark with related pages from the hub: