Canonical benchmark page

ClawBench leaderboard

Benchmark page for ClawBench with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-04-16

About this benchmark

ClawBench evaluates AI agents on 153 everyday tasks that real people need to complete regularly — booking appointments, completing purchases, submitting job applications, and filling in forms — across 144 live production websites in 15 categories.

Unlike most browser benchmarks that use offline sandboxes with static pages, ClawBench runs entirely on live websites, preserving the full complexity and dynamic nature of real-world web interaction. This makes it a particularly demanding and realistic signal for production browser agent capability.

The best of 7 frontier models evaluated at publication time (Claude Sonnet 4.6) completed only 33% of tasks, making it one of the most challenging publicly available browser agent benchmarks today.

Very new benchmark (April 2026) — published results cover only 7 frontier models. Expect the leaderboard to expand rapidly.

Methodology

  • Tasks require agents to obtain information from user-provided documents, navigate multi-step workflows, and complete write-heavy operations like filling in detailed forms — capabilities explicitly beyond existing benchmarks.
  • Evaluation captures 5 layers of behavioral data: session replay, screenshots, HTTP traffic, agent reasoning traces, and browser actions. An agentic evaluator scores results with step-level traceable diagnostics.
  • Human ground-truth is collected for every task. The agentic evaluator provides step-level diagnostics rather than a single pass/fail, making failure analysis more actionable.

Links

ClawBench

Agent scope
Rank System / Submission Score Organization Notes Source Share
1
Claude Sonnet 4.6
33.3% Anthropic Native multi-modal reasoning with high success in Finance and Academic tasks. Source
2
GLM-5 New
24.2% Z.ai Strongest text-only baseline; high performance in Developer-centric workflows. Source
3
Gemini 3 Flash
19.0% Google Efficiency-optimized vision model with consistent performance in Travel categories. Source
4
Claude Haiku 4.5
18.3% Anthropic Balanced agentic loop demonstrating strong reasoning in Academic task groups. Source
5
GPT-5.4
6.5% OpenAI Large-scale reasoning model baseline; highlights difficulty of live-web transactions. Source
6
Gemini 3.1 Flash Lite New
3.3% Google Lightweight inference model tested on real-world multi-step website interactions. Source
7
Kimi K2.5 New
0.7% Moonshot AI Early-stage agentic baseline demonstrating challenges in state-changing operations. Source

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on ClawBench? +
Claude Sonnet 4.6 is the system/agent setup currently leading with a tracked score of 33.3%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Apr 16, 2026.
What should I read into a ClawBench score? +
ClawBench scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.