Canonical benchmark page

ClawBench leaderboard

Benchmark page for ClawBench with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-04-16

About this benchmark

ClawBench evaluates AI agents on 153 everyday tasks that real people need to complete regularly — booking appointments, completing purchases, submitting job applications, and filling in forms — across 144 live production websites in 15 categories.

Unlike most browser benchmarks that use offline sandboxes with static pages, ClawBench runs entirely on live websites, preserving the full complexity and dynamic nature of real-world web interaction. This makes it a particularly demanding and realistic signal for production browser agent capability.

The best of 7 frontier models evaluated at publication time (Claude Sonnet 4.6) completed only 33% of tasks, making it one of the most challenging publicly available browser agent benchmarks today.

Very new benchmark (April 2026) — published results cover only 7 frontier models. Expect the leaderboard to expand rapidly.

Methodology

Tasks require agents to obtain information from user-provided documents, navigate multi-step workflows, and complete write-heavy operations like filling in detailed forms — capabilities explicitly beyond existing benchmarks.
Evaluation captures 5 layers of behavioral data: session replay, screenshots, HTTP traffic, agent reasoning traces, and browser actions. An agentic evaluator scores results with step-level traceable diagnostics.
Human ground-truth is collected for every task. The agentic evaluator provides step-level diagnostics rather than a single pass/fail, making failure analysis more actionable.

ClawBench

Agent scope

Rank	System / Submission	Score	Organization	Notes	Source	Share
1	Claude Sonnet 4.6	33.3%	Anthropic	Native multi-modal reasoning with high success in Finance and Academic tasks.	Source	Share on X Share on LinkedIn
2	GLM-5 New	24.2%	Z.ai	Strongest text-only baseline; high performance in Developer-centric workflows.	Source	Share on X Share on LinkedIn
3	Gemini 3 Flash	19.0%	Google	Efficiency-optimized vision model with consistent performance in Travel categories.	Source	Share on X Share on LinkedIn
4	Claude Haiku 4.5	18.3%	Anthropic	Balanced agentic loop demonstrating strong reasoning in Academic task groups.	Source	Share on X Share on LinkedIn
5	GPT-5.4	6.5%	OpenAI	Large-scale reasoning model baseline; highlights difficulty of live-web transactions.	Source	Share on X Share on LinkedIn
6	Gemini 3.1 Flash Lite New	3.3%	Google	Lightweight inference model tested on real-world multi-step website interactions.	Source	Share on X Share on LinkedIn
7	Kimi K2.5 New	0.7%	Moonshot AI	Early-stage agentic baseline demonstrating challenges in state-changing operations.	Source	Share on X Share on LinkedIn

Related benchmarks

Compare this benchmark with related pages from the hub:

webvoyager webarena browsecomp

Back to benchmark hub

Frequently asked questions

Which system is currently best on ClawBench? +

Claude Sonnet 4.6 is the system/agent setup currently leading with a tracked score of 33.3%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Apr 16, 2026.

What should I read into a ClawBench score? +

ClawBench scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.