ClawBench Leaderboard 2026: Latest Browser Agent Scores

Leaderboard

Agent scope

System / Submission	Score	Organization	Reported	Source
Claude Sonnet 4.6	33.3%	Anthropic	Apr 2026	Source
GLM-5 New	24.2%	Z.ai	Apr 2026	Source
Gemini 3 Flash	19.0%	Google	Apr 2026	Source
Claude Haiku 4.5	18.3%	Anthropic	Apr 2026	Source
GPT-5.4	6.5%	OpenAI	Apr 2026	Source
Gemini 3.1 Flash Lite New	3.3%	Google	Apr 2026	Source
Kimi K2.5	0.7%	Moonshot AI	Apr 2026	Source

About this benchmark

ClawBench evaluates browser agents on 153 everyday online tasks across 144 live platforms in 15 categories, including purchases, appointments, job applications, and detailed forms.

Its emphasis is on state-changing, write-heavy workflows. A lightweight interception layer blocks final submissions so agents can be evaluated safely on production sites without causing real-world side effects.

The first reported results show a large gap: the best of seven frontier models completed 33.3%, making ClawBench useful for measuring robustness beyond navigation-only or read-only web tasks.

New benchmark with limited independent submissions; current rows mainly reflect the initial paper's model suite.

Example tasks

Three public tasks quoted from benchmark sources:

"On Uber Eats, order delivery: one Pad Thai, deliver to home address, note "no peanuts"" Citation: ClawBench task JSON
"Search Zillow for a one-bedroom apartment in Toronto downtown under $3500/month, select one and submit a rental application" Citation: ClawBench task JSON
"Search "Senior Software Engineer" (Toronto) on Indeed, apply to the top-ranked listing" Citation: ClawBench task JSON

Methodology

Evaluation uses human ground truth and an agentic evaluator over session replay, screenshots, HTTP traffic, reasoning traces, and browser actions.
Tasks often require using user-provided documents, filling many fields correctly, and recovering from dynamic live-site behavior.
Because ClawBench is new, most rows currently come from the paper's initial model suite rather than independent follow-up submissions.
Compare ClawBench with WebVoyager and Online-Mind2Web when separating read/navigation ability from transactional form-completion ability.

Related benchmarks

Compare this benchmark with related pages from the hub:

webvoyager online-mind2web webarena

Back to benchmark hub

Frequently asked questions

Which system is currently best on ClawBench? +

Claude Sonnet 4.6 is the system/agent setup currently leading with a tracked score of 33.3%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Apr 16, 2026.

What should I read into a ClawBench score? +

ClawBench scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.