Leaderboard

System / Submission Score Organization Reported Source
Claude Sonnet 4.6
33.3% Anthropic Source
GLM-5 New
24.2% Z.ai Source
Gemini 3 Flash
19.0% Google Source
Claude Haiku 4.5
18.3% Anthropic Source
GPT-5.4
6.5% OpenAI Source
Gemini 3.1 Flash Lite New
3.3% Google Source
Kimi K2.5
0.7% Moonshot AI Source

About this benchmark

ClawBench evaluates browser agents on 153 everyday online tasks across 144 live platforms in 15 categories, including purchases, appointments, job applications, and detailed forms.

Its emphasis is on state-changing, write-heavy workflows. A lightweight interception layer blocks final submissions so agents can be evaluated safely on production sites without causing real-world side effects.

The first reported results show a large gap: the best of seven frontier models completed 33.3%, making ClawBench useful for measuring robustness beyond navigation-only or read-only web tasks.

New benchmark with limited independent submissions; current rows mainly reflect the initial paper's model suite.

Example tasks

Three public tasks quoted from benchmark sources:

Methodology

  • Evaluation uses human ground truth and an agentic evaluator over session replay, screenshots, HTTP traffic, reasoning traces, and browser actions.
  • Tasks often require using user-provided documents, filling many fields correctly, and recovering from dynamic live-site behavior.
  • Because ClawBench is new, most rows currently come from the paper's initial model suite rather than independent follow-up submissions.
  • Compare ClawBench with WebVoyager and Online-Mind2Web when separating read/navigation ability from transactional form-completion ability.

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on ClawBench? +
Claude Sonnet 4.6 is the system/agent setup currently leading with a tracked score of 33.3%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Apr 16, 2026.
What should I read into a ClawBench score? +
ClawBench scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.