Browser agents
WebVoyager
WebVoyager is a benchmark for browser agents operating on live websites. It focuses on practical tasks such as navigation, search, form completion, and multi-step workflows across a broad website mix.
Top: Jina (98.9%)
Caveat: Rows may use different evaluation settings and are not always strict apples-to-apples.
Open WebVoyager leaderboard
Research/search
BrowseComp
BrowseComp targets difficult browse-and-synthesize research questions that are easy to verify but hard to answer without strong search and reasoning strategy.
Top: GPT-5.5 Pro (90.1%)
Caveat: Mixed-scope benchmark: model-only and tool-augmented rows are not inherently directly comparable.
Open BrowseComp leaderboard
Browser agents
WebArena
WebArena evaluates browser agents in controlled, self-hosted web environments that represent realistic application patterns such as e-commerce, forums, and developer workflows.
Top: DeepSeek V3.2 (74.3%)
Caveat: Even with controlled environments, ranking rows can differ by setup and submission policy.
Open WebArena leaderboard
Coding
SWE-bench Verified
SWE-bench Verified evaluates software engineering performance on real GitHub issues with stricter quality controls than the broader SWE-bench set.
Top: Claude Mythos (93.9%)
Caveat: Model-focused benchmark, but harness and evaluation policy still affect outcomes.
Open SWE-bench Verified leaderboard
Computer use
OSWorld
OSWorld evaluates computer-use agents across 369 real desktop tasks spanning Ubuntu, Windows, and macOS — covering web apps, desktop software, file I/O, and multi-application workflows.
Top: Mythos Preview (79.6%)
Caveat: Self-reported and independently verified rows coexist — check the source before comparing directly.
Open OSWorld leaderboard
Model evals / reasoning
GAIA
GAIA (General AI Assistants) evaluates agents on over 450 real-world questions with unambiguous, verifiable answers — requiring multi-step reasoning, tool use, web search, and file handling across three difficulty levels.
Top: OPS-Agentic-Search (92.36%)
Caveat: Top entries are multi-model ensembles — scores cannot be attributed to any single model.
Open GAIA leaderboard
Browser agents
ClawBench
ClawBench evaluates AI agents on 153 everyday tasks that real people need to complete regularly — booking appointments, completing purchases, submitting job applications, and filling in forms — across 144 live production websites in 15 categories.
Top: Claude Sonnet 4.6 (33.3%)
Caveat: Very new benchmark (April 2026) — published results cover only 7 frontier models. Expect the leaderboard to expand rapidly.
Open ClawBench leaderboard
Browser agents
Online-Mind2Web
Online-Mind2Web is a live browser agent benchmark of 300 diverse, realistic tasks across 136 popular websites — spanning shopping, finance, travel, government, and more. Unlike static offline benchmarks, agents interact with real, dynamic pages as they exist at evaluation time.
Top: Browser Use Cloud (bu-max) (97.0%)
Caveat: Judge methodology varies significantly across submissions — human eval, WebJudge, and custom agentic judges produce different scores for the same agent. Always check the Notes column before comparing rows.
Open Online-Mind2Web leaderboard
Model evals / reasoning
τ-bench
τ-bench (TAU-bench) evaluates AI agents in realistic enterprise tool-use scenarios across retail and airline domains — testing multi-turn conversation, policy adherence, database interactions, and rule-following consistency over many trials.
Top: Step-3.5-Flash (88.2%)
Caveat: Score comparisons across organizations require caution — prompt setup, tool schema, and trial count differ between submissions.
Open τ-bench leaderboard
Model evals / reasoning
AgentBench
AgentBench evaluates LLMs as agents across 8 distinct interactive environments — including OS interaction, database querying, knowledge graph traversal, digital card games, lateral thinking puzzles, house-holding tasks, web browsing, and web shopping.
Top: AgentRL w/ Qwen2.5-32B-Instruct (70.4%)
Caveat: Community-submitted leaderboard — rows are self-reported and not independently verified. Check source links before drawing strong conclusions.
Open AgentBench leaderboard