Canonical benchmark page

τ-bench leaderboard

Benchmark page for τ-bench with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-04-16

About this benchmark

τ-bench (TAU-bench) evaluates AI agents in realistic enterprise tool-use scenarios across retail and airline domains — testing multi-turn conversation, policy adherence, database interactions, and rule-following consistency over many trials.

It is one of the most widely adopted agentic benchmarks by AI labs, cited in model cards from Anthropic, OpenAI, and Google. Unlike static QA benchmarks, τ-bench measures whether an agent behaves reliably and correctly across multiple independent runs using the pass^k metric.

Because evaluation setup (prompt, tool schema, trial count) varies by submitter, self-reported scores across organizations are not always directly comparable — the Notes column captures key setup differences.

Score comparisons across organizations require caution — prompt setup, tool schema, and trial count differ between submissions.

Methodology

The pass^k metric measures the probability an agent succeeds on all k independent trials of the same task — penalizing inconsistency even when average accuracy is high.
Evaluation uses a simulated user (another LLM) and checks final database state against the annotated goal state — no LLM judge for pass/fail decisions.
The official leaderboard is hosted at taubench.com and maintained by Sierra Research. Some rows are self-reported in model cards; verify source before comparing across organizations.

τ-bench

Model scope

Rank	System / Submission	Score	Organization	Notes	Source	Share
1	Step-3.5-Flash New	88.2%	StepFun	Direct technical report; Step-3.5 architecture optimized for high-consistency tool use.	Source	Share on X Share on LinkedIn
2	GLM-4.7 New	87.4%	Z.ai	Official Z.ai Developer docs; Introduces enhanced agentic policy compliance for enterprise retail/airline workflows.	Source	Share on X Share on LinkedIn
3	MiMo-V2-Flash New	80.3%	Xiaomi	Technical Report; MoE model with 309B total/15B active params and hybrid attention for long-horizon agent tasks.	Source	Share on X Share on LinkedIn
4	GLM-4.7-Flash New	79.5%	Z.ai	Verified via Cerebras & Z.ai; SOTA performance for a lightweight flash-tier model in tool-use consistency.	Source	Share on X Share on LinkedIn
5	MiniMax M2	77.2%	MiniMax	Official repository; M2 specifically models 'thinking' content to handle complex multi-turn toolchains.	Source	Share on X Share on LinkedIn
6	Claude Opus 4.5	70.2%	Anthropic	Sierra Research Evaluation; Measured using the 'High' user simulator and GPT-5.2 judge protocol.	Source	Share on X Share on LinkedIn
7	GPT-5.2	69.9%	OpenAI	Sierra Research Evaluation; Results based on the standardized Sierra simulation harness.	Source	Share on X Share on LinkedIn
8	Qwen3.5-397B-A17B	68.4%	Alibaba	Official Qwen Blog; Native multimodal agent capabilities with sparse MoE architecture (17B active params).	Source	Share on X Share on LinkedIn
9	Gemini 3 Flash	67.8%	Google DeepMind	DeepMind Technical Report; Evaluation includes performance across retail, airline, and telecom domains.	Source	Share on X Share on LinkedIn
10	Gemini 3 Pro	65.8%	Google DeepMind	DeepMind Technical Report; Frontier reasoning performance for enterprise customer service simulations.	Source	Share on X Share on LinkedIn
11	GLM-5	63.2%	Zhipu AI	τ-bench overall pass^1 (gpt-5.2 user sim, reasoning enabled); reported on taubench.com.	Source	Share on X Share on LinkedIn
12	Claude Sonnet 4.5	62.9%	Anthropic	τ-bench overall pass^1 (gpt-5.2 user sim, reasoning enabled); reported on taubench.com.	Source	Share on X Share on LinkedIn

Related benchmarks

Compare this benchmark with related pages from the hub:

swe-bench-verified gaia

Back to benchmark hub

Frequently asked questions

Which system is currently best on τ-bench? +

Step-3.5-Flash is the model currently leading with a tracked score of 88.2%. This page is model-focused, so rankings mostly reflect model capability under the reported harness. Based on our latest tracked results, last updated Apr 16, 2026.

What should I read into a τ-bench score? +

τ-bench scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.

Are these independently verified? +

Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.