Canonical benchmark page

τ-bench leaderboard

Benchmark page for τ-bench with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-04-16

About this benchmark

τ-bench (TAU-bench) evaluates AI agents in realistic enterprise tool-use scenarios across retail and airline domains — testing multi-turn conversation, policy adherence, database interactions, and rule-following consistency over many trials.

It is one of the most widely adopted agentic benchmarks by AI labs, cited in model cards from Anthropic, OpenAI, and Google. Unlike static QA benchmarks, τ-bench measures whether an agent behaves reliably and correctly across multiple independent runs using the pass^k metric.

Because evaluation setup (prompt, tool schema, trial count) varies by submitter, self-reported scores across organizations are not always directly comparable — the Notes column captures key setup differences.

Score comparisons across organizations require caution — prompt setup, tool schema, and trial count differ between submissions.

Methodology

  • The pass^k metric measures the probability an agent succeeds on all k independent trials of the same task — penalizing inconsistency even when average accuracy is high.
  • Evaluation uses a simulated user (another LLM) and checks final database state against the annotated goal state — no LLM judge for pass/fail decisions.
  • The official leaderboard is hosted at taubench.com and maintained by Sierra Research. Some rows are self-reported in model cards; verify source before comparing across organizations.

Links

τ-bench

Model scope
Rank System / Submission Score Organization Notes Source Share
1
Step-3.5-Flash New
88.2% StepFun Direct technical report; Step-3.5 architecture optimized for high-consistency tool use. Source
2
GLM-4.7 New
87.4% Z.ai Official Z.ai Developer docs; Introduces enhanced agentic policy compliance for enterprise retail/airline workflows. Source
3
MiMo-V2-Flash New
80.3% Xiaomi Technical Report; MoE model with 309B total/15B active params and hybrid attention for long-horizon agent tasks. Source
4
GLM-4.7-Flash New
79.5% Z.ai Verified via Cerebras & Z.ai; SOTA performance for a lightweight flash-tier model in tool-use consistency. Source
5
MiniMax M2
77.2% MiniMax Official repository; M2 specifically models 'thinking' content to handle complex multi-turn toolchains. Source
6
Claude Opus 4.5
70.2% Anthropic Sierra Research Evaluation; Measured using the 'High' user simulator and GPT-5.2 judge protocol. Source
7
GPT-5.2
69.9% OpenAI Sierra Research Evaluation; Results based on the standardized Sierra simulation harness. Source
8
Qwen3.5-397B-A17B
68.4% Alibaba Official Qwen Blog; Native multimodal agent capabilities with sparse MoE architecture (17B active params). Source
9
Gemini 3 Flash
67.8% Google DeepMind DeepMind Technical Report; Evaluation includes performance across retail, airline, and telecom domains. Source
10
Gemini 3 Pro
65.8% Google DeepMind DeepMind Technical Report; Frontier reasoning performance for enterprise customer service simulations. Source
11
GLM-5
63.2% Zhipu AI τ-bench overall pass^1 (gpt-5.2 user sim, reasoning enabled); reported on taubench.com. Source
12
Claude Sonnet 4.5
62.9% Anthropic τ-bench overall pass^1 (gpt-5.2 user sim, reasoning enabled); reported on taubench.com. Source

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on τ-bench? +
Step-3.5-Flash is the model currently leading with a tracked score of 88.2%. This page is model-focused, so rankings mostly reflect model capability under the reported harness. Based on our latest tracked results, last updated Apr 16, 2026.
What should I read into a τ-bench score? +
τ-bench scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? +
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.