Canonical benchmark page
τ-bench leaderboard
Benchmark page for τ-bench with standardized structure: about, leaderboard table, and FAQ.
Last updated: 2026-04-16
About this benchmark
τ-bench (TAU-bench) evaluates AI agents in realistic enterprise tool-use scenarios across retail and airline domains — testing multi-turn conversation, policy adherence, database interactions, and rule-following consistency over many trials.
It is one of the most widely adopted agentic benchmarks by AI labs, cited in model cards from Anthropic, OpenAI, and Google. Unlike static QA benchmarks, τ-bench measures whether an agent behaves reliably and correctly across multiple independent runs using the pass^k metric.
Because evaluation setup (prompt, tool schema, trial count) varies by submitter, self-reported scores across organizations are not always directly comparable — the Notes column captures key setup differences.
Score comparisons across organizations require caution — prompt setup, tool schema, and trial count differ between submissions.
Methodology
- The pass^k metric measures the probability an agent succeeds on all k independent trials of the same task — penalizing inconsistency even when average accuracy is high.
- Evaluation uses a simulated user (another LLM) and checks final database state against the annotated goal state — no LLM judge for pass/fail decisions.
- The official leaderboard is hosted at taubench.com and maintained by Sierra Research. Some rows are self-reported in model cards; verify source before comparing across organizations.
Links
τ-bench
Model scope| Rank | System / Submission | Score | Organization | Notes | Source | Share |
|---|---|---|---|---|---|---|
| 1 | Step-3.5-Flash New | 88.2% | StepFun | Direct technical report; Step-3.5 architecture optimized for high-consistency tool use. | Source | |
| 2 | GLM-4.7 New | 87.4% | Z.ai | Official Z.ai Developer docs; Introduces enhanced agentic policy compliance for enterprise retail/airline workflows. | Source | |
| 3 | MiMo-V2-Flash New | 80.3% | Xiaomi | Technical Report; MoE model with 309B total/15B active params and hybrid attention for long-horizon agent tasks. | Source | |
| 4 | GLM-4.7-Flash New | 79.5% | Z.ai | Verified via Cerebras & Z.ai; SOTA performance for a lightweight flash-tier model in tool-use consistency. | Source | |
| 5 | MiniMax M2 | 77.2% | MiniMax | Official repository; M2 specifically models 'thinking' content to handle complex multi-turn toolchains. | Source | |
| 6 | Claude Opus 4.5 | 70.2% | Anthropic | Sierra Research Evaluation; Measured using the 'High' user simulator and GPT-5.2 judge protocol. | Source | |
| 7 | GPT-5.2 | 69.9% | OpenAI | Sierra Research Evaluation; Results based on the standardized Sierra simulation harness. | Source | |
| 8 | Qwen3.5-397B-A17B | 68.4% | Alibaba | Official Qwen Blog; Native multimodal agent capabilities with sparse MoE architecture (17B active params). | Source | |
| 9 | Gemini 3 Flash | 67.8% | Google DeepMind | DeepMind Technical Report; Evaluation includes performance across retail, airline, and telecom domains. | Source | |
| 10 | Gemini 3 Pro | 65.8% | Google DeepMind | DeepMind Technical Report; Frontier reasoning performance for enterprise customer service simulations. | Source | |
| 11 | GLM-5 | 63.2% | Zhipu AI | τ-bench overall pass^1 (gpt-5.2 user sim, reasoning enabled); reported on taubench.com. | Source | |
| 12 | Claude Sonnet 4.5 | 62.9% | Anthropic | τ-bench overall pass^1 (gpt-5.2 user sim, reasoning enabled); reported on taubench.com. | Source |
Related benchmarks
Compare this benchmark with related pages from the hub: