Leaderboard
Model scope| System / Submission | Score | Organization | Reported | Source |
|---|---|---|---|---|
| Step-3.5-Flash Direct technical report; Step-3.5 architecture optimized for high-consistency tool use. | 88.2% | StepFun | Source | |
| GLM-4.7 Official Z.ai Developer docs; Introduces enhanced agentic policy compliance for enterprise retail/airline workflows. | 87.4% | Z.ai | Source | |
| MiMo-V2-Flash Technical Report; MoE model with 309B total/15B active params and hybrid attention for long-horizon agent tasks. | 80.3% | Xiaomi | Source | |
| GLM-4.7-Flash Verified via Cerebras & Z.ai; SOTA performance for a lightweight flash-tier model in tool-use consistency. | 79.5% | Z.ai | Source | |
| MiniMax M2 Official repository; M2 specifically models 'thinking' content to handle complex multi-turn toolchains. | 77.2% | MiniMax | Source | |
| Claude Opus 4.5 Sierra Research Evaluation; Measured using the 'High' user simulator and GPT-5.2 judge protocol. | 70.2% | Anthropic | Source | |
| GPT-5.2 Sierra Research Evaluation; Results based on the standardized Sierra simulation harness. | 69.9% | OpenAI | Source | |
| Qwen3.5-397B-A17B Official Qwen Blog; Native multimodal agent capabilities with sparse MoE architecture (17B active params). | 68.4% | Alibaba | Source | |
| Gemini 3 Flash DeepMind Technical Report; Evaluation includes performance across retail, airline, and telecom domains. | 67.8% | Google DeepMind | Source | |
| Gemini 3 Pro DeepMind Technical Report; Frontier reasoning performance for enterprise customer service simulations. | 65.8% | Google DeepMind | Source | |
| GLM-5 τ-bench overall pass^1 (gpt-5.2 user sim, reasoning enabled); reported on taubench.com. | 63.2% | Zhipu AI | Source | |
| Claude Sonnet 4.5 τ-bench overall pass^1 (gpt-5.2 user sim, reasoning enabled); reported on taubench.com. | 62.9% | Anthropic | Source |
About this benchmark
τ-bench evaluates conversational agents in realistic customer-service tasks where the agent must talk to a simulated user, call domain APIs, and follow a policy manual.
The original domains are retail and airline, making it especially relevant for enterprise agents that must update backend state correctly while staying consistent across long multi-turn conversations.
It is a reliability benchmark as much as a capability benchmark: agents can solve a task once but fail repeated trials because of nondeterminism or brittle policy adherence.
Compare rows carefully: prompt setup, tool schema, user simulator, and trial count can all change pass^k.
Example tasks
Three public tasks quoted from benchmark sources:
- "Your name is Raj Lee and your email, you have multiple email addressed, raj89@example.com, rajlee@example.com, lee42@example.com, raj.lee6137@example.com. You don't remember which email you used for placing the order. You are cautious, confident, pessimistic, sad. You want to cancel the order #W9933266 which you've just placed because you don't need the items." Citation: τ-bench retail dev tasks
- "Your name is Fatima Anderson and your zip code is 32100. You are relaxing, logical, shy, polite. For the #W2974929 that you've just placed, you realize that you've picked the wrong deck material, change it to 'bamboo' deck material." Citation: τ-bench retail dev tasks
- "Your name is Aarav Sanchez and your email is aarav.sanchez5467@example.com. You are patient, shy. Return the Portable Charger of your order. But before confirming, decide to return the Bookshelf and the Cycling Helmet as well. You wanna get website credit for the return." Citation: τ-bench retail dev tasks
Methodology
- Evaluation compares final database state to the annotated goal state, avoiding an LLM judge for pass/fail task completion.
- The key metric is pass^k: probability an agent succeeds across k independent trials, which penalizes systems that are correct only intermittently.
- Reported rows may use different user simulators, model settings, tool schemas, and trial counts; source notes matter for direct comparison.
- We prefer official taubench.com rows and technical reports that specify simulator and pass metric.
Links
Related benchmarks
Compare this benchmark with related pages from the hub: