Leaderboard

System / Submission Score Organization Reported Source
Step-3.5-Flash
88.2% StepFun Source
GLM-4.7
87.4% Z.ai Source
MiMo-V2-Flash
80.3% Xiaomi Source
GLM-4.7-Flash
79.5% Z.ai Source
MiniMax M2
77.2% MiniMax Source
Claude Opus 4.5
70.2% Anthropic Source
GPT-5.2
69.9% OpenAI Source
Qwen3.5-397B-A17B
68.4% Alibaba Source
Gemini 3 Flash
67.8% Google DeepMind Source
Gemini 3 Pro
65.8% Google DeepMind Source
GLM-5
63.2% Zhipu AI Source
Claude Sonnet 4.5
62.9% Anthropic Source

About this benchmark

τ-bench evaluates conversational agents in realistic customer-service tasks where the agent must talk to a simulated user, call domain APIs, and follow a policy manual.

The original domains are retail and airline, making it especially relevant for enterprise agents that must update backend state correctly while staying consistent across long multi-turn conversations.

It is a reliability benchmark as much as a capability benchmark: agents can solve a task once but fail repeated trials because of nondeterminism or brittle policy adherence.

Compare rows carefully: prompt setup, tool schema, user simulator, and trial count can all change pass^k.

Example tasks

Three public tasks quoted from benchmark sources:

  • "Your name is Raj Lee and your email, you have multiple email addressed, raj89@example.com, rajlee@example.com, lee42@example.com, raj.lee6137@example.com. You don't remember which email you used for placing the order. You are cautious, confident, pessimistic, sad. You want to cancel the order #W9933266 which you've just placed because you don't need the items." Citation: τ-bench retail dev tasks
  • "Your name is Fatima Anderson and your zip code is 32100. You are relaxing, logical, shy, polite. For the #W2974929 that you've just placed, you realize that you've picked the wrong deck material, change it to 'bamboo' deck material." Citation: τ-bench retail dev tasks
  • "Your name is Aarav Sanchez and your email is aarav.sanchez5467@example.com. You are patient, shy. Return the Portable Charger of your order. But before confirming, decide to return the Bookshelf and the Cycling Helmet as well. You wanna get website credit for the return." Citation: τ-bench retail dev tasks

Methodology

  • Evaluation compares final database state to the annotated goal state, avoiding an LLM judge for pass/fail task completion.
  • The key metric is pass^k: probability an agent succeeds across k independent trials, which penalizes systems that are correct only intermittently.
  • Reported rows may use different user simulators, model settings, tool schemas, and trial counts; source notes matter for direct comparison.
  • We prefer official taubench.com rows and technical reports that specify simulator and pass metric.

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on τ-bench? +
Step-3.5-Flash is the model currently leading with a tracked score of 88.2%. This page is model-focused, so rankings mostly reflect model capability under the reported harness. Based on our latest tracked results, last updated Apr 16, 2026.
What should I read into a τ-bench score? +
τ-bench scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? +
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.
Can I compare every row directly? +
Use caution. Rows can vary by evaluator, harness, attempt budget, tool access, task filtering, or verification level. Source links and notes are part of the score, not decoration.