tau-bench Leaderboard 2026: Latest Tool Use Agent Scores

Leaderboard

Model scope

System / Submission	Score	Organization	Reported	Source
Step-3.5-Flash	88.2%	StepFun	Jan 2026	Source
GLM-4.7	87.4%	Z.ai	Dec 2025	Source
MiMo-V2-Flash	80.3%	Xiaomi	Jan 2026	Source
GLM-4.7-Flash	79.5%	Z.ai	Dec 2025	Source
MiniMax M2	77.2%	MiniMax	Oct 2025	Source
Claude Opus 4.5	70.2%	Anthropic	Nov 2025	Source
GPT-5.2	69.9%	OpenAI	Dec 2025	Source
Qwen3.5-397B-A17B	68.4%	Alibaba	Feb 2026	Source
Gemini 3 Flash	67.8%	Google DeepMind	Dec 2025	Source
Gemini 3 Pro	65.8%	Google DeepMind	Nov 2025	Source
GLM-5	63.2%	Zhipu AI	Feb 2026	Source
Claude Sonnet 4.5	62.9%	Anthropic	Sep 2025	Source

About this benchmark

τ-bench evaluates conversational agents in realistic customer-service tasks where the agent must talk to a simulated user, call domain APIs, and follow a policy manual.

The original domains are retail and airline, making it especially relevant for enterprise agents that must update backend state correctly while staying consistent across long multi-turn conversations.

It is a reliability benchmark as much as a capability benchmark: agents can solve a task once but fail repeated trials because of nondeterminism or brittle policy adherence.

Compare rows carefully: prompt setup, tool schema, user simulator, and trial count can all change pass^k.

Example tasks

Three public tasks quoted from benchmark sources:

"Your name is Raj Lee and your email, you have multiple email addressed, raj89@example.com, rajlee@example.com, lee42@example.com, raj.lee6137@example.com. You don't remember which email you used for placing the order. You are cautious, confident, pessimistic, sad. You want to cancel the order #W9933266 which you've just placed because you don't need the items." Citation: τ-bench retail dev tasks
"Your name is Fatima Anderson and your zip code is 32100. You are relaxing, logical, shy, polite. For the #W2974929 that you've just placed, you realize that you've picked the wrong deck material, change it to 'bamboo' deck material." Citation: τ-bench retail dev tasks
"Your name is Aarav Sanchez and your email is aarav.sanchez5467@example.com. You are patient, shy. Return the Portable Charger of your order. But before confirming, decide to return the Bookshelf and the Cycling Helmet as well. You wanna get website credit for the return." Citation: τ-bench retail dev tasks

Methodology

Evaluation compares final database state to the annotated goal state, avoiding an LLM judge for pass/fail task completion.
The key metric is pass^k: probability an agent succeeds across k independent trials, which penalizes systems that are correct only intermittently.
Reported rows may use different user simulators, model settings, tool schemas, and trial counts; source notes matter for direct comparison.
We prefer official taubench.com rows and technical reports that specify simulator and pass metric.

Related benchmarks

Compare this benchmark with related pages from the hub:

swe-bench-verified gaia

Back to benchmark hub

Frequently asked questions

Which system is currently best on τ-bench? +

Step-3.5-Flash is the model currently leading with a tracked score of 88.2%. This page is model-focused, so rankings mostly reflect model capability under the reported harness. Based on our latest tracked results, last updated Apr 16, 2026.

What should I read into a τ-bench score? +

τ-bench scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.

Are these independently verified? +

Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.

Can I compare every row directly? +

Use caution. Rows can vary by evaluator, harness, attempt budget, tool access, task filtering, or verification level. Source links and notes are part of the score, not decoration.