Tool use benchmark - Self-hosted

ToolSandbox

Stateful tool-use benchmark with interdependencies between tool calls. Agents must manage tool state across multi-step tasks — calling one tool affects what another returns.

BENCHMARK

Benchmark by Apple

Benchmark type:: Self-hosted benchmark
Benchmark domain:: Tool use
Task count:: ~200
Evaluation method:: State verification

Top model score: ~52%
Human score: N/A

View ToolSandbox benchmark paper ToolSandbox GitHub repository

About this benchmark

ToolSandbox is a stateful, conversational, interactive evaluation benchmark for LLM tool-use capabilities, published in August 2024. Unlike prior benchmarks that evaluate stateless RESTful API calls or single-turn prompts, ToolSandbox introduces stateful tool execution where tools maintain persistent state across calls, implicit state dependencies between tools, a built-in user simulator for on-policy conversational evaluation, and a dynamic evaluation strategy that checks both intermediate and final milestones over arbitrary trajectories.

Evaluation uses milestone-based assessment, tracking whether the agent achieves required intermediate states and final goals throughout a multi-turn conversation. The benchmark defines several challenging task categories including State Dependency (tools whose outputs depend on prior tool calls), Canonicalization (handling input variations), and Insufficient Information (recognizing when a task cannot be completed). Results show a significant performance gap between open-source and proprietary models, with even the most capable SOTA LLMs struggling on complex task categories.

ToolSandbox provides important insights that static, single-turn benchmarks miss, particularly around stateful reasoning and conversational tool use. Its on-policy evaluation approach, where the user simulator reacts dynamically to agent behavior rather than following a fixed script, makes it more realistic than off-policy dialog trajectory evaluation. The evaluation framework is publicly released on GitHub.

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the tool use view . You can also jump straight to this benchmark in the master registry list .