API-Bank
73 API tools across 3 difficulty levels testing tool retrieval, plan selection, and API call correctness. One of the earliest systematic tool-use benchmarks for LLMs.
- Benchmark type:
- Public benchmark
- Benchmark domain:
- Tool use
- Task count:
- 314
- Evaluation method:
- Exact match
- Top model score
- ~75%
- Human score
- N/A
About this benchmark
API-Bank is a comprehensive benchmark for evaluating tool-augmented LLMs, introduced by Minghao Li et al. at Alibaba DAMO ConvAI and published in April 2023. It provides a runnable evaluation system consisting of 73 API tools across diverse domains, with 314 tool-use dialogues containing 753 API calls for testing. The benchmark assesses three core capabilities: API call planning (deciding which APIs to use), API retrieval (finding the right API from a large pool), and API invocation (generating correct calls with proper arguments). A larger training set of 1,888 tool-use dialogues drawn from 2,138 APIs spanning 1,000 distinct domains is also provided.
Evaluation measures correctness across the three capability levels. Experimental results show that GPT-3.5 improves over GPT-3 in tool utilization, while GPT-4 excels particularly in planning. The authors train Lynx, a tool-augmented LLM initialized from Alpaca, which surpasses Alpaca by more than 26 percentage points and approaches GPT-3.5 effectiveness. Error analysis reveals that accurate argument generation and multi-step planning remain key challenges.
API-Bank is significant as one of the earliest comprehensive benchmarks specifically targeting LLM tool-use capabilities, providing both evaluation and training infrastructure. It is released under the MIT license through the DAMO-ConvAI repository.
Where this benchmark fits
Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the tool use view . You can also jump straight to this benchmark in the master registry list .