Canonical benchmark page
AgentBench leaderboard
Benchmark page for AgentBench with standardized structure: about, leaderboard table, and FAQ.
Last updated: 2026-04-16
About this benchmark
AgentBench evaluates LLMs as agents across 8 distinct interactive environments — including OS interaction, database querying, knowledge graph traversal, digital card games, lateral thinking puzzles, house-holding tasks, web browsing, and web shopping.
Published at ICLR 2024 and developed by Tsinghua University, AgentBench was one of the first benchmarks to systematically expose the performance gap between top commercial LLMs and open-source competitors on real agentic tasks requiring multi-turn reasoning and decision-making.
The FC (Function Calling) variant of the leaderboard focuses specifically on structured tool use and function-calling ability — the most relevant dimension for teams building tool-augmented pipelines.
Community-submitted leaderboard — rows are self-reported and not independently verified. Check source links before drawing strong conclusions.
Methodology
- Each environment has its own task set and automated evaluator. Scores reflect an overall average across environments unless a specific environment subset is specified.
- The FC leaderboard tracks function-calling performance specifically — models are evaluated on structured tool invocation accuracy rather than free-form action generation.
- Results are community-submitted via the public Google Sheets tracker; rows are not independently verified by the authors unless explicitly noted.
Links
AgentBench
Model scope| Rank | System / Submission | Score | Organization | Notes | Source | Share |
|---|---|---|---|---|---|---|
| 1 | AgentRL w/ Qwen2.5-32B-Instruct New | 70.4% | Tsinghua University | RL-trained on AgentBench FC environments; outperforms GPT-5 and Claude Sonnet 4 per paper. | Source | |
| 2 | AgentRL w/ Qwen2.5-14B-Instruct New | 67.7% | Tsinghua University | RL-trained 14B model; evaluated on ALFWorld, DB, KG, OS, and Webshop environments. | Source | |
| 3 | AgentRL w/ GLM-4-9B-0414 New | 65.0% | Tsinghua University | RL-trained on GLM-4-9B backbone; demonstrates cross-architecture generalization of AgentRL. | Source | |
| 4 | AgentRL w/ Qwen2.5-7B-Instruct New | 62.0% | Tsinghua University | RL-trained 7B model; competitive with much larger commercial models on AgentBench FC. | Source | |
| 5 | AgentRL w/ Qwen2.5-3B-Instruct New | 60.0% | Tsinghua University | Smallest AgentRL model; shows RL training benefit extends to 3B parameter scale. | Source | |
| 6 | Claude Sonnet 4.5 | 58.9% | Anthropic | Community leaderboard submission; evaluated on AgentBench FC function-calling task suite. | Source | |
| 7 | Claude Sonnet 4.5 Thinking | 58.3% | Anthropic | Extended thinking variant; marginal drop vs base Sonnet 4.5 on FC tasks. | Source | |
| 8 | Claude Sonnet 4 Thinking | 58.2% | Anthropic | Community leaderboard submission; evaluated on AgentBench FC function-calling task suite. | Source | |
| 9 | Claude Sonnet 4 | 57.4% | Anthropic | Community leaderboard submission; evaluated on AgentBench FC function-calling task suite. | Source | |
| 10 | Claude Sonnet 3.7 | 53.2% | Anthropic | Community leaderboard submission; earlier Anthropic generation included for progress reference. | Source |
Related benchmarks
Compare this benchmark with related pages from the hub: