Leaderboard
Model scope| System / Submission | Score | Organization | Reported | Source |
|---|---|---|---|---|
| AgentRL w/ Qwen2.5-32B-Instruct RL-trained on AgentBench FC environments; outperforms GPT-5 and Claude Sonnet 4 per paper. | 70.4% | Tsinghua University | Source | |
| AgentRL w/ Qwen2.5-14B-Instruct RL-trained 14B model; evaluated on ALFWorld, DB, KG, OS, and Webshop environments. | 67.7% | Tsinghua University | Source | |
| AgentRL w/ GLM-4-9B-0414 RL-trained on GLM-4-9B backbone; demonstrates cross-architecture generalization of AgentRL. | 65.0% | Tsinghua University | Source | |
| AgentRL w/ Qwen2.5-7B-Instruct RL-trained 7B model; competitive with much larger commercial models on AgentBench FC. | 62.0% | Tsinghua University | Source | |
| AgentRL w/ Qwen2.5-3B-Instruct Smallest AgentRL model; shows RL training benefit extends to 3B parameter scale. | 60.0% | Tsinghua University | Source | |
| Claude Sonnet 4.5 Community leaderboard submission; evaluated on AgentBench FC function-calling task suite. | 58.9% | Anthropic | Source | |
| Claude Sonnet 4.5 Thinking Extended thinking variant; marginal drop vs base Sonnet 4.5 on FC tasks. | 58.3% | Anthropic | Source | |
| Claude Sonnet 4 Thinking Community leaderboard submission; evaluated on AgentBench FC function-calling task suite. | 58.2% | Anthropic | Source | |
| Claude Sonnet 4 Community leaderboard submission; evaluated on AgentBench FC function-calling task suite. | 57.4% | Anthropic | Source | |
| Claude Sonnet 3.7 Community leaderboard submission; earlier Anthropic generation included for progress reference. | 53.2% | Anthropic | Source |
About this benchmark
AgentBench evaluates LLMs as agents across 8 interactive environments, including operating-system tasks, database querying, knowledge graphs, games, lateral-thinking puzzles, ALFWorld, WebShop, and Mind2Web-style browsing.
The current tracked page focuses on the Function Calling (FC) variant when rows cite it, because structured tool invocation is closest to modern agent deployment.
It is useful as a broad agentic skill check, but aggregate scores hide large differences between environment types; a system can be strong on database or tool calling and weak on web or OS tasks.
Community-submitted leaderboard; rows are not always independently verified or directly comparable across harness revisions.
Example tasks
Three public tasks quoted from benchmark sources:
- "How many hidden files are in /home? (not including subdirectories)" Citation: AgentBench OS task data
- "I would like to implement the following function: entering the "calc" command will enable the calculation of an expression. The expression can include addition, subtraction, multiplication, division, and parentheses. If the absolute error between the calculated answer and the expected answer is less than 1e-5, it will be considered correct. For example, I can calculate the result by entering "calc 2 * (9 / 3)", and the output will be 6." Citation: AgentBench OS task data
- "Stock logs are shown in /usr/stock.log. The last two columns are stock index and count respectively. Tell me how many times Bob sold a stock." Citation: AgentBench OS task data
Methodology
- Scores aggregate task completion across benchmark environments; FC rows emphasize structured function calls over free-form action text.
- Original AgentBench was published at ICLR 2024; later leaderboard rows may use revised harnesses, containerized environments, or FC subsets.
- Community leaderboard rows are not always independently verified, so we keep source links and notes close to the score.
- Use AgentBench with narrower benchmarks such as GAIA, τ-bench, and SWE-bench when diagnosing which capability is driving an aggregate result.
Links
Related benchmarks
Compare this benchmark with related pages from the hub: