Canonical benchmark page

AgentBench leaderboard

Benchmark page for AgentBench with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-04-16

About this benchmark

AgentBench evaluates LLMs as agents across 8 distinct interactive environments — including OS interaction, database querying, knowledge graph traversal, digital card games, lateral thinking puzzles, house-holding tasks, web browsing, and web shopping.

Published at ICLR 2024 and developed by Tsinghua University, AgentBench was one of the first benchmarks to systematically expose the performance gap between top commercial LLMs and open-source competitors on real agentic tasks requiring multi-turn reasoning and decision-making.

The FC (Function Calling) variant of the leaderboard focuses specifically on structured tool use and function-calling ability — the most relevant dimension for teams building tool-augmented pipelines.

Community-submitted leaderboard — rows are self-reported and not independently verified. Check source links before drawing strong conclusions.

Methodology

  • Each environment has its own task set and automated evaluator. Scores reflect an overall average across environments unless a specific environment subset is specified.
  • The FC leaderboard tracks function-calling performance specifically — models are evaluated on structured tool invocation accuracy rather than free-form action generation.
  • Results are community-submitted via the public Google Sheets tracker; rows are not independently verified by the authors unless explicitly noted.

Links

AgentBench

Model scope
Rank System / Submission Score Organization Notes Source Share
1
AgentRL w/ Qwen2.5-32B-Instruct New
70.4% Tsinghua University RL-trained on AgentBench FC environments; outperforms GPT-5 and Claude Sonnet 4 per paper. Source
2
AgentRL w/ Qwen2.5-14B-Instruct New
67.7% Tsinghua University RL-trained 14B model; evaluated on ALFWorld, DB, KG, OS, and Webshop environments. Source
3
AgentRL w/ GLM-4-9B-0414 New
65.0% Tsinghua University RL-trained on GLM-4-9B backbone; demonstrates cross-architecture generalization of AgentRL. Source
4
AgentRL w/ Qwen2.5-7B-Instruct New
62.0% Tsinghua University RL-trained 7B model; competitive with much larger commercial models on AgentBench FC. Source
5
AgentRL w/ Qwen2.5-3B-Instruct New
60.0% Tsinghua University Smallest AgentRL model; shows RL training benefit extends to 3B parameter scale. Source
6
Claude Sonnet 4.5
58.9% Anthropic Community leaderboard submission; evaluated on AgentBench FC function-calling task suite. Source
7
Claude Sonnet 4.5 Thinking
58.3% Anthropic Extended thinking variant; marginal drop vs base Sonnet 4.5 on FC tasks. Source
8
Claude Sonnet 4 Thinking
58.2% Anthropic Community leaderboard submission; evaluated on AgentBench FC function-calling task suite. Source
9
Claude Sonnet 4
57.4% Anthropic Community leaderboard submission; evaluated on AgentBench FC function-calling task suite. Source
10
Claude Sonnet 3.7
53.2% Anthropic Community leaderboard submission; earlier Anthropic generation included for progress reference. Source

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on AgentBench? +
AgentRL w/ Qwen2.5-32B-Instruct is the model currently leading with a tracked score of 70.4%. This page is model-focused, so rankings mostly reflect model capability under the reported harness. Based on our latest tracked results, last updated Apr 16, 2026.
What should I read into a AgentBench score? +
AgentBench scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? +
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.