Canonical benchmark page

AgentBench leaderboard

Benchmark page for AgentBench with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-04-16

About this benchmark

AgentBench evaluates LLMs as agents across 8 distinct interactive environments — including OS interaction, database querying, knowledge graph traversal, digital card games, lateral thinking puzzles, house-holding tasks, web browsing, and web shopping.

Published at ICLR 2024 and developed by Tsinghua University, AgentBench was one of the first benchmarks to systematically expose the performance gap between top commercial LLMs and open-source competitors on real agentic tasks requiring multi-turn reasoning and decision-making.

The FC (Function Calling) variant of the leaderboard focuses specifically on structured tool use and function-calling ability — the most relevant dimension for teams building tool-augmented pipelines.

Community-submitted leaderboard — rows are self-reported and not independently verified. Check source links before drawing strong conclusions.

Methodology

Each environment has its own task set and automated evaluator. Scores reflect an overall average across environments unless a specific environment subset is specified.
The FC leaderboard tracks function-calling performance specifically — models are evaluated on structured tool invocation accuracy rather than free-form action generation.
Results are community-submitted via the public Google Sheets tracker; rows are not independently verified by the authors unless explicitly noted.

AgentBench

Model scope

Rank	System / Submission	Score	Organization	Notes	Source	Share
1	AgentRL w/ Qwen2.5-32B-Instruct New	70.4%	Tsinghua University	RL-trained on AgentBench FC environments; outperforms GPT-5 and Claude Sonnet 4 per paper.	Source	Share on X Share on LinkedIn
2	AgentRL w/ Qwen2.5-14B-Instruct New	67.7%	Tsinghua University	RL-trained 14B model; evaluated on ALFWorld, DB, KG, OS, and Webshop environments.	Source	Share on X Share on LinkedIn
3	AgentRL w/ GLM-4-9B-0414 New	65.0%	Tsinghua University	RL-trained on GLM-4-9B backbone; demonstrates cross-architecture generalization of AgentRL.	Source	Share on X Share on LinkedIn
4	AgentRL w/ Qwen2.5-7B-Instruct New	62.0%	Tsinghua University	RL-trained 7B model; competitive with much larger commercial models on AgentBench FC.	Source	Share on X Share on LinkedIn
5	AgentRL w/ Qwen2.5-3B-Instruct New	60.0%	Tsinghua University	Smallest AgentRL model; shows RL training benefit extends to 3B parameter scale.	Source	Share on X Share on LinkedIn
6	Claude Sonnet 4.5	58.9%	Anthropic	Community leaderboard submission; evaluated on AgentBench FC function-calling task suite.	Source	Share on X Share on LinkedIn
7	Claude Sonnet 4.5 Thinking	58.3%	Anthropic	Extended thinking variant; marginal drop vs base Sonnet 4.5 on FC tasks.	Source	Share on X Share on LinkedIn
8	Claude Sonnet 4 Thinking	58.2%	Anthropic	Community leaderboard submission; evaluated on AgentBench FC function-calling task suite.	Source	Share on X Share on LinkedIn
9	Claude Sonnet 4	57.4%	Anthropic	Community leaderboard submission; evaluated on AgentBench FC function-calling task suite.	Source	Share on X Share on LinkedIn
10	Claude Sonnet 3.7	53.2%	Anthropic	Community leaderboard submission; earlier Anthropic generation included for progress reference.	Source	Share on X Share on LinkedIn

Related benchmarks

Compare this benchmark with related pages from the hub:

tau-bench gaia swe-bench-verified

Back to benchmark hub

Frequently asked questions

Which system is currently best on AgentBench? +

AgentRL w/ Qwen2.5-32B-Instruct is the model currently leading with a tracked score of 70.4%. This page is model-focused, so rankings mostly reflect model capability under the reported harness. Based on our latest tracked results, last updated Apr 16, 2026.

What should I read into a AgentBench score? +

AgentBench scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.

Are these independently verified? +

Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.