AgentBench Leaderboard 2026: Latest LLM Agent Scores

Leaderboard

Model scope

System / Submission	Score	Organization	Reported	Source
AgentRL w/ Qwen2.5-32B-Instruct	70.4%	Tsinghua University	Oct 2025	Source
AgentRL w/ Qwen2.5-14B-Instruct	67.7%	Tsinghua University	Oct 2025	Source
AgentRL w/ GLM-4-9B-0414	65.0%	Tsinghua University	Oct 2025	Source
AgentRL w/ Qwen2.5-7B-Instruct	62.0%	Tsinghua University	Oct 2025	Source
AgentRL w/ Qwen2.5-3B-Instruct	60.0%	Tsinghua University	Oct 2025	Source
Claude Sonnet 4.5	58.9%	Anthropic	Sep 2025	Source
Claude Sonnet 4.5 Thinking	58.3%	Anthropic	Sep 2025	Source
Claude Sonnet 4 Thinking	58.2%	Anthropic	May 2025	Source
Claude Sonnet 4	57.4%	Anthropic	May 2025	Source
Claude Sonnet 3.7	53.2%	Anthropic	Feb 2025	Source

About this benchmark

AgentBench evaluates LLMs as agents across 8 interactive environments, including operating-system tasks, database querying, knowledge graphs, games, lateral-thinking puzzles, ALFWorld, WebShop, and Mind2Web-style browsing.

The current tracked page focuses on the Function Calling (FC) variant when rows cite it, because structured tool invocation is closest to modern agent deployment.

It is useful as a broad agentic skill check, but aggregate scores hide large differences between environment types; a system can be strong on database or tool calling and weak on web or OS tasks.

Community-submitted leaderboard; rows are not always independently verified or directly comparable across harness revisions.

Example tasks

Three public tasks quoted from benchmark sources:

"How many hidden files are in /home? (not including subdirectories)" Citation: AgentBench OS task data
"I would like to implement the following function: entering the "calc" command will enable the calculation of an expression. The expression can include addition, subtraction, multiplication, division, and parentheses. If the absolute error between the calculated answer and the expected answer is less than 1e-5, it will be considered correct. For example, I can calculate the result by entering "calc 2 * (9 / 3)", and the output will be 6." Citation: AgentBench OS task data
"Stock logs are shown in /usr/stock.log. The last two columns are stock index and count respectively. Tell me how many times Bob sold a stock." Citation: AgentBench OS task data

Methodology

Scores aggregate task completion across benchmark environments; FC rows emphasize structured function calls over free-form action text.
Original AgentBench was published at ICLR 2024; later leaderboard rows may use revised harnesses, containerized environments, or FC subsets.
Community leaderboard rows are not always independently verified, so we keep source links and notes close to the score.
Use AgentBench with narrower benchmarks such as GAIA, τ-bench, and SWE-bench when diagnosing which capability is driving an aggregate result.

Related benchmarks

Compare this benchmark with related pages from the hub:

tau-bench gaia swe-bench-verified

Back to benchmark hub

Frequently asked questions

Which system is currently best on AgentBench? +

AgentRL w/ Qwen2.5-32B-Instruct is the model currently leading with a tracked score of 70.4%. This page is model-focused, so rankings mostly reflect model capability under the reported harness. Based on our latest tracked results, last updated Apr 16, 2026.

What should I read into a AgentBench score? +

AgentBench scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.

Are these independently verified? +

Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.

Can I compare every row directly? +

Use caution. Rows can vary by evaluator, harness, attempt budget, tool access, task filtering, or verification level. Source links and notes are part of the score, not decoration.