Leaderboard

System / Submission Score Organization Reported Source
AgentRL w/ Qwen2.5-32B-Instruct
70.4% Tsinghua University Source
AgentRL w/ Qwen2.5-14B-Instruct
67.7% Tsinghua University Source
AgentRL w/ GLM-4-9B-0414
65.0% Tsinghua University Source
AgentRL w/ Qwen2.5-7B-Instruct
62.0% Tsinghua University Source
AgentRL w/ Qwen2.5-3B-Instruct
60.0% Tsinghua University Source
Claude Sonnet 4.5
58.9% Anthropic Source
Claude Sonnet 4.5 Thinking
58.3% Anthropic Source
Claude Sonnet 4 Thinking
58.2% Anthropic Source
Claude Sonnet 4
57.4% Anthropic Source
Claude Sonnet 3.7
53.2% Anthropic Source

About this benchmark

AgentBench evaluates LLMs as agents across 8 interactive environments, including operating-system tasks, database querying, knowledge graphs, games, lateral-thinking puzzles, ALFWorld, WebShop, and Mind2Web-style browsing.

The current tracked page focuses on the Function Calling (FC) variant when rows cite it, because structured tool invocation is closest to modern agent deployment.

It is useful as a broad agentic skill check, but aggregate scores hide large differences between environment types; a system can be strong on database or tool calling and weak on web or OS tasks.

Community-submitted leaderboard; rows are not always independently verified or directly comparable across harness revisions.

Example tasks

Three public tasks quoted from benchmark sources:

  • "How many hidden files are in /home? (not including subdirectories)" Citation: AgentBench OS task data
  • "I would like to implement the following function: entering the "calc" command will enable the calculation of an expression. The expression can include addition, subtraction, multiplication, division, and parentheses. If the absolute error between the calculated answer and the expected answer is less than 1e-5, it will be considered correct. For example, I can calculate the result by entering "calc 2 * (9 / 3)", and the output will be 6." Citation: AgentBench OS task data
  • "Stock logs are shown in /usr/stock.log. The last two columns are stock index and count respectively. Tell me how many times Bob sold a stock." Citation: AgentBench OS task data

Methodology

  • Scores aggregate task completion across benchmark environments; FC rows emphasize structured function calls over free-form action text.
  • Original AgentBench was published at ICLR 2024; later leaderboard rows may use revised harnesses, containerized environments, or FC subsets.
  • Community leaderboard rows are not always independently verified, so we keep source links and notes close to the score.
  • Use AgentBench with narrower benchmarks such as GAIA, τ-bench, and SWE-bench when diagnosing which capability is driving an aggregate result.

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on AgentBench? +
AgentRL w/ Qwen2.5-32B-Instruct is the model currently leading with a tracked score of 70.4%. This page is model-focused, so rankings mostly reflect model capability under the reported harness. Based on our latest tracked results, last updated Apr 16, 2026.
What should I read into a AgentBench score? +
AgentBench scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? +
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.
Can I compare every row directly? +
Use caution. Rows can vary by evaluator, harness, attempt budget, tool access, task filtering, or verification level. Source links and notes are part of the score, not decoration.