General reasoning benchmark - Self-hosted

AgentBoard

Analytical benchmark across 9 diverse agent scenarios. Provides fine-grained progress rates beyond binary success/fail — measures how far along a task an agent gets even when it fails.

BENCHMARK

Benchmark by HKUST

Benchmark type:: Self-hosted benchmark
Benchmark domain:: General reasoning
Task count:: ~1,000
Evaluation method:: Progress rate

Top model score: ~58% progress
Human score: N/A

View AgentBoard benchmark paper AgentBoard GitHub repository

About this benchmark

AgentBoard is a comprehensive evaluation framework for multi-turn LLM agents introduced by Chang Ma et al. in January 2024, accepted at LLMAgents @ ICLR 2024. It encompasses 9 diverse tasks across 4 categories: Embodied AI (AlfWorld, ScienceWorld, BabyAI), Game (Jericho, PDDL), Web (WebShop, WebArena), and Tool (Tool-Query, Tool-Operation). All tasks feature partially-observable environments requiring multi-round interaction, with evaluation spanning up to 15,000 inference rounds per model.

AgentBoard introduces a fine-grained progress rate metric that captures incremental advancements beyond binary success/failure. The evaluation toolkit provides multi-faceted analysis including grounding accuracy, performance breakdown for hard and easy examples, long-range interaction quality, sub-skill proficiency, and trajectory visualization via Weights & Biases integration. The framework supports 12 models out of the box including GPT-4, Claude 2, and open-source models like DeepSeek-67b and Llama2-70b, all evaluated using a simple reflex agent with act-only prompting.

AgentBoard stands out for its emphasis on analytical rather than purely aggregate evaluation, providing researchers with interpretable insights into agent behavior and failure modes. The partially-observable environment design tests genuine world-modeling ability rather than pattern matching. The code is released under Apache 2.0 and the dataset under GPL 2.0.

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the general reasoning view . You can also jump straight to this benchmark in the master registry list .