General reasoning benchmark - Public

GAIA

466 tasks across 3 difficulty levels requiring tool use, multimodal reasoning, and web browsing. With 587 submissions on HuggingFace, the most submitted-to AI agent benchmark in existence.

BENCHMARK

Benchmark type:: Public benchmark
Benchmark domain:: General reasoning
Task count:: 466
Evaluation method:: Exact match

Top model score: ~75%
Human score: 92%

View GAIA benchmark paper GAIA GitHub repository

About this benchmark

GAIA (General AI Assistants) is a benchmark designed to evaluate general-purpose AI assistants on real-world questions requiring reasoning, multi-modality handling, web browsing, and tool-use proficiency, introduced in November 2023. It contains 466 questions with definitive answers, each conceptually simple for humans but challenging for AI systems. The benchmark deliberately targets tasks where average humans excel rather than following the trend of testing on expert-level professional knowledge. Of the 466 questions, 300 have answers withheld to power a public leaderboard.

Evaluation is straightforward: each question has a single correct answer, enabling exact-match scoring. The benchmark reveals a stark performance gap: human respondents achieve 92% accuracy while GPT-4 with plugins scores only 15%. This disparity is notable because it contrasts with recent trends where LLMs outperform humans on specialized tasks in domains like law and chemistry. Questions span three difficulty levels, with Level 1 being the simplest and Level 3 requiring the most complex multi-step reasoning and tool use.

GAIA is significant because it reframes AGI evaluation around robustness on everyday tasks rather than superhuman performance on narrow domains. Its philosophy posits that matching average human performance on such practical questions is a meaningful milestone for artificial general intelligence. The benchmark and leaderboard are publicly available on HuggingFace.

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the general reasoning view . You can also jump straight to this benchmark in the master registry list .