GAIA
466 tasks across 3 difficulty levels requiring tool use, multimodal reasoning, and web browsing. With 587 submissions on HuggingFace, the most submitted-to AI agent benchmark in existence.
- Benchmark type:
- Public benchmark
- Benchmark domain:
- General reasoning
- Task count:
- 466
- Evaluation method:
- Exact match
- Top model score
- ~75%
- Human score
- 92%
About this benchmark
GAIA (General AI Assistants) is a benchmark designed to evaluate general-purpose AI assistants on real-world questions requiring reasoning, multi-modality handling, web browsing, and tool-use proficiency, introduced in November 2023. It contains 466 questions with definitive answers, each conceptually simple for humans but challenging for AI systems. The benchmark deliberately targets tasks where average humans excel rather than following the trend of testing on expert-level professional knowledge. Of the 466 questions, 300 have answers withheld to power a public leaderboard.
Evaluation is straightforward: each question has a single correct answer, enabling exact-match scoring. The benchmark reveals a stark performance gap: human respondents achieve 92% accuracy while GPT-4 with plugins scores only 15%. This disparity is notable because it contrasts with recent trends where LLMs outperform humans on specialized tasks in domains like law and chemistry. Questions span three difficulty levels, with Level 1 being the simplest and Level 3 requiring the most complex multi-step reasoning and tool use.
GAIA is significant because it reframes AGI evaluation around robustness on everyday tasks rather than superhuman performance on narrow domains. Its philosophy posits that matching average human performance on such practical questions is a meaningful milestone for artificial general intelligence. The benchmark and leaderboard are publicly available on HuggingFace.
Where this benchmark fits
Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the general reasoning view . You can also jump straight to this benchmark in the master registry list .