About this benchmark

GAIA evaluates general AI assistants on 466 real-world questions requiring reasoning, web browsing, multimodal understanding, file handling, and tool use.

Questions are designed to be conceptually simple for humans with unambiguous final answers; 300 answers are withheld to power the official leaderboard.

Top GAIA systems are usually orchestrated agents or ensembles, not raw model calls, so rankings reward tool selection, search depth, verification, and answer formatting.

Top entries are multi-model ensembles; scores usually cannot be attributed to one base model.

Example tasks

Three public tasks quoted from benchmark sources:

  • "What was the actual enrollment count of the clinical trial on H. pylori in acne vulgaris patients from Jan-May 2018 as listed on the NIH website?" Citation: GAIA paper, Figure 1
  • "If this whole pint is made up of ice cream, how many percent above or below the US federal standards for butterfat content is it when using the standards as reported by Wikipedia in 2020? Answer as + or - a number rounded to one decimal place." Citation: GAIA paper, Figure 1
  • "In NASA’s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute? Exclude any astronauts who did not spend any time in space. Give the last name of the astronaut, separated from the number of minutes by a semicolon. Use commas as thousands separators in the number of minutes." Citation: GAIA paper, Figure 1

Methodology

  • Scoring is final-answer accuracy or quasi-exact match against ground truth, with no partial credit or open-ended rubric.
  • The official Hugging Face leaderboard is the canonical source for test-set submissions; launch posts may report related or approximate results.
  • Scores average across difficulty levels, so inspect source breakdowns when comparing systems optimized for easy versus multi-step tasks.
  • We prioritize official leaderboard rows and source pages that identify the agent composition or underlying model stack.

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on GAIA? +
OPS-Agentic-Search is the system/agent setup currently leading with a tracked score of 92.36%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Apr 16, 2026.
What should I read into a GAIA score? +
GAIA scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? +
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.