Leaderboard
Agent scope| System / Submission | Score | Organization | Reported | Source |
|---|---|---|---|---|
| OPS-Agentic-Search New Official GAIA leaderboard submission; multi-model ensemble using Qwen, Claude 4.6, GPT-5, DeepSeek 3.2, Gemini 3 Pro, and Kimi K2.5. | 92.36% | Alibaba Cloud | Source | |
| openJiuwen-deepagent New Official GAIA leaderboard submission; GPT-5 agent with o3 hints/summary plus Gemini 2.5 Pro and Claude tool roles. | 92.36% | Suzhou AI Lab / Shuqian Tech | Source | |
| openJiuwen-deepagent (GPT5/Gemini) Official GAIA leaderboard submission; GPT-5 and Gemini 3 Pro backbone. | 91.69% | openJiuwen | Source | |
| Lemon Agent Official GAIA leaderboard submission; open-source Lemon agent using GPT-5, Gemini 3 Pro, and o3. | 91.36% | Lenovo CTO Org | Source | |
| JoinAI V2.2 Official GAIA leaderboard; multi-model ensemble with GPT-5, Gemini 3 Pro, DeepSeek 3.1, Qwen 3. | 90.7% | JoinAI-CMCC | Source | |
| Nemotron-ToolOrchestra Official GAIA leaderboard; Nemotron Tool-Orchestrator 8B routes tasks to specialized tools and frontier models. | 90.37% | NVIDIA | Source | |
| JoinAI V2.1 GAIA test set average score; reported on gaia-benchmark HF leaderboard. | 90.03% | JoinAI-CMCC | Source | |
| SU Zero (Shuqian Pro) Official GAIA leaderboard submission. | 90.03% | Shuqian Tech | Source | |
| HALO V1217-1 Official GAIA leaderboard submission from Microsoft AI Asia. | 89.37% | Microsoft AI Asia | Source | |
| ShawnAgent v3.1 GAIA test set average score; reported on gaia-benchmark HF leaderboard. | 89.37% | Independent | Source | |
| HALO V1217 Official GAIA leaderboard submission from Microsoft AI Asia. | 89.04% | Microsoft AI Asia | Source | |
| Su Zero + SQ Pro Official GAIA leaderboard submission using GPT, Gemini, and Claude. | 89.04% | Suzhou AI Lab / Shuqian Tech | Source | |
| JoinAI V2 Official GAIA leaderboard submission using GPT, Gemini, DeepSeek, and Qwen. | 89.04% | JoinAI-CMCC | Source | |
| ShawnAgent v3.0 Official GAIA leaderboard submission using GPT-5.2, Claude Sonnet 4.5, and Gemini 3 Pro. | 89.04% | Independent | Source | |
| Lemon Agent v1.0.8 Earlier official GAIA leaderboard Lemon Agent submission using GPT-5, o3, and Gemini 3 Pro. | 88.37% | Lenovo CTO Org | Source | |
| Su Zero + Shuqian Lite Official GAIA leaderboard submission using Gemini 3, Claude Sonnet 4.5, and GPT-5.1. | 87.38% | Suzhou AI Lab / Shuqian Tech | Source | |
| Co-Sight v2.1.0 Official GAIA leaderboard submission using ZTE Nebula LLM, Claude Sonnet 4, and Gemini 2.5 Pro. | 87.04% | ZTE-AICloud | Source | |
| JoinAI V1.1 Official GAIA leaderboard submission using JoinLLM, GPT-4.1, DeepSeek V3.1, and Gemini 2.5 Pro. | 86.71% | JoinAI | Source | |
| Manus Self-reported; multi-agent system with parallel tool use across browser, code, and file tools. | 86.5% | Monica AI | Source | |
| Deep Research (o3, cons@64) OpenAI Deep Research consistency-over-64-samples result reported in the launch post. | 72.57% | OpenAI | Source | |
| Deep Research (o3) OpenAI Deep Research pass@1 result on GAIA; launch post also reports 72.57% with consistency over 64 samples. | 67.36% | OpenAI | Source |
About this benchmark
GAIA evaluates general AI assistants on 466 real-world questions requiring reasoning, web browsing, multimodal understanding, file handling, and tool use.
Questions are designed to be conceptually simple for humans with unambiguous final answers; 300 answers are withheld to power the official leaderboard.
Top GAIA systems are usually orchestrated agents or ensembles, not raw model calls, so rankings reward tool selection, search depth, verification, and answer formatting.
Top entries are multi-model ensembles; scores usually cannot be attributed to one base model.
Example tasks
Three public tasks quoted from benchmark sources:
- "What was the actual enrollment count of the clinical trial on H. pylori in acne vulgaris patients from Jan-May 2018 as listed on the NIH website?" Citation: GAIA paper, Figure 1
- "If this whole pint is made up of ice cream, how many percent above or below the US federal standards for butterfat content is it when using the standards as reported by Wikipedia in 2020? Answer as + or - a number rounded to one decimal place." Citation: GAIA paper, Figure 1
- "In NASA’s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute? Exclude any astronauts who did not spend any time in space. Give the last name of the astronaut, separated from the number of minutes by a semicolon. Use commas as thousands separators in the number of minutes." Citation: GAIA paper, Figure 1
Methodology
- Scoring is final-answer accuracy or quasi-exact match against ground truth, with no partial credit or open-ended rubric.
- The official Hugging Face leaderboard is the canonical source for test-set submissions; launch posts may report related or approximate results.
- Scores average across difficulty levels, so inspect source breakdowns when comparing systems optimized for easy versus multi-step tasks.
- We prioritize official leaderboard rows and source pages that identify the agent composition or underlying model stack.
Links
Related benchmarks
Compare this benchmark with related pages from the hub:
Frequently asked questions
Which system is currently best on GAIA? + -
OPS-Agentic-Search is the system/agent setup currently leading with a tracked score of 92.36%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Apr 16, 2026.
What should I read into a GAIA score? + -
GAIA scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? + -
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.