Canonical benchmark page
GAIA leaderboard
Benchmark page for GAIA with standardized structure: about, leaderboard table, and FAQ.
Last updated: 2026-04-16
About this benchmark
GAIA (General AI Assistants) evaluates agents on over 450 real-world questions with unambiguous, verifiable answers — requiring multi-step reasoning, tool use, web search, and file handling across three difficulty levels.
It is the most competitive public general-agent benchmark, with top systems now exceeding 90% and the official leaderboard hosted on Hugging Face.
Because top systems are large multi-model ensembles rather than a single model, scores reflect system design and orchestration quality as much as any individual model capability.
Top entries are multi-model ensembles — scores cannot be attributed to any single model.
Methodology
- Scoring uses quasi-exact match against ground truth answers — no partial credit, no LLM judge.
- Submissions are evaluated on the private test set; the official leaderboard on Hugging Face is the canonical source.
- Level 1 tasks require minimal tooling; Level 3 tasks demand complex multi-step agent behavior. Average scores blend all three levels.
Links
GAIA
Agent scope| Rank | System / Submission | Score | Organization | Notes | Source | Share |
|---|---|---|---|---|---|---|
| 1 | OPS-Agentic-Search New | 92.36% | Alibaba Cloud | Official GAIA leaderboard submission; multi-model ensemble using Qwen, Claude 4.6, GPT-5, DeepSeek 3.2. | Source | |
| 2 | openJiuwen-deepagent New | 91.69% | openJiuwen | Official GAIA leaderboard; GPT-5 and Gemini 3 Pro backbone. | Source | |
| 3 | Lemon Agent New | 91.36% | Lenovo CTO Org | Official GAIA leaderboard; open-source agent using GPT-5, Gemini 3 Pro, and o3. | Source | |
| 4 | JoinAI V2.2 | 90.7% | JoinAI-CMCC | Official GAIA leaderboard; multi-model ensemble with GPT-5, Gemini 3 Pro, DeepSeek 3.1, Qwen 3. | Source | |
| 5 | Nemotron-ToolOrchestra | 90.37% | NVIDIA | Official GAIA leaderboard; open-source orchestrator model routing tasks to specialized tools. | Source | |
| 6 | JoinAI V2.1 | 90.03% | JoinAI-CMCC | GAIA test set average score; reported on gaia-benchmark HF leaderboard. | Source | |
| 6 | SU Zero (Shuqian Pro) | 90.03% | Shuqian Tech | Official GAIA leaderboard submission. | Source | |
| 8 | HALO V1217-1 | 89.37% | Microsoft AI Asia | Official GAIA leaderboard submission from Microsoft AI Asia. | Source | |
| 8 | ShawnAgent v3.1 | 89.37% | Independent | GAIA test set average score; reported on gaia-benchmark HF leaderboard. | Source | |
| 10 | Deep Research (o3) | ~87% | OpenAI | Self-reported at Deep Research launch; reached new SOTA on GAIA at time of release. | Source | |
| 11 | Manus | 86.5% | Monica AI | Self-reported; multi-agent system with parallel tool use across browser, code, and file tools. | Source | |
| 12 | GPT-4o (Deep Research) | 72.0% | OpenAI | Baseline row from Deep Research launch post; earlier generation model for reference. | Source |
Related benchmarks
Compare this benchmark with related pages from the hub: