Canonical benchmark page

GAIA leaderboard

Benchmark page for GAIA with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-04-16

About this benchmark

GAIA (General AI Assistants) evaluates agents on over 450 real-world questions with unambiguous, verifiable answers — requiring multi-step reasoning, tool use, web search, and file handling across three difficulty levels.

It is the most competitive public general-agent benchmark, with top systems now exceeding 90% and the official leaderboard hosted on Hugging Face.

Because top systems are large multi-model ensembles rather than a single model, scores reflect system design and orchestration quality as much as any individual model capability.

Top entries are multi-model ensembles — scores cannot be attributed to any single model.

Methodology

  • Scoring uses quasi-exact match against ground truth answers — no partial credit, no LLM judge.
  • Submissions are evaluated on the private test set; the official leaderboard on Hugging Face is the canonical source.
  • Level 1 tasks require minimal tooling; Level 3 tasks demand complex multi-step agent behavior. Average scores blend all three levels.

Links

GAIA

Agent scope
Rank System / Submission Score Organization Notes Source Share
1
OPS-Agentic-Search New
92.36% Alibaba Cloud Official GAIA leaderboard submission; multi-model ensemble using Qwen, Claude 4.6, GPT-5, DeepSeek 3.2. Source
2
openJiuwen-deepagent New
91.69% openJiuwen Official GAIA leaderboard; GPT-5 and Gemini 3 Pro backbone. Source
3
Lemon Agent New
91.36% Lenovo CTO Org Official GAIA leaderboard; open-source agent using GPT-5, Gemini 3 Pro, and o3. Source
4
JoinAI V2.2
90.7% JoinAI-CMCC Official GAIA leaderboard; multi-model ensemble with GPT-5, Gemini 3 Pro, DeepSeek 3.1, Qwen 3. Source
5
Nemotron-ToolOrchestra
90.37% NVIDIA Official GAIA leaderboard; open-source orchestrator model routing tasks to specialized tools. Source
6
JoinAI V2.1
90.03% JoinAI-CMCC GAIA test set average score; reported on gaia-benchmark HF leaderboard. Source
6
SU Zero (Shuqian Pro)
90.03% Shuqian Tech Official GAIA leaderboard submission. Source
8
HALO V1217-1
89.37% Microsoft AI Asia Official GAIA leaderboard submission from Microsoft AI Asia. Source
8
ShawnAgent v3.1
89.37% Independent GAIA test set average score; reported on gaia-benchmark HF leaderboard. Source
10
Deep Research (o3)
~87% OpenAI Self-reported at Deep Research launch; reached new SOTA on GAIA at time of release. Source
11
Manus
86.5% Monica AI Self-reported; multi-agent system with parallel tool use across browser, code, and file tools. Source
12
GPT-4o (Deep Research)
72.0% OpenAI Baseline row from Deep Research launch post; earlier generation model for reference. Source

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on GAIA? +
OPS-Agentic-Search is the system/agent setup currently leading with a tracked score of 92.36%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Apr 16, 2026.
What should I read into a GAIA score? +
GAIA scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? +
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.