Canonical benchmark page

GAIA leaderboard

Benchmark page for GAIA with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-04-16

About this benchmark

GAIA (General AI Assistants) evaluates agents on over 450 real-world questions with unambiguous, verifiable answers — requiring multi-step reasoning, tool use, web search, and file handling across three difficulty levels.

It is the most competitive public general-agent benchmark, with top systems now exceeding 90% and the official leaderboard hosted on Hugging Face.

Because top systems are large multi-model ensembles rather than a single model, scores reflect system design and orchestration quality as much as any individual model capability.

Top entries are multi-model ensembles — scores cannot be attributed to any single model.

Methodology

Scoring uses quasi-exact match against ground truth answers — no partial credit, no LLM judge.
Submissions are evaluated on the private test set; the official leaderboard on Hugging Face is the canonical source.
Level 1 tasks require minimal tooling; Level 3 tasks demand complex multi-step agent behavior. Average scores blend all three levels.

GAIA

Agent scope

Rank	System / Submission	Score	Organization	Notes	Source	Share
1	OPS-Agentic-Search New	92.36%	Alibaba Cloud	Official GAIA leaderboard submission; multi-model ensemble using Qwen, Claude 4.6, GPT-5, DeepSeek 3.2.	Source	Share on X Share on LinkedIn
2	openJiuwen-deepagent New	91.69%	openJiuwen	Official GAIA leaderboard; GPT-5 and Gemini 3 Pro backbone.	Source	Share on X Share on LinkedIn
3	Lemon Agent New	91.36%	Lenovo CTO Org	Official GAIA leaderboard; open-source agent using GPT-5, Gemini 3 Pro, and o3.	Source	Share on X Share on LinkedIn
4	JoinAI V2.2	90.7%	JoinAI-CMCC	Official GAIA leaderboard; multi-model ensemble with GPT-5, Gemini 3 Pro, DeepSeek 3.1, Qwen 3.	Source	Share on X Share on LinkedIn
5	Nemotron-ToolOrchestra	90.37%	NVIDIA	Official GAIA leaderboard; open-source orchestrator model routing tasks to specialized tools.	Source	Share on X Share on LinkedIn
6	JoinAI V2.1	90.03%	JoinAI-CMCC	GAIA test set average score; reported on gaia-benchmark HF leaderboard.	Source	Share on X Share on LinkedIn
6	SU Zero (Shuqian Pro)	90.03%	Shuqian Tech	Official GAIA leaderboard submission.	Source	Share on X Share on LinkedIn
8	HALO V1217-1	89.37%	Microsoft AI Asia	Official GAIA leaderboard submission from Microsoft AI Asia.	Source	Share on X Share on LinkedIn
8	ShawnAgent v3.1	89.37%	Independent	GAIA test set average score; reported on gaia-benchmark HF leaderboard.	Source	Share on X Share on LinkedIn
10	Deep Research (o3)	~87%	OpenAI	Self-reported at Deep Research launch; reached new SOTA on GAIA at time of release.	Source	Share on X Share on LinkedIn
11	Manus	86.5%	Monica AI	Self-reported; multi-agent system with parallel tool use across browser, code, and file tools.	Source	Share on X Share on LinkedIn
12	GPT-4o (Deep Research)	72.0%	OpenAI	Baseline row from Deep Research launch post; earlier generation model for reference.	Source	Share on X Share on LinkedIn

Related benchmarks

Compare this benchmark with related pages from the hub:

browsecomp webarena

Back to benchmark hub

Frequently asked questions

Which system is currently best on GAIA? +

OPS-Agentic-Search is the system/agent setup currently leading with a tracked score of 92.36%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Apr 16, 2026.

What should I read into a GAIA score? +

GAIA scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.

Are these independently verified? +

Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.