DRACO Leaderboard 2026: Latest Deep Research Agent Scores

Leaderboard

Mixed scope

System / Submission	Score	Organization	Reported	Source
Claude Mythos 5 New	86.4%	Anthropic	Jun 2026	Source
Claude Mythos Preview	83.6%	Anthropic	Jun 2026	Source
Claude Opus 4.8	80.4%	Anthropic	Jun 2026	Source
Claude Opus 4.7	77.7%	Anthropic	Apr 2026	Source
MiniMax M3 New	73.23%	MiniMax	May 2026	Source
Perplexity Deep Research (Opus 4.6)	70.5%	Perplexity	Feb 2026	Source
Perplexity Deep Research (Opus 4.5)	67.2%	Perplexity	Feb 2026	Source
Claude Fable 5 New	65.3%	Anthropic	Jun 2026	Source
Claude Opus 4.6	59.8%	Anthropic	Feb 2026	Source
Gemini Deep Research	59.0%	Google	Feb 2026	Source
OpenAI Deep Research (o3)	52.1%	OpenAI	Feb 2026	Source
Claude Opus 4.5	46.7%	Anthropic	Feb 2026	Source
OpenAI Deep Research (o4-mini)	41.9%	OpenAI	Feb 2026	Source

About this benchmark

DRACO (Deep Research Accuracy, Completeness, and Objectivity) is Perplexity's benchmark for deep research systems: 100 open-ended tasks across 10 domains, sourced from real Perplexity Deep Research traffic and graded against expert rubrics on accuracy, completeness, presentation, and citation.

Unlike short-answer benchmarks such as BrowseComp, DRACO grades full research reports, so it rewards synthesis, citation quality, and presentation, not just finding the answer. It is the closest public measure of how well a system writes a real research report.

Read each row as a whole system: the agent or harness, the base model, and the grading judge all shape the number. Here the judge matters most, so the methodology is part of the ranking, not a footnote.

Three judges, one table. Rows are graded by Claude Opus 4.6 (Anthropic, MiniMax), Gemini 3.1 Pro Preview (OpenRouter), or Gemini-3-Pro (the paper). Judge methodology varies, and judge choice shifts absolute scores 10–25 points, so read each row's note before comparing.

Vendor self-reported under a common judge. The top rows are Anthropic models graded by an Anthropic judge, and DRACO was authored by Perplexity, so each regime favors its originator. Expect levels to move as more evaluators adopt the Opus 4.6 judge.

Mixed scope: rows mix full deep-research agents (Perplexity, MiniMax) with standard models plus tools (Claude Opus 4.6 and 4.5 in the paper). Compare within a setup class before reading a rank gap as capability.

Example tasks

Three public tasks quoted from benchmark sources:

"In 2008, Longwood Gardens opened "Nature's Castles: The Treehouse Reimagined" featuring three treehouse structures. Can you find the name of the architectural firm or designer who created these treehouses, and locate a contemporaneous source (2008 or earlier) that describes the design concept and construction process?" Citation: DRACO paper, augmented task example
"Define an independent director under the NASDAQ listing standards. List the eligibility criteria (who qualifies) and disqualification criteria (who cannot serve). Which types of companies are required to have independent directors on their board?" Citation: DRACO paper, augmented task example
"Document the global expansion and local resistance to industrial agriculture mega-farms, comparing case studies from Ukraine's massive grain operations, Brazilian cerrado soy plantations, Saudi Arabia's desert farming investments in Arizona and California, and Chinese pork production facilities." Citation: DRACO paper, augmented task example

Methodology

The headline metric is the normalized score (0–100%): each rubric criterion gets a binary MET/UNMET verdict from an LLM judge, aggregated by weight into a per-task score and averaged across 100 tasks. An unweighted pass rate is a secondary metric.
The judge is the dominant variable. The paper used Gemini-3-Pro (now unavailable); Anthropic finds that swapping judges shifts absolute scores 10–25 points while preserving order. Three judges appear here: Claude Opus 4.6 (Anthropic, MiniMax), Gemini 3.1 Pro Preview (OpenRouter's Fable 5 row), and Gemini-3-Pro (the paper). Compare only within the same judge.
Rows are vendor self-evaluations under a shared judge. Anthropic grades its own models at max effort with a ~1M-token budget, compaction, and five grading runs of the final report; MiniMax grades M3 through its internal harness. Treat each as self-reported.
Each judge is a separate ladder, so don't compare rank gaps across them. The same model shows it: Opus 4.8 scores 80.4% under Anthropic's Opus 4.6 judge and 58.8% under OpenRouter's Gemini 3.1 Pro Preview judge.

Related benchmarks

Compare this benchmark with related pages from the hub:

browsecomp gaia online-mind2web

Back to benchmark hub

Frequently asked questions

Which system is currently best on DRACO? +

Claude Mythos 5 is the system/agent setup currently leading with a tracked score of 86.4%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Jun 23, 2026.

What should I read into a DRACO score? +

DRACO scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.

Can I compare model-only and agent-with-tools rows directly? +

Not directly. Mixed pages can combine model-centric and full-system submissions. Treat those comparisons as directional unless evaluation setup and tool policy are explicitly aligned.