Canonical benchmark page

BrowseComp leaderboard

Benchmark page for BrowseComp with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-03-22

About this benchmark

BrowseComp targets difficult browse-and-synthesize research questions that are easy to verify but hard to answer without strong search and reasoning strategy.

The benchmark is valuable for teams building deep research systems where retrieval strategy, persistence, and answer synthesis quality matter.

Results may include model-centric and system-augmented submissions, so score ownership should be interpreted carefully.

Mixed-scope benchmark: model-only and tool-augmented rows are not inherently directly comparable.

Methodology

  • Scoring emphasizes verifiable final answers over intermediate browsing traces.
  • Rows can combine model-only evaluations and agent-with-tools submissions depending on source.
  • For procurement decisions, review each source for retrieval tooling and attempt policy.

Links

BrowseComp

Mixed scope
Rank System / Submission Score Organization Notes Source Share
1
GPT-5.5 Pro New
90.1% OpenAI GPT-5.5 Pro - BrowseComp; reported on openai.com. Source
2
Claude Mythos Preview
86.9% Anthropic Scores higher than Opus 4.6 while using 4.9× fewer tokens; reported on anthropic.com. Source
3
Kimi K2.6 New
86.3% Moonshot AI Agent Swarm; reported on kimi.com. Source
4
Gemini 3.1 Pro
85.9% Google Search + Python + Browse; reported in Google DeepMind Gemini 3.1 Pro evaluation PDF. Source
5
GPT-5.5 New
84.4% OpenAI BrowseComp agentic web browsing benchmark. Reasoning effort xhigh; reported on openai.com. Source
6
Claude Opus 4.6
84.0% Anthropic Web search, web fetch, programmatic tool calling, context compaction triggered at 50k tokens up to 10M total tokens, max reasoning effort...; reported on anthropic.com. Source
7
DeepSeek-V4-Pro-Max New
83.4% DeepSeek Pass@1; reported on huggingface.co. Source
8
GPT-5.4
82.7% OpenAI BrowseComp agentic web browsing benchmark. Reasoning effort xhigh; reported on openai.com. Source
9
Claude Opus 4.7
79.3% Anthropic Agentic search evaluation; reported on anthropic.com. Source
9
GLM-5.1
79.3% Zhipu AI With context management; reported on z.ai. Source
11
GPT-5.2 Pro
77.9% OpenAI GPT-5.2 Pro - BrowseComp; reported on openai.com. Source
12
GPT-5.3-Codex
77.3% OpenAI Reported alongside GPT-5.4 announcement; reported on openai.com. Source
12
Seed 2.0 Pro
77.3% ByteDance Self-reported result; reported on seed.bytedance.com. Source
14
MiniMax M2.5
76.3% MiniMax with context management; reported on minimax.io. Source
15
GLM-5
75.9% Zhipu AI Self-reported result; reported on docs.z.ai. Source
16
Kimi K2.5
74.9% Moonshot AI Agents; reported on fireworks.ai. Source
17
Claude Sonnet 4.6
74.7% Anthropic Agentic search (BrowseComp); reported on anthropic.com. Source
18
DeepSeek-V4-Flash-Max New
73.2% DeepSeek Pass@1; reported on huggingface.co. Source
19
Qwen3.5-397B-A17B
69.0% Alibaba Cloud / Qwen Team Self-reported result; reported on qwenlm.github.io. Source
19
Step-3.5-Flash
69.0% StepFun With Context Manager; reported on stepfun.com. Source
21
GPT-5.2
65.8% OpenAI GPT-5.2 Thinking - BrowseComp; reported on openai.com. Source
22
Qwen3.5-122B-A10B
63.8% Alibaba Cloud / Qwen Team Self-reported result; reported on qwen.ai. Source
23
MiniMax M2.1
62.0% MiniMax context management; reported on minimax.io. Source
24
Qwen3.5-27B
61.0% Alibaba Cloud / Qwen Team Self-reported result; reported on qwen.ai. Source
24
Qwen3.5-35B-A3B
61.0% Alibaba Cloud / Qwen Team Self-reported result; reported on qwen.ai. Source
26
Kimi K2-Thinking-0905
60.2% Moonshot AI w/ tools; reported on moonshotai.github.io. Source
27
MiMo-V2-Flash
58.3% Xiaomi With Context Management; reported on mimo.xiaomi.com. Source
28
LongCat-Flash-Thinking-2601
56.6% Meituan Pass@1; reported on huggingface.co. Source
29
GPT-5
54.9% OpenAI GPT-5 with thinking mode enabled - Agentic search & browsing benchmark; reported on openai.com. Source
30
GLM-4.7
52.0% Zhipu AI Self-reported result; reported on z.ai. Source
31
o4-mini
51.5% OpenAI accuracy (with python + browsing); reported on openai.com. Source
32
DeepSeek-V3.2
51.4% DeepSeek Search Agent; reported on huggingface.co. Source
32
DeepSeek-V3.2 (Thinking)
51.4% DeepSeek Pass@1; reported on huggingface.co. Source
34
o3
49.7% OpenAI accuracy (with python + browsing); reported on openai.com. Source
35
Sarvam-105B
49.5% Sarvam AI Self-reported result; reported on sarvam.ai. Source
36
GLM-4.6
45.1% Zhipu AI standard; reported on z.ai. Source
37
Grok 4 Fast
44.9% xAI accuracy; reported on x.ai. Source
38
MiniMax M2
44.0% MiniMax Reported in MiniMax M2.5 announcement; reported on minimax.io. Source
39
GLM-4.7-Flash
42.8% Zhipu AI Self-reported result; reported on z.ai. Source
40
DeepSeek-V3.2-Exp
40.1% DeepSeek Agentic Tool Use; reported on huggingface.co. Source
41
Sarvam-30B
35.5% Sarvam AI Self-reported result; reported on sarvam.ai. Source
42
Nemotron 3 Super (120B A12B)
31.3% NVIDIA With Search; reported on build.nvidia.com. Source
43
DeepSeek-V3.1
30.0% DeepSeek Thinking mode with search agent; reported on huggingface.co. Source
44
GLM-4.5
26.4% Zhipu AI standard; reported on z.ai. Source
45
GLM-4.5-Air
21.3% Zhipu AI standard; reported on z.ai. Source
46
WebSailor-72B
12.0% Academic Third-party verified open-source 72B model specialized for web navigation; reported on arxiv.org. Source
47
DeepSeek-R1-0528
8.9% DeepSeek Search agent with pre-defined workflow; reported on huggingface.co. Source
48
GPT-4o + browsing
1.9% OpenAI Reference baseline from OpenAI's BrowseComp paper; illustrates benchmark difficulty. Source

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on BrowseComp? +
GPT-5.5 Pro is the system/agent setup currently leading with a tracked score of 90.1%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Mar 22, 2026.
What should I read into a BrowseComp score? +
BrowseComp scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? +
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.
Can I compare model-only and agent-with-tools rows directly? +
Not directly. Mixed pages can combine model-centric and full-system submissions. Treat those comparisons as directional unless evaluation setup and tool policy are explicitly aligned.