Canonical benchmark page
BrowseComp leaderboard
Benchmark page for BrowseComp with standardized structure: about, leaderboard table, and FAQ.
Last updated: 2026-03-22
About this benchmark
BrowseComp targets difficult browse-and-synthesize research questions that are easy to verify but hard to answer without strong search and reasoning strategy.
The benchmark is valuable for teams building deep research systems where retrieval strategy, persistence, and answer synthesis quality matter.
Results may include model-centric and system-augmented submissions, so score ownership should be interpreted carefully.
Mixed-scope benchmark: model-only and tool-augmented rows are not inherently directly comparable.
Methodology
- Scoring emphasizes verifiable final answers over intermediate browsing traces.
- Rows can combine model-only evaluations and agent-with-tools submissions depending on source.
- For procurement decisions, review each source for retrieval tooling and attempt policy.
Links
BrowseComp
Mixed scope| Rank | System / Submission | Score | Organization | Notes | Source | Share |
|---|---|---|---|---|---|---|
| 1 | GPT-5.5 Pro New | 90.1% | OpenAI | GPT-5.5 Pro - BrowseComp; reported on openai.com. | Source | |
| 2 | Claude Mythos Preview | 86.9% | Anthropic | Scores higher than Opus 4.6 while using 4.9× fewer tokens; reported on anthropic.com. | Source | |
| 3 | Kimi K2.6 New | 86.3% | Moonshot AI | Agent Swarm; reported on kimi.com. | Source | |
| 4 | Gemini 3.1 Pro | 85.9% | Search + Python + Browse; reported in Google DeepMind Gemini 3.1 Pro evaluation PDF. | Source | ||
| 5 | GPT-5.5 New | 84.4% | OpenAI | BrowseComp agentic web browsing benchmark. Reasoning effort xhigh; reported on openai.com. | Source | |
| 6 | Claude Opus 4.6 | 84.0% | Anthropic | Web search, web fetch, programmatic tool calling, context compaction triggered at 50k tokens up to 10M total tokens, max reasoning effort...; reported on anthropic.com. | Source | |
| 7 | DeepSeek-V4-Pro-Max New | 83.4% | DeepSeek | Pass@1; reported on huggingface.co. | Source | |
| 8 | GPT-5.4 | 82.7% | OpenAI | BrowseComp agentic web browsing benchmark. Reasoning effort xhigh; reported on openai.com. | Source | |
| 9 | Claude Opus 4.7 | 79.3% | Anthropic | Agentic search evaluation; reported on anthropic.com. | Source | |
| 9 | GLM-5.1 | 79.3% | Zhipu AI | With context management; reported on z.ai. | Source | |
| 11 | GPT-5.2 Pro | 77.9% | OpenAI | GPT-5.2 Pro - BrowseComp; reported on openai.com. | Source | |
| 12 | GPT-5.3-Codex | 77.3% | OpenAI | Reported alongside GPT-5.4 announcement; reported on openai.com. | Source | |
| 12 | Seed 2.0 Pro | 77.3% | ByteDance | Self-reported result; reported on seed.bytedance.com. | Source | |
| 14 | MiniMax M2.5 | 76.3% | MiniMax | with context management; reported on minimax.io. | Source | |
| 15 | GLM-5 | 75.9% | Zhipu AI | Self-reported result; reported on docs.z.ai. | Source | |
| 16 | Kimi K2.5 | 74.9% | Moonshot AI | Agents; reported on fireworks.ai. | Source | |
| 17 | Claude Sonnet 4.6 | 74.7% | Anthropic | Agentic search (BrowseComp); reported on anthropic.com. | Source | |
| 18 | DeepSeek-V4-Flash-Max New | 73.2% | DeepSeek | Pass@1; reported on huggingface.co. | Source | |
| 19 | Qwen3.5-397B-A17B | 69.0% | Alibaba Cloud / Qwen Team | Self-reported result; reported on qwenlm.github.io. | Source | |
| 19 | Step-3.5-Flash | 69.0% | StepFun | With Context Manager; reported on stepfun.com. | Source | |
| 21 | GPT-5.2 | 65.8% | OpenAI | GPT-5.2 Thinking - BrowseComp; reported on openai.com. | Source | |
| 22 | Qwen3.5-122B-A10B | 63.8% | Alibaba Cloud / Qwen Team | Self-reported result; reported on qwen.ai. | Source | |
| 23 | MiniMax M2.1 | 62.0% | MiniMax | context management; reported on minimax.io. | Source | |
| 24 | Qwen3.5-27B | 61.0% | Alibaba Cloud / Qwen Team | Self-reported result; reported on qwen.ai. | Source | |
| 24 | Qwen3.5-35B-A3B | 61.0% | Alibaba Cloud / Qwen Team | Self-reported result; reported on qwen.ai. | Source | |
| 26 | Kimi K2-Thinking-0905 | 60.2% | Moonshot AI | w/ tools; reported on moonshotai.github.io. | Source | |
| 27 | MiMo-V2-Flash | 58.3% | Xiaomi | With Context Management; reported on mimo.xiaomi.com. | Source | |
| 28 | LongCat-Flash-Thinking-2601 | 56.6% | Meituan | Pass@1; reported on huggingface.co. | Source | |
| 29 | GPT-5 | 54.9% | OpenAI | GPT-5 with thinking mode enabled - Agentic search & browsing benchmark; reported on openai.com. | Source | |
| 30 | GLM-4.7 | 52.0% | Zhipu AI | Self-reported result; reported on z.ai. | Source | |
| 31 | o4-mini | 51.5% | OpenAI | accuracy (with python + browsing); reported on openai.com. | Source | |
| 32 | DeepSeek-V3.2 | 51.4% | DeepSeek | Search Agent; reported on huggingface.co. | Source | |
| 32 | DeepSeek-V3.2 (Thinking) | 51.4% | DeepSeek | Pass@1; reported on huggingface.co. | Source | |
| 34 | o3 | 49.7% | OpenAI | accuracy (with python + browsing); reported on openai.com. | Source | |
| 35 | Sarvam-105B | 49.5% | Sarvam AI | Self-reported result; reported on sarvam.ai. | Source | |
| 36 | GLM-4.6 | 45.1% | Zhipu AI | standard; reported on z.ai. | Source | |
| 37 | Grok 4 Fast | 44.9% | xAI | accuracy; reported on x.ai. | Source | |
| 38 | MiniMax M2 | 44.0% | MiniMax | Reported in MiniMax M2.5 announcement; reported on minimax.io. | Source | |
| 39 | GLM-4.7-Flash | 42.8% | Zhipu AI | Self-reported result; reported on z.ai. | Source | |
| 40 | DeepSeek-V3.2-Exp | 40.1% | DeepSeek | Agentic Tool Use; reported on huggingface.co. | Source | |
| 41 | Sarvam-30B | 35.5% | Sarvam AI | Self-reported result; reported on sarvam.ai. | Source | |
| 42 | Nemotron 3 Super (120B A12B) | 31.3% | NVIDIA | With Search; reported on build.nvidia.com. | Source | |
| 43 | DeepSeek-V3.1 | 30.0% | DeepSeek | Thinking mode with search agent; reported on huggingface.co. | Source | |
| 44 | GLM-4.5 | 26.4% | Zhipu AI | standard; reported on z.ai. | Source | |
| 45 | GLM-4.5-Air | 21.3% | Zhipu AI | standard; reported on z.ai. | Source | |
| 46 | WebSailor-72B | 12.0% | Academic | Third-party verified open-source 72B model specialized for web navigation; reported on arxiv.org. | Source | |
| 47 | DeepSeek-R1-0528 | 8.9% | DeepSeek | Search agent with pre-defined workflow; reported on huggingface.co. | Source | |
| 48 | GPT-4o + browsing | 1.9% | OpenAI | Reference baseline from OpenAI's BrowseComp paper; illustrates benchmark difficulty. | Source |
Related benchmarks
Compare this benchmark with related pages from the hub:
Frequently asked questions
Which system is currently best on BrowseComp? + -
GPT-5.5 Pro is the system/agent setup currently leading with a tracked score of 90.1%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Mar 22, 2026.
What should I read into a BrowseComp score? + -
BrowseComp scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? + -
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.
Can I compare model-only and agent-with-tools rows directly? + -
Not directly. Mixed pages can combine model-centric and full-system submissions. Treat those comparisons as directional unless evaluation setup and tool policy are explicitly aligned.