AI Agent Benchmark Results — All Leaderboards
Browse 281 sourced results across 12 AI agent benchmarks — WebVoyager, WebArena, OSWorld, SWE-bench Verified, GAIA, BrowseComp, and more. Every row links to the original paper, model card, repository, or launch post. Filter by category, benchmark, or search by agent or organization name.
Need methodology, evaluator details, and example tasks before comparing scores? Open the individual benchmark hub page for any benchmark below.
Scores on this page are not directly comparable across benchmarks.
This index puts every result in one place for browsing — it does not normalize methodology. Each benchmark uses its own task set, evaluator, scoring metric, and scope (model vs. agent). A 90% on one benchmark does not mean the same thing as a 90% on another. Compare scores within a single benchmark using the filters below, and read each benchmark's methodology notes on its dedicated page before drawing conclusions.
Results
Sorted by reported date| Benchmark | Rank | System / Submission | Score | Organization | Reported | Source |
|---|---|---|---|---|---|---|
| HealthAdminBench | #1 | Claude Mythos 5 (browser-use)
ⓘ
| 51.9% | Anthropic | 2026-06 | Source |
| HealthAdminBench | #1 | Claude Opus 4.8 (browser-use)
ⓘ
| 51.9% | Anthropic | 2026-06 | Source |
| HealthAdminBench | #3 | Claude Mythos Preview (browser-use)
ⓘ
| 47.4% | Anthropic | 2026-06 | Source |
| HealthAdminBench | #4 | Claude Sonnet 4.6 (browser-use)
ⓘ
| 45.2% | Anthropic | 2026-06 | Source |
| OSWorld | #1 | Claude Opus 4.8
ⓘ
| 83.4% | Anthropic | 2026-05-28 | Source |
| SWE-bench Verified | #2 | Claude Opus 4.8
ⓘ
| 88.6% | Anthropic | 2026-05-28 | Source |
| BrowseComp | #8 | Claude Opus 4.8
ⓘ
| 84.3% | Anthropic | 2026-05-28 | Source |
| BrowseComp | #29 | LongSeeker
ⓘ
| 61.5% | Academic Research | 2026-05-06 | Source |
| BrowseComp | #49 | OpenSeeker-v2
ⓘ
| 46.0% | PolarSeeker | 2026-05-05 | Source |
| SWE-bench Verified | #6 | DeepSeek-V4-Pro-Max
ⓘ
| 80.6% | DeepSeek | 2026-04-24 | Source |
| BrowseComp | #10 | DeepSeek-V4-Pro-Max
ⓘ
| 83.4% | DeepSeek | 2026-04-24 | Source |
| SWE-bench Verified | #12 | DeepSeek-V4-Flash-Max
ⓘ
| 79.0% | DeepSeek | 2026-04-24 | Source |
| BrowseComp | #22 | DeepSeek-V4-Flash-Max
ⓘ
| 73.2% | DeepSeek | 2026-04-24 | Source |
| BrowseComp | #38 | Parallel Basic + GPT-5.4 harness
ⓘ
| 53.0% | Parallel | 2026-04-21 | Source |
| BrowseComp | #43 | Parallel Advanced + GPT-5.4 harness
ⓘ
| 51.0% | Parallel | 2026-04-21 | Source |
| BrowseComp | #48 | MiroThinker v1.0-72B
ⓘ
| 47.1% | MiroMind | 2026-04-21 | Source |
| BrowseComp | #57 | Tavily + GPT-5.4 harness
ⓘ
| 42.0% | Tavily | 2026-04-21 | Source |
| BrowseComp | #59 | Exa + GPT-5.4 harness
ⓘ
| 40.0% | Exa | 2026-04-21 | Source |
| BrowseComp | #5 | Kimi K2.6
ⓘ
| 86.3% | Moonshot AI | 2026-04-20 | Source |
| SWE-bench Verified | #8 | Kimi K2.6
ⓘ
| 80.2% | Moonshot AI | 2026-04-20 | Source |
| SWE-bench Verified | #3 | Claude Opus 4.7
ⓘ
| 87.6% | Anthropic | 2026-04-16 | Source |
| BrowseComp | #12 | Claude Opus 4.7
ⓘ
| 79.3% | Anthropic | 2026-04-16 | Source |
| SWE-bench Verified | #1 | Claude Mythos
ⓘ
| 93.9% | Anthropic | 2026-04-09 | Source |
| ClawBench | #1 | Claude Sonnet 4.6
ⓘ
| 33.3% | Anthropic | 2026-04-09 | Source |
| ClawBench | #2 | GLM-5
ⓘ
| 24.2% | Z.ai | 2026-04-09 | Source |
| ClawBench | #3 | Gemini 3 Flash
ⓘ
| 19.0% | 2026-04-09 | Source | |
| ClawBench | #4 | Claude Haiku 4.5
ⓘ
| 18.3% | Anthropic | 2026-04-09 | Source |
| ClawBench | #5 | GPT-5.4
ⓘ
| 6.5% | OpenAI | 2026-04-09 | Source |
| ClawBench | #6 | Gemini 3.1 Flash Lite
ⓘ
| 3.3% | 2026-04-09 | Source | |
| ClawBench | #7 | Kimi K2.5
ⓘ
| 0.7% | Moonshot AI | 2026-04-09 | Source |
| OSWorld | #2 | Mythos Preview
ⓘ
| 79.6% | Anthropic | 2026-04-07 | Source |
| BrowseComp | #4 | Claude Mythos Preview
ⓘ
| 86.9% | Anthropic | 2026-04-07 | Source |
| BrowseComp | #9 | Claude Opus 4.6
ⓘ
| 83.7% | Anthropic | 2026-04-07 | Source |
| SWE-bench Verified | #13 | Qwen3.6 Plus
ⓘ
| 78.8% | Alibaba Cloud / Qwen Team | 2026-04-02 | Source |
| BrowseComp | #1 | GPT-5.5 Pro
ⓘ
| 90.1% | OpenAI | 2026-04 | Source |
| BrowseComp | #2 | GPT-5.4 Pro
ⓘ
| 89.3% | OpenAI | 2026-04 | Source |
| HealthAdminBench | #5 | Claude Opus 4.6 CUA
ⓘ
| 36.3% | Anthropic | 2026-04 | Source |
| HealthAdminBench | #6 | GPT-5.4 CUA
ⓘ
| 26.7% | OpenAI | 2026-04 | Source |
| WebVoyager | #7 | GLM-5V-Turbo
ⓘ
| 88.5% | Z.ai | 2026-04 | Source |
| BrowseComp | #7 | GPT-5.5
ⓘ
| 84.4% | OpenAI | 2026-04 | Source |
| HealthAdminBench | #7 | Kimi K2.5
ⓘ
| 15.6% | Moonshot AI | 2026-04 | Source |
| HealthAdminBench | #8 | Claude Opus 4.6
ⓘ
| 14.8% | Anthropic | 2026-04 | Source |
| HealthAdminBench | #9 | Qwen 3.5
ⓘ
| 13.3% | Alibaba | 2026-04 | Source |
| OSWorld | #10 | GLM-5V-Turbo
ⓘ
| 62.3% | Zhipu AI | 2026-04-01 | Source |
| HealthAdminBench | #10 | Gemini 3.1 Pro
ⓘ
| 11.9% | 2026-04 | Source | |
| HealthAdminBench | #11 | GPT-5.4
ⓘ
| 5.9% | OpenAI | 2026-04 | Source |
| BrowseComp | #12 | GLM-5.1
ⓘ
| 79.3% | Zhipu AI | 2026-04 | Source |
| WebArena | #16 | WebUncertainty + GPT-4-Turbo
ⓘ
| 46.9% | Academic Research | 2026-04 | Source |
| WebArena | #18 | A3-Qwen3.5-9B
ⓘ
| 42.1% | McGill NLP | 2026-04 | Source |
| BrowseComp | #19 | MiniMax M2.5
ⓘ
| 76.3% | MiniMax | 2026-04 | Source |
| Online-Mind2Web | #1 | Browser Use Cloud (bu-max)
ⓘ
| 97.0% | Browser-Use | 2026-03-25 | Source |
| Online-Mind2Web | #10 | Stagehand (Gemini 2.5 CU)
ⓘ
| 65.0% | Browserbase | 2026-03-25 | Source |
| Online-Mind2Web | #15 | Stagehand (Sonnet 4.5)
ⓘ
| 55.0% | Browserbase | 2026-03-25 | Source |
| SWE-bench Verified | #14 | MiMo-V2-Pro
ⓘ
| 78.0% | Xiaomi | 2026-03-18 | Source |
| BrowseComp | #3 | MiroThinker-H1
ⓘ
| 88.2% | MiroMind | 2026-03-16 | Source |
| BrowseComp | #66 | OpenSeeker
ⓘ
| 29.5% | PolarSeeker | 2026-03-16 | Source |
| GAIA | #1 | OPS-Agentic-Search
ⓘ
| 92.36% | Alibaba Cloud | 2026-03-11 | Source |
| GAIA | #1 | openJiuwen-deepagent
ⓘ
| 92.36% | Suzhou AI Lab / Shuqian Tech | 2026-03-11 | Source |
| BrowseComp | #63 | Nemotron 3 Super (120B A12B)
ⓘ
| 31.3% | NVIDIA | 2026-03-11 | Source |
| BrowseComp | #46 | Sarvam-105B
ⓘ
| 49.5% | Sarvam AI | 2026-03-06 | Source |
| BrowseComp | #61 | Sarvam-30B
ⓘ
| 35.5% | Sarvam AI | 2026-03-06 | Source |
| Online-Mind2Web | #2 | GPT-5.4 Native Computer Use
ⓘ
| 93.0% | OpenAI | 2026-03-05 | Source |
| OSWorld | #4 | GPT-5.4
ⓘ
| 75.0% | OpenAI | 2026-03-05 | Source |
| Online-Mind2Web | #8 | ChatGPT Atlas Agent Mode
ⓘ
| 71.0% | OpenAI | 2026-03-05 | Source |
| BrowseComp | #11 | GPT-5.4
ⓘ
| 82.7% | OpenAI | 2026-03-05 | Source |
| BrowseComp | #17 | GPT-5.3-Codex
ⓘ
| 77.3% | OpenAI | 2026-03-05 | Source |
| Online-Mind2Web | #3 | ABP + Claude Opus 4.6
ⓘ
| 90.53% | theredsix | 2026-03-03 | Source |
| WebVoyager | #1 | Alumnium
ⓘ
| 98.5% | Alumnium | 2026-03 | Source |
| BrowseComp | #47 | SMTL
ⓘ
| 48.6% | Academic Research | 2026-02-27 | Source |
| BrowseComp | #6 | Gemini 3.1 Pro
ⓘ
| 85.9% | 2026-02-19 | Source | |
| SWE-bench Verified | #6 | Gemini 3.1 Pro
ⓘ
| 80.6% | Google DeepMind | 2026-02-19 | Source |
| BrowseComp | #33 | Gemini 3 Pro
ⓘ
| 59.2% | 2026-02-19 | Source | |
| OSWorld | #6 | Claude Sonnet 4.6
ⓘ
| 72.5% | Anthropic | 2026-02-17 | Source |
| SWE-bench Verified | #11 | Claude Sonnet 4.6
ⓘ
| 79.6% | Anthropic | 2026-02-17 | Source |
| BrowseComp | #21 | Claude Sonnet 4.6
ⓘ
| 74.7% | Anthropic | 2026-02-17 | Source |
| τ-bench | #8 | Qwen3.5-397B-A17B
ⓘ
| 68.4% | Alibaba | 2026-02-16 | Source |
| BrowseComp | #17 | Seed 2.0 Pro
ⓘ
| 77.3% | ByteDance | 2026-02-15 | Source |
| Online-Mind2Web | #4 | TinyFish
ⓘ
| 90.0% | TinyFish AI | 2026-02-12 | Source |
| SWE-bench Verified | #8 | MiniMax M2.5
ⓘ
| 80.2% | MiniMax | 2026-02-12 | Source |
| τ-bench | #11 | GLM-5
ⓘ
| 63.2% | Zhipu AI | 2026-02-11 | Source |
| SWE-bench Verified | #16 | GLM-5
ⓘ
| 77.8% | Zhipu AI | 2026-02-11 | Source |
| GAIA | #3 | openJiuwen-deepagent (GPT5/Gemini)
ⓘ
| 91.69% | openJiuwen | 2026-02-09 | Source |
| GAIA | #4 | Lemon Agent
ⓘ
| 91.36% | Lenovo CTO Org | 2026-02-06 | Source |
| SWE-bench Verified | #5 | Claude Opus 4.6
ⓘ
| 80.8% | Anthropic | 2026-02-05 | Source |
| OSWorld | #5 | Claude Opus 4.6
ⓘ
| 72.7% | Anthropic | 2026-02-05 | Source |
| WebArena | #1 | WebTactix (DeepSeek v3.2)
ⓘ
| 74.3% | WebTactix | 2026-02 | Source |
| WebArena | #8 | Kimi K2.5
ⓘ
| 58.9% | Moonshot AI | 2026-02 | Source |
| WebArena | #11 | Plan-MCTS + GPT-5-mini
ⓘ
| 55.3% | Academic Research | 2026-02 | Source |
| BrowseComp | #14 | Qwen3.5-397B-A17B
ⓘ
| 78.6% | Alibaba Cloud / Qwen Team | 2026-02 | Source |
| BrowseComp | #15 | Kimi K2.5
ⓘ
| 78.4% | Moonshot AI | 2026-02 | Source |
| BrowseComp | #20 | GLM-5
ⓘ
| 75.9% | Zhipu AI | 2026-02 | Source |
| BrowseComp | #24 | Step-3.5-Flash
ⓘ
| 69.0% | StepFun | 2026-02 | Source |
| BrowseComp | #27 | Qwen3.5-122B-A10B
ⓘ
| 63.8% | Alibaba Cloud / Qwen Team | 2026-02 | Source |
| BrowseComp | #30 | Qwen3.5-27B
ⓘ
| 61.0% | Alibaba Cloud / Qwen Team | 2026-02 | Source |
| BrowseComp | #30 | Qwen3.5-35B-A3B
ⓘ
| 61.0% | Alibaba Cloud / Qwen Team | 2026-02 | Source |
| BrowseComp | #56 | GLM-4.7-Flash
ⓘ
| 42.8% | Zhipu AI | 2026-02 | Source |
| τ-bench | #1 | Step-3.5-Flash
ⓘ
| 88.2% | StepFun | 2026-01-29 | Source |
| OSWorld | #9 | Kimi K2.5
ⓘ
| 63.3% | Moonshot AI | 2026-01-27 | Source |
| GAIA | #9 | ShawnAgent v3.1
ⓘ
| 89.37% | Independent | 2026-01-16 | Source |
| GAIA | #5 | JoinAI V2.2
ⓘ
| 90.7% | JoinAI-CMCC | 2026-01-14 | Source |
| GAIA | #11 | ShawnAgent v3.0
ⓘ
| 89.04% | Independent | 2026-01-14 | Source |
| OSWorld | #12 | UiPath Screen Agent
ⓘ
| 53.6% | UiPath | 2026-01-14 | Source |
| GAIA | #7 | JoinAI V2.1
ⓘ
| 90.03% | JoinAI-CMCC | 2026-01-13 | Source |
| BrowseComp | #49 | WebAnchor-30B
ⓘ
| 46.0% | Academic Research | 2026-01-07 | Source |
| τ-bench | #3 | MiMo-V2-Flash
ⓘ
| 80.3% | Xiaomi | 2026-01-06 | Source |
| GAIA | #6 | Nemotron-ToolOrchestra
ⓘ
| 90.37% | NVIDIA | 2026-01-06 | Source |
| WebArena | #2 | OpAgent
ⓘ
| 71.6% | CodeFuse AI | 2026-01 | Source |
| BrowseComp | #23 | LongCat-Flash-Thinking-2601
ⓘ
| 73.1% | Meituan | 2026-01 | Source |
| BrowseComp | #34 | MiMo-V2-Flash
ⓘ
| 58.3% | Xiaomi | 2026-01 | Source |
| GAIA | #11 | JoinAI V2
ⓘ
| 89.04% | JoinAI-CMCC | 2025-12-28 | Source |
| GAIA | #7 | SU Zero (Shuqian Pro)
ⓘ
| 90.03% | Shuqian Tech | 2025-12-23 | Source |
| τ-bench | #2 | GLM-4.7
ⓘ
| 87.4% | Z.ai | 2025-12-22 | Source |
| τ-bench | #4 | GLM-4.7-Flash
ⓘ
| 79.5% | Z.ai | 2025-12-22 | Source |
| GAIA | #9 | HALO V1217-1
ⓘ
| 89.37% | Microsoft AI Asia | 2025-12-17 | Source |
| τ-bench | #9 | Gemini 3 Flash
ⓘ
| 67.8% | Google DeepMind | 2025-12-17 | Source |
| GAIA | #11 | HALO V1217
ⓘ
| 89.04% | Microsoft AI Asia | 2025-12-17 | Source |
| SWE-bench Verified | #14 | Gemini 3 Flash
ⓘ
| 78.0% | Google DeepMind | 2025-12-17 | Source |
| τ-bench | #7 | GPT-5.2
ⓘ
| 69.9% | OpenAI | 2025-12-11 | Source |
| SWE-bench Verified | #10 | GPT-5.2
ⓘ
| 80.0% | OpenAI | 2025-12-11 | Source |
| GAIA | #11 | Su Zero + SQ Pro
ⓘ
| 89.04% | Suzhou AI Lab / Shuqian Tech | 2025-12-11 | Source |
| BrowseComp | #16 | GPT-5.2 Pro
ⓘ
| 77.9% | OpenAI | 2025-12-11 | Source |
| BrowseComp | #26 | GPT-5.2
ⓘ
| 65.8% | OpenAI | 2025-12-11 | Source |
| GAIA | #16 | Su Zero + Shuqian Lite
ⓘ
| 87.38% | Suzhou AI Lab / Shuqian Tech | 2025-12-07 | Source |
| GAIA | #15 | Lemon Agent v1.0.8
ⓘ
| 88.37% | Lenovo CTO Org | 2025-12-04 | Source |
| WebArena | #3 | ColorBrowserAgent
ⓘ
| 71.2% | MadeAgents | 2025-12 | Source |
| Online-Mind2Web | #6 | OpenAGI Lux
ⓘ
| 83.6% | OpenAGI Foundation | 2025-12-01 | Source |
| WebArena | #12 | WebOperator + GPT-4o
ⓘ
| 54.6% | KAIST KAG NLP | 2025-12 | Source |
| BrowseComp | #25 | GLM-4.7
ⓘ
| 67.5% | Zhipu AI | 2025-12 | Source |
| BrowseComp | #28 | MiniMax M2.1
ⓘ
| 62.0% | MiniMax | 2025-12 | Source |
| BrowseComp | #41 | DeepSeek-V3.2
ⓘ
| 51.4% | DeepSeek | 2025-12-01 | Source |
| BrowseComp | #41 | DeepSeek-V3.2 (Thinking)
ⓘ
| 51.4% | DeepSeek | 2025-12-01 | Source |
| SWE-bench Verified | #4 | Claude Opus 4.5
ⓘ
| 80.9% | Anthropic | 2025-11-24 | Source |
| τ-bench | #6 | Claude Opus 4.5
ⓘ
| 70.2% | Anthropic | 2025-11-24 | Source |
| OSWorld | #8 | Claude Opus 4.5
ⓘ
| 66.3% | Anthropic | 2025-11-24 | Source |
| Online-Mind2Web | #7 | Navigator
ⓘ
| 78.7% | Yutori | 2025-11-19 | Source |
| Online-Mind2Web | #12 | Claude 4.0
ⓘ
| 61.0% | Anthropic | 2025-11-19 | Source |
| Online-Mind2Web | #15 | Claude 4.5
ⓘ
| 55.0% | Anthropic | 2025-11-19 | Source |
| τ-bench | #10 | Gemini 3 Pro
ⓘ
| 65.8% | Google DeepMind | 2025-11-18 | Source |
| BrowseComp | #32 | Kimi K2-Thinking-0905
ⓘ
| 60.2% | Moonshot AI | 2025-11 | Source |
| τ-bench | #5 | MiniMax M2
ⓘ
| 77.2% | MiniMax | 2025-10-27 | Source |
| WebArena | #4 | Claude Code + GBOX MCP
ⓘ
| 68.0% | GBOX AI | 2025-10-25 | Source |
| OSWorld | #3 | OSAgent
ⓘ
| 76.26% | TheAGI Company | 2025-10-23 | Source |
| OSWorld | #13 | Claude Haiku 4.5
ⓘ
| 50.7% | Anthropic | 2025-10-15 | Source |
| GAIA | #17 | Co-Sight v2.1.0
ⓘ
| 87.04% | ZTE-AICloud | 2025-10-13 | Source |
| GAIA | #18 | JoinAI V1.1
ⓘ
| 86.71% | JoinAI | 2025-10-09 | Source |
| BrowseComp | #62 | DeepMiner-32B
ⓘ
| 33.5% | Academic Research | 2025-10-09 | Source |
| Online-Mind2Web | #9 | Gemini 2.5 Computer Use
ⓘ
| 69.0% | Google DeepMind | 2025-10-07 | Source |
| AgentBench | #1 | AgentRL w/ Qwen2.5-32B-Instruct
ⓘ
| 70.4% | Tsinghua University | 2025-10-05 | Source |
| AgentBench | #2 | AgentRL w/ Qwen2.5-14B-Instruct
ⓘ
| 67.7% | Tsinghua University | 2025-10-05 | Source |
| AgentBench | #3 | AgentRL w/ GLM-4-9B-0414
ⓘ
| 65.0% | Tsinghua University | 2025-10-05 | Source |
| AgentBench | #4 | AgentRL w/ Qwen2.5-7B-Instruct
ⓘ
| 62.0% | Tsinghua University | 2025-10-05 | Source |
| AgentBench | #5 | AgentRL w/ Qwen2.5-3B-Instruct
ⓘ
| 60.0% | Tsinghua University | 2025-10-05 | Source |
| Aider | #12 | DeepSeek-V3.2-Exp (Reasoner)
ⓘ
| 74.2% | DeepSeek | 2025-10-03 | Source |
| WebVoyager | #2 | Surfer 2
ⓘ
| 97.1% | H Company | 2025-10 | Source |
| WebArena | #6 | Narada AI
ⓘ
| 64.2% | Narada AI | 2025-10 | Source |
| BrowseComp | #51 | GLM-4.6
ⓘ
| 45.1% | Zhipu AI | 2025-10 | Source |
| BrowseComp | #54 | MiniMax M2
ⓘ
| 44.0% | MiniMax | 2025-10 | Source |
| BrowseComp | #58 | DeepSeek-V3.2-Exp
ⓘ
| 40.1% | DeepSeek | 2025-09-30 | Source |
| AgentBench | #6 | Claude Sonnet 4.5
ⓘ
| 58.9% | Anthropic | 2025-09-29 | Source |
| AgentBench | #7 | Claude Sonnet 4.5 Thinking
ⓘ
| 58.3% | Anthropic | 2025-09-29 | Source |
| OSWorld | #11 | Claude Sonnet 4.5
ⓘ
| 61.4% | Anthropic | 2025-09-29 | Source |
| τ-bench | #12 | Claude Sonnet 4.5
ⓘ
| 62.9% | Anthropic | 2025-09-29 | Source |
| BrowseComp | #70 | InfoAgent
ⓘ
| 15.3% | Academic Research | 2025-09-29 | Source |
| OSWorld | #7 | Qwen3 VL 235B
ⓘ
| 66.7% | Alibaba | 2025-09-23 | Source |
| BrowseComp | #53 | Grok 4 Fast
ⓘ
| 44.9% | xAI | 2025-09-22 | Source |
| BrowseComp | #55 | Tongyi DeepResearch
ⓘ
| 43.4% | Alibaba Cloud / Tongyi Lab | 2025-09-16 | Source |
| BrowseComp | #60 | AgentFounder-30B
ⓘ
| 39.9% | Alibaba Cloud / Tongyi Lab | 2025-09-16 | Source |
| BrowseComp | #70 | DeepDive-32B
ⓘ
| 15.3% | THUDM / Tsinghua University | 2025-09-12 | Source |
| BrowseComp | #69 | WebExplorer-8B (RL)
ⓘ
| 15.7% | HKUST NLP Group | 2025-09-08 | Source |
| BrowseComp | #74 | WebSailor-32B
ⓘ
| 10.5% | Alibaba Cloud / Tongyi Lab | 2025-09-08 | Source |
| BrowseComp | #78 | WebSailor-7B
ⓘ
| 6.7% | Alibaba Cloud / Tongyi Lab | 2025-09-08 | Source |
| Online-Mind2Web | #5 | UI-TARS-2
ⓘ
| 88.2% | ByteDance / VLM-Research | 2025-09-02 | Source |
| WebArena | #5 | DeepSky Agent
ⓘ
| 66.9% | DeepSky | 2025-09 | Source |
| Aider | #2 | gpt-5 (medium)
ⓘ
| 86.7% | OpenAI | 2025-08-25 | Source |
| Aider | #5 | gpt-5 (low)
ⓘ
| 81.3% | OpenAI | 2025-08-25 | Source |
| Aider | #1 | gpt-5 (high)
ⓘ
| 88.0% | OpenAI | 2025-08-23 | Source |
| Online-Mind2Web | #13 | ACT-1-20250814
ⓘ
| 57.3% | Enhans | 2025-08-14 | Source |
| BrowseComp | #64 | BrowseMaster
ⓘ
| 30.0% | Academic Research | 2025-08-12 | Source |
| WebVoyager | #14 | WebSight
ⓘ
| 68% | Academic Research | 2025-08 | Source |
| BrowseComp | #35 | Parallel Ultra8x
ⓘ
| 58.0% | Parallel | 2025-08 | Source |
| BrowseComp | #36 | Parallel Ultra4x
ⓘ
| 56.0% | Parallel | 2025-08 | Source |
| BrowseComp | #37 | GPT-5
ⓘ
| 54.9% | OpenAI | 2025-08 | Source |
| BrowseComp | #43 | Parallel Ultra2x
ⓘ
| 51.0% | Parallel | 2025-08 | Source |
| BrowseComp | #52 | Parallel Ultra
ⓘ
| 45.0% | Parallel | 2025-08 | Source |
| BrowseComp | #64 | DeepSeek-V3.1
ⓘ
| 30.0% | DeepSeek | 2025-08 | Source |
| BrowseComp | #72 | Exa Research Pro
ⓘ
| 14.0% | Exa | 2025-08 | Source |
| BrowseComp | #77 | Claude Opus 4.1 (Parallel Task API benchmark)
ⓘ
| 7.0% | Anthropic | 2025-08 | Source |
| BrowseComp | #79 | Perplexity Sonar Deep Research
ⓘ
| 6.0% | Perplexity | 2025-08 | Source |
| Aider | #7 | grok-4 (high)
ⓘ
| 79.6% | xAI | 2025-07-11 | Source |
| Online-Mind2Web | #17 | ACT-1-20250703
ⓘ
| 45.7% | Enhans | 2025-07-03 | Source |
| WebVoyager | #3 | Magnitude
ⓘ
| 93.9% | Magnitude | 2025-07 | Source |
| BrowseComp | #67 | GLM-4.5
ⓘ
| 26.4% | Zhipu AI | 2025-07 | Source |
| BrowseComp | #68 | GLM-4.5-Air
ⓘ
| 21.3% | Zhipu AI | 2025-07 | Source |
| BrowseComp | #73 | WebSailor-72B
ⓘ
| 12.0% | Alibaba Cloud / Tongyi Lab | 2025-07 | Source |
| Aider | #3 | o3-pro (high)
ⓘ
| 84.9% | OpenAI | 2025-06-28 | Source |
| Aider | #9 | o3 (high) + gpt-4.1
ⓘ
| 78.2% | OpenAI | 2025-06-27 | Source |
| Aider | #5 | o3 (high)
ⓘ
| 81.3% | OpenAI | 2025-06-25 | Source |
| Aider | #10 | o3
ⓘ
| 76.9% | OpenAI | 2025-06-25 | Source |
| Aider | #4 | gemini-2.5-pro-preview-06-05 (32k think)
ⓘ
| 83.1% | 2025-06-06 | Source | |
| Aider | #8 | gemini-2.5-pro-preview-06-05 (default think)
ⓘ
| 79.1% | 2025-06-06 | Source | |
| WebVoyager | #4 | Surfer-H + Holo1
ⓘ
| 92.2% | H Company | 2025-06 | Source |
| WebArena | #19 | GUI-API Hybrid Agent
ⓘ
| 38.9% | Academic Research | 2025-06 | Source |
| WebArena | #23 | TTI
ⓘ
| 26.1% | Academic Research | 2025-06 | Source |
| BrowseComp | #76 | DeepSeek-R1-0528
ⓘ
| 8.9% | DeepSeek | 2025-05-28 | Source |
| Aider | #14 | claude-opus-4-20250514 (32k thinking)
ⓘ
| 72.0% | Anthropic | 2025-05-25 | Source |
| AgentBench | #8 | Claude Sonnet 4 Thinking
ⓘ
| 58.2% | Anthropic | 2025-05-22 | Source |
| AgentBench | #9 | Claude Sonnet 4
ⓘ
| 57.4% | Anthropic | 2025-05-22 | Source |
| Aider | #10 | Gemini 2.5 Pro Preview 05-06
ⓘ
| 76.9% | 2025-05-07 | Source | |
| Aider | #14 | o4-mini (high)
ⓘ
| 72.0% | OpenAI | 2025-04-16 | Source |
| BrowseComp | #39 | o4-mini
ⓘ
| 51.5% | OpenAI | 2025-04-16 | Source |
| BrowseComp | #45 | o3
ⓘ
| 49.7% | OpenAI | 2025-04-16 | Source |
| Aider | #13 | Gemini 2.5 Pro Preview 03-25
ⓘ
| 72.9% | 2025-04-12 | Source | |
| BrowseComp | #39 | OpenAI Deep Research
ⓘ
| 51.5% | OpenAI | 2025-04-10 | Source |
| BrowseComp | #75 | OpenAI o1
ⓘ
| 9.9% | OpenAI | 2025-04-10 | Source |
| BrowseComp | #80 | GPT-4o + browsing
ⓘ
| 1.9% | OpenAI | 2025-04-10 | Source |
| BrowseComp | #81 | GPT-4.5
ⓘ
| 0.9% | OpenAI | 2025-04-10 | Source |
| BrowseComp | #82 | GPT-4o
ⓘ
| 0.6% | OpenAI | 2025-04-10 | Source |
| Online-Mind2Web | #11 | OpenAI Operator
ⓘ
| 61.3% | OpenAI | 2025-04-02 | Source |
| WebVoyager | #5 | Browserable
ⓘ
| 90.4% | Browserable | 2025-04 | Source |
| WebVoyager | #10 | Notte
ⓘ
| 86.2% | Notte | 2025-04 | Source |
| Online-Mind2Web | #19 | HAL Leaderboard baseline (best open)
ⓘ
| ~30% | Princeton / OSU | 2025-04 | Source |
| OSWorld | #17 | Qwen2.5 VL 32B Instruct
ⓘ
| 5.9% | Alibaba Cloud / Qwen Team | 2025-03-24 | Source |
| OSWorld | #14 | Agent S2 + Claude 3.7
ⓘ
| 34.5% | Simular AI | 2025-03-12 | Source |
| GAIA | #19 | Manus
ⓘ
| 86.5% | Monica AI | 2025-03-06 | Source |
| WebVoyager | #15 | Runner H 0.1
ⓘ
| 67% | H Company | 2025-03 | Source |
| AgentBench | #10 | Claude Sonnet 3.7
ⓘ
| 53.2% | Anthropic | 2025-02-24 | Source |
| Online-Mind2Web | #14 | Claude Computer Use 3.7 (w/o thinking)
ⓘ
| 56.3% | Anthropic | 2025-02-24 | Source |
| WebArena | #7 | IBM CUGA
ⓘ
| 61.7% | IBM | 2025-02-17 | Source |
| GAIA | #20 | Deep Research (o3, cons@64)
ⓘ
| 72.57% | OpenAI | 2025-02-02 | Source |
| GAIA | #21 | Deep Research (o3)
ⓘ
| 67.36% | OpenAI | 2025-02-02 | Source |
| OSWorld | #16 | Qwen2.5 VL 72B Instruct
ⓘ
| 8.8% | Alibaba Cloud / Qwen Team | 2025-01-26 | Source |
| OSWorld | #15 | OpenAI Operator (CUA)
ⓘ
| 32.6% | OpenAI | 2025-01-23 | Source |
| WebVoyager | #8 | Operator
ⓘ
| 87% | OpenAI | 2025-01 | Source |
| WebArena | #9 | OpenAI Operator
ⓘ
| 58.1% | OpenAI | 2025-01 | Source |
| WebVoyager | #11 | Skyvern 2.0
ⓘ
| 85.85% | Skyvern | 2025-01 | Source |
| WebArena | #14 | AgentSymbiotic
ⓘ
| 52.1% | Academic Research | 2025-01 | Source |
| WebArena | #15 | Learn-by-Interact
ⓘ
| 48.0% | Academic Research | 2025-01 | Source |
| WebArena | #25 | AgentTrek-1.0-32B
ⓘ
| 22.4% | xLang AI | 2025-01 | Source |
| WebArena | #29 | NNetNav
ⓘ
| 16.3% | Stanford NLP | 2025-01 | Source |
| WebArena | #13 | ScribeAgent + GPT-4o
ⓘ
| 53.0% | Academic Research | 2024-12-24 | Source |
| WebVoyager | #6 | Browser Use
ⓘ
| 89.1% | Browser Use | 2024-12 | Source |
| WebVoyager | #12 | Project Mariner
ⓘ
| 83.5% | 2024-12 | Source | |
| Online-Mind2Web | #19 | Browser Use (gpt-4o)
ⓘ
| 30.0% | Browser Use | 2024-11-06 | Source |
| WebVoyager | #8 | Agent Kura
ⓘ
| 87.0% | Kura | 2024-11 | Source |
| WebVoyager | #17 | Anthropic Computer Use 3.5
ⓘ
| 56.0% | Anthropic | 2024-11 | Source |
| Online-Mind2Web | #21 | Claude Computer Use 3.5
ⓘ
| 29.0% | Anthropic | 2024-10-22 | Source |
| WebArena | #17 | AgentOccam-Judge
ⓘ
| 45.7% | Amazon Science | 2024-10 | Source |
| WebArena | #38 | Synatra-CodeLLama7b
ⓘ
| 6.28% | Academic Research | 2024-10 | Source |
| WebArena | #21 | Agent Workflow Memory
ⓘ
| 35.5% | Academic Research | 2024-09 | Source |
| WebArena | #10 | Jace.AI (AWA-1.5)
ⓘ
| 57.1% | Jace AI | 2024-08 | Source |
| WebArena | #20 | WebPilot
ⓘ
| 37.2% | Academic Research | 2024-08 | Source |
| Online-Mind2Web | #22 | Agent-E (gpt-4o)
ⓘ
| 28.0% | Emergence AI | 2024-07-16 | Source |
| WebVoyager | #13 | Agent-E
ⓘ
| 73.2% | Emergence AI | 2024-07 | Source |
| WebArena | #27 | GPT-4o + Tree Search
ⓘ
| 19.2% | Academic Research | 2024-06 | Source |
| WebArena | #31 | gpt-4o-2024-05-13
ⓘ
| 13.05% | OpenAI | 2024-05 | Source |
| WebArena | #33 | Patel et al. + GPT-4
ⓘ
| 9.36% | Academic Research | 2024-05 | Source |
| WebArena | #37 | Llama3-chat-70b
ⓘ
| 7.02% | Meta | 2024-04-02 | Source |
| WebVoyager | #18 | WILBUR
ⓘ
| 53% | Bardeen / UC Berkeley | 2024-04 | Source |
| WebArena | #22 | SteP
ⓘ
| 33.5% | ASAPP Research | 2024-04 | Source |
| WebArena | #24 | BrowserGym + GPT-4
ⓘ
| 23.5% | ServiceNow Research | 2024-04 | Source |
| WebArena | #26 | GPT-4 + Auto Eval
ⓘ
| 20.2% | Academic Research | 2024-04 | Source |
| WebArena | #28 | AutoWebGLM
ⓘ
| 18.2% | THUDM | 2024-04 | Source |
| WebArena | #43 | Llama3-chat-8b
ⓘ
| 3.32% | Meta | 2024-04 | Source |
| WebArena | #40 | Agent-FLAN
ⓘ
| 4.68% | InternLM | 2024-03 | Source |
| WebArena | #44 | CodeAct Agent
ⓘ
| 2.3% | Academic Research | 2024-02 | Source |
| Online-Mind2Web | #18 | SeeAct (gpt-4o)
ⓘ
| 30.7% | OSU NLP | 2024-01-16 | Source |
| WebVoyager | #16 | WebVoyager
ⓘ
| 59.1% | Academic Research | 2024-01 | Source |
| WebVoyager | #19 | GPT-4 (All Tools)
ⓘ
| 30.8% | OpenAI | 2024-01 | Source |
| WebArena | #46 | Mixtral-8x7B
ⓘ
| 1.39% | Mistral AI | 2024-01 | Source |
| WebArena | #36 | Gemini Pro
ⓘ
| 7.12% | 2023-12 | Source | |
| WebArena | #39 | Lemur-chat-70b
ⓘ
| 5.3% | OpenLemur | 2023-10 | Source |
| WebArena | #42 | AgentLM-70b
ⓘ
| 3.81% | THUDM | 2023-10 | Source |
| WebArena | #45 | AgentLM-13b
ⓘ
| 1.6% | THUDM | 2023-10 | Source |
| WebArena | #47 | AgentLM-7b
ⓘ
| 0.74% | THUDM | 2023-10 | Source |
| WebArena | #48 | FireAct
ⓘ
| 0.25% | Academic Research | 2023-10 | Source |
| WebArena | #35 | Qwen-1.5-chat-72b
ⓘ
| 7.14% | Qwen | 2023-09 | Source |
| WebArena | #41 | CodeLlama-instruct-34b
ⓘ
| 4.06% | Meta | 2023-08 | Source |
| WebArena | #49 | CodeLlama-instruct-7b
ⓘ
| 0.0% | Meta | 2023-08 | Source |
| WebArena | #30 | gpt-4-0613 (no not-achievable hint)
ⓘ
| 14.9% | OpenAI | 2023-06 | Source |
| WebArena | #32 | gpt-4-0613 (with not-achievable hint)
ⓘ
| 11.7% | OpenAI | 2023-06 | Source |
| WebArena | #34 | gpt-3.5-turbo-16k-0613
ⓘ
| 8.87% | OpenAI | 2023-03 | Source |
No results match these filters. Clear filters or try a different search.
Scores come from public papers, model cards, repositories, and launch posts. Comparisons are most useful within a single benchmark — across benchmarks, evaluators, task sets, judges, attempt budgets, and tool access can differ, so treat cross-benchmark rankings as directional. Use the individual benchmark pages for methodology and interpretation notes.