# Steel Benchmark Hub

> Benchmark-specific leaderboards for agent and model evaluation.
> Source: https://leaderboard.steel.dev | Maintained by Steel (https://steel.dev)

Canonical benchmark pages:

| Benchmark | Category | Top tracked row | Updated | Description | URL |
|-----------|----------|------------------|---------|-------------|-----|
| WebVoyager | Browser agents | Alumnium (98.5%) | 2026-03-22 | WebVoyager benchmark leaderboard for AI browser agents on 643 live-web tasks across 15 popular websites, with source-linked scores and methodology notes. | https://leaderboard.steel.dev/leaderboards/webvoyager/ |
| BrowseComp | Research/search | GPT-5.5 Pro (90.1%) | 2026-03-22 | BrowseComp leaderboard for agentic web research systems solving OpenAI's hard-to-find short-answer browsing benchmark, with sourced scores and setup notes. | https://leaderboard.steel.dev/leaderboards/browsecomp/ |
| WebArena | Browser agents | WebTactix (DeepSeek v3.2) (74.3%) | 2026-05-27 | WebArena leaderboard for autonomous browser agents evaluated on reproducible, self-hosted web tasks across shopping, forum, GitLab, CMS, map, and wiki environments. | https://leaderboard.steel.dev/leaderboards/webarena/ |
| SWE-bench Verified | Coding | Claude Mythos (93.9%) | 2026-03-22 | SWE-bench Verified leaderboard for coding agents resolving 500 human-filtered real GitHub issues with Docker-based test execution. | https://leaderboard.steel.dev/leaderboards/swe-bench-verified/ |
| OSWorld | Computer use | Mythos Preview (79.6%) | 2026-04-16 | OSWorld leaderboard for multimodal computer-use agents completing 369 real desktop tasks with execution-based verification. | https://leaderboard.steel.dev/leaderboards/osworld/ |
| GAIA | Model evals / reasoning | OPS-Agentic-Search (92.36%) | 2026-04-16 | GAIA leaderboard for general AI assistants answering 466 real-world questions with reasoning, web browsing, tools, and exact final answers. | https://leaderboard.steel.dev/leaderboards/gaia/ |
| ClawBench | Browser agents | Claude Sonnet 4.6 (33.3%) | 2026-04-16 | ClawBench leaderboard for browser agents completing 153 everyday state-changing tasks on 144 live production websites. | https://leaderboard.steel.dev/leaderboards/clawbench/ |
| Online-Mind2Web | Browser agents | Browser Use Cloud (bu-max) (97.0%) | 2026-04-16 | Online-Mind2Web leaderboard for live web agents on 300 realistic tasks across 136 websites, including human and WebJudge evaluation notes. | https://leaderboard.steel.dev/leaderboards/online-mind2web/ |
| τ-bench | Model evals / reasoning | Step-3.5-Flash (88.2%) | 2026-04-16 | τ-bench leaderboard for conversational AI agents collaborating with users across complex enterprise domains, emphasizing policy adherence and pass^k reliability. | https://leaderboard.steel.dev/leaderboards/tau-bench/ |
| AgentBench | Model evals / reasoning | AgentRL w/ Qwen2.5-32B-Instruct (70.4%) | 2026-04-16 | AgentBench leaderboard for LLM agents across 8 interactive environments, with a focus on function-calling and tool-use results. | https://leaderboard.steel.dev/leaderboards/agentbench/ |

## Featured

- WebVoyager remains the flagship browser-agent benchmark module on the homepage.
- Each benchmark page follows: Leaderboard -> About -> Example tasks -> Methodology -> Links -> Related benchmarks -> FAQ.

- [Full context file](https://leaderboard.steel.dev/llms-full.txt)