Canonical benchmark page
OSWorld leaderboard
Benchmark page for OSWorld with standardized structure: about, leaderboard table, and FAQ.
Last updated: 2026-04-16
About this benchmark
OSWorld evaluates computer-use agents across 369 real desktop tasks spanning Ubuntu, Windows, and macOS — covering web apps, desktop software, file I/O, and multi-application workflows.
It is the most widely adopted computer-use benchmark for comparing end-to-end system performance in realistic GUI environments with execution-based evaluation.
Rankings reflect full agent stacks. Model choice, GUI grounding approach, planning strategy, and error-recovery design all contribute meaningfully to observed scores.
Self-reported and independently verified rows coexist — check the source before comparing directly.
Methodology
- Evaluation is execution-based: each task has a deterministic verifier that checks whether the final computer state matches the expected outcome.
- OSWorld-Verified is a stricter variant where the research team independently runs agent code; self-reported rows on the main leaderboard are unaudited.
- The human baseline is ~72.4%, making it a useful calibration point when reading scores.
Links
OSWorld
Agent scope| Rank | System / Submission | Score | Organization | Notes | Source | Share |
|---|---|---|---|---|---|---|
| 1 | Mythos Preview New | 79.6% | Anthropic | Reported on Anthropic's Glasswing page. | Source | |
| 2 | OSAgent New | 76.26% | TheAGI Company | Self-reported October 2025; trained with RL on OSWorld VMs and internal browser environments. | Source | |
| 3 | GPT-5.4 New | 75.0% | OpenAI | Self-reported at GPT-5.4 launch on OSWorld-Verified; awaiting independent verification. | Source | |
| 4 | Claude Opus 4.6 New | 72.7% | Anthropic | Reported on Anthropic's Glasswing page. | Source | |
| 5 | Claude Sonnet 4.6 New | 72.5% | Anthropic | Independently assessed; within 0.2 points of Opus 4.6 at significantly lower cost. | Source | |
| 6 | Qwen3 VL 235B New | 66.7% | Alibaba | Strongest open-source model on OSWorld; self-reported. | Source | |
| 7 | Claude Opus 4.5 | 66.3% | Anthropic | OSWorld-Verified self-reported result; reported on anthropic.com. | Source | |
| 8 | Kimi K2.5 | 63.3% | Moonshot AI | Self-reported in technical paper; GUI-only actions without external tools on OSWorld-Verified. | Source | |
| 9 | GLM-5V-Turbo | 62.3% | Zhipu AI | Self-reported VLM result; reported on docs.z.ai. | Source | |
| 10 | Claude Sonnet 4.5 | 61.4% | Anthropic | OSWorld-Verified, official framework, 100 max steps, 4-run avg; reported on anthropic.com. | Source | |
| 11 | UiPath Screen Agent | 53.6% | UiPath | OSWorld-Verified independently verified result; enterprise automation scaffold on Claude Opus 4.5. | Source | |
| 12 | Claude Haiku 4.5 | 50.7% | Anthropic | Self-reported result; reported on anthropic.com. | Source | |
| 13 | Agent S2 + Claude 3.7 | 34.5% | Simular AI | Open-source modular agent; evaluated on 50-step OSWorld tasks. | Source | |
| 14 | OpenAI Operator (CUA) | 32.6% | OpenAI | Self-reported on 50-step OSWorld tasks at Operator launch. | Source | |
| 15 | Qwen2.5 VL 72B Instruct | 8.8% | Alibaba Cloud / Qwen Team | Self-reported result; reported on huggingface.co. | Source | |
| 16 | Qwen2.5 VL 32B Instruct | 5.9% | Alibaba Cloud / Qwen Team | Self-reported result; reported on huggingface.co. | Source |
Related benchmarks
Compare this benchmark with related pages from the hub: