Leaderboard
Agent scope| System / Submission | Score | Organization | Reported | Source |
|---|---|---|---|---|
| Claude Opus 4.8 New OSWorld-Verified pass@1 (361 tasks, 100 steps, max effort, multi-run avg) using Anthropic's revised harness (zoom-tool fix, 128K tokens/turn) — not directly comparable to older entries. Self-reported in the Opus 4.8 system card. | 83.4% | Anthropic | Source | |
| Mythos Preview New Reported on Anthropic's Glasswing page. | 79.6% | Anthropic | Source | |
| OSAgent Self-reported October 2025; trained with RL on OSWorld VMs and internal browser environments. | 76.26% | TheAGI Company | Source | |
| GPT-5.4 New Self-reported at GPT-5.4 launch on OSWorld-Verified; awaiting independent verification. | 75.0% | OpenAI | Source | |
| Claude Opus 4.6 Reported on Anthropic's Glasswing page. | 72.7% | Anthropic | Source | |
| Claude Sonnet 4.6 Independently assessed; within 0.2 points of Opus 4.6 at significantly lower cost. | 72.5% | Anthropic | Source | |
| Qwen3 VL 235B Strongest open-source model on OSWorld; self-reported. | 66.7% | Alibaba | Source | |
| Claude Opus 4.5 OSWorld-Verified self-reported result; reported on anthropic.com. | 66.3% | Anthropic | Source | |
| Kimi K2.5 Self-reported in technical paper; GUI-only actions without external tools on OSWorld-Verified. | 63.3% | Moonshot AI | Source | |
| GLM-5V-Turbo Self-reported VLM result; reported on docs.z.ai. | 62.3% | Zhipu AI | Source | |
| Claude Sonnet 4.5 OSWorld-Verified, official framework, 100 max steps, 4-run avg; reported on anthropic.com. | 61.4% | Anthropic | Source | |
| UiPath Screen Agent OSWorld-Verified independently verified result; enterprise automation scaffold on Claude Opus 4.5. | 53.6% | UiPath | Source | |
| Claude Haiku 4.5 Self-reported result; reported on anthropic.com. | 50.7% | Anthropic | Source | |
| Agent S2 + Claude 3.7 Open-source modular agent; evaluated on 50-step OSWorld tasks. | 34.5% | Simular AI | Source | |
| OpenAI Operator (CUA) Self-reported on 50-step OSWorld tasks at Operator launch. | 32.6% | OpenAI | Source | |
| Qwen2.5 VL 72B Instruct Self-reported result; reported on huggingface.co. | 8.8% | Alibaba Cloud / Qwen Team | Source | |
| Qwen2.5 VL 32B Instruct Self-reported result; reported on huggingface.co. | 5.9% | Alibaba Cloud / Qwen Team | Source |
About this benchmark
OSWorld evaluates multimodal computer-use agents in real desktop environments across 369 tasks involving web apps, desktop software, files, and workflows spanning multiple applications.
It is valuable for teams building GUI agents because tasks require visual grounding, keyboard and mouse execution, OS knowledge, and error recovery, not only text planning.
Modern reports often distinguish original OSWorld, OSWorld-Verified, and submitter-run variants; read source details before comparing human-level claims.
Self-reported and independently verified rows coexist; setup differences can matter as much as the model.
Example tasks
Three public tasks quoted from benchmark sources:
- "Can you enable the 'Do Not Track' feature in Chrome to enhance my online privacy?" Citation: OSWorld example JSON
- "Can you make my computer bring back the last tab I shut down?" Citation: OSWorld example JSON
- "Computer, please navigate to the area in my browser settings where my passwords are stored. I want to check my login information for Etsy without revealing it just yet." Citation: OSWorld example JSON
Methodology
- Original OSWorld uses execution-based validators that check final computer state after the agent acts in configured VM environments.
- Reported metric is success rate; the original paper reported a 72.36% human baseline and 12.24% for the best early model.
- OSWorld-Verified adds independent or standardized re-runs for some systems; self-reported rows can use different max steps, OS images, and tool permissions.
- We track public results with source URLs and note whether the source claims verified or independent execution.
Links
Related benchmarks
Compare this benchmark with related pages from the hub:
Frequently asked questions
Which system is currently best on OSWorld? + -
Claude Opus 4.8 is the system/agent setup currently leading with a tracked score of 83.4%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated May 28, 2026.
What should I read into a OSWorld score? + -
OSWorld scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? + -
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.
Can I compare every row directly? + -
Use caution. Rows can vary by evaluator, harness, attempt budget, tool access, task filtering, or verification level. Source links and notes are part of the score, not decoration.