Canonical benchmark page

OSWorld leaderboard

Benchmark page for OSWorld with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-04-16

About this benchmark

OSWorld evaluates computer-use agents across 369 real desktop tasks spanning Ubuntu, Windows, and macOS — covering web apps, desktop software, file I/O, and multi-application workflows.

It is the most widely adopted computer-use benchmark for comparing end-to-end system performance in realistic GUI environments with execution-based evaluation.

Rankings reflect full agent stacks. Model choice, GUI grounding approach, planning strategy, and error-recovery design all contribute meaningfully to observed scores.

Self-reported and independently verified rows coexist — check the source before comparing directly.

Methodology

  • Evaluation is execution-based: each task has a deterministic verifier that checks whether the final computer state matches the expected outcome.
  • OSWorld-Verified is a stricter variant where the research team independently runs agent code; self-reported rows on the main leaderboard are unaudited.
  • The human baseline is ~72.4%, making it a useful calibration point when reading scores.

Links

OSWorld

Agent scope
Rank System / Submission Score Organization Notes Source Share
1
Mythos Preview New
79.6% Anthropic Reported on Anthropic's Glasswing page. Source
2
OSAgent New
76.26% TheAGI Company Self-reported October 2025; trained with RL on OSWorld VMs and internal browser environments. Source
3
GPT-5.4 New
75.0% OpenAI Self-reported at GPT-5.4 launch on OSWorld-Verified; awaiting independent verification. Source
4
Claude Opus 4.6 New
72.7% Anthropic Reported on Anthropic's Glasswing page. Source
5
Claude Sonnet 4.6 New
72.5% Anthropic Independently assessed; within 0.2 points of Opus 4.6 at significantly lower cost. Source
6
Qwen3 VL 235B New
66.7% Alibaba Strongest open-source model on OSWorld; self-reported. Source
7
Claude Opus 4.5
66.3% Anthropic OSWorld-Verified self-reported result; reported on anthropic.com. Source
8
Kimi K2.5
63.3% Moonshot AI Self-reported in technical paper; GUI-only actions without external tools on OSWorld-Verified. Source
9
GLM-5V-Turbo
62.3% Zhipu AI Self-reported VLM result; reported on docs.z.ai. Source
10
Claude Sonnet 4.5
61.4% Anthropic OSWorld-Verified, official framework, 100 max steps, 4-run avg; reported on anthropic.com. Source
11
UiPath Screen Agent
53.6% UiPath OSWorld-Verified independently verified result; enterprise automation scaffold on Claude Opus 4.5. Source
12
Claude Haiku 4.5
50.7% Anthropic Self-reported result; reported on anthropic.com. Source
13
Agent S2 + Claude 3.7
34.5% Simular AI Open-source modular agent; evaluated on 50-step OSWorld tasks. Source
14
OpenAI Operator (CUA)
32.6% OpenAI Self-reported on 50-step OSWorld tasks at Operator launch. Source
15
Qwen2.5 VL 72B Instruct
8.8% Alibaba Cloud / Qwen Team Self-reported result; reported on huggingface.co. Source
16
Qwen2.5 VL 32B Instruct
5.9% Alibaba Cloud / Qwen Team Self-reported result; reported on huggingface.co. Source

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on OSWorld? +
Mythos Preview is the system/agent setup currently leading with a tracked score of 79.6%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Apr 16, 2026.
What should I read into a OSWorld score? +
OSWorld scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? +
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.