Canonical benchmark page

OSWorld leaderboard

Benchmark page for OSWorld with standardized structure: about, leaderboard table, and FAQ.

Last updated: 2026-04-16

About this benchmark

OSWorld evaluates computer-use agents across 369 real desktop tasks spanning Ubuntu, Windows, and macOS — covering web apps, desktop software, file I/O, and multi-application workflows.

It is the most widely adopted computer-use benchmark for comparing end-to-end system performance in realistic GUI environments with execution-based evaluation.

Rankings reflect full agent stacks. Model choice, GUI grounding approach, planning strategy, and error-recovery design all contribute meaningfully to observed scores.

Self-reported and independently verified rows coexist — check the source before comparing directly.

Methodology

Evaluation is execution-based: each task has a deterministic verifier that checks whether the final computer state matches the expected outcome.
OSWorld-Verified is a stricter variant where the research team independently runs agent code; self-reported rows on the main leaderboard are unaudited.
The human baseline is ~72.4%, making it a useful calibration point when reading scores.

OSWorld

Agent scope

Rank	System / Submission	Score	Organization	Notes	Source	Share
1	Mythos Preview New	79.6%	Anthropic	Reported on Anthropic's Glasswing page.	Source	Share on X Share on LinkedIn
2	OSAgent New	76.26%	TheAGI Company	Self-reported October 2025; trained with RL on OSWorld VMs and internal browser environments.	Source	Share on X Share on LinkedIn
3	GPT-5.4 New	75.0%	OpenAI	Self-reported at GPT-5.4 launch on OSWorld-Verified; awaiting independent verification.	Source	Share on X Share on LinkedIn
4	Claude Opus 4.6 New	72.7%	Anthropic	Reported on Anthropic's Glasswing page.	Source	Share on X Share on LinkedIn
5	Claude Sonnet 4.6 New	72.5%	Anthropic	Independently assessed; within 0.2 points of Opus 4.6 at significantly lower cost.	Source	Share on X Share on LinkedIn
6	Qwen3 VL 235B New	66.7%	Alibaba	Strongest open-source model on OSWorld; self-reported.	Source	Share on X Share on LinkedIn
7	Claude Opus 4.5	66.3%	Anthropic	OSWorld-Verified self-reported result; reported on anthropic.com.	Source	Share on X Share on LinkedIn
8	Kimi K2.5	63.3%	Moonshot AI	Self-reported in technical paper; GUI-only actions without external tools on OSWorld-Verified.	Source	Share on X Share on LinkedIn
9	GLM-5V-Turbo	62.3%	Zhipu AI	Self-reported VLM result; reported on docs.z.ai.	Source	Share on X Share on LinkedIn
10	Claude Sonnet 4.5	61.4%	Anthropic	OSWorld-Verified, official framework, 100 max steps, 4-run avg; reported on anthropic.com.	Source	Share on X Share on LinkedIn
11	UiPath Screen Agent	53.6%	UiPath	OSWorld-Verified independently verified result; enterprise automation scaffold on Claude Opus 4.5.	Source	Share on X Share on LinkedIn
12	Claude Haiku 4.5	50.7%	Anthropic	Self-reported result; reported on anthropic.com.	Source	Share on X Share on LinkedIn
13	Agent S2 + Claude 3.7	34.5%	Simular AI	Open-source modular agent; evaluated on 50-step OSWorld tasks.	Source	Share on X Share on LinkedIn
14	OpenAI Operator (CUA)	32.6%	OpenAI	Self-reported on 50-step OSWorld tasks at Operator launch.	Source	Share on X Share on LinkedIn
15	Qwen2.5 VL 72B Instruct	8.8%	Alibaba Cloud / Qwen Team	Self-reported result; reported on huggingface.co.	Source	Share on X Share on LinkedIn
16	Qwen2.5 VL 32B Instruct	5.9%	Alibaba Cloud / Qwen Team	Self-reported result; reported on huggingface.co.	Source	Share on X Share on LinkedIn

Related benchmarks

Compare this benchmark with related pages from the hub:

webvoyager webarena

Back to benchmark hub

Frequently asked questions

Which system is currently best on OSWorld? +

Mythos Preview is the system/agent setup currently leading with a tracked score of 79.6%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Apr 16, 2026.

What should I read into a OSWorld score? +

OSWorld scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.

Are these independently verified? +

Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.