OSWorld Leaderboard 2026: Latest Computer Use Agent Scores

Leaderboard

Agent scope

System / Submission	Score	Organization	Reported	Source
Claude Opus 4.8 New	83.4%	Anthropic	May 2026	Source
Mythos Preview New	79.6%	Anthropic	Apr 2026	Source
OSAgent	76.26%	TheAGI Company	Oct 2025	Source
GPT-5.4 New	75.0%	OpenAI	Mar 2026	Source
Claude Opus 4.6	72.7%	Anthropic	Feb 2026	Source
Claude Sonnet 4.6	72.5%	Anthropic	Feb 2026	Source
Qwen3 VL 235B	66.7%	Alibaba	Sep 2025	Source
Claude Opus 4.5	66.3%	Anthropic	Nov 2025	Source
Kimi K2.5	63.3%	Moonshot AI	Jan 2026	Source
GLM-5V-Turbo	62.3%	Zhipu AI	Apr 2026	Source
Claude Sonnet 4.5	61.4%	Anthropic	Sep 2025	Source
UiPath Screen Agent	53.6%	UiPath	Jan 2026	Source
Claude Haiku 4.5	50.7%	Anthropic	Oct 2025	Source
Agent S2 + Claude 3.7	34.5%	Simular AI	Mar 2025	Source
OpenAI Operator (CUA)	32.6%	OpenAI	Jan 2025	Source
Qwen2.5 VL 72B Instruct	8.8%	Alibaba Cloud / Qwen Team	Jan 2025	Source
Qwen2.5 VL 32B Instruct	5.9%	Alibaba Cloud / Qwen Team	Mar 2025	Source

About this benchmark

OSWorld evaluates multimodal computer-use agents in real desktop environments across 369 tasks involving web apps, desktop software, files, and workflows spanning multiple applications.

It is valuable for teams building GUI agents because tasks require visual grounding, keyboard and mouse execution, OS knowledge, and error recovery, not only text planning.

Modern reports often distinguish original OSWorld, OSWorld-Verified, and submitter-run variants; read source details before comparing human-level claims.

Self-reported and independently verified rows coexist; setup differences can matter as much as the model.

Example tasks

Three public tasks quoted from benchmark sources:

"Can you enable the 'Do Not Track' feature in Chrome to enhance my online privacy?" Citation: OSWorld example JSON
"Can you make my computer bring back the last tab I shut down?" Citation: OSWorld example JSON
"Computer, please navigate to the area in my browser settings where my passwords are stored. I want to check my login information for Etsy without revealing it just yet." Citation: OSWorld example JSON

Methodology

Original OSWorld uses execution-based validators that check final computer state after the agent acts in configured VM environments.
Reported metric is success rate; the original paper reported a 72.36% human baseline and 12.24% for the best early model.
OSWorld-Verified adds independent or standardized re-runs for some systems; self-reported rows can use different max steps, OS images, and tool permissions.
We track public results with source URLs and note whether the source claims verified or independent execution.

Related benchmarks

Compare this benchmark with related pages from the hub:

webarena webvoyager online-mind2web

Back to benchmark hub

Frequently asked questions

Which system is currently best on OSWorld? +

Claude Opus 4.8 is the system/agent setup currently leading with a tracked score of 83.4%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated May 28, 2026.

What should I read into a OSWorld score? +

OSWorld scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.

Are these independently verified? +

Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.

Can I compare every row directly? +

Use caution. Rows can vary by evaluator, harness, attempt budget, tool access, task filtering, or verification level. Source links and notes are part of the score, not decoration.