About this benchmark

OSWorld evaluates multimodal computer-use agents in real desktop environments across 369 tasks involving web apps, desktop software, files, and workflows spanning multiple applications.

It is valuable for teams building GUI agents because tasks require visual grounding, keyboard and mouse execution, OS knowledge, and error recovery, not only text planning.

Modern reports often distinguish original OSWorld, OSWorld-Verified, and submitter-run variants; read source details before comparing human-level claims.

Self-reported and independently verified rows coexist; setup differences can matter as much as the model.

Example tasks

Three public tasks quoted from benchmark sources:

Methodology

  • Original OSWorld uses execution-based validators that check final computer state after the agent acts in configured VM environments.
  • Reported metric is success rate; the original paper reported a 72.36% human baseline and 12.24% for the best early model.
  • OSWorld-Verified adds independent or standardized re-runs for some systems; self-reported rows can use different max steps, OS images, and tool permissions.
  • We track public results with source URLs and note whether the source claims verified or independent execution.

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on OSWorld? +
Claude Opus 4.8 is the system/agent setup currently leading with a tracked score of 83.4%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated May 28, 2026.
What should I read into a OSWorld score? +
OSWorld scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Are these independently verified? +
Not always. Some rows are independently benchmarked and some are team-reported. Use each source link and notes field to verify evidence level before drawing strong conclusions.
Can I compare every row directly? +
Use caution. Rows can vary by evaluator, harness, attempt budget, tool access, task filtering, or verification level. Source links and notes are part of the score, not decoration.