OSWorld
369 cross-application desktop tasks across Ubuntu, Windows, and macOS. Covers Chrome, LibreOffice, VS Code, and more. Execution-based evaluation. Agents still well below the human baseline of 72%.
- Benchmark type:
- Self-hosted benchmark
- Benchmark domain:
- Desktop control
- Task count:
- 369
- Evaluation method:
- Execution-based
- Top model score
- 66.2%
- Human score
- 72.4%
About this benchmark
OSWorld is the first scalable, real computer environment for multimodal agents, introduced by Xie et al. (XLANG Lab) in April 2024. It supports task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS. The benchmark contains 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows. Each task includes a detailed initial state configuration and a custom execution-based evaluation script for reproducible assessment. An updated OSWorld-Verified version was released in July 2025 with community-driven fixes and AWS parallelization support.
Evaluation is execution-based, checking whether the agent achieved the correct end state via programmatic validators. Humans accomplish over 72.36% of tasks, while the best model achieves only 12.24% success rate, primarily struggling with GUI grounding and operational knowledge. The environment supports screenshot, accessibility tree, and other observation modes, with agents interacting through pyautogui actions.
OSWorld has become a foundational benchmark for desktop agent research, serving as the basis for Windows Agent Arena and inspiring similar platforms. It supports VMware, VirtualBox, Docker, and AWS deployment. The codebase is open source under Apache 2.0, with an active leaderboard and verified evaluation track.
Where this benchmark fits
Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the desktop control view . You can also jump straight to this benchmark in the master registry list .