AppWorld
9 app ecosystem with 750 tasks spanning contacts, music, email, maps, and calendar. Tests agents on realistic app-based workflows requiring coordination across multiple simulated apps.
- Benchmark type:
- Self-hosted benchmark
- Benchmark domain:
- General reasoning
- Task count:
- 750
- Evaluation method:
- Functional
- Top model score
- ~49%
- Human score
- N/A
About this benchmark
AppWorld is a benchmark for interactive coding agents introduced by Stony Brook NLP, winning the ACL 2024 Best Resource Paper award. It comprises the AppWorld Engine, a high-quality execution environment built with 60,000 lines of code simulating 9 day-to-day apps (notes, messaging, shopping, etc.) operable via 457 APIs and populated with realistic digital activities for approximately 100 fictitious users. The AppWorld Benchmark contains 750 natural, diverse, and challenging tasks requiring agents to generate rich, interactive code with complex control flow across multiple apps.
Evaluation uses robust programmatic state-based unit tests (40,000 lines of test code) that allow different valid solution paths while checking for collateral damage from unexpected state changes. Tasks are split into "normal" and "challenge" difficulty levels. GPT-4o, the strongest model at release, solves approximately 49% of normal tasks and 30% of challenge tasks, with other models solving at least 16 percentage points fewer.
AppWorld is significant because it moves beyond simple sequential API-call benchmarks to test genuine multi-app orchestration with iterative environment interaction. Its controllable simulated world enables reproducible evaluation without real-world side effects, while the state-based testing framework ensures evaluation integrity. The project is publicly available at appworld.dev.
Where this benchmark fits
Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the general reasoning view . You can also jump straight to this benchmark in the master registry list .