Desktop control benchmark - Self-hosted

Mobile-Env

A gym environment for mobile UI interaction built on Android emulator. Provides step-level rewards for fine-grained evaluation of touch-based agent interaction.

BENCHMARK
Benchmark by X-LANCE
Benchmark type:
Self-hosted benchmark
Benchmark domain:
Desktop control
Task count:
~70 tasks
Evaluation method:
Step reward
Top model score
N/A
AppAgent
Tencent
Human score
N/A

About this benchmark

Mobile-Env is a comprehensive toolkit for building GUI interaction benchmarks on Android, developed by Zhang et al. (SJTU X-LANCE Lab) and published in May 2023. Built on top of DeepMind's AndroidEnv, it provides an isolated and controllable platform where agents interact via screenshots and view hierarchies, taking touch or text-typing actions. The platform supports intermediate instructions and rewards at crucial steps, reflecting real-world usage patterns more naturally than simple pass/fail evaluation.

Evaluation uses event-driven reward signals parsed from multiple sources: screen text, screen icons, view hierarchy, and system logs. The platform ships with a WikiHow task set available on HuggingFace that captures dynamic online content for fully controllable and reproducible evaluation, plus an open-world task set across various real-world apps. Even advanced models like GPT-4V and LLaMA-3 struggle with tasks that are relatively simple for humans, highlighting significant gaps in current GUI agent capabilities.

Mobile-Env is significant as a foundational platform for Android GUI agent research, emphasizing extensibility and benchmark quality. New tasks can be added through task definition files without code changes, and template tools support auto-generating multi-step task definitions. Docker images are available for simplified deployment. The platform is open source and supports both visual-based and text-based agents.

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the desktop control view . You can also jump straight to this benchmark in the master registry list .