Web navigation benchmark - Self-hosted

WorkArena

ServiceNow-based enterprise workflow benchmark. Tests agents on realistic IT, HR, and operations tasks inside a real enterprise SaaS environment via BrowserGym.

BENCHMARK

Benchmark by ServiceNow

Benchmark type:: Self-hosted benchmark
Benchmark domain:: Web navigation
Task count:: 33 task types
Evaluation method:: Programmatic

Top model score: ~42%
Human score: ~78%

View WorkArena benchmark paper WorkArena GitHub repository

About this benchmark

WorkArena is a browser-based benchmark for evaluating web agents on enterprise knowledge work tasks, introduced by Drouin et al. (ServiceNow Research) at ICML 2024. WorkArena-L1 includes 19,912 unique task instances drawn from 33 atomic tasks covering core ServiceNow platform components: knowledge bases, forms, service catalogs, lists, menus, and dashboards. WorkArena++ (NeurIPS 2024) extends this with 682 compositional tasks that combine atomic operations into realistic multi-step workflows testing planning, reasoning, and memorization abilities.

Evaluation measures binary task success through programmatic validators built into each task, with oracle (cheat) functions available for verification. The benchmark reveals significant performance gaps between open and closed-source LLMs, with even GPT-4-vision-based agents falling well short of full automation on atomic tasks. WorkArena++ compositional tasks prove substantially harder, requiring agents to chain multiple operations across different UI components.

WorkArena is distinctive for targeting enterprise software automation rather than general web browsing, using the widely-deployed ServiceNow platform as its foundation. It is tightly integrated with BrowserGym for standardized evaluation and AgentLab for parallel experiment execution, reporting results on a unified Hugging Face leaderboard. Access requires ServiceNow instances provided through a gated Hugging Face repository. The benchmark is installable via pip as browsergym-workarena.

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the web navigation view . You can also jump straight to this benchmark in the master registry list .