Leaderboard
Agent scope| System / Submission | Score | Organization | Reported | Source |
|---|---|---|---|---|
| Claude Mythos 5 (browser-use) New Anthropic-run internal port with a browser-use scaffold, per-portal skill files, and a single trial; Anthropic flags it as not directly comparable to the published leaderboard. | 51.9% | Anthropic | Source | |
| Claude Opus 4.8 (browser-use) New Same Anthropic internal-port setup as the Mythos 5 row, and also the grader model for the run's LLM-judged subtasks; ties its successor at 51.9%. | 51.9% | Anthropic | Source | |
| Claude Mythos Preview (browser-use) New Anthropic internal-port run with browser-use scaffold and per-portal skill files, single trial; not directly comparable to the paper harness rows. | 47.4% | Anthropic | Source | |
| Claude Sonnet 4.6 (browser-use) New Anthropic internal-port run with browser-use scaffold and per-portal skill files, single trial; not directly comparable to the paper harness rows. | 45.2% | Anthropic | Source | |
| Claude Opus 4.6 CUA New Native Anthropic computer-use system; best paper result under the headline screenshot-only, task-description-plus-portal-guidance configuration. | 36.3% | Anthropic | Source | |
| GPT-5.4 CUA New Native OpenAI computer-use loop; highest subtask success in the paper but weaker end-to-end completion than Claude Opus 4.6 CUA. | 26.7% | OpenAI | Source | |
| Kimi K2.5 New Paper's standardized harness with screenshot-only observations and portal guidance prompting; strongest harness-based model. | 15.6% | Moonshot AI | Source | |
| Claude Opus 4.6 New Standardized harness, screenshot-only; the paper notes it reaches 51.9% under accessibility-tree observations, showing how much observation mode matters. | 14.8% | Anthropic | Source | |
| Qwen 3.5 New Paper's standardized harness with screenshot-only observations and portal guidance prompting. | 13.3% | Alibaba | Source | |
| Gemini 3.1 Pro New Paper's standardized harness with screenshot-only observations and portal guidance prompting. | 11.9% | Source | ||
| GPT-5.4 New Standardized harness, screenshot-only; far below its native CUA result, underlining the impact of system-level orchestration. | 5.9% | OpenAI | Source |
About this benchmark
HealthAdminBench evaluates computer-use agents on healthcare revenue-cycle work: 135 expert-designed tasks covering prior authorization, appeals and denials management, and durable medical equipment (DME) order processing, executed in four simulated GUI environments (an EHR, two payer portals, and a fax system). It was built by Stanford's Shah Lab with Stanford Healthcare and Kinetic Systems.
Each task decomposes into fine-grained verifiable subtasks — 1,698 evaluation points in total — scored by deterministic portal-state checks plus LLM-judged rubric items for free-text documentation and clinical reasoning. Full-task success requires passing every subtask, making it a strict end-to-end reliability measure.
The benchmark's headline finding is the gap between subtask and full-task performance: agents routinely complete 70–95% of subtasks while finishing far fewer whole workflows, which mirrors the reliability bar real back-office automation has to clear.
New benchmark with no independent submissions yet: current rows are the paper authors' baselines plus Anthropic's self-reported system card run.
Anthropic system card rows come from an internal port with a browser-use scaffold, per-portal skill files, and a single trial, and were self-graded by Claude Opus 4.8; Anthropic itself flags them as not directly comparable to the published leaderboard.
Setup differences (observation mode, prompting, orchestration) move scores more than model choice does on this page, so read the Notes column before treating rank gaps as capability gaps.
Example tasks
Three public tasks quoted from benchmark sources:
- "Open referral REF-2025-002 for Smith, Emily (67F with Santa Clara Family Health Plan - Medicare Advantage). Determine whether the payer requires prior authorization for this eye follow-up visit. Document your determination, then clear the referral from the worklist." Citation: HealthAdminBench task explorer
- "Open denial DEN-001 for Martinez, Carlos. Review all available information about this denial and determine the appropriate triage disposition. Document your reasoning in a triage note." Citation: HealthAdminBench task explorer
- "Download all 3 required documents, fax to DME supplier, and document in the Notes tab." Citation: HealthAdminBench task explorer
Methodology
- Headline metric is full-task success rate (pass@1 over 135 tasks): a task counts only if all of its subtasks pass. Subtask success rate is reported alongside it and is much higher for every agent.
- The paper's default configuration is screenshot-only observations with task-description-plus-portal-guidance prompting; configuration moves scores dramatically (Claude Opus 4.6 jumps from 14.8% to 51.9% with accessibility-tree observations, and task-specific prompts push harness agents above 90%).
- Rows mix two harness families: the paper's standardized harness, native computer-use systems (Anthropic and OpenAI CUA loops), and Anthropic's internal browser-use port from its system card; compare within a family before comparing across.
- Anthropic's system card rows used an internal port of the benchmark with a browser-use agent, adaptive thinking, a 500k-token per-task budget, per-portal skill files, and a single trial per model, with Claude Opus 4.8 grading the LLM-judged subtasks; Anthropic states these results are not directly comparable to the published leaderboard.
Links
Related benchmarks
Compare this benchmark with related pages from the hub: