Leaderboard

System / Submission Score Organization Reported Source
Claude Mythos 5 (browser-use) New
51.9% Anthropic Source
Claude Opus 4.8 (browser-use) New
51.9% Anthropic Source
Claude Mythos Preview (browser-use) New
47.4% Anthropic Source
Claude Sonnet 4.6 (browser-use) New
45.2% Anthropic Source
Claude Opus 4.6 CUA New
36.3% Anthropic Source
GPT-5.4 CUA New
26.7% OpenAI Source
Kimi K2.5 New
15.6% Moonshot AI Source
Claude Opus 4.6 New
14.8% Anthropic Source
Qwen 3.5 New
13.3% Alibaba Source
Gemini 3.1 Pro New
11.9% Google Source
GPT-5.4 New
5.9% OpenAI Source

About this benchmark

HealthAdminBench evaluates computer-use agents on healthcare revenue-cycle work: 135 expert-designed tasks covering prior authorization, appeals and denials management, and durable medical equipment (DME) order processing, executed in four simulated GUI environments (an EHR, two payer portals, and a fax system). It was built by Stanford's Shah Lab with Stanford Healthcare and Kinetic Systems.

Each task decomposes into fine-grained verifiable subtasks — 1,698 evaluation points in total — scored by deterministic portal-state checks plus LLM-judged rubric items for free-text documentation and clinical reasoning. Full-task success requires passing every subtask, making it a strict end-to-end reliability measure.

The benchmark's headline finding is the gap between subtask and full-task performance: agents routinely complete 70–95% of subtasks while finishing far fewer whole workflows, which mirrors the reliability bar real back-office automation has to clear.

New benchmark with no independent submissions yet: current rows are the paper authors' baselines plus Anthropic's self-reported system card run.

Anthropic system card rows come from an internal port with a browser-use scaffold, per-portal skill files, and a single trial, and were self-graded by Claude Opus 4.8; Anthropic itself flags them as not directly comparable to the published leaderboard.

Setup differences (observation mode, prompting, orchestration) move scores more than model choice does on this page, so read the Notes column before treating rank gaps as capability gaps.

Example tasks

Three public tasks quoted from benchmark sources:

  • "Open referral REF-2025-002 for Smith, Emily (67F with Santa Clara Family Health Plan - Medicare Advantage). Determine whether the payer requires prior authorization for this eye follow-up visit. Document your determination, then clear the referral from the worklist." Citation: HealthAdminBench task explorer
  • "Open denial DEN-001 for Martinez, Carlos. Review all available information about this denial and determine the appropriate triage disposition. Document your reasoning in a triage note." Citation: HealthAdminBench task explorer
  • "Download all 3 required documents, fax to DME supplier, and document in the Notes tab." Citation: HealthAdminBench task explorer

Methodology

  • Headline metric is full-task success rate (pass@1 over 135 tasks): a task counts only if all of its subtasks pass. Subtask success rate is reported alongside it and is much higher for every agent.
  • The paper's default configuration is screenshot-only observations with task-description-plus-portal-guidance prompting; configuration moves scores dramatically (Claude Opus 4.6 jumps from 14.8% to 51.9% with accessibility-tree observations, and task-specific prompts push harness agents above 90%).
  • Rows mix two harness families: the paper's standardized harness, native computer-use systems (Anthropic and OpenAI CUA loops), and Anthropic's internal browser-use port from its system card; compare within a family before comparing across.
  • Anthropic's system card rows used an internal port of the benchmark with a browser-use agent, adaptive thinking, a 500k-token per-task budget, per-portal skill files, and a single trial per model, with Claude Opus 4.8 grading the LLM-judged subtasks; Anthropic states these results are not directly comparable to the published leaderboard.

Related benchmarks

Compare this benchmark with related pages from the hub:

Back to benchmark hub

Frequently asked questions

Which system is currently best on HealthAdminBench? +
Claude Mythos 5 (browser-use) is the system/agent setup currently leading with a tracked score of 51.9%. This ranking reflects submitted system setups (model plus tools and policy), not just a base model. Based on our latest tracked results, last updated Jun 12, 2026.
What should I read into a HealthAdminBench score? +
HealthAdminBench scores are most useful for within-benchmark ranking. Read the Notes column to understand setup context, and use the methodology section before making procurement or architecture decisions.
Can I compare every row directly? +
Use caution. Rows can vary by evaluator, harness, attempt budget, tool access, task filtering, or verification level. Source links and notes are part of the score, not decoration.