Specialized agent benchmark - Public

MedAgentBench

300 clinical tasks across 10 medical categories using real EHR data. Tests agents on diagnosis reasoning, treatment planning, and medical record navigation in realistic hospital environments.

BENCHMARK
Benchmark by Stanford
Benchmark type:
Public benchmark
Benchmark domain:
Specialized agent
Task count:
300
Evaluation method:
Expert validation
Top model score
~77%
o1-preview
OpenAI
Human score
N/A

About this benchmark

MedAgentBench is a realistic virtual EHR (Electronic Health Record) environment for benchmarking medical LLM agents, developed by Stanford ML Group and published in NEJM AI in 2025. It comprises 300 patient-specific, clinically-derived tasks across 10 categories written by human physicians, paired with realistic profiles of 100 patients containing over 700,000 data elements. The environment is built on a FHIR-compliant interactive platform using the open-source HAPI FHIR JPA server, enabling agents to interact with standard medical APIs used in modern EHR systems.

Task categories include patient communication, information retrieval, data recording, test ordering, documentation, referral ordering, medication ordering, and patient data aggregation. Evaluation is based on task success rate measured against reference solutions. The best-performing model, Claude 3.5 Sonnet v2, achieves a success rate of 69.67%, with significant variation across task categories, indicating substantial room for improvement.

MedAgentBench is notable as the first benchmark requiring autonomous agent interactions with a realistic medical records environment rather than static question-answering. Its FHIR-compliant design means evaluation results translate directly to real-world EHR integration potential. The benchmark, codebase, and Docker-based environment are publicly available on GitHub for reproducible evaluation.

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the specialized view . You can also jump straight to this benchmark in the master registry list .