Specialized agent benchmark - Public

AgentHarm

Safety red-teaming benchmark with 440 harmful agent tasks across 11 categories. Tests whether agent frameworks allow harmful behaviors — jailbreaking, weapon synthesis, fraud, and more.

BENCHMARK
Benchmark by AIEvals
Benchmark type:
Public benchmark
Benchmark domain:
Specialized agent
Task count:
440
Evaluation method:
Human / LLM judge
Top model score
N/A
N/A (safety eval)
UK AISI / Meta
Human score
N/A

About this benchmark

AgentHarm is a safety benchmark for measuring the harmfulness of LLM agents, introduced in October 2024 (arXiv:2410.09024). It contains 110 explicitly malicious agent tasks (440 with augmentations) spanning 11 harm categories including fraud, cybercrime, and harassment. Unlike chatbot-focused safety evaluations, AgentHarm specifically targets agents that use external tools and execute multi-step tasks, testing whether models refuse harmful agentic requests and whether jailbroken agents maintain coherent capabilities to complete malicious workflows.

Evaluation measures two axes: refusal rate on harmful requests and task completion quality when safety measures are bypassed. The benchmark reveals that leading LLMs are surprisingly compliant with malicious agent requests even without jailbreaking, that simple universal jailbreak templates effectively compromise agents, and that jailbroken agents retain full multi-step capabilities. The dataset is publicly available on Hugging Face via the UK AI Safety Institute.

AgentHarm fills a critical gap in AI safety research by shifting focus from single-turn chatbot attacks to realistic multi-step agent misuse scenarios. An A2A-compatible evaluation harness (AgentBeats green agent) enables agent-agnostic testing of any A2A-compliant system without benchmark-side modifications. The benchmark and tooling are openly available for reproducible safety evaluation.

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the specialized view . You can also jump straight to this benchmark in the master registry list .