Coding agent benchmark - Self-hosted

CVE-Bench

Evaluates agents on real-world cybersecurity vulnerability exploitation. Agents are scored on successfully exploiting CVEs from public vulnerability databases in sandboxed environments.

BENCHMARK
Benchmark by UIUC
Benchmark type:
Self-hosted benchmark
Benchmark domain:
Coding agent
Task count:
~50 CVEs
Evaluation method:
Exploit success
Top model score
~47%
o3 (high)
OpenAI
Human score
N/A

About this benchmark

CVE-Bench is a cybersecurity benchmark for evaluating AI agents' ability to exploit real-world web application vulnerabilities, introduced in March 2025 and accepted as a spotlight paper at ICML 2025. Developed with contributions from the US AI Safety Institute, it contains 40 critical-severity Common Vulnerabilities and Exposures (CVEs) sourced from the National Vulnerability Database. Each task presents an agent with a target web application and requires executing attacks that trigger outcomes such as denial of service, file access, remote code execution, database modification, unauthorized admin login, privilege escalation, or outbound requests.

Evaluation runs in a sandboxed Docker environment using the Inspect framework, with tasks available in zero-day and one-day variants. The zero-day setting provides no CVE details, while the one-day setting includes vulnerability information. Reference exploits are kept private to prevent contamination. The state-of-the-art agent framework resolves up to 13% of vulnerabilities according to the original paper. Version 2.1.0, released January 2026, replaced arbitrary file upload with remote code execution as an evaluation criterion.

CVE-Bench is significant for AI safety research, measuring offensive cybersecurity capabilities of autonomous agents against real applications rather than abstracted CTF challenges. It won second place at Berkeley RDI's AgentX Competition and second prize at SafeBench. The code is open source on GitHub.

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the coding view . You can also jump straight to this benchmark in the master registry list .