Coding agent benchmark - Public

HumanEval+

Enhanced version of OpenAI's HumanEval with 80x more test cases per problem to reduce false positives. Tests Python code generation against significantly stricter test coverage.

BENCHMARK

Benchmark by EvalPlus

Benchmark type:: Public benchmark
Benchmark domain:: Coding agent
Task count:: 164
Evaluation method:: Test suite

Top model score: ~99%
Human score: N/A

View HumanEval+ benchmark paper HumanEval+ GitHub repository

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the coding view . You can also jump straight to this benchmark in the master registry list .