Coding agent benchmark - Public

HumanEval+

Enhanced version of OpenAI's HumanEval with 80x more test cases per problem to reduce false positives. Tests Python code generation against significantly stricter test coverage.

BENCHMARK
Benchmark by EvalPlus
Benchmark type:
Public benchmark
Benchmark domain:
Coding agent
Task count:
164
Evaluation method:
Test suite
Top model score
~99%
o3 / Claude 3.7
Various
Human score
N/A

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the coding view . You can also jump straight to this benchmark in the master registry list .