Coding agent benchmark
- Public
HumanEval+
Enhanced version of OpenAI's HumanEval with 80x more test cases per problem to reduce false positives. Tests Python code generation against significantly stricter test coverage.
BENCHMARK
Benchmark by EvalPlus
- Benchmark type:
- Public benchmark
- Benchmark domain:
- Coding agent
- Task count:
- 164
- Evaluation method:
- Test suite
- Top model score
- ~99%
- Human score
- N/A
o3 / Claude 3.7
Various
Where this benchmark fits
Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the coding view . You can also jump straight to this benchmark in the master registry list .