General reasoning benchmark
- Public
Humanity's Last Exam
3,000 expert-level questions across 100+ academic disciplines, crowd-sourced from domain experts. Designed to be at or beyond the frontier of human knowledge — the hardest factual benchmark yet.
BENCHMARK
- Benchmark type:
- Public benchmark
- Benchmark domain:
- General reasoning
- Task count:
- 3,000
- Evaluation method:
- Exact match
- Top model score
- ~26%
- Human score
- N/A
o3 (high)
OpenAI
Where this benchmark fits
Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the general reasoning view . You can also jump straight to this benchmark in the master registry list .