General reasoning benchmark - Public

Humanity's Last Exam

3,000 expert-level questions across 100+ academic disciplines, crowd-sourced from domain experts. Designed to be at or beyond the frontier of human knowledge — the hardest factual benchmark yet.

BENCHMARK
Benchmark type:
Public benchmark
Benchmark domain:
General reasoning
Task count:
3,000
Evaluation method:
Exact match
Top model score
~26%
o3 (high)
OpenAI
Human score
N/A

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the general reasoning view . You can also jump straight to this benchmark in the master registry list .