General reasoning benchmark - Public

GPQA Diamond

448 expert-level multiple-choice questions in biology, physics, and chemistry — written and validated by domain PhDs. Only experts in the relevant field consistently score above random.

BENCHMARK

Benchmark type:: Public benchmark
Benchmark domain:: General reasoning
Task count:: 448
Evaluation method:: Multiple choice

Top model score: ~87%
Human score: ~69% (experts)

View GPQA Diamond benchmark paper GPQA Diamond GitHub repository

About this benchmark

GPQA (Graduate-Level Google-Proof Q&A) is a benchmark of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry, introduced by David Rein et al. in November 2023. GPQA Diamond is the hardest curated subset of 198 questions, selected for maximum difficulty and quality. Questions are designed to be "Google-proof," meaning skilled non-experts cannot answer them even with 30+ minutes of unrestricted web access. The dataset is available on Hugging Face and via a password-protected download.

Domain experts with PhDs in the relevant fields achieve 65% accuracy (74% after discounting identified mistakes), while highly skilled non-expert validators reach only 34% despite extensive web research. At the original paper's release, the strongest GPT-4 baseline achieved 39% accuracy. Current frontier reasoning models have pushed significantly higher, with o3-high reaching 83.4% and o4-mini-high achieving 81.3% on the GPQA benchmark according to OpenAI's simple-evals results.

GPQA Diamond has become a standard evaluation in major model releases, featured prominently in OpenAI's simple-evals suite alongside MMLU and MATH. Its focus on graduate-level scientific reasoning that resists internet lookup makes it particularly valuable for measuring genuine domain understanding. The dataset is released under an MIT license.

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the general reasoning view . You can also jump straight to this benchmark in the master registry list .