General reasoning benchmark - Public

SimpleQA

4,326 short factual questions with a single unambiguous correct answer. Measures factual accuracy and hallucination rate — designed to have no trick questions, only clear facts.

BENCHMARK

Benchmark by OpenAI

Benchmark type:: Public benchmark
Benchmark domain:: General reasoning
Task count:: 4,326
Evaluation method:: Exact match

Top model score: ~97%
Human score: ~94%

View SimpleQA benchmark paper SimpleQA GitHub repository

About this benchmark

SimpleQA is a factuality benchmark for large language models developed by OpenAI, released in October 2024. It measures short-form factual accuracy by presenting straightforward questions that have single, unambiguous correct answers. The benchmark is part of OpenAI's simple-evals evaluation suite and is designed to test whether models can provide accurate factual responses rather than hallucinate plausible-sounding but incorrect information.

Evaluation classifies each response as correct, incorrect, or not attempted, enabling measurement of both accuracy and calibration (whether models appropriately abstain when uncertain). Results from OpenAI's simple-evals show significant variation across models: GPT-4.5-preview leads at 62.5%, o3 scores 49.4%, and GPT-4o variants range from 38.8% to 40.1%. Reasoning-focused models like o3-mini score notably lower at 13.4%, and o4-mini at 20.2%, suggesting that chain-of-thought reasoning does not reliably improve factual recall. Smaller models like GPT-4o-mini score 9.5%.

SimpleQA is significant because it isolates factual accuracy from other capabilities like reasoning or instruction following. The wide spread of scores across model families reveals that factuality remains a distinct and unsolved challenge. Its inclusion in OpenAI's standard evaluation suite ensures broad adoption as a reporting metric for new model releases. The benchmark is available under an MIT license.

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the general reasoning view . You can also jump straight to this benchmark in the master registry list .