Coding agent benchmark - Public

SciCode

338 scientific coding subproblems from 80 research problems across math, physics, chemistry, biology, and materials science. Tests research-grade code generation against expert-written tests.

BENCHMARK

Benchmark by SciCode

Benchmark type:: Public benchmark
Benchmark domain:: Coding agent
Task count:: 338
Evaluation method:: Test suite

Top model score: ~26%
Human score: ~81%

View SciCode benchmark paper SciCode GitHub repository

About this benchmark

SciCode is a scientist-curated coding benchmark for evaluating language models on realistic scientific research problems, introduced in July 2024 and accepted at NeurIPS 2024 Datasets and Benchmarks Track. It covers 16 subdomains across 5 disciplines: Physics, Mathematics, Material Science, Biology, and Chemistry. The benchmark contains 80 challenging main problems decomposed into 338 subproblems, focusing on numerical methods, system simulation, and scientific calculation. Each problem includes optional scientific background descriptions and gold-standard solutions with test cases annotated by domain scientists.

Evaluation measures both main problem resolve rate and subproblem solve rate. Problems are tested by executing generated code against scientist-annotated numerical test results. In the most realistic setting (no background hints), OpenAI o3-mini-low achieves the highest main problem resolve rate at 10.8% with a 33.3% subproblem rate. Claude 3.5 Sonnet and DeepSeek-R1 each solve 4.6% of main problems. Many models, including Llama-3.1-70B and Mixtral-8x22B, solve 0% of main problems despite partial subproblem success.

SciCode stands out for its extreme difficulty and real-world scientific grounding, going well beyond exam-style coding questions. It is integrated with the inspect_ai framework and OpenCompass for standardized evaluation. The dataset is available on HuggingFace and GitHub.

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the coding view . You can also jump straight to this benchmark in the master registry list .