Coding agent benchmark - Public

Terminal-Bench 2.0

Purely terminal-based coding and system tasks with no GUI. Tests command-line proficiency across bash, Python, and system administration. Harder and more realistic than sandbox coding benchmarks.

BENCHMARK

Benchmark by Harbor

Benchmark type:: Public benchmark
Benchmark domain:: Coding agent
Task count:: ~200
Evaluation method:: Execution-based

Top model score: ~45%
Human score: N/A

View Terminal-Bench 2.0 benchmark paper Terminal-Bench 2.0 GitHub repository

About this benchmark

Terminal-Bench 2.0 is a benchmark developed by the Laude Institute for measuring the capabilities of AI agents and language models to perform valuable work in containerized terminal environments. Released in 2025 as a successor to the original Terminal-Bench, version 2.0 features harder tasks designed to test frontier model capabilities. Tasks span diverse domains including protein assembly for synthesis, debugging asynchronous code, and resolving security vulnerabilities. Each task undergoes several hours of human and LM-assisted validation to ensure it is solvable, realistic, and well-specified.

Evaluation runs through Harbor, the Laude Institute's framework for agentic evals and RL rollouts, which executes tasks in Docker containers locally or in the cloud. Agents interact with containerized environments via terminal commands, and oracle solutions are provided for validation. The benchmark supports multiple agent backends including Claude Code and custom agents via the BaseAgent interface.

Terminal-Bench is used by virtually all frontier labs as a measure of practical agent capability in realistic computing environments. The emphasis on task quality and containerized reproducibility distinguishes it from simpler code generation benchmarks. TB 2.0 tasks and the Harbor evaluation framework are open source on GitHub.

Where this benchmark fits

Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the coding view . You can also jump straight to this benchmark in the master registry list .