Terminal-Bench 2.0
Purely terminal-based coding and system tasks with no GUI. Tests command-line proficiency across bash, Python, and system administration. Harder and more realistic than sandbox coding benchmarks.
- Benchmark type:
- Public benchmark
- Benchmark domain:
- Coding agent
- Task count:
- ~200
- Evaluation method:
- Execution-based
- Top model score
- ~45%
- Human score
- N/A
About this benchmark
Terminal-Bench 2.0 is a benchmark developed by the Laude Institute for measuring the capabilities of AI agents and language models to perform valuable work in containerized terminal environments. Released in 2025 as a successor to the original Terminal-Bench, version 2.0 features harder tasks designed to test frontier model capabilities. Tasks span diverse domains including protein assembly for synthesis, debugging asynchronous code, and resolving security vulnerabilities. Each task undergoes several hours of human and LM-assisted validation to ensure it is solvable, realistic, and well-specified.
Evaluation runs through Harbor, the Laude Institute's framework for agentic evals and RL rollouts, which executes tasks in Docker containers locally or in the cloud. Agents interact with containerized environments via terminal commands, and oracle solutions are provided for validation. The benchmark supports multiple agent backends including Claude Code and custom agents via the BaseAgent interface.
Terminal-Bench is used by virtually all frontier labs as a measure of practical agent capability in realistic computing environments. The emphasis on task quality and containerized reproducibility distinguishes it from simpler code generation benchmarks. TB 2.0 tasks and the Harbor evaluation framework are open source on GitHub.
Where this benchmark fits
Use this page when you need the benchmark-specific context. For side-by-side comparison, go back to the full registry or open the coding view . You can also jump straight to this benchmark in the master registry list .