Terminal-Bench 2.1
Can the model complete complex tasks in a terminal environment on its own, from compiling code to setting up servers?
Top models
| # | Model | Provider | Score | Date |
|---|---|---|---|---|
| 1 | Claude Fable 5 | Anthropic | 88% | 2026-06-09 |
| 2 | Claude Mythos 5 | Anthropic | 88% | 2026-06-09 |
| 3 | Claude Opus 4.8 | Anthropic | 74.6% | May 2026 |
What does it measure?
Terminal-Bench measures how well AI agents carry out full, realistic tasks inside a Linux container. Tasks include protein synthesis, debugging async code, fixing security holes, training ML models and configuring servers. The model gets an instruction, a Docker container and a time limit. The tests only look at the end result, not the intermediate steps.
How to read the score
The percentage of correct end results. The benchmark contains 89 tasks from software engineering, machine learning, security and data science. Each task is scored in binary: completed or not.
Example task
Set up a container environment with a PostgreSQL database, import a CSV dataset, write a query that ranks the top 10 customers by revenue, and export the result as JSON.
What to watch out for
Terminal-Bench contains only Linux tasks. Windows and macOS workflows are not tested. The tasks require full autonomy: the model may not ask for human help. Version 2.1 corrected 26 tasks from 2.0 for bugs and reward hacking.