Agentic

Terminal-Bench 2.1

Can the model complete complex tasks in a terminal environment on its own, from compiling code to setting up servers?

Top models

# Model Provider Score Date
1 Claude Fable 5 Anthropic 88% 2026-06-09
2 Claude Mythos 5 Anthropic 88% 2026-06-09
3 Claude Opus 4.8 Anthropic 74.6% May 2026

What does it measure?

Terminal-Bench measures how well AI agents carry out full, realistic tasks inside a Linux container. Tasks include protein synthesis, debugging async code, fixing security holes, training ML models and configuring servers. The model gets an instruction, a Docker container and a time limit. The tests only look at the end result, not the intermediate steps.

How to read the score

The percentage of correct end results. The benchmark contains 89 tasks from software engineering, machine learning, security and data science. Each task is scored in binary: completed or not.

Example task

Set up a container environment with a PostgreSQL database, import a CSV dataset, write a query that ranks the top 10 customers by revenue, and export the result as JSON.

What to watch out for

Terminal-Bench contains only Linux tasks. Windows and macOS workflows are not tested. The tasks require full autonomy: the model may not ask for human help. Version 2.1 corrected 26 tasks from 2.0 for bugs and reward hacking.

Sources

← Back to all benchmarks