HumanEval
Can the model write a Python function from a short description?
Top models
| # | Model | Provider | Score | Date |
|---|---|---|---|---|
| 1 | MiniCPM-SALA | OpenBMB | 95.1% | 2026-04 |
| 2 | Kimi K2 0905 | Moonshot AI | 94.5% | 2025-09 |
| 3 | Claude 3.5 Sonnet | Anthropic | 93.7% | 2024-10 |
| 4 | GPT-5 | OpenAI | 93.4% | 2025-08 |
| 5 | Kimi K2 Instruct | Moonshot AI | 93.3% | 2025-07 |
| 6 | Qwen2.5-Coder 32B Instruct | Alibaba | 92.7% | 2024-11 |
| 7 | o1-mini reasoning | OpenAI | 92.4% | 2024-09 |
| 8 | Sarvam-30B | Sarvam AI | 92.1% | 2025-06 |
What does it measure?
HumanEval is OpenAI's classic code test from 2021. The model gets 164 hand-written Python tasks: a function name, a list of arguments and a docstring describing what the function should do. The model writes the function body. The solution is tested automatically with unit tests.
The metric is called pass@1: does it work on the first try, with no extra attempts?
How to read the score
The score is the percentage of tasks where the first attempt passes.
- Random guessing: not meaningful for free code generation.
- Human baseline: no official experiment, but experienced Python developers sit around 80-90%.
- Current top: 97+%, with the top five models within a single percentage point. Practically saturated.
Below 80% = not competitive. Above 95% = impossible to tell models apart on this test.
Example task
Example task (task 0 from HumanEval):
def has_close_elements(numbers: List[float], threshold: float) -> bool:
"Check whether any two numbers in the given list are closer to each other than the given threshold."Test:
has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)should returnTrue(2.8 and 3.0 are 0.2 apart).
What to watch out for
- Saturated. In 2026 HumanEval no longer separates top models. It is still a useful "baseline filter": if a model scores below 80%, do not let it near your code at all.
- Contamination. These tasks have been public online for years. Models have almost certainly seen them during training. OpenAI itself reported ~25% overlap with GPT-4's training corpus.
- No agent workflow. HumanEval tests pure generation of short functions, with no file manipulation and no multiple rounds. Not representative of real software work.