Code

HumanEval

Can the model write a Python function from a short description?

Top models

# Model Provider Score Date
1 MiniCPM-SALA OpenBMB 95.1% 2026-04
2 Kimi K2 0905 Moonshot AI 94.5% 2025-09
3 Claude 3.5 Sonnet Anthropic 93.7% 2024-10
4 GPT-5 OpenAI 93.4% 2025-08
5 Kimi K2 Instruct Moonshot AI 93.3% 2025-07
6 Qwen2.5-Coder 32B Instruct Alibaba 92.7% 2024-11
7 o1-mini reasoning OpenAI 92.4% 2024-09
8 Sarvam-30B Sarvam AI 92.1% 2025-06

What does it measure?

HumanEval is OpenAI's classic code test from 2021. The model gets 164 hand-written Python tasks: a function name, a list of arguments and a docstring describing what the function should do. The model writes the function body. The solution is tested automatically with unit tests.

The metric is called pass@1: does it work on the first try, with no extra attempts?

How to read the score

The score is the percentage of tasks where the first attempt passes.

  • Random guessing: not meaningful for free code generation.
  • Human baseline: no official experiment, but experienced Python developers sit around 80-90%.
  • Current top: 97+%, with the top five models within a single percentage point. Practically saturated.

Below 80% = not competitive. Above 95% = impossible to tell models apart on this test.

Example task

Example task (task 0 from HumanEval):

def has_close_elements(numbers: List[float], threshold: float) -> bool:
"Check whether any two numbers in the given list are closer to each other than the given threshold."

Test: has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) should return True (2.8 and 3.0 are 0.2 apart).

What to watch out for

  • Saturated. In 2026 HumanEval no longer separates top models. It is still a useful "baseline filter": if a model scores below 80%, do not let it near your code at all.
  • Contamination. These tasks have been public online for years. Models have almost certainly seen them during training. OpenAI itself reported ~25% overlap with GPT-4's training corpus.
  • No agent workflow. HumanEval tests pure generation of short functions, with no file manipulation and no multiple rounds. Not representative of real software work.

Sources

← Back to all benchmarks