Top models

#	Model	Provider	Score	Date
1	MiniCPM-SALA	OpenBMB	95.1%	2026-04
2	Kimi K2 0905	Moonshot AI	94.5%	2025-09
3	Claude 3.5 Sonnet	Anthropic	93.7%	2024-10
4	GPT-5	OpenAI	93.4%	2025-08
5	Kimi K2 Instruct	Moonshot AI	93.3%	2025-07
6	Qwen2.5-Coder 32B Instruct	Alibaba	92.7%	2024-11
7	o1-mini reasoning	OpenAI	92.4%	2024-09
8	Sarvam-30B	Sarvam AI	92.1%	2025-06

What does it measure?

HumanEval is OpenAI's classic code test from 2021. The model gets 164 hand-written Python tasks: a function name, a list of arguments and a docstring describing what the function should do. The model writes the function body. The solution is tested automatically with unit tests.

The metric is called pass@1: does it work on the first try, with no extra attempts?

How to read the score

The score is the percentage of tasks where the first attempt passes.

Random guessing: not meaningful for free code generation.
Human baseline: no official experiment, but experienced Python developers sit around 80-90%.
Current top: 97+%, with the top five models within a single percentage point. Practically saturated.

Below 80% = not competitive. Above 95% = impossible to tell models apart on this test.

Example task

Example task (task 0 from HumanEval):

def has_close_elements(numbers: List[float], threshold: float) -> bool:
"Check whether any two numbers in the given list are closer to each other than the given threshold."

Test: has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) should return True (2.8 and 3.0 are 0.2 apart).

What to watch out for

Saturated. In 2026 HumanEval no longer separates top models. It is still a useful "baseline filter": if a model scores below 80%, do not let it near your code at all.
Contamination. These tasks have been public online for years. Models have almost certainly seen them during training. OpenAI itself reported ~25% overlap with GPT-4's training corpus.
No agent workflow. HumanEval tests pure generation of short functions, with no file manipulation and no multiple rounds. Not representative of real software work.

Top models

What does it measure?

How to read the score

Example task

What to watch out for

Sources