Code

LiveCodeBench

A code benchmark that adds new tasks every month to avoid training leaks.

Top models

# Model Provider Score Date
1 Gemini 3 Pro Preview (high) reasoning Google 91.7% 2026-04
2 Gemini 3 Flash Preview (Reasoning) reasoning Google 90.8% 2026-04
3 DeepSeek V3.2 Speciale DeepSeek 89.6% 2026-04
4 DeepSeek-V3.2 (Thinking) reasoning DeepSeek 83.3% 2026-04
5 MiniMax M2 MiniMax 83% 2026-04
6 LongCat-Flash-Thinking-2601 reasoning Meituan 82.8% 2026-04
7 Nemotron 3 Super (120B A12B) NVIDIA 81.2% 2026-04
8 Grok 4 Fast xAI 80% 2026-04

What does it measure?

LiveCodeBench collects new programming-contest tasks from LeetCode, AtCoder and Codeforces as soon as they go live, and uses them to test whether a model can write code for problems it cannot have seen during training (contamination-resistant). Beyond plain code generation it also tests self-correction, code execution and test-output prediction.

The dataset is live: every few months a new version (v4, v5, v6) ships with only tasks published after a certain date. That makes the benchmark inherently contamination-resistant.

How to read the score

The score is pass@1 on the latest tranche. Scores depend heavily on the chosen time window; the same benchmark can produce two different scores depending on version v5 vs v6.

  • Random guessing: not meaningful (free code generation).
  • Human baseline: top competitive programmers (red Codeforces rating) typically reach 80-95%.
  • Current top: around 90% on the latest tranche.

Example task

Example task (LeetCode-style, from a live tranche):

"You are given two positive integers xCorner and yCorner and a 2D array circles, where each circle is given as [x, y, r]. There is a rectangle with its bottom-left corner at (0,0) and top-right corner at (xCorner, yCorner). Determine whether a path exists from bottom-left to top-right that stays entirely inside the rectangle and does not touch or cross any circle."

Example: xCorner=3, yCorner=4, circles=[[2,1,1]]true.

What to watch out for

  • The time window matters. Comparisons that do not state which tranche (v4/v5/v6) was used are misleading. A model can score 92% on v5 and 82% on v6.
  • Contest style. Tasks are short, puzzle-like algorithmic questions, not representative of production software where you touch files, dependencies and legacy code.
  • Contamination window. Over time, tasks still creep into training data. LiveCodeBench therefore has to be refreshed continuously.

Sources

← Back to all benchmarks