LiveCodeBench
A code benchmark that adds new tasks every month to avoid training leaks.
Top models
| # | Model | Provider | Score | Date |
|---|---|---|---|---|
| 1 | Gemini 3 Pro Preview (high) reasoning | 91.7% | 2026-04 | |
| 2 | Gemini 3 Flash Preview (Reasoning) reasoning | 90.8% | 2026-04 | |
| 3 | DeepSeek V3.2 Speciale | DeepSeek | 89.6% | 2026-04 |
| 4 | DeepSeek-V3.2 (Thinking) reasoning | DeepSeek | 83.3% | 2026-04 |
| 5 | MiniMax M2 | MiniMax | 83% | 2026-04 |
| 6 | LongCat-Flash-Thinking-2601 reasoning | Meituan | 82.8% | 2026-04 |
| 7 | Nemotron 3 Super (120B A12B) | NVIDIA | 81.2% | 2026-04 |
| 8 | Grok 4 Fast | xAI | 80% | 2026-04 |
What does it measure?
LiveCodeBench collects new programming-contest tasks from LeetCode, AtCoder and Codeforces as soon as they go live, and uses them to test whether a model can write code for problems it cannot have seen during training (contamination-resistant). Beyond plain code generation it also tests self-correction, code execution and test-output prediction.
The dataset is live: every few months a new version (v4, v5, v6) ships with only tasks published after a certain date. That makes the benchmark inherently contamination-resistant.
How to read the score
The score is pass@1 on the latest tranche. Scores depend heavily on the chosen time window; the same benchmark can produce two different scores depending on version v5 vs v6.
- Random guessing: not meaningful (free code generation).
- Human baseline: top competitive programmers (red Codeforces rating) typically reach 80-95%.
- Current top: around 90% on the latest tranche.
Example task
Example task (LeetCode-style, from a live tranche):
"You are given two positive integers
xCornerandyCornerand a 2D arraycircles, where each circle is given as[x, y, r]. There is a rectangle with its bottom-left corner at (0,0) and top-right corner at (xCorner, yCorner). Determine whether a path exists from bottom-left to top-right that stays entirely inside the rectangle and does not touch or cross any circle."Example:
xCorner=3, yCorner=4, circles=[[2,1,1]]→true.
What to watch out for
- The time window matters. Comparisons that do not state which tranche (v4/v5/v6) was used are misleading. A model can score 92% on v5 and 82% on v6.
- Contest style. Tasks are short, puzzle-like algorithmic questions, not representative of production software where you touch files, dependencies and legacy code.
- Contamination window. Over time, tasks still creep into training data. LiveCodeBench therefore has to be refreshed continuously.