Code

SWE-bench Verified

Can the model fix real bugs from open-source GitHub projects?

Top models

# Model Provider Score Date
1 Claude Mythos Preview reasoning Anthropic 93.9% 2026-04
2 Claude Opus 4.7 reasoning Anthropic 87.6% 2026-04
3 Claude Opus 4.5 reasoning Anthropic 80.9% 2026-03
4 Claude Opus 4.6 reasoning Anthropic 80.8% 2026-03
5 Gemini 3.1 Pro reasoning Google 80.6% 2026-04
6 Claude Fable 5 Anthropic 80.3% 2026-06-09
7 Claude Mythos 5 Anthropic 80.3% 2026-06-09
8 MiniMax M2.5 MiniMax 80.2% 2026-03
9 GPT-5.2 reasoning OpenAI 80% 2025-12-11
10 Claude Sonnet 4.6 reasoning Anthropic 79.6% 2026-02

What does it measure?

SWE-bench Verified tests whether a model can independently resolve real GitHub issues by editing the codebase. It uses 500 human-validated problems from popular Python projects (Django, Flask, scikit-learn, SymPy and others), drawn from the original SWE-bench dataset.

The model gets the full repo, the issue text and a starting point in the git history. It has to change files so that a predefined set of tests passes. No multiple choice, no quiz, just: does the fix work or not.

How to read the score

Scores are the percentage of issues solved correctly. Three reference points to read the score:

  • Random guessing: not meaningful (it is not a quiz).
  • Human baseline: the original pull request that fixed the bug is 100% by definition, that is the ground truth.
  • Current top: the best models score well above 80%, the very top touches 90+.

Above 70% is strong. Below 50% means a model often fails on real software tasks.

Example task

Example issue (translated from the dataset):

Repo: django/django — "Incorrect queryset result when using Subquery with exclude()". The ORM layer does not filter out all expected rows when a subquery is combined with exclude. Fix the bug so the supplied tests pass."

The model gets the repo at a specific commit hash before the fix. It has to identify the right files, find the cause and write a patch. Success = the linked tests (which failed first) pass afterwards.

What to watch out for

  • Agentic vs. plain model. Many top scores use an agent system with tool use, multiple rounds and retries. That is not an apples-to-apples comparison with single-shot model scores.
  • Saturation. The gap between top models is shrinking; the community is already moving to SWE-bench Pro.
  • Self-reported scores. Preview models are often scored by their own makers; wait for independent replication (e.g. Epoch AI or Artificial Analysis).

Sources

← Back to all benchmarks