Top models

#	Model	Provider	Score	Date
1	Claude Mythos Preview reasoning	Anthropic	93.9%	2026-04
2	Claude Opus 4.7 reasoning	Anthropic	87.6%	2026-04
3	Claude Opus 4.5 reasoning	Anthropic	80.9%	2026-03
4	Claude Opus 4.6 reasoning	Anthropic	80.8%	2026-03
5	Gemini 3.1 Pro reasoning	Google	80.6%	2026-04
6	Claude Fable 5	Anthropic	80.3%	2026-06-09
7	Claude Mythos 5	Anthropic	80.3%	2026-06-09
8	MiniMax M2.5	MiniMax	80.2%	2026-03
9	GPT-5.2 reasoning	OpenAI	80%	2025-12-11
10	Claude Sonnet 4.6 reasoning	Anthropic	79.6%	2026-02

What does it measure?

SWE-bench Verified tests whether a model can independently resolve real GitHub issues by editing the codebase. It uses 500 human-validated problems from popular Python projects (Django, Flask, scikit-learn, SymPy and others), drawn from the original SWE-bench dataset.

The model gets the full repo, the issue text and a starting point in the git history. It has to change files so that a predefined set of tests passes. No multiple choice, no quiz, just: does the fix work or not.

How to read the score

Scores are the percentage of issues solved correctly. Three reference points to read the score:

Random guessing: not meaningful (it is not a quiz).
Human baseline: the original pull request that fixed the bug is 100% by definition, that is the ground truth.
Current top: the best models score well above 80%, the very top touches 90+.

Above 70% is strong. Below 50% means a model often fails on real software tasks.

Example task

Example issue (translated from the dataset):

Repo: django/django — "Incorrect queryset result when using Subquery with exclude()". The ORM layer does not filter out all expected rows when a subquery is combined with exclude. Fix the bug so the supplied tests pass."

The model gets the repo at a specific commit hash before the fix. It has to identify the right files, find the cause and write a patch. Success = the linked tests (which failed first) pass afterwards.

What to watch out for

Agentic vs. plain model. Many top scores use an agent system with tool use, multiple rounds and retries. That is not an apples-to-apples comparison with single-shot model scores.
Saturation. The gap between top models is shrinking; the community is already moving to SWE-bench Pro.
Self-reported scores. Preview models are often scored by their own makers; wait for independent replication (e.g. Epoch AI or Artificial Analysis).

Top models

What does it measure?

How to read the score

Example task

What to watch out for

Sources