SWE-bench Verified
Can the model fix real bugs from open-source GitHub projects?
Top models
| # | Model | Provider | Score | Date |
|---|---|---|---|---|
| 1 | Claude Mythos Preview reasoning | Anthropic | 93.9% | 2026-04 |
| 2 | Claude Opus 4.7 reasoning | Anthropic | 87.6% | 2026-04 |
| 3 | Claude Opus 4.5 reasoning | Anthropic | 80.9% | 2026-03 |
| 4 | Claude Opus 4.6 reasoning | Anthropic | 80.8% | 2026-03 |
| 5 | Gemini 3.1 Pro reasoning | 80.6% | 2026-04 | |
| 6 | Claude Fable 5 | Anthropic | 80.3% | 2026-06-09 |
| 7 | Claude Mythos 5 | Anthropic | 80.3% | 2026-06-09 |
| 8 | MiniMax M2.5 | MiniMax | 80.2% | 2026-03 |
| 9 | GPT-5.2 reasoning | OpenAI | 80% | 2025-12-11 |
| 10 | Claude Sonnet 4.6 reasoning | Anthropic | 79.6% | 2026-02 |
What does it measure?
SWE-bench Verified tests whether a model can independently resolve real GitHub issues by editing the codebase. It uses 500 human-validated problems from popular Python projects (Django, Flask, scikit-learn, SymPy and others), drawn from the original SWE-bench dataset.
The model gets the full repo, the issue text and a starting point in the git history. It has to change files so that a predefined set of tests passes. No multiple choice, no quiz, just: does the fix work or not.
How to read the score
Scores are the percentage of issues solved correctly. Three reference points to read the score:
- Random guessing: not meaningful (it is not a quiz).
- Human baseline: the original pull request that fixed the bug is 100% by definition, that is the ground truth.
- Current top: the best models score well above 80%, the very top touches 90+.
Above 70% is strong. Below 50% means a model often fails on real software tasks.
Example task
Example issue (translated from the dataset):
Repo:
django/django— "Incorrect queryset result when usingSubquerywithexclude()". The ORM layer does not filter out all expected rows when a subquery is combined with exclude. Fix the bug so the supplied tests pass."
The model gets the repo at a specific commit hash before the fix. It has to identify the right files, find the cause and write a patch. Success = the linked tests (which failed first) pass afterwards.
What to watch out for
- Agentic vs. plain model. Many top scores use an agent system with tool use, multiple rounds and retries. That is not an apples-to-apples comparison with single-shot model scores.
- Saturation. The gap between top models is shrinking; the community is already moving to SWE-bench Pro.
- Self-reported scores. Preview models are often scored by their own makers; wait for independent replication (e.g. Epoch AI or Artificial Analysis).