What is SWE-Bench Pro? The Benchmark That Tests Real Coding Agents

SWE-Bench Pro is Scale AI's benchmark for evaluating coding agents on long-horizon, multi-file software engineering tasks from real repositories. 1,865 tasks across 41 repos, with top agents scoring 50-59%.

February 23, 2026 · 3 min read

SWE-Bench Pro is Scale AI's benchmark for evaluating coding agents on long-horizon, multi-file software engineering tasks from real repositories. Unlike SWE-Bench Verified's 500 Python-only tasks with median 4-line fixes, Pro demands an average of 107 lines across 4.1 files -- closer to what professional engineers actually do. Top agents solve around 50-59% of tasks, down from 70%+ on Verified.

1,865
Tasks across 41 repositories
107
Avg lines changed per task
4.1
Avg files modified per task
~59%
Top agent score (public set)

How SWE-Bench Pro Works

SWE-Bench Pro contains 1,865 tasks across 41 actively maintained repositories spanning Python, Go, TypeScript, and JavaScript. The tasks come from real commit histories: consecutive commits where one resolves a bug or adds a feature, paired with tests that demonstrate the fix.

Three Subsets

Public Set

731 tasks from 11 GPL-licensed repositories, openly available on HuggingFace. This is the primary evaluation target.

Commercial Set

276 tasks from 18 proprietary startup codebases, acquired through partnerships with Scale AI. Not publicly accessible.

Held-Out Set

858 tasks from 12 repositories, reserved for future overfitting detection. Scale AI can release these to check if improvements generalize.

Three-Stage Human Augmentation

Each task goes through a rigorous annotation process:

  1. Problem statement creation -- original commit messages and issue discussions are synthesized into clear, structured descriptions
  2. Requirements definition -- annotators create specification lists grounded in unit tests and gold patches, detailing expected behavior without prescribing implementation
  3. Interface specification -- class and function signatures are documented to prevent false negatives from naming mismatches

Evaluation methodology

Evaluation uses containerized, language-specific environments. Each task must pass "fail2pass" tests (verifying the issue is resolved) and "pass2pass" tests (ensuring existing functionality isn't broken). Gold patches are validated across 3 test runs before inclusion.

SWE-Bench Pro vs SWE-Bench Verified

SWE-Bench Verified was the previous gold standard -- a human-validated subset of 500 tasks from the original SWE-Bench. It served the community well, but it has limitations that Pro was designed to address.

DimensionSWE-Bench VerifiedSWE-Bench Pro
Tasks5001,865
Repositories12 (all Python)41 (Python, Go, TS, JS)
Avg lines changed11 (median: 4)107.4
Avg files changed~14.1
Top score (Feb 2026)~72%~59%
Contamination resistanceLow -- all public reposHigh -- GPL + proprietary code
Task clarityAmbiguous issues removedAmbiguous issues clarified with human context

The difference in task complexity is stark. 161 of SWE-Bench Verified's 500 tasks require only 1-2 lines of change. Every SWE-Bench Pro task requires at least 10 lines. Over 100 tasks require more than 100 lines. These are tasks that would take a professional engineer hours to days -- not minutes.

Contamination confirmed

OpenAI's own audit found that every frontier model tested -- GPT-5.2, Claude Opus 4.5, Gemini 3 Flash -- could reproduce verbatim gold patches or problem statement specifics for certain SWE-Bench Verified tasks. They also found that 59.4% of the hardest unsolved problems had flawed test cases. OpenAI has stopped reporting Verified scores and recommends SWE-Bench Pro instead.

SWE-Bench Pro Leaderboard

The leaderboard has two tiers: the standard SEAL leaderboard (Scale AI's unified scaffolding with a 250-turn limit) and results from specialized agent systems that bring their own scaffolding.

Agent Systems (Custom Scaffolding)

These are the highest-performing configurations on the public set:

AgentBase ModelScore
Codex 5.3 (CLI) + WarpGrep v2Codex 5.359.1%
MiniMax 2.5 + WarpGrep v2MiniMax 2.557.6%
Opus 4.6 + WarpGrep v2Opus 4.657.5%
AuggieOpus 4.551.8%
CursorOpus 4.550.2%
Claude CodeOpus 4.549.8%

SEAL Leaderboard (Unified Scaffolding)

Scale AI's standardized evaluation with identical tooling for all models:

RankModelScore
1Claude Opus 4.545.9%
2Claude Sonnet 4.543.6%
3Gemini 3 Pro43.3%
4Claude Sonnet 442.7%
5GPT-5 (High)41.8%
6GPT-5.2 Codex41.0%
7Claude Haiku 4.539.5%
8Qwen3 Coder 480B38.7%
9MiniMax 2.136.8%

The gap between the SEAL leaderboard and agent systems is instructive. Claude Opus 4.5 scores 45.9% with generic scaffolding but 57.5% when paired with WarpGrep v2 as a search subagent. The model is the same -- the difference is how it retrieves context.

Why Scores Are So Much Lower Than SWE-Bench Verified

The drop from 72% (Verified) to ~50-59% (Pro) isn't just harder tasks. It's a fundamentally different challenge.

Multi-File Modifications

SWE-Bench Verified is largely a single-file benchmark. Most fixes touch one file with a few lines changed. SWE-Bench Pro tasks require coordinating changes across an average of 4.1 files. The agent needs to understand how a change in one file affects behavior in three others.

Longer Time Horizons

These aren't 5-minute fixes. They're tasks that would take a professional engineer hours. The agent must maintain coherent plans across many steps, managing context and state throughout.

Codebase Complexity

Pro repositories are production systems -- business applications, B2B services, developer tools. They have complex build systems, cross-cutting concerns, and domain-specific conventions that an agent must navigate.

Contamination Resistance

Models can't rely on having seen the code before. The GPL licensing and proprietary repos mean agents must genuinely reason about unfamiliar codebases.

Failure mode analysis

Scale AI's analysis of agent trajectories reveals where models break down: semantic understanding failures (35.9% of Opus 4.1 failures), context overflow (35.6% of Sonnet 4 failures), and tool-use inefficiency (42% of smaller model failures). The context overflow finding aligns with research showing coding agents spend 60%+ of their time searching for context.

The Search Subagent Effect

One pattern in the leaderboard deserves attention: WarpGrep v2 lifts every model it's paired with by 2-4 points.

ModelWithout WarpGrepWith WarpGrep v2Delta
Codex 5.3 (CLI)56.0%59.1%+3.1
MiniMax 2.555.4%57.6%+3.7
Opus 4.655.4%57.5%+2.1

This isn't surprising given the failure modes above. If context overflow causes 35% of failures and semantic understanding failures cause another 36%, then a system that delivers cleaner, more precise context should help on both fronts.

WarpGrep v2 is an RL-trained search subagent that runs in its own context window, issues up to 8 parallel tool calls per turn, and returns only the relevant file spans. The main coding model never sees the files WarpGrep rejected. Its context stays clean -- which is exactly what SWE-Bench Pro's multi-file, long-horizon tasks demand.

The cost story reinforces this. On SWE-Bench Pro tasks with Opus 4.6, adding WarpGrep v2 makes the system 15.6% cheaper and 28% faster -- counterintuitive until you realize the expensive model spends less time doing its own search and generates fewer wasted tokens.

What SWE-Bench Pro Gets Right

SWE-Bench Pro isn't perfect, but it addresses the most pressing criticisms of earlier benchmarks:

Structural Contamination Resistance

Rather than hoping models haven't seen the data, Pro uses GPL licensing and proprietary access controls that make contamination unlikely by construction.

Real-World Task Complexity

107 lines across 4 files is within the range of a normal pull request. The benchmark tests whether agents can do the work engineers actually do.

Multi-Language Coverage

Python-only benchmarks miss failure modes in statically typed languages and different tooling ecosystems. Pro covers Python, Go, TypeScript, and JavaScript.

Reproducible Evaluation

Containerized environments with full dependency resolution ensure consistent results. Three-stage test verification catches flaky tests before inclusion.

The held-out set is a smart design choice. As labs inevitably optimize for the public set, Scale AI can release held-out tasks to check whether improvements generalize.

Frequently Asked Questions

What is SWE-Bench Pro?

SWE-Bench Pro is a software engineering benchmark by Scale AI that evaluates AI coding agents on 1,865 long-horizon tasks from 41 real repositories. Tasks require an average of 107 lines of changes across 4 files, making it significantly harder than SWE-Bench Verified.

How does SWE-Bench Pro differ from SWE-Bench Verified?

SWE-Bench Verified has 500 Python-only tasks with small fixes (median 4 lines). SWE-Bench Pro has 1,865 multi-language tasks requiring substantial, multi-file modifications. Pro also uses GPL licensing and proprietary codebases to resist data contamination. OpenAI has stopped reporting Verified scores due to contamination concerns.

What is a good SWE-Bench Pro score?

As of February 2026, top agent systems score 55-59% on the public set. On Scale AI's standardized SEAL leaderboard (unified scaffolding), top models reach 43-46%. Scores above 50% require strong context retrieval, not just strong code generation.

Is SWE-Bench Verified still useful?

SWE-Bench Verified remains useful for quick iteration -- it's faster to run and still differentiates between weaker models. But OpenAI's audit found that all frontier models are contaminated on it, and 59.4% of hard tasks have flawed tests. OpenAI has stopped reporting Verified scores. Pro is a better measure of production readiness.

What is the benchmark for GPT-5 on SWE-Bench Pro?

GPT-5 (High) scores 41.8% on the SEAL leaderboard with standardized scaffolding. GPT-5.2 Codex scores 41.0%. With custom agent scaffolding including search subagents, scores can reach higher.

WarpGrep v2: #1 on SWE-Bench Pro

WarpGrep v2 is the RL-trained search subagent that lifted every model it was paired with by 2-4 points on SWE-Bench Pro. It runs in its own context window, issues 8 parallel tool calls per turn, and makes your coding agent 15.6% cheaper and 28% faster.