SWE-bench and Harness Leaderboards

One-Line Summary: SWE-bench is the dominant agent benchmark for software engineering tasks, and harness leaderboards (top scores published by ruflo, Aider, Devin, Cursor, OpenHands) are how the harness-layer competition is now measured — a 2026 frontier harness scoring 80%+ on SWE-bench Verified is roughly a year-over-year doubling of capability.

Prerequisites: Agent benchmarks, harness vs. framework vs. SDK

What Is SWE-bench?

SWE-bench (Jimenez et al., 2023) is a benchmark of real GitHub issues with corresponding pull requests as ground truth. The agent reads a repository, reads the issue, and produces a patch. The patch is evaluated by running the project's test suite — if tests that should pass after the fix do, the agent succeeds.

The benchmark has variants. SWE-bench full is the original 2,294 issue corpus. SWE-bench Lite is a curated subset (~300) that's easier to evaluate. SWE-bench Verified is human-curated for solvability and correctness (500 issues) — the version most often cited in 2026.

SWE-bench is hard. State-of-the-art systems in 2025–2026 score 70–85% on SWE-bench Verified, up from sub-5% in early 2023. Scoring well requires the harness's full stack: file navigation, code editing, test running, multi-file coordination.

Why Harness Leaderboards Matter

Three reasons:

Whole-system evaluation: SWE-bench scores measure the full harness, not just the model. A harness with good repo navigation, careful planning, and robust test running outperforms one with the same model but worse plumbing.
Model-agnostic comparability: Two harnesses on Claude Sonnet 4.6 with significantly different SWE-bench scores reveal that the harness is doing different work. The difference is plumbing.
Vendor pressure: Public leaderboards force harness vendors to invest in measurable quality. The benchmarks aren't perfect, but they're a real forcing function.

May 2026 leaderboard standings (approximate, public claims):

Harness / System	SWE-bench Verified	Notes
Devin (Cognition)	~85%	Their internal eval; conditions vary
ruflo + Claude	~84.8%	Their public claim
OpenHands	~76%	Open-source baseline
Aider + Claude	~74%	Lightweight harness, strong score
Claude Code (Anthropic)	~72%	First-party; expected to climb
Cursor	not officially benchmarked	Different optimization target
Codex CLI	~68%	OpenAI's public number

How to Read the Leaderboard

A few cautions:

Reproducibility: Some scores are hard to reproduce because of harness configuration, model snapshots, evaluation protocols.
Verified-vs-Lite-vs-Full: A score is meaningless without knowing which variant.
Score gaming: Harnesses can over-fit to SWE-bench specifically. A 5-point lead on the benchmark may not translate to your codebase.
Cost is rarely reported: A 2-point lead at 5× the cost is a poor trade.
Latency is rarely reported: A 3-point lead at 10× the latency is a poor trade.

Why It Matters Beyond the Score

The leaderboard is a forcing function for harness investment in: better repo navigation, smarter planning, robust test execution, careful state management, multi-file coordination. Even harnesses that don't compete directly benefit because the techniques developed for benchmark scoring leak into the broader ecosystem within months.

Key Technical Details

The benchmark constrains harness behavior: SWE-bench gives the harness a frozen repo + an issue. No internet, no human, no additional documentation.
Test running is the evaluation: A patch must produce a passing test suite. This means the harness's ability to run and interpret tests matters.
Edit precision: Patches that touch unrelated files (collateral damage) reduce the score.
Time budgets: Most variants cap per-issue time/calls.
Common failure modes: Wrong file, wrong function, hallucinated fix, partial fix, breaking other tests.

Connections to Other Concepts

harness-cost-models.md — Cost dimension of the trade.
the-75-percent-savings-claim.md — Quality side.
claude-code-vs-codex-vs-cursor.md — Where benchmark scores feed into harness choice.
../../ai-agent-evaluation/02-benchmark-ecosystem/swe-bench.md — Foundational coverage.