One-Line Summary: SWE-bench is the dominant agent benchmark for software engineering tasks, and harness leaderboards (top scores published by ruflo, Aider, Devin, Cursor, OpenHands) are how the harness-layer competition is now measured — a 2026 frontier harness scoring 80%+ on SWE-bench Verified is roughly a year-over-year doubling of capability.
Prerequisites: Agent benchmarks, harness vs. framework vs. SDK
What Is SWE-bench?
SWE-bench (Jimenez et al., 2023) is a benchmark of real GitHub issues with corresponding pull requests as ground truth. The agent reads a repository, reads the issue, and produces a patch. The patch is evaluated by running the project's test suite — if tests that should pass after the fix do, the agent succeeds.
The benchmark has variants. SWE-bench full is the original 2,294 issue corpus. SWE-bench Lite is a curated subset (~300) that's easier to evaluate. SWE-bench Verified is human-curated for solvability and correctness (500 issues) — the version most often cited in 2026.
SWE-bench is hard. State-of-the-art systems in 2025–2026 score 70–85% on SWE-bench Verified, up from sub-5% in early 2023. Scoring well requires the harness's full stack: file navigation, code editing, test running, multi-file coordination.
Why Harness Leaderboards Matter
Three reasons:
- Whole-system evaluation: SWE-bench scores measure the full harness, not just the model. A harness with good repo navigation, careful planning, and robust test running outperforms one with the same model but worse plumbing.
- Model-agnostic comparability: Two harnesses on Claude Sonnet 4.6 with significantly different SWE-bench scores reveal that the harness is doing different work. The difference is plumbing.
- Vendor pressure: Public leaderboards force harness vendors to invest in measurable quality. The benchmarks aren't perfect, but they're a real forcing function.
May 2026 leaderboard standings (approximate, public claims):
| Harness / System | SWE-bench Verified | Notes |
|---|---|---|
| Devin (Cognition) | ~85% | Their internal eval; conditions vary |
| ruflo + Claude | ~84.8% | Their public claim |
| OpenHands | ~76% | Open-source baseline |
| Aider + Claude | ~74% | Lightweight harness, strong score |
| Claude Code (Anthropic) | ~72% | First-party; expected to climb |
| Cursor | not officially benchmarked | Different optimization target |
| Codex CLI | ~68% | OpenAI's public number |
How to Read the Leaderboard
A few cautions:
- Reproducibility: Some scores are hard to reproduce because of harness configuration, model snapshots, evaluation protocols.
- Verified-vs-Lite-vs-Full: A score is meaningless without knowing which variant.
- Score gaming: Harnesses can over-fit to SWE-bench specifically. A 5-point lead on the benchmark may not translate to your codebase.
- Cost is rarely reported: A 2-point lead at 5× the cost is a poor trade.
- Latency is rarely reported: A 3-point lead at 10× the latency is a poor trade.
Why It Matters Beyond the Score
The leaderboard is a forcing function for harness investment in: better repo navigation, smarter planning, robust test execution, careful state management, multi-file coordination. Even harnesses that don't compete directly benefit because the techniques developed for benchmark scoring leak into the broader ecosystem within months.
Key Technical Details
- The benchmark constrains harness behavior: SWE-bench gives the harness a frozen repo + an issue. No internet, no human, no additional documentation.
- Test running is the evaluation: A patch must produce a passing test suite. This means the harness's ability to run and interpret tests matters.
- Edit precision: Patches that touch unrelated files (collateral damage) reduce the score.
- Time budgets: Most variants cap per-issue time/calls.
- Common failure modes: Wrong file, wrong function, hallucinated fix, partial fix, breaking other tests.
Connections to Other Concepts
harness-cost-models.md— Cost dimension of the trade.the-75-percent-savings-claim.md— Quality side.claude-code-vs-codex-vs-cursor.md— Where benchmark scores feed into harness choice.../../ai-agent-evaluation/02-benchmark-ecosystem/swe-bench.md— Foundational coverage.
Further Reading
- Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (2023).
- SWE-bench leaderboard at swebench.com.