One-Line Summary: SWE-bench is the dominant agent benchmark for software engineering tasks, and harness leaderboards (top scores published by ruflo, Aider, Devin, Cursor, OpenHands) are how the harness-layer competition is now measured — a 2026 frontier harness scoring 80%+ on SWE-bench Verified is roughly a year-over-year doubling of capability.

Prerequisites: Agent benchmarks, harness vs. framework vs. SDK

What Is SWE-bench?

SWE-bench (Jimenez et al., 2023) is a benchmark of real GitHub issues with corresponding pull requests as ground truth. The agent reads a repository, reads the issue, and produces a patch. The patch is evaluated by running the project's test suite — if tests that should pass after the fix do, the agent succeeds.

The benchmark has variants. SWE-bench full is the original 2,294 issue corpus. SWE-bench Lite is a curated subset (~300) that's easier to evaluate. SWE-bench Verified is human-curated for solvability and correctness (500 issues) — the version most often cited in 2026.

SWE-bench is hard. State-of-the-art systems in 2025–2026 score 70–85% on SWE-bench Verified, up from sub-5% in early 2023. Scoring well requires the harness's full stack: file navigation, code editing, test running, multi-file coordination.

Why Harness Leaderboards Matter

Three reasons:

  1. Whole-system evaluation: SWE-bench scores measure the full harness, not just the model. A harness with good repo navigation, careful planning, and robust test running outperforms one with the same model but worse plumbing.
  2. Model-agnostic comparability: Two harnesses on Claude Sonnet 4.6 with significantly different SWE-bench scores reveal that the harness is doing different work. The difference is plumbing.
  3. Vendor pressure: Public leaderboards force harness vendors to invest in measurable quality. The benchmarks aren't perfect, but they're a real forcing function.

May 2026 leaderboard standings (approximate, public claims):

Harness / SystemSWE-bench VerifiedNotes
Devin (Cognition)~85%Their internal eval; conditions vary
ruflo + Claude~84.8%Their public claim
OpenHands~76%Open-source baseline
Aider + Claude~74%Lightweight harness, strong score
Claude Code (Anthropic)~72%First-party; expected to climb
Cursornot officially benchmarkedDifferent optimization target
Codex CLI~68%OpenAI's public number

How to Read the Leaderboard

A few cautions:

  • Reproducibility: Some scores are hard to reproduce because of harness configuration, model snapshots, evaluation protocols.
  • Verified-vs-Lite-vs-Full: A score is meaningless without knowing which variant.
  • Score gaming: Harnesses can over-fit to SWE-bench specifically. A 5-point lead on the benchmark may not translate to your codebase.
  • Cost is rarely reported: A 2-point lead at 5× the cost is a poor trade.
  • Latency is rarely reported: A 3-point lead at 10× the latency is a poor trade.

Why It Matters Beyond the Score

The leaderboard is a forcing function for harness investment in: better repo navigation, smarter planning, robust test execution, careful state management, multi-file coordination. Even harnesses that don't compete directly benefit because the techniques developed for benchmark scoring leak into the broader ecosystem within months.

Key Technical Details

  • The benchmark constrains harness behavior: SWE-bench gives the harness a frozen repo + an issue. No internet, no human, no additional documentation.
  • Test running is the evaluation: A patch must produce a passing test suite. This means the harness's ability to run and interpret tests matters.
  • Edit precision: Patches that touch unrelated files (collateral damage) reduce the score.
  • Time budgets: Most variants cap per-issue time/calls.
  • Common failure modes: Wrong file, wrong function, hallucinated fix, partial fix, breaking other tests.

Connections to Other Concepts

  • harness-cost-models.md — Cost dimension of the trade.
  • the-75-percent-savings-claim.md — Quality side.
  • claude-code-vs-codex-vs-cursor.md — Where benchmark scores feed into harness choice.
  • ../../ai-agent-evaluation/02-benchmark-ecosystem/swe-bench.md — Foundational coverage.

Further Reading

  • Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (2023).
  • SWE-bench leaderboard at swebench.com.