The 75% Savings Claim

One-Line Summary: Ruflo's headline claim of "75% API cost savings vs. Claude Code direct" is plausible but conditional on workload — the savings come from prompt caching discipline + multi-provider routing + parallel tool calls + cheaper-model fallback; this concept audits the claim and shows where it does and doesn't hold.

Prerequisites: Harness cost models, prompt and context caching, model routing in harnesses

The Claim

Ruflo (May 2026 messaging) advertises ~75% cost reduction vs. running the same workload directly through Claude Code. This is a real claim made in their marketing and tutorial materials. The question for course purposes: is it true, and under what conditions?

What Drives the Savings

Stack-ranking the contributors to ruflo's savings (approximate, based on documented behaviors):

Prompt caching discipline (~40% of the savings). Ruflo's prompt structure is engineered for high cache hit rates. Long stable prefixes, careful breakpoints, explicit cache markers. Hit rates of 92%+ translate to cache-rate savings of ~70% on those tokens.
Model routing across providers (~20%). Cheap models for routing/classification turns; mid-tier for routine work; frontier for hard reasoning. Multiplied by routing accuracy of ~89%.
Parallel tool calls (~10%). One model response can fire 4–6 tools simultaneously, amortizing the model call across multiple actions.
Cheaper-model fallback (~5%). When routing chooses a smaller model that succeeds, no large-model call needed.

The ~75% headline number assumes all four contribute on a workload that fits the routing assumptions. On workloads where most turns are at the frontier of capability (hard reasoning every turn), savings drop substantially because routing to cheaper models would degrade quality.

Where the Claim Holds

Repository-scale coding tasks with mixes of easy and hard subtasks.
Long sessions where cache amortization is large.
Workflows with many tool calls per turn (parallel execution helps).
Workloads with stable prefixes (cache hit rates stay high).

Where It Doesn't

Short ad-hoc sessions — cache write cost outweighs read benefit.
Workflows always at the frontier of model capability — routing to cheaper models loses quality.
Highly variable prompt structure — cache hit rates collapse.
Single-turn or few-turn tasks — not enough turns to amortize.

Why It Matters

Two reasons. First, the claim is the kind of marketing number that drives stack decisions. Understanding what drives it lets you predict whether your workload benefits.

Second — more interesting — every team's harness can probably capture most of these savings without adopting ruflo. The four levers are: caching discipline (prompt structure), routing (configure per-turn model selection), parallel tools (allow multiple calls per response), and cheaper-model fallback (instrument retry-with-larger-model). Each is implementable in Claude Code, the SDK, or any framework.

Practical Audit Approach

If you want to verify a 75% claim against your own workload:

Baseline: Run a representative workload through your current setup. Record total cost and per-turn breakdown.
Route: Re-run on the alternative harness, same workload.
Analyze: Where did the savings come from? Cache hits? Cheaper models? Tool batching?
Generalize: Are the savings architectural (would apply to any of your workloads) or workload-specific?

A team that does this exercise once builds intuition that pays off across many decisions.

Connections to Other Concepts

harness-cost-models.md — Parent concept.
prompt-and-context-caching.md — Single largest contributor.
model-routing-in-harnesses.md — Second largest.
swe-bench-and-harness-leaderboards.md — Quality side of the trade-off.