One-Line Summary: A harness's cost is dominated not by per-token model price but by how often it calls the model, how aggressively it caches the prefix, when it falls back to a cheaper model, and how many sub-agents it parallelizes — these are harness-level decisions, not model-level ones.

Prerequisites: What is an AI harness, agent loop, context window management, model routing

Status: Module 08 anchor concept — first draft.

What Is a Harness Cost Model?

The naive cost model is "tokens × price." It is wrong by an order of magnitude. The actual cost of a harness running a real workload is determined by: how many turns the loop runs (a function of how decisively the agent terminates); how much of the prefix gets cached (5–10× cheaper than a cold call on Claude); whether sub-agents run in parallel (multiplying spend) or sequentially; whether the cheaper model is acceptable for routine subtasks (Haiku at ~5% of Opus's price); and whether tool results bypass the model entirely when they can be summarized deterministically.

Ruflo claims a 75% cost reduction over raw Claude Code on the same workloads. Whether that exact number holds depends on the workload, but the underlying point — that harness-level routing dwarfs model selection as a cost lever — is now well-established in 2026.

How It Works

A cost-aware harness instruments every model call with: a token estimate, a model choice (cheap → expensive escalation), a cache key for the prefix, and a budget check. When a turn would exceed budget, the harness can summarize the conversation, hand off to a smaller model, or refuse. Caching, in particular, is a discipline: the harness has to lay out the prompt so the cacheable prefix is genuinely stable across turns.

Why It Matters

Production agent costs are dominated by a few hot loops — the one workflow you run a thousand times a day. Any team running agents in production within a year discovers that harness-level cost engineering is where the leverage is.

How Harnesses & Frameworks Implement This

Harness / FrameworkCost-control levers
Claude CodeModel selection per session, prompt caching, manual /compact
Claude Agent SDKAll of the above, programmatic
rufloRouting across providers, prefix caching, model fallback, claim of ~75% reduction
LangGraphDIY — instrument your own nodes
AutoGenDIY
CrewAIPer-agent model selection
OpenAI Agents SDKPer-agent model + handoff routing
Codex CLIApproval/auto modes affect cost indirectly
CursorSubscription-based; indirect cost visibility

Connections to Other Concepts

  • model-routing-in-harnesses.md — How a harness chooses which model gets which turn.
  • prompt-and-context-caching.md — The single biggest cost lever.
  • the-75-percent-savings-claim.md — An audit of ruflo's specific claim.
  • swe-bench-and-harness-leaderboards.md — The performance-per-dollar comparisons.
  • choosing-your-harness-stack.md — The capstone decision framework.

Further Reading

  • Anthropic, "Prompt Caching" documentation — The mechanics that make caching a real cost lever.
  • ruvnet, ruflo economics whitepaper / claims — The case study.