One-Line Summary: A harness's cost is dominated not by per-token model price but by how often it calls the model, how aggressively it caches the prefix, when it falls back to a cheaper model, and how many sub-agents it parallelizes — these are harness-level decisions, not model-level ones.
Prerequisites: What is an AI harness, agent loop, context window management, model routing
Status: Module 08 anchor concept — first draft.
What Is a Harness Cost Model?
The naive cost model is "tokens × price." It is wrong by an order of magnitude. The actual cost of a harness running a real workload is determined by: how many turns the loop runs (a function of how decisively the agent terminates); how much of the prefix gets cached (5–10× cheaper than a cold call on Claude); whether sub-agents run in parallel (multiplying spend) or sequentially; whether the cheaper model is acceptable for routine subtasks (Haiku at ~5% of Opus's price); and whether tool results bypass the model entirely when they can be summarized deterministically.
Ruflo claims a 75% cost reduction over raw Claude Code on the same workloads. Whether that exact number holds depends on the workload, but the underlying point — that harness-level routing dwarfs model selection as a cost lever — is now well-established in 2026.
How It Works
A cost-aware harness instruments every model call with: a token estimate, a model choice (cheap → expensive escalation), a cache key for the prefix, and a budget check. When a turn would exceed budget, the harness can summarize the conversation, hand off to a smaller model, or refuse. Caching, in particular, is a discipline: the harness has to lay out the prompt so the cacheable prefix is genuinely stable across turns.
Why It Matters
Production agent costs are dominated by a few hot loops — the one workflow you run a thousand times a day. Any team running agents in production within a year discovers that harness-level cost engineering is where the leverage is.
How Harnesses & Frameworks Implement This
| Harness / Framework | Cost-control levers |
|---|---|
| Claude Code | Model selection per session, prompt caching, manual /compact |
| Claude Agent SDK | All of the above, programmatic |
| ruflo | Routing across providers, prefix caching, model fallback, claim of ~75% reduction |
| LangGraph | DIY — instrument your own nodes |
| AutoGen | DIY |
| CrewAI | Per-agent model selection |
| OpenAI Agents SDK | Per-agent model + handoff routing |
| Codex CLI | Approval/auto modes affect cost indirectly |
| Cursor | Subscription-based; indirect cost visibility |
Connections to Other Concepts
model-routing-in-harnesses.md— How a harness chooses which model gets which turn.prompt-and-context-caching.md— The single biggest cost lever.the-75-percent-savings-claim.md— An audit of ruflo's specific claim.swe-bench-and-harness-leaderboards.md— The performance-per-dollar comparisons.choosing-your-harness-stack.md— The capstone decision framework.
Further Reading
- Anthropic, "Prompt Caching" documentation — The mechanics that make caching a real cost lever.
- ruvnet, ruflo economics whitepaper / claims — The case study.