Prompt and Context Caching

One-Line Summary: Prompt caching reuses computation for repeated prefixes — system prompts, long instructions, recently-seen documents — at 5–10× cost savings on cache-hit tokens; it is the single largest cost lever in any agent system, and harness-layer prompt structure determines whether you actually capture it.

Prerequisites: Tokens and tokenization, context window management

What Is Prompt Caching?

When you call an LLM with a long prefix that's identical to a recent call's prefix, providers can skip recomputing the prefix's KV cache and reuse the cached version. The cached tokens are billed at a fraction of the standard rate (Anthropic: typically 10× cheaper for cache reads, ~25% cheaper for cache writes; OpenAI: similar pattern).

For agent workloads, prefixes are often very long (the system prompt + tools + conversation history) and very stable across turns. The same agent session reuses 80–95% of its prefix from turn to turn. Cache hit rates above 90% translate directly to bills cut by ~3–5×.

Caching is a model-provider feature, but capturing it is a harness responsibility. Prompt structure determines whether two consecutive calls hit the same cache entry; small changes (a timestamp, a randomized order) can invalidate the cache and force full recomputation.

How to Capture Caching at the Harness Layer

Three disciplines:

Stable prefix order: Place the longest, most-stable content first (system prompt, tool definitions, project memory). Variable content (current user message, latest tool result) goes last.
No timestamp pollution: Avoid embedding the current time in the prompt. Use the LLM's built-in time-awareness or fetch via tool call.
Cache breakpoints: Some providers expose explicit cache markers. Use them on the boundary between stable and variable content.

Ruflo's cost-savings claims (75% vs raw Claude Code) come substantially from disciplined caching. Claude Code captures most of the benefit by default; manual configurations can do better.

Why It Matters

Prompt caching is the rare engineering investment that simultaneously reduces cost, latency, and load on the provider. There is essentially no downside other than the discipline required to maintain stable prefixes. Teams that haven't audited their cache hit rates almost always have one.

A practical heuristic: a session with high cache hit rate feels different. Turn-to-turn latency is lower. Cost-per-turn is sub-linear. If your sessions don't have these properties, you're leaving money on the table.

Key Technical Details

Cache TTL: Anthropic's prompt cache is short-lived (5 min default; up to 1 hour). After TTL expires, the cache is gone.
Cache hit rate vs. cache savings: Hit rate is the fraction of prefix tokens served from cache. Savings is hit rate × cache discount. For Anthropic, 90% hit rate ≈ 75% cost reduction on those tokens.
Token-level caching: Caches are token-prefix-based, not semantically based. The first token that differs invalidates everything downstream.
Cache observability: Providers return cache hit/miss metadata. Surface this in your harness's cost-tracking UI.
Multi-turn agent loops are ideal: Long sessions amortize the cache write cost across many cache reads. Short bursty workloads don't benefit as much.
Provider differences: Anthropic, OpenAI, Google all have prompt caching with different mechanics. Cross-provider routing forfeits cache benefits.
Stale cache guards: A cached prefix referencing a file that's since changed will produce stale answers. Invalidate cache when source content changes.

How Harnesses & Frameworks Implement This

Harness / Framework	Caching support
Claude Code	Native — automatic for long-running sessions
Claude Agent SDK	Programmatic — `cache_control` on prompt blocks
ruflo	Aggressive — disciplined prefix layout for high hit rates
LangGraph	DIY — caching is the user's responsibility
AutoGen	Limited
CrewAI	Limited
OpenAI Agents SDK	Native — automatic for OpenAI models
Codex CLI	Native
Cursor	Captures via subscription pricing model; user doesn't see directly

Connections to Other Concepts

harness-cost-models.md — The largest cost lever.
model-routing-in-harnesses.md — Routing across providers loses cache.
the-75-percent-savings-claim.md — Caching is most of why the claim is plausible.
../../llm-concepts/07-inference-and-deployment/prefix-caching.md — Foundational coverage.