Model Routing in Harnesses

One-Line Summary: Model routing is the harness-layer decision of which model handles which turn — a small fast model for routing/classification, a large smart model for hard reasoning, a code-tuned model for coding subtasks; routing is the second-largest cost lever after caching, and a major source of harness differentiation.

Prerequisites: Harness cost models, harness primitives

What Is Model Routing?

A naive harness uses one model for everything. A routed harness picks the right model per turn based on what's happening. The routing dimensions:

Capability tier: Easy turns (formatting, classification) → cheap small model. Hard turns (multi-step planning, hard debugging) → large frontier model.
Specialization: Code-heavy turns → code-tuned model. Vision-heavy turns → multimodal model.
Context length: Short turns → standard context model. Long-context turns → extended context model.
Latency: User-blocking turns → fast model. Background turns → cheap-but-slow model.

A 75% cost reduction (ruflo's claim) without quality loss comes mainly from routing — sending easy turns to Haiku-class models while reserving Opus-class for hard turns.

How It Works

A routing pipeline:

Per-turn classifier: A small, cheap model (or rules) classifies the current turn — what type, how complex.
Model selection: The router maps classification → model.
Fallback policy: If the chosen model fails or is unavailable, fall back to a known-good alternative.
Cost accounting: Per-turn cost is tracked; budgets are enforced.

The classifier can itself be the source of regression — a routing decision that sends a hard turn to a small model produces bad outputs. Good routers err on the side of "if uncertain, use the bigger model."

Why It Matters

Routing is why production agent costs are not 5–10× hobbyist costs. A team running thousands of agent invocations daily, all on the largest model, pays orders of magnitude more than necessary. A routed deployment matches model size to task — most tasks land at the cheap end of the distribution; the expensive turns are reserved for the few that need them.

Routing is also a quality lever, not just a cost lever. A code-tuned model on a coding turn outperforms a generalist of equivalent size. Routing toward specialization is a quality investment.

Key Technical Details

Classifier latency adds up: A 50ms classification on every turn is real overhead. Cache classifications per session prefix.
Hysteresis prevents thrashing: Once routed to a model, stay there for a few turns unless signals strongly change.
Multi-provider routing complicates failover: If you route across Claude, GPT, Gemini, you need to handle differing tool-call formats, prompt engineering quirks, and capabilities.
Quality regressions are easy to miss: Routing changes can degrade output in ways users don't immediately notice. Track quality metrics post-routing-change.
Per-tenant overrides: Some users prefer "always use the largest model"; offer a setting.
Router-as-agent: A learned router can outperform rule-based routing but adds complexity. Start with rules.
Locality matters: A router that streams between providers loses prompt-cache benefits (caches are per-provider).

How Harnesses & Frameworks Implement This

Harness / Framework	Routing
Claude Code	Per-session model selection; no per-turn routing
Claude Agent SDK	Programmatic — DIY router
ruflo	First-class — multi-provider router across Claude, GPT, Gemini, Cohere, Ollama
LangGraph	DIY — different nodes can use different models
AutoGen	Per-agent model — limited turn-level routing
CrewAI	Per-agent model
OpenAI Agents SDK	Per-agent model
Codex CLI	Per-session
Cursor	Per-session + free tier-bundled fast models for autocompletes

Connections to Other Concepts

harness-cost-models.md — Routing is the second-largest lever.
prompt-and-context-caching.md — The largest lever; interacts with routing.
the-75-percent-savings-claim.md — Routing is most of the reason.
claude-code-vs-codex-vs-cursor.md — Routing differences across harnesses.
../../llm-concepts/07-inference-and-deployment/model-routing.md — Foundational coverage.