One-Line Summary: Model routing is the harness-layer decision of which model handles which turn — a small fast model for routing/classification, a large smart model for hard reasoning, a code-tuned model for coding subtasks; routing is the second-largest cost lever after caching, and a major source of harness differentiation.

Prerequisites: Harness cost models, harness primitives

What Is Model Routing?

A naive harness uses one model for everything. A routed harness picks the right model per turn based on what's happening. The routing dimensions:

  • Capability tier: Easy turns (formatting, classification) → cheap small model. Hard turns (multi-step planning, hard debugging) → large frontier model.
  • Specialization: Code-heavy turns → code-tuned model. Vision-heavy turns → multimodal model.
  • Context length: Short turns → standard context model. Long-context turns → extended context model.
  • Latency: User-blocking turns → fast model. Background turns → cheap-but-slow model.

A 75% cost reduction (ruflo's claim) without quality loss comes mainly from routing — sending easy turns to Haiku-class models while reserving Opus-class for hard turns.

How It Works

A routing pipeline:

  1. Per-turn classifier: A small, cheap model (or rules) classifies the current turn — what type, how complex.
  2. Model selection: The router maps classification → model.
  3. Fallback policy: If the chosen model fails or is unavailable, fall back to a known-good alternative.
  4. Cost accounting: Per-turn cost is tracked; budgets are enforced.

The classifier can itself be the source of regression — a routing decision that sends a hard turn to a small model produces bad outputs. Good routers err on the side of "if uncertain, use the bigger model."

Why It Matters

Routing is why production agent costs are not 5–10× hobbyist costs. A team running thousands of agent invocations daily, all on the largest model, pays orders of magnitude more than necessary. A routed deployment matches model size to task — most tasks land at the cheap end of the distribution; the expensive turns are reserved for the few that need them.

Routing is also a quality lever, not just a cost lever. A code-tuned model on a coding turn outperforms a generalist of equivalent size. Routing toward specialization is a quality investment.

Key Technical Details

  • Classifier latency adds up: A 50ms classification on every turn is real overhead. Cache classifications per session prefix.
  • Hysteresis prevents thrashing: Once routed to a model, stay there for a few turns unless signals strongly change.
  • Multi-provider routing complicates failover: If you route across Claude, GPT, Gemini, you need to handle differing tool-call formats, prompt engineering quirks, and capabilities.
  • Quality regressions are easy to miss: Routing changes can degrade output in ways users don't immediately notice. Track quality metrics post-routing-change.
  • Per-tenant overrides: Some users prefer "always use the largest model"; offer a setting.
  • Router-as-agent: A learned router can outperform rule-based routing but adds complexity. Start with rules.
  • Locality matters: A router that streams between providers loses prompt-cache benefits (caches are per-provider).

How Harnesses & Frameworks Implement This

Harness / FrameworkRouting
Claude CodePer-session model selection; no per-turn routing
Claude Agent SDKProgrammatic — DIY router
rufloFirst-class — multi-provider router across Claude, GPT, Gemini, Cohere, Ollama
LangGraphDIY — different nodes can use different models
AutoGenPer-agent model — limited turn-level routing
CrewAIPer-agent model
OpenAI Agents SDKPer-agent model
Codex CLIPer-session
CursorPer-session + free tier-bundled fast models for autocompletes

Connections to Other Concepts

  • harness-cost-models.md — Routing is the second-largest lever.
  • prompt-and-context-caching.md — The largest lever; interacts with routing.
  • the-75-percent-savings-claim.md — Routing is most of the reason.
  • claude-code-vs-codex-vs-cursor.md — Routing differences across harnesses.
  • ../../llm-concepts/07-inference-and-deployment/model-routing.md — Foundational coverage.

Further Reading

  • ruvnet, ruflo multi-provider routing documentation.
  • Vercel AI SDK / OpenRouter — Routing-as-a-service products.