Automatic Model Routing Doesn't Exist Yet

I’ve been looking at how the major AI coding assistants pick which model to use, and they all do the same thing: error-based failover. Claude Code falls back to Sonnet when Opus throws. Cursor swaps to Haiku when latency creeps. The big open-source ones (OpenHands, OpenClaw) let you pin a model per session, but they don’t route by what the task needs.

Nobody routes by task complexity. Nobody is asking the question “what’s being requested here, and which tier should answer it?”

This is a real gap, and the more I work in these tools, the more obvious it gets.

The Pattern Everyone Uses

Default to the operator’s preferred model. Fail over when the model errors. Optionally let the operator pick per-session.

That’s it. Three modes, none of which look at the task.

The argument for the current state is reasonable: model routing is a hard problem. Mis-routing is worse than over-routing because mis-routing produces bad outputs at premium prices. Better to default to the model the user trusts than to silently downgrade and ship something worse.

What’s Wrong With It

Most tasks don’t need Opus. They need Haiku. The model that’s good at “summarize this email” is not the model that’s good at “design the database schema for our agent state.” You don’t want to pay Opus-rate to summarize emails and you don’t want Haiku-rate decisions about your schema.

When everything defaults to a single model, one of two things happens:

You default high. Your Haiku-class tasks cost 20x what they should. Over time, you cap how much you’ll automate because the per-task cost adds up.
You default low. Your Opus-class tasks get half-thought-through answers and you have to redo them, or worse, ship them.

Operators end up doing model routing manually. “This is a long task, let me make sure I’m on Opus today.” That’s manual error-based routing. We can do better.

What Routing Could Look Like

A classifier between the prompt and the dispatcher. Examines the task and picks a tier:

Atomic, well-defined, low-context -> Haiku. Email summary, file rename, simple regex.
Multi-step, scoped, medium context -> Sonnet. Single-file refactor, write a function, debug a known error.
Open-ended, high-context, design-y -> Opus. Architecture decisions, multi-file refactors, anything requiring real judgment.

The classifier doesn’t have to be smart. A rule-based heuristic that looks at prompt length, presence of tool calls, presence of design language, and whether multiple files are referenced would route 80% correctly. The remaining 20% gets a confidence score and routes to a higher tier when uncertain (fail safe high, not low).

The GSD project does this with rolling-history learning (track outcomes per category, adjust thresholds). Nobody else does. Anthropic’s Pi SDK could do this. Claude Code Code could ship it. So could Cursor.

Why This Matters

The token cost of running these tools is the rate-limiting reagent on how much you’ll automate. If Haiku-tier tasks ran at Haiku price, you’d queue 10x more work overnight. The daemon would be doing actual labor while you sleep instead of just the work-you-explicitly-approved variety.

The cost difference between Haiku and Opus is roughly 20x. Across a workspace doing 500 small tasks a week, that’s the difference between “this is a tool I use a lot” and “this is a budget I have to manage.” It’s the difference between automating freely and automating with anxiety.

The first AI assistant that ships smart routing wins the operator-loyalty game. The user who feels free to queue work without budget guilt is the user who builds their entire operations around the tool.

What I’d Build

If I were building it: a small classifier service that sits in front of the dispatcher. Takes the prompt, the tool definitions, and the project context. Returns a tier (and a confidence score). Operators see the routing decision and can override per-task or per-tier.

Two stretch goals:

Cost budgets. Per-day or per-project. When the budget is 80% spent, downgrade soft tasks to Haiku. When it’s at 100%, pause and notify.
Outcome learning. Track which tier produced the accepted output. After 1,000 tasks, the routing improves.

Neither is hard. The hard part is shipping it without people noticing the model switch and losing trust. That’s solved by exposing the routing decision in the UI - the operator should always know which tier is answering, and which tier the task was originally routed to.

The Bet

The first major coding assistant to route intelligently will become structurally cheaper for high-volume operators. That’s a moat. The current state - everyone defaulting high and burning tokens - won’t last another year.

If nobody’s shipped it by then, I might build it myself.