What Is an LLM Router? Automatic Model Routing for Cost and Quality

You are paying $15 per million tokens for Claude Opus 4 to add a TODO comment. To rename a variable. To fix a typo in a docstring. These are tasks that Haiku handles at $0.25/M tokens with identical output quality. An LLM router fixes this by classifying prompt difficulty in ~430ms and sending each request to the cheapest model that can handle it. 40-70% cost reduction, under 2% quality loss on hard tasks.

40-70%

API cost savings with routing

~430ms

Classification latency

<2%

Quality loss on hard tasks

$0.001

Per classification request

The Problem: Paying Top Dollar for Simple Tasks

Most prompts in a coding session are simple. Adding imports. Writing boilerplate tests. Renaming variables. Generating docstrings. Fixing lint errors. These tasks require minimal reasoning and produce identical output whether you use a $15/M token model or a $0.25/M token model.

Without routing, every request goes to the same model. If your agent or application is configured to use Claude Opus 4 or GPT-5-high, that is what it uses for everything. The TODO comment costs the same per-token as a complex architectural refactor.

This waste compounds fast. A typical coding agent session involves 50-200 LLM calls. If 60% of those are easy tasks routed to a $15/M model instead of a $0.25/M model, you are spending 60x more than necessary on the majority of your requests. Over thousands of developer sessions per month, the wasted spend becomes a material line item.

The prompt difficulty distribution

Analysis of millions of coding prompts shows a consistent distribution: roughly 60% easy, 25% medium, 15% hard. The ratio shifts depending on the task (greenfield coding skews harder, maintenance and refactoring skew easier), but the pattern holds. Most work is simple.

What an LLM Router Does

An LLM router is a classification layer between your application and the model providers. Before each LLM call, the router examines the prompt and assigns a difficulty level. That difficulty level maps to a model tier. The application then sends the request to the recommended model instead of a fixed default.

The classification happens on the prompt text, system prompt, and conversation history. The router does not proxy the LLM request itself. It returns a model recommendation, and your application makes the actual API call to the provider. This keeps the router lightweight: it only needs to see the input, not handle streaming, function calls, or response parsing.

Four classification categories cover the space:

Easy

Boilerplate, simple edits, documentation, formatting, imports, renaming. These need token prediction, not reasoning. Route to the smallest, cheapest model.

Medium

Multi-file changes, moderate logic, standard patterns with some nuance. Requires understanding of code structure but not deep reasoning. Route to a mid-tier model.

Hard

Architectural decisions, complex debugging, large refactors across many files, novel algorithms. These require the full reasoning capability of a frontier model.

Needs Info

Ambiguous prompts where classification is uncertain. Instead of guessing, return to the user for clarification. This avoids wasting tokens on a misrouted request.

The "needs_info" category is important. A prompt like "fix it" with no context is unclassifiable. Routing it to a cheap model wastes tokens on a bad response. Routing it to an expensive model wastes money on a bad response. The correct action is to ask the user what they mean.

How Routing Works Internally

The router is a trained classifier, not a set of heuristics. Rule-based approaches (keyword matching, prompt length, regex patterns) are fast but inaccurate. A short prompt can be extremely hard ("refactor this to use the visitor pattern") and a long prompt can be trivially easy (a large block of boilerplate with "add error handling to each function").

Morph's router is trained on millions of coding prompts with labeled difficulty. The training data comes from real coding sessions across multiple languages, frameworks, and task types. Each prompt is paired with the model tier that produced the best cost-quality tradeoff for that specific task.

The classifier runs in ~430ms average latency. For comparison, most LLM API calls take 1-10 seconds for time-to-first-token, so the classification overhead is a fraction of the total request time. In practice, the classification runs in parallel with request preparation (building the message array, fetching context from files, assembling the system prompt), making the effective overhead near zero.

Router classification flow

// 1. User submits a prompt
const userQuery = "Add a TODO comment above the fetchUsers function"

// 2. Router classifies difficulty (~430ms, runs in parallel with step 3)
const { model, difficulty } = await morph.routers.anthropic.selectModel({
  input: userQuery,
  mode: 'balanced'
})
// → { model: "claude-haiku-4", difficulty: "easy" }

// 3. Meanwhile, prepare the request (context fetching, system prompt, etc.)
const messages = await buildMessages(userQuery, context)

// 4. Send to the recommended model
const response = await anthropic.messages.create({
  model,  // "claude-haiku-4" instead of "claude-opus-4"
  messages,
})
// Cost: $0.25/M tokens instead of $15/M tokens. Same output quality.

The parallel execution is key. If you run classification sequentially before the LLM call, you add 430ms to every request. If you run it in parallel with the work you were already doing (fetching file contents, building context, preparing the system prompt), the classification completes before the request is ready to send. Zero added latency in the common case.

Model Tiers and Pricing

The router maps each difficulty level to a model tier. The specific models within each tier vary by provider (Anthropic, OpenAI, Google), but the cost-quality tradeoff follows the same structure.

Difficulty	Anthropic	OpenAI	Google	Cost per M tokens
Easy	Haiku	GPT-5-mini	Gemini Flash	$0.25-1
Medium	Sonnet	GPT-5-low	Gemini Pro	$3-5
Hard	Opus	GPT-5-high	Gemini Ultra	$15
Needs Info	Return to user	Return to user	Return to user	$0

The cost difference between tiers is 15-60x. Haiku at $0.25/M tokens versus Opus at $15/M tokens is a 60x difference. Even Sonnet at $3/M versus Opus at $15/M is 5x. These multipliers are what make routing profitable even with imperfect classification. A router that correctly classifies 80% of easy prompts still saves substantially, because each correct classification saves 15-60x on that request.

Provider-specific routing

Morph's router supports provider-specific model selection. morph.routers.anthropic.selectModel() returns Anthropic model names. morph.routers.openai.selectModel() returns OpenAI model names. The classification logic is the same. Only the output mapping differs.

Cost Savings Math

Assume a coding agent session with 100 LLM calls, each consuming an average of 4,000 tokens (input + output combined). Without routing, all 100 calls go to Opus at $15/M tokens.

Scenario	Model	Calls	Tokens	Cost
No routing (all Opus)	Opus ($15/M)	100	400K	$6.00
Easy (60%)	Haiku ($0.25/M)	60	240K	$0.06
Medium (25%)	Sonnet ($3/M)	25	100K	$0.30
Hard (15%)	Opus ($15/M)	15	60K	$0.90
Routing total	Mixed	100	400K	$1.26

$6.00 without routing versus $1.26 with routing. That is 79% savings. Add the router cost ($0.001 per classification x 100 calls = $0.10) and the total with routing is $1.36, still 77% cheaper.

The savings scale linearly with usage. A team running 10,000 sessions per month saves $46,400/month in this scenario. The router cost for those 1,000,000 classifications is $1,000. The net saving is $45,400/month.

Conservative estimates (50% easy, 20% medium, 30% hard) still yield 40-50% savings. The savings only disappear if your prompt distribution is overwhelmingly hard tasks, which is rare in practice. Even research-heavy coding sessions have significant boilerplate components.

60x

Cost difference: Haiku vs Opus

77%

Savings in typical coding session

$0.10

Router cost per 100 calls

<2%

Quality loss on hard tasks

Classification Approaches

Not all routing implementations are equal. The accuracy and speed of the classifier determines whether routing saves money or degrades quality.

Method	Latency	Accuracy	Tradeoff
Keyword/regex matching	<1ms	Low (50-65%)	Fast but misclassifies semantic difficulty. 'refactor' is not always hard.
Prompt length heuristics	<1ms	Low (45-60%)	Long prompts can be easy (boilerplate). Short prompts can be hard.
Small classifier model	50-200ms	Medium (70-80%)	Better accuracy but adds latency. General-purpose, not domain-tuned.
Trained domain classifier	200-500ms	High (85-95%)	Best accuracy. Requires domain-specific training data. Morph's router uses this approach.
LLM-as-judge	1-5s	High (85-90%)	Uses a cheap LLM to classify. Accurate but slow. The classification latency may exceed the savings.

The keyword approach fails on semantic difficulty. "Add error handling" might be easy (wrapping a function in try-catch) or hard (designing a retry strategy with exponential backoff, circuit breakers, and graceful degradation). No keyword can distinguish these without understanding the context.

The LLM-as-judge approach is accurate but slow. Using GPT-5-mini to classify before routing to GPT-5-high adds 1-5 seconds per request. If the average request takes 3 seconds, you have doubled the latency for a subset of requests. The cost of the classification call itself also reduces net savings.

Trained domain classifiers hit the practical sweet spot. Morph's classifier processes the prompt in ~430ms, which hides behind parallel request preparation. It is trained on millions of real coding prompts, so it understands that "add a TODO" is easy while "refactor the authentication flow to support SSO" is hard, regardless of prompt length or keywords.

Routing Modes

Different use cases require different cost-quality tradeoffs. A developer writing production code wants quality. A CI pipeline running automated test generation wants cost efficiency. Two routing modes cover this spectrum.

Balanced (default)

Optimizes for cost and quality together. Routes clearly easy prompts to cheap models but keeps borderline cases on the expensive model. Under 2% quality degradation on hard tasks. Best for interactive coding where quality matters.

Aggressive

Maximum cost savings. Routes borderline prompts to cheaper models as well. 50-70% cost savings but with slightly higher quality variance on medium-difficulty tasks. Best for batch processing, CI pipelines, and automated tasks where cost dominates.

Selecting a routing mode

import { morph } from 'morph'

// Balanced: quality-first, still saves 40-60%
const balanced = await morph.routers.anthropic.selectModel({
  input: userQuery,
  mode: 'balanced'
})

// Aggressive: cost-first, saves 50-70%
const aggressive = await morph.routers.anthropic.selectModel({
  input: userQuery,
  mode: 'aggressive'
})

In balanced mode, a prompt that is 55% likely to be easy and 45% likely to be medium stays on the medium-tier model. The potential savings from downgrading are not worth the quality risk. In aggressive mode, the same prompt routes to the easy-tier model because the expected savings outweigh the expected quality loss.

Implementation

Adding routing to an existing application takes minimal code changes. The router is a single API call that returns a model name. You replace your hardcoded model string with the router's recommendation.

Full implementation with Morph SDK

import { morph } from 'morph'
import Anthropic from '@anthropic-ai/sdk'

const anthropic = new Anthropic()

async function chat(userQuery: string, conversationHistory: Message[]) {
  // Classify the prompt and select the best model
  const { model } = await morph.routers.anthropic.selectModel({
    input: userQuery,
    mode: 'balanced'
  })

  // Use the selected model for the actual request
  const response = await anthropic.messages.create({
    model,  // e.g., "claude-haiku-4", "claude-sonnet-4", or "claude-opus-4"
    max_tokens: 4096,
    messages: [
      ...conversationHistory,
      { role: 'user', content: userQuery }
    ]
  })

  return response
}

// Example outputs:
// "Add a TODO comment" → model: "claude-haiku-4"
// "Refactor this module to use dependency injection" → model: "claude-sonnet-4"
// "Debug this race condition in the distributed lock" → model: "claude-opus-4"
// "fix it" → model: null, difficulty: "needs_info"

For applications that already use a model configuration variable, the change is a one-line replacement. Instead of const model = "claude-opus-4", you use const { model } = await morph.routers.anthropic.selectModel({ input }). Everything downstream (streaming, tool calls, response parsing) works identically because the Anthropic SDK handles model differences internally.

Parallel classification pattern

To avoid adding latency, run the classification in parallel with your request preparation:

Zero-latency routing with parallel classification

async function chatWithParallelRouting(userQuery: string) {
  // Run classification and context preparation in parallel
  const [routerResult, context] = await Promise.all([
    morph.routers.anthropic.selectModel({
      input: userQuery,
      mode: 'balanced'
    }),
    fetchRelevantContext(userQuery),  // File reads, search, etc.
  ])

  // Both complete. Build and send the request.
  const response = await anthropic.messages.create({
    model: routerResult.model,
    messages: buildMessages(userQuery, context),
    max_tokens: 4096,
  })

  return response
}

When Not to Route

Routing is not always the right choice. Some scenarios benefit from a fixed model.

All prompts are hard

If your application exclusively handles complex reasoning tasks (theorem proving, novel algorithm design), routing will classify everything as hard and add overhead without savings.

Latency is critical

For real-time applications where every millisecond matters (autocomplete, inline suggestions), the 430ms classification overhead may not be acceptable even with parallel execution.

Low volume

If you make fewer than 1,000 LLM calls per month, the absolute savings from routing may not justify the integration complexity. The percentage savings are the same, but $50/month saved may not matter.

For most applications, routing is worth it. The threshold is roughly: if more than 30% of your prompts are easy or medium difficulty, and you make more than a few thousand LLM calls per month, routing pays for itself within the first day.

Frequently Asked Questions

What is an LLM router?

An LLM router classifies the difficulty of each prompt and routes it to the appropriate model tier. Easy prompts go to cheap models like Haiku at $0.25/M tokens. Hard prompts go to expensive models like Opus at $15/M tokens. The classification runs in ~430ms and costs $0.001 per request.

How much does LLM routing save on API costs?

40-70% depending on your prompt distribution. In a typical coding session, roughly 60% of prompts are easy and 25% are medium. Routing these to cheaper models while reserving the expensive model for the 15% of hard prompts saves significantly with under 2% quality loss on hard tasks.

Does routing add latency to LLM requests?

Morph's router classifies prompts in ~430ms average. Run the classification in parallel with request preparation (context fetching, system prompt assembly) and the effective added latency is near zero. The classification finishes before the request is ready to send.

What classification categories does an LLM router use?

Four tiers: easy (boilerplate, simple edits, documentation), medium (multi-file changes, moderate logic), hard (architectural decisions, complex debugging, large refactors), and needs_info (ambiguous prompts that should be returned to the user for clarification).

How accurate is prompt difficulty classification?

Morph's classifier is trained on millions of coding prompts. Balanced mode delivers under 2% quality degradation on hard tasks. Aggressive mode trades slightly more quality variance for higher savings. Accuracy depends on having domain-specific training data, which is why general-purpose classifiers underperform.

Can I use an LLM router with any model provider?

Yes. The router returns a model recommendation (e.g., "claude-haiku-4" or "claude-opus-4"). You pass that model name to whichever provider SDK you use. The router does not proxy the actual LLM request. It only classifies and recommends.

What is the difference between balanced and aggressive routing modes?

Balanced mode (default) keeps borderline cases on the expensive model to protect quality. Aggressive mode routes borderline prompts to cheaper models for maximum savings. Use balanced for interactive coding where quality matters. Use aggressive for batch processing, CI, and automated tasks.

Related Resources

Stop Paying Frontier Prices for Simple Tasks

Morph's LLM router classifies prompt difficulty in ~430ms and routes to the right model tier. $0.001 per request. Trained on millions of coding prompts. Four difficulty tiers: easy, medium, hard, needs_info. Balanced and aggressive modes. 40-70% API cost savings.

Try the Router

View API Docs

Morph Fast Apply

Morph WarpGrep

Morph Compact

Morph Glance

Morph MCP

Morph Monitor

Blog

Startup Credits

Students

Contact Us

About

Careers