LLM Cost Optimization: 5 Levers That Cut API Spend 70-85%

A practical guide to reducing LLM API costs without sacrificing quality. Covers the five main levers: model routing (40-70% savings), context compaction (50-70% token reduction), prompt optimization, caching (90% on cache hits), and batching (50% discount). Includes code examples, cost math, and a combined-savings breakdown.

March 31, 2026 ยท 2 min read

LLM API costs scale linearly with usage. A coding agent making 200 API calls per session on Claude Opus 4.6 ($5/M input, $25/M output) racks up $7+ per session before anyone checks the bill. Most teams discover their monthly spend is 3-5x what they budgeted once they move past prototyping. Five optimization levers, applied together, cut that spend by 70-85% without changing what the agent produces.

70-85%
Total cost reduction with all five levers combined
40-70%
Savings from model routing alone
50-70%
Token reduction from context compaction
90%
Cache hit discount (Anthropic)

The Cost Problem

LLM API pricing follows a simple formula: (input tokens x input price) + (output tokens x output price). But the way agents use LLMs makes costs compound in ways that are not obvious from the pricing page.

A coding agent does not make one API call. It makes dozens per task. Each call re-sends the full conversation history. A 20-turn agent session with growing context means the early messages are paid for 20 times. A 2,000-token system prompt sent with every call across a 200-call session costs 400,000 input tokens just for the repeated prompt.

Consider the numbers. Claude Opus 4.6 costs $5/M input tokens and $25/M output tokens. Sonnet 4.6 costs $3/M input and $15/M output. Haiku 4.5 costs $1/M input and $5/M output. A single agent session making 200 calls with an average of 20K tokens per call (growing context) on Opus 4.6 generates roughly 4M input tokens. At $5/M, that is $20 in input alone. Add output tokens and the session costs $30+.

A 20-developer team running 50 sessions per day: $10,200/month at the average rate, much more if sessions run long or use frontier models. Most teams hit this number and start asking what went wrong. Nothing went wrong. LLM costs compound with conversation length, and agents have long conversations.

The compounding problem

Every unnecessary token in an agent conversation is paid for on every subsequent turn. 100 wasted tokens in turn 1 of a 30-turn session costs 3,000 tokens total. At $5/M (Opus 4.6 input pricing), that is $0.015 of pure waste per session. Across 1,000 sessions per day, it adds up to $450/month from 100 tokens of waste.

Five Levers for LLM Cost Reduction

Each lever works independently. Several stack for compound savings. The right combination depends on your workload, but most teams should start with model routing (highest ROI, lowest effort) and add compaction for context-heavy workflows.

1. Model Routing

Classify prompt difficulty, route to the right model tier. 40-70% savings. $0.001/classification, ~430ms.

2. Context Compaction

Remove redundant tokens from conversation history. 50-70% token reduction. 33,000 tok/s, zero hallucination.

3. Prompt Optimization

Trim system prompts. Use structured output. Reduce few-shot examples. Costs nothing but attention.

4. Caching

Cache repeated prefixes. 90% savings on cache hits (Anthropic). 50% on OpenAI. 60-80% hit rates for agents.

5. Batching

50% flat discount for non-urgent work. Trade latency for cost. Ideal for evals, pipelines, background jobs.

Combined

All five levers stacked: 70-85% total cost reduction. $6/session drops to $0.90-1.80.

Lever 1: Model Routing (40-70% Savings)

Most requests to a coding agent do not need the most expensive model. Adding a comment to a function, formatting a JSON response, renaming a variable, generating a boilerplate test. These tasks run equally well on Haiku 4.5 at $1/M tokens as on Opus 4.6 at $5/M. The problem is that without a router, every request goes to whichever model the agent is configured to use, usually the most capable (and most expensive) one.

A model router sits between the application and the LLM API. It reads the prompt, classifies the difficulty, and picks the cheapest model that can handle it. Easy tasks (formatting, simple edits, boilerplate) go to Haiku 4.5 at $1/M. Medium tasks (standard code generation, refactoring) go to Sonnet 4.6 at $3/M. Hard tasks (system design, complex multi-file debugging, architectural decisions) go to Opus 4.6 at $5/M.

The economics work because 60-80% of coding agent requests are routine. If 70% of requests route to a model that costs 5x less on input, the weighted average cost per request drops 40-70%.

$0.001
Per classification request (Morph Router)
~430ms
Router classification latency
60-80%
Requests classified as easy/medium
40-70%
Typical cost savings

Morph Router is trained on millions of coding prompts. It classifies each request into four categories: easy, medium, hard, and needs_info. The classification costs $0.001 per request and takes approximately 430ms. The router returns the difficulty tier, and your application maps that to a model.

Model routing with Morph Router (TypeScript)

import Morph from "morphllm";

const morph = new Morph({ apiKey: process.env.MORPH_API_KEY });

// Map difficulty tiers to models and their per-M-token costs
const MODEL_TIERS = {
  easy:       { model: "claude-haiku-4-5",  inputCost: 1.00 },  // $1/M
  medium:     { model: "claude-sonnet-4-6", inputCost: 3.00 },  // $3/M
  hard:       { model: "claude-opus-4-6",   inputCost: 5.00 },  // $5/M
  needs_info: { model: "claude-sonnet-4-6", inputCost: 3.00 },
} as const;

async function routedCompletion(messages: Message[]) {
  // Step 1: Classify difficulty ($0.001, ~430ms)
  const classification = await morph.router.classify({
    messages,
  });

  // Step 2: Route to the right model
  const tier = MODEL_TIERS[classification.difficulty];
  const response = await morph.chat.completions.create({
    model: tier.model,
    messages,
  });

  return response;
}

// Without routing: 200 calls x Opus 4.6 ($5/M) = $20+ input/session
// With routing (70% easy, 20% medium, 10% hard):
//   140 calls x $1/M + 40 calls x $3/M + 20 calls x $5/M
//   Weighted average: ~$1.90/M vs $5/M = 62% savings

The router does not change what the agent produces. Easy tasks routed to Haiku produce the same output as Opus for those tasks. The quality difference only matters on genuinely hard prompts, which still go to the frontier model. The result: same quality on the hard tasks, same quality on the easy tasks, 40-70% lower total cost.

Full documentation: Morph Router docs.

Start with routing

Of the five levers, model routing has the highest ROI with the lowest implementation effort. One API call per request, no changes to your prompts or conversation structure, no data pipeline changes. If you only implement one optimization, make it routing.

Lever 2: Context Compaction (50-70% Savings on Token Count)

Agent conversations accumulate context. Each tool call adds its output to the history. Each response adds the model's reasoning. By turn 50, the conversation might contain 150K-200K tokens. By turn 100, it might hit 300K+. Every token in that history is re-sent on every subsequent call. A 200K-token conversation costs 10x what a 20K-token conversation costs per turn.

Traditional summarization rewrites the context in fewer words. The problem: summaries lose details. File paths become "a configuration file." Error codes become "an error occurred." Specific function signatures become "a function in the auth module." The agent then asks for the same information again, spending tokens to re-acquire what the summary threw away.

Morph Compact takes a different approach: verbatim deletion. It identifies and removes low-signal tokens (redundant formatting, repeated boilerplate, verbose metadata, noise) while keeping every surviving sentence character-for-character identical to the original. No paraphrasing. No rewriting. The file paths, error codes, function signatures, and specific numbers all survive intact.

33,000
Tokens per second (Morph Compact)
50-70%
Typical token reduction
<3s
Compaction latency
0%
Hallucination rate (verbatim deletion)

The cost impact is direct. If a 200K-token conversation is compacted to 80K tokens (60% reduction), the input cost for the next API call drops by 60%. And the savings compound: the compacted conversation is what gets re-sent on every subsequent turn. Over a 200-call session, compacting early saves tokens on every remaining call.

Run compaction before every LLM call, not just when you hit the context window limit. Most teams wait until auto-compact fires at the capacity cliff. By then, the agent has already paid full price for 100+ turns of bloated context. Compacting proactively from the start keeps the context lean throughout the session.

Context compaction with Morph Compact (TypeScript)

import Morph from "morphllm";

const morph = new Morph({ apiKey: process.env.MORPH_API_KEY });

async function compactAndSend(messages: Message[]) {
  // Compact the conversation history
  const compacted = await morph.compact({
    model: "morph-compact-v1",
    messages,
    // Optional: tell Compact what to prioritize
    system: "Preserve file paths, error codes, function signatures, and specific numbers.",
  });

  // Send the compacted conversation to the model
  const response = await morph.chat.completions.create({
    model: "claude-sonnet-4",
    messages: compacted.choices[0].message.content,
  });

  return response;
}

// Before compaction: 200K tokens/turn x $3/M (Sonnet) = $0.60/turn
// After compaction:  80K tokens/turn x $3/M (Sonnet)  = $0.24/turn
// Over 200 turns: $120 vs $48 = 60% savings on input costs

Full documentation: Morph Compact docs.

Compaction is not summarization

Summarization rewrites content in the model's own words. This introduces hallucination risk: invented details, changed numbers, lost specifics. Compaction performs verbatim deletion. Every sentence that survives is copied character-for-character from the original. The hallucination rate is 0%, not "low." For cost optimization, this means the cheaper downstream model receives exact quotes, exact code, and exact references. See Compaction vs Summarization for the full comparison.

Lever 3: Prompt Optimization for Cost

Shorter prompts mean fewer input tokens mean lower cost. This sounds obvious, but the savings are larger than most teams realize because of the compounding effect in agent conversations.

System prompts are the biggest target. A system prompt is sent with every API call. A 2,000-token system prompt across 200 calls in one session = 400,000 input tokens just for the repeated instructions. Trimming that to 800 tokens saves 240,000 tokens per session. At $3/M (Sonnet 4.6), that is $0.72 per session. At $5/M (Opus 4.6), that is $1.20 per session. For a team running 1,000 sessions/day, the annual difference from system prompt length alone is $260K-$438K.

Trim system prompts to essentials

Most system prompts contain instructions the model already follows by default, examples that could be one instead of five, formatting rules that could be one sentence instead of a paragraph, and personality instructions that do not affect output quality. Audit your system prompt. Remove anything the model does correctly without being told.

Use few-shot examples sparingly

Each few-shot example adds its full token count to every request. Three examples at 500 tokens each = 1,500 tokens per call. If two examples produce the same quality as three, cut the third. If zero-shot works for 80% of your requests, conditionally include examples only when the request type needs them.

Request structured output

Free-form text responses use more output tokens than structured JSON. A model describing a code change in prose might use 200 tokens. The same information as a JSON object with file, line, old, new fields might use 80 tokens. Output tokens cost 3-5x more than input tokens, so reducing output length has an outsized impact on cost.

System prompt: bloated vs lean

// BLOATED: 2,100 tokens (sent with every API call)
const SYSTEM_PROMPT_BLOATED = `
You are an expert software engineer with deep knowledge
of TypeScript, React, Node.js, Python, and many other
programming languages and frameworks. You follow best
practices and write clean, maintainable, well-documented
code. When writing code, always include comprehensive
error handling, input validation, and appropriate logging.
Consider edge cases, performance implications, and
security concerns. Use meaningful variable names...
[... 1,800 more tokens of instructions ...]
`;

// LEAN: 400 tokens (same output quality for 95% of tasks)
const SYSTEM_PROMPT_LEAN = `
Senior engineer. TypeScript/React/Python.
Write minimal, correct code. Handle errors.
Respond with code blocks only, no explanation unless asked.
Output format: { file, changes: [{ line, old, new }] }
`;

// Savings per 200-call session:
// Bloated: 2100 * 200 = 420K input tokens
// Lean:    400 * 200  = 80K input tokens
// Delta: 340K tokens/session
// At $5/M (Opus 4.6): $1.70 saved per session
// At $3/M (Sonnet 4.6): $1.02 saved per session

Audit system prompt tokens

Measure your system prompt length in tokens. Multiply by calls per session. That is the recurring cost. A 2,000-token prompt across 200 calls = 400K input tokens per session.

Conditional few-shot

Include examples only when the request type needs them. Simple requests (formatting, renaming) rarely benefit from examples. Complex requests (code generation patterns) do.

Lever 4: Caching (60-90% Savings on Repeated Content)

Prompt caching stores the processed representation of repeated content so the model does not recompute it on every call. The savings are immediate and require zero changes to your prompt quality.

Anthropic prompt caching

Anthropic caches static content (system prompts, documentation, few-shot examples) and charges 90% less on cache reads. Cached tokens on Sonnet 4.6 cost $0.30/M instead of $3.00/M. On Opus 4.6, cached tokens cost $0.50/M instead of $5/M. The first call pays a 1.25x write premium. Every subsequent call with the same prefix reads from the cache at the 90% discounted rate.

A 2,000-token system prompt sent 200 times per session: the first call pays the cache write premium (1.25x base). The remaining 199 calls cost 2,000 tokens at the cache read rate (0.1x base). On Opus 4.6: without caching, 400,000 tokens at $5/M = $2.00. With caching: 2,000 at $6.25/M (write) + 398,000 at $0.50/M (read) = $0.013 + $0.199 = $0.21. That is a 90% reduction on the system prompt portion of your bill.

Prompt caching with Anthropic (Python)

from anthropic import Anthropic

client = Anthropic()

# Mark static content for caching with cache_control
response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,      # e.g. 2,000 tokens
            "cache_control": {"type": "ephemeral"}
        },
        {
            "type": "text",
            "text": REFERENCE_DOCS,     # e.g. 10,000 tokens
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)

# First call: 12K tokens at cache write price ($3.75/M on Sonnet 4.6) = $0.045
# Next 199 calls: 12K tokens at cache read price ($0.30/M) = $0.0036 each
# Without caching: 200 x 12K x $3/M = $7.20
# With caching:    $0.045 + 199 x $0.0036 = $0.76
# Savings: ~89% on the cached prefix

OpenAI cached responses

OpenAI automatically caches prompt prefixes and offers a 50% discount on cache hits. The caching is implicit: you do not need to mark content for caching. If consecutive requests share the same prefix (minimum 1,024 tokens), OpenAI detects it and applies the discount. The savings are lower than Anthropic's 90%, but the implementation requires zero code changes.

Application-level caching

For repeated identical queries (same input, same expected output), cache the response at the application level. A Redis or in-memory cache that stores responses keyed by a hash of the prompt eliminates the API call entirely. This works for deterministic tasks: classification, extraction, formatting. It does not work for creative or context-dependent tasks where the same input should produce different outputs.

Cache hit rates of 60-80% are common for agents with stable system prompts and repeated tool-call patterns. A code search agent that frequently searches the same files, a QA bot answering the same questions, a formatting tool processing similar documents. All benefit from caching.

Caching stacks with other levers

Compress first, then cache the compressed version. The cache stores fewer tokens and serves faster. Route first, then cache per-model. Easy requests cached against Haiku cost less to store and retrieve than easy requests cached against Opus. Each lever makes the others more effective.

Lever 5: Batching (50% Flat Discount)

Both Anthropic and OpenAI offer batch APIs that process requests asynchronously at a 50% discount. The tradeoff: results arrive within 24 hours instead of in real-time. For any workload where the user is not waiting for a response, this is a free 50% cost reduction.

When batching makes sense

Evaluation pipelines. Test suite generation. Data labeling. Content backfills for a CMS. Nightly report generation. Bulk code migration. Any task where latency tolerance exceeds 1 hour. These workloads often represent 20-40% of total LLM spend and are trivially eligible for the batch discount.

Anthropic Batch API

Submit a batch of messages requests. Anthropic processes them within 24 hours. Each request in the batch gets a 50% discount on both input and output tokens. The batch also benefits from prompt caching: if requests share prefixes, the cached prefix gets both the 50% batch discount and the 90% cache discount. Combined savings on the cached prefix: 95%.

OpenAI Batch API

Similar structure. Submit a JSONL file of requests. OpenAI processes them with a 24-hour SLA. 50% discount on all tokens. Supports all GPT models. Like Anthropic, batch requests that hit the cache get compound discounts.

Batch API usage (Anthropic Python SDK)

from anthropic import Anthropic

client = Anthropic()

# Submit a batch of requests at 50% discount
batch = client.batches.create(
    requests=[
        {
            "custom_id": f"eval-{i}",
            "params": {
                "model": "claude-sonnet-4-5-20250514",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": prompt}]
            }
        }
        for i, prompt in enumerate(evaluation_prompts)
    ]
)

# Poll for results (delivered within 24 hours)
result = client.batches.retrieve(batch.id)

# Pricing (Sonnet 4.6):
# Real-time: $3.00/M input, $15.00/M output
# Batch:     $1.50/M input, $7.50/M output (50% off)
# Batch + cache read: $0.15/M input (50% batch + 90% cache = 95% combined)

50% flat discount

Both Anthropic and OpenAI offer 50% off all tokens for batch requests. No quality difference. Same models, same outputs.

Stacks with caching

Batch requests with cached prefixes get compound discounts. Anthropic: 50% batch + 90% cache = 95% savings on repeated content.

24-hour SLA

Results delivered within 24 hours. Most batches complete in 1-4 hours. Plan your pipeline around the SLA, not the typical time.

Combining All Five Levers: The Math

Each lever works independently, but the real savings come from stacking them. The following scenario uses concrete numbers from a coding agent workflow.

Baseline: no optimization

200 calls per session. All requests routed to Opus 4.6 ($5/M input, $25/M output). Average 20K input tokens per call (growing conversation). Average 500 output tokens per call. System prompt: 2,000 tokens. No compaction, no caching, no batching.

Input cost: 200 calls x 20K avg tokens x $5/M = $20. Output cost: 200 calls x 500 tokens x $25/M = $2.50. Total per session: $22.50.

After model routing: 40-70% savings

70% of calls route to Haiku 4.5 ($1/M input), 20% to Sonnet 4.6 ($3/M), 10% to Opus 4.6 ($5/M). Weighted input price: $1.60/M (vs $5/M baseline). Input cost drops from $20 to $6.40. Output follows similar ratios (weighted $4.50/M vs $25/M). New total: $7-9 per session.

After context compaction: additional 50-70% on token volume

Compacting the conversation before each call reduces the average 20K tokens per call to 8K-10K (60% reduction on the growing context portion). Input cost drops proportionally. From the post-routing $6.40, input cost falls to $2.56-3.20. New total: $3.50-5 per session.

After caching: additional 90% on cached prefix

The system prompt and any stable reference docs (2,000-10,000 tokens) are cached. Cache reads cost 0.1x base input price (90% off). For a 2,000-token cached prefix across 200 calls using the routed model mix, caching saves an additional $0.50-1.50 depending on cache hit rate and model distribution.

After prompt optimization: 10-20% additional savings

Trimming the system prompt from 2,000 to 800 tokens reduces per-call overhead. Requesting structured output reduces output token count by 30-50%. Together, these save another 10-20% on top of the other levers.

OptimizationPer-Session CostCumulative Savings
Baseline (all Opus 4.6, no optimization)$22.500%
After model routing$7-960-69%
After routing + compaction$3.50-578-84%
After routing + compaction + caching$2.50-482-89%
After all five levers$2-3.5084-91%

The exact savings depend on the specific model mix, conversation length, cache hit rate, and how much content is batchable. Conservative estimates (shorter sessions, lower cache hit rates, less aggressive routing) land at 70-85% total reduction. Optimistic estimates (long sessions, high cache hit rates, aggressive routing with the Haiku/Sonnet split maximized) reach 90%+.

Where to start

Start with model routing. It requires one API call per request, no changes to your prompts, and delivers 40-70% savings immediately. Add context compaction second if you run agent workloads or process long documents. Add caching third (it may already be on if you use Anthropic). Prompt optimization and batching fill in the remaining gaps. Router docs | Compact docs | Get started.

Frequently Asked Questions

How much can model routing reduce LLM API costs?

Model routing typically saves 40-70%. A router classifies each prompt by difficulty and sends easy tasks to cheap models like Haiku 4.5 at $1/M while reserving Opus 4.6 at $5/M for genuinely hard tasks. Since 60-80% of coding agent requests are routine, the weighted average cost drops dramatically. Morph Router adds $0.001 per classification with ~430ms latency.

What is context compaction and how does it reduce LLM costs?

Context compaction removes redundant tokens from conversation history before sending it to the LLM. Unlike summarization, compaction works by verbatim deletion: it removes noise while preserving every surviving sentence character-for-character. Morph Compact achieves 50-70% token reduction at 33,000 tok/s with zero hallucination. For a 200K-token conversation, compaction to 80K tokens saves 60% on input costs for that turn and every subsequent turn.

How does prompt caching save money on LLM API calls?

Prompt caching stores the processed representation of repeated content so the model does not recompute it. Anthropic charges 90% less on cache reads: $0.30/M instead of $3.00/M on Sonnet 4.6, $0.50/M instead of $5/M on Opus 4.6. The first call pays a 1.25x write premium, then subsequent calls hit the cache at 0.1x base input price. Cache hit rates of 60-80% are common for agents with stable system prompts.

What is the difference between compaction and summarization?

Summarization rewrites content, which introduces hallucination risk: invented details, changed numbers, lost specifics. Compaction removes tokens via verbatim deletion. Every surviving sentence is character-identical to the original. Morph Compact achieves 50-70% reduction with zero hallucination. See Compaction vs Summarization for the technical comparison.

Should I use batch APIs to reduce LLM costs?

Yes, for any workload that tolerates latency. Both Anthropic and OpenAI offer 50% discounts on batch requests with a 24-hour SLA. Batch APIs stack with caching: batch + cached prefix = 95% savings on repeated content. Good candidates: evaluations, data pipelines, content generation, test suites, background processing.

What is the best first step to reduce LLM API costs?

Start with model routing. It delivers the highest ROI with the lowest implementation effort: one API call per request, no changes to prompts or conversation structure. Morph Router classifies prompt difficulty at $0.001/request in ~430ms. Add Compact second for agent and long-context workloads where token volume is the primary cost driver.

How much does a typical AI coding agent session cost?

A typical Claude Code session costs ~$0.34 (45K input, 13K output, 38K cache read tokens). Heavy users running 50+ sessions/day face $5,600+/month. A 20-developer team at that rate pays $10,200/month. Long sessions with frontier models cost more: a 200-call session on Opus 4.6 ($5/M input, $25/M output) with growing context can cost $20+ without optimization. With all five levers applied, the same session costs $2-3.50.

Related Resources

Cut Your LLM API Costs by 70-85%

Start with Morph Router: classify prompt difficulty at $0.001/request, route to the right model tier, save 40-70% immediately. Add Morph Compact for context-heavy workloads: 50-70% token reduction, 33,000 tok/s, zero hallucination. Both require one API call to integrate.