The average LLM API call wastes 40-60% of input tokens on context the model doesn't need. Stale conversation history, boilerplate system prompts, full-file includes when only three functions matter. You pay for every wasted token twice: once in your API bill, again in latency while the model attends over padding.

The Cost Problem
LLM API prices dropped roughly 80% from 2025 to 2026. GPT-4-level performance costs $0.40 per million tokens now, down from $30/M in March 2023. But inference volume is growing faster than prices are falling. Agentic workflows that make 50-200 LLM calls per task turn a cheap per-token price into an expensive per-task cost.
The problem compounds in three ways:
Context bloat
Agents accumulate context over multi-turn sessions. By turn 30, input tokens per call can be 5-10x what they were at turn 1. Most of those tokens are stale.
Redundant computation
Without caching, the model recomputes attention over the same system prompt and conversation prefix on every call. For a 10K-token prefix, that is billions of wasted FLOPs per request.
Underutilized hardware
Default serving configurations leave GPUs idle between requests. Without continuous batching, a single H100 at $3/hr may process only 50 tok/s instead of 16,000+.
Optimization is not about squeezing a few percentage points. It is about removing the 3-10x overhead that default configurations impose. The techniques below address each source of waste at the layer where it originates.
Model-Level Optimizations
Model-level techniques reduce the computational cost per parameter. They modify the model itself, before it ever sees a request.
Quantization
Quantization reduces weight precision from FP16 to INT8, INT4, or lower. The tradeoff: lower precision means smaller memory footprint and faster matrix multiplications, at the cost of small accuracy degradation.
SmoothQuant migrates quantization difficulty from activations to weights, achieving 2x memory reduction with negligible accuracy loss. GPTQ and AWQ use calibration data to find optimal per-layer quantization parameters. Google's TurboQuant (March 2026) compresses the KV cache itself to 3 bits per value with zero measured accuracy loss, cutting KV cache memory by 6x.
Pruning
Pruning removes redundant parameters from the model. Structured pruning removes entire attention heads or MLP columns; unstructured pruning zeros out individual weights. A pruned 6B-parameter model runs 30% faster than its dense counterpart and scores 72.5 on MMLU, beating the unpruned 4B model at 70.0.
Knowledge Distillation
Distillation trains a smaller "student" model to match a larger "teacher" model's output distribution. The student runs at a fraction of the cost. The optimal compression pipeline is P-KD-Q: prune first, distill second, quantize last. Each step compounds.
When to use each
Quantization gives the best cost/effort ratio for API providers and self-hosted deployments. Pruning and distillation require training compute but produce permanently cheaper models. If you consume LLMs via API, these are handled by your provider. If you self-host, start with quantization (zero training cost), then evaluate pruning and distillation for your specific workload.
System-Level Optimizations
System-level techniques maximize hardware utilization without changing the model. They operate in the serving layer between your model and the network.
Continuous Batching
Static batching waits for all requests in a batch to finish before accepting new ones. Short requests sit idle while long ones generate. Continuous batching dynamically inserts new requests as old ones complete, keeping the GPU saturated.
The throughput difference is significant: 3-10x higher on the same hardware. Anyscale measured a 23x improvement in aggregate throughput with continuous batching enabled on production workloads.
PagedAttention and KV Cache Management
The KV cache stores computed attention keys and values so the model doesn't recompute them on each token. The problem: pre-allocating KV cache memory for the maximum sequence length wastes up to 90% of GPU memory, because most requests don't use the full context window.
PagedAttention (vLLM) splits the KV cache into small, reusable pages allocated on demand. This cuts memory waste by up to 90% and enables up to 24x higher serving throughput because more requests fit in memory simultaneously.
ChunkKV treats semantic chunks rather than isolated tokens as compression units, preserving linguistic structure under aggressive compression. RocketKV uses a two-stage pipeline: coarse-grained KV eviction first, then fine-grained compression on the survivors.
Speculative Decoding
Autoregressive decoding generates one token at a time, leaving the GPU underutilized during each forward pass. Speculative decoding adds a small, fast draft model that proposes multiple tokens ahead. The target model verifies them in a single parallel pass. Accepted tokens are mathematically identical to what the target model would have generated alone.
2-3x typical speedup
Production benchmarks with off-the-shelf EAGLE3 draft models on general queries. The speedup is essentially free: output quality is identical.
Up to 5x optimized
Domain-specific or hardware-optimized implementations reach 5-5.5x speedup over standard autoregressive decoding.
Draft latency matters most
Recent benchmarks show little correlation between draft model accuracy and throughput. The draft model's latency is the stronger determinant of end-to-end speed.
FlashAttention
FlashAttention reorganizes the attention computation to minimize memory I/O by tiling the computation and fusing softmax with matrix multiplication. FlashAttention-3 provides the fastest custom attention kernels available, and is integrated into both vLLM and SGLang.
Inference Engines Compared
Four engines dominate production LLM serving in 2026. Each takes a different optimization approach.
| Engine | Version | Throughput (H100) | Key Feature | Best For |
|---|---|---|---|---|
| SGLang | v0.4.3 | 16,200 tok/s | RadixAttention prefix caching | Prefix-heavy workloads (RAG, chat) |
| LMDeploy | Latest | 16,200 tok/s | Persistent batch scheduling | High-throughput serving |
| vLLM | v0.7.3 | 12,500 tok/s | PagedAttention, Blackwell support | Flexibility, frequent model swaps |
| TensorRT-LLM | Latest | Highest at high concurrency | Compiled CUDA kernels | Single-model, long-term production |
The 29% throughput gap between SGLang/LMDeploy and vLLM narrows under prefix-heavy workloads where SGLang's RadixAttention provides additional advantages. TensorRT-LLM requires a compilation step but delivers the highest throughput at scale once compiled.
For most teams, the recommendation: vLLM if you swap models frequently and want the easiest path to production. SGLang if your workload has shared prefixes (chatbots, RAG, multi-turn). TensorRT-LLM if you're running one model in long-term production and throughput is the priority.
Application-Level Optimizations
Application-level techniques reduce the tokens you send before they reach the model. They are the highest-ROI optimizations for teams consuming LLMs via API, because they compound with whatever model-level and system-level work your provider has already done.
Prompt Caching
Prompt caching reuses previously computed KV tensors from attention layers. When consecutive requests share a common prefix (system prompt, conversation history), the cached portion skips the prefill phase entirely.
Anthropic, OpenAI, and Google all offer prompt caching. For contexts over 10K tokens, cached portions see 80-90% latency reduction. With Anthropic's implementation, cached input tokens don't count toward rate limits, effectively multiplying throughput by 5x at 80% cache hit rate.
Semantic Caching
Semantic caching goes further: it stores complete request-response pairs and returns cached responses for semantically similar queries. On cache hits, the LLM inference call is eliminated entirely. AWS benchmarks show 3-10x cost savings for workloads with repetitive query patterns.
Context Compression
Most input tokens in agentic workflows are low-signal: old conversation turns, boilerplate headers, file contents the model already processed. Context compression removes them before inference.
Techniques like LLMLingua achieve up to 20x compression by ranking and preserving key tokens. But compression methods that rewrite content introduce a fidelity problem. Summarization-based approaches score 3.4-3.7/5 on accuracy in production evaluations because they paraphrase away file paths, error codes, and specific decisions.
Verbatim compaction takes a different approach: it deletes low-information tokens while keeping every surviving sentence character-for-character. No generated content, no reformatting. JetBrains found that summarization causes 13-15% longer agent trajectories compared to verbatim compaction, because agents re-derive information that was paraphrased away.
Morph Compact
Morph Compact runs verbatim context compaction at 33,000 tok/s on a custom inference engine. It shrinks context 50-70% while keeping every surviving sentence word-for-word. Fast enough to run inline before every LLM call, not just at the 95% capacity cliff.
Model Routing
Not every request needs your most expensive model. Routing classification and extraction tasks to Haiku ($0.25/M input) instead of Sonnet ($3/M input) yields a 12x cost reduction with minimal quality difference for those task types. Production routing typically delivers 2-5x aggregate cost savings.
Stacking the Optimization Layers
Each layer targets a different bottleneck. They compound without overlap.
| Layer | What It Reduces | Typical Savings | Effort |
|---|---|---|---|
| Quantization (Model) | Memory per parameter | 2-4x memory, ~50% cost | Low (tooling exists) |
| Continuous Batching (System) | GPU idle time | 3-10x throughput | Low (engine config) |
| PagedAttention (System) | KV cache memory waste | Up to 24x throughput | Low (use vLLM/SGLang) |
| Speculative Decoding (System) | Decode latency | 2-5x speed | Medium (draft model selection) |
| Context Compaction (App) | Input tokens sent | 50-70% token reduction | Low (API call) |
| Prompt Caching (App) | Redundant prefill | 80-90% latency on cached | Low (API flag) |
| Model Routing (App) | Cost per request | 2-5x aggregate savings | Medium (classifier needed) |
A concrete example: a coding agent running on a quantized Llama 70B model (2x cheaper), served via vLLM with continuous batching (5x throughput), using Morph Compact to compress context before each call (60% fewer input tokens). The combined effect: roughly 80% lower cost per task compared to naive FP16 serving with full context.
For teams using hosted APIs (OpenAI, Anthropic, Google), the model and system layers are handled by the provider. Application-layer optimizations, specifically context compression, prompt caching, and model routing, are the levers you control. They are also the highest-ROI, because they reduce the tokens entering a system that your provider has already optimized.
Measuring Optimization Impact
The wrong metric hides waste. Track these separately:
Tokens per task
Total tokens consumed to complete a unit of work (not per request). This is the metric that maps to cost. A coding agent that takes 50 requests averaging 8K tokens costs 400K tokens per task.
Time to first token (TTFT)
Latency from request to first response byte. Dominated by prefill time. Context compression and prompt caching directly reduce TTFT.
Tokens per second (TPS)
Decode throughput. Affected by model size, quantization, batch size, and speculative decoding. Measure under realistic concurrency, not single-request benchmarks.
Cost per task
The bottom line. tokens per task multiplied by price per token. This is what you optimize. A 60% reduction in tokens per task is a 60% cost reduction, regardless of per-token pricing.
Common mistake
Optimizing tokens per second while ignoring tokens per task. A faster engine processing bloated context still costs more than a slower engine processing compressed context. Measure from the application outward, not from the GPU inward.
Frequently Asked Questions
What is LLM inference optimization?
LLM inference optimization is the set of techniques that reduce the cost, latency, and memory consumption of running large language model predictions. It spans three layers: model-level (quantization, pruning, distillation), system-level (continuous batching, PagedAttention, speculative decoding), and application-level (context compression, prompt caching). Stacking optimizations across all three layers can reduce inference cost by 80% or more.
How much does quantization reduce LLM inference cost?
Quantizing from FP16 to INT8 or INT4 reduces memory by 2-4x and cuts inference cost by roughly 50% while maintaining 95-99% of original accuracy. Google's TurboQuant (2026) compresses the KV cache to 3 bits with zero measured accuracy loss, achieving 6x memory reduction. SmoothQuant achieves 2x memory reduction and 1.56x speedup.
What is speculative decoding and how fast is it?
Speculative decoding uses a small, fast draft model to propose multiple tokens, then the larger target model verifies them in a single parallel pass. The output is mathematically identical to normal autoregressive decoding. Production benchmarks show 2-3x speedup with off-the-shelf draft models, and optimized implementations reach 5x.
Which LLM inference engine is fastest in 2026?
SGLang v0.4.3 and LMDeploy both hit approximately 16,200 tokens per second on H100. vLLM v0.7.3 follows at 12,500 tok/s. TensorRT-LLM leads at every concurrency level once compiled. The right choice depends on workload: vLLM for flexibility, SGLang for prefix-heavy workloads, TensorRT-LLM for maximum throughput at scale.
How does context compression differ from summarization?
Summarization rewrites your context in fewer words, paraphrasing away file paths, error codes, and specific decisions. Production evaluations score it 3.4-3.7/5 on accuracy. Context compaction (like Morph Compact) deletes filler while keeping every surviving sentence word-for-word. JetBrains found summarization causes 13-15% longer agent trajectories compared to verbatim compaction.
What is the ROI of prompt caching for LLM inference?
Prompt caching reuses previously computed KV tensors from attention layers. For contexts over 10K tokens, cached portions see 80-90% latency reduction. With Anthropic's prompt caching, cached input tokens don't count toward rate limits, effectively multiplying throughput by 5x at 80% cache hit rate.
Can you combine multiple inference optimization techniques?
Yes, and you should. Model-level (quantization), system-level (batching, PagedAttention), and application-level (context compression) optimizations are independent and compound. A quantized model served on vLLM with context compaction can cost 80% less than the unoptimized baseline. Each layer targets a different bottleneck, so there is minimal overlap or diminishing returns.
Cut Inference Cost at the Application Layer
Morph Compact compresses context 50-70% at 33,000 tok/s. Verbatim compaction, not summarization. Stacks with whatever model and system optimizations you already have.