LLM Inference Optimization: A Practical Guide to Cutting Cost and Latency (2026)

Concrete techniques for optimizing LLM inference across model, system, and application layers. Quantization, KV cache compression, continuous batching, speculative decoding, and context compaction with real benchmarks.

March 27, 2026 · 3 min read

The average LLM API call wastes 40-60% of input tokens on context the model doesn't need. Stale conversation history, boilerplate system prompts, full-file includes when only three functions matter. You pay for every wasted token twice: once in your API bill, again in latency while the model attends over padding.

80%
Cost reduction (stacked optimizations)
2-4x
Memory savings from quantization
3-10x
Throughput from continuous batching
33K tok/s
Compact context compaction speed
LLM inference optimization layers: model quantization, system batching, and context compression stacked with cost reduction arrows

The Cost Problem

LLM API prices dropped roughly 80% from 2025 to 2026. GPT-4-level performance costs $0.40 per million tokens now, down from $30/M in March 2023. But inference volume is growing faster than prices are falling. Agentic workflows that make 50-200 LLM calls per task turn a cheap per-token price into an expensive per-task cost.

The problem compounds in three ways:

Context bloat

Agents accumulate context over multi-turn sessions. By turn 30, input tokens per call can be 5-10x what they were at turn 1. Most of those tokens are stale.

Redundant computation

Without caching, the model recomputes attention over the same system prompt and conversation prefix on every call. For a 10K-token prefix, that is billions of wasted FLOPs per request.

Underutilized hardware

Default serving configurations leave GPUs idle between requests. Without continuous batching, a single H100 at $3/hr may process only 50 tok/s instead of 16,000+.

Optimization is not about squeezing a few percentage points. It is about removing the 3-10x overhead that default configurations impose. The techniques below address each source of waste at the layer where it originates.

Model-Level Optimizations

Model-level techniques reduce the computational cost per parameter. They modify the model itself, before it ever sees a request.

Quantization

Quantization reduces weight precision from FP16 to INT8, INT4, or lower. The tradeoff: lower precision means smaller memory footprint and faster matrix multiplications, at the cost of small accuracy degradation.

2-4x
Memory reduction (INT8/INT4)
~50%
Cost reduction per inference
95-99%
Accuracy retained
1.56x
Speedup (SmoothQuant)

SmoothQuant migrates quantization difficulty from activations to weights, achieving 2x memory reduction with negligible accuracy loss. GPTQ and AWQ use calibration data to find optimal per-layer quantization parameters. Google's TurboQuant (March 2026) compresses the KV cache itself to 3 bits per value with zero measured accuracy loss, cutting KV cache memory by 6x.

Pruning

Pruning removes redundant parameters from the model. Structured pruning removes entire attention heads or MLP columns; unstructured pruning zeros out individual weights. A pruned 6B-parameter model runs 30% faster than its dense counterpart and scores 72.5 on MMLU, beating the unpruned 4B model at 70.0.

Knowledge Distillation

Distillation trains a smaller "student" model to match a larger "teacher" model's output distribution. The student runs at a fraction of the cost. The optimal compression pipeline is P-KD-Q: prune first, distill second, quantize last. Each step compounds.

When to use each

Quantization gives the best cost/effort ratio for API providers and self-hosted deployments. Pruning and distillation require training compute but produce permanently cheaper models. If you consume LLMs via API, these are handled by your provider. If you self-host, start with quantization (zero training cost), then evaluate pruning and distillation for your specific workload.

System-Level Optimizations

System-level techniques maximize hardware utilization without changing the model. They operate in the serving layer between your model and the network.

Continuous Batching

Static batching waits for all requests in a batch to finish before accepting new ones. Short requests sit idle while long ones generate. Continuous batching dynamically inserts new requests as old ones complete, keeping the GPU saturated.

The throughput difference is significant: 3-10x higher on the same hardware. Anyscale measured a 23x improvement in aggregate throughput with continuous batching enabled on production workloads.

PagedAttention and KV Cache Management

The KV cache stores computed attention keys and values so the model doesn't recompute them on each token. The problem: pre-allocating KV cache memory for the maximum sequence length wastes up to 90% of GPU memory, because most requests don't use the full context window.

PagedAttention (vLLM) splits the KV cache into small, reusable pages allocated on demand. This cuts memory waste by up to 90% and enables up to 24x higher serving throughput because more requests fit in memory simultaneously.

ChunkKV treats semantic chunks rather than isolated tokens as compression units, preserving linguistic structure under aggressive compression. RocketKV uses a two-stage pipeline: coarse-grained KV eviction first, then fine-grained compression on the survivors.

Speculative Decoding

Autoregressive decoding generates one token at a time, leaving the GPU underutilized during each forward pass. Speculative decoding adds a small, fast draft model that proposes multiple tokens ahead. The target model verifies them in a single parallel pass. Accepted tokens are mathematically identical to what the target model would have generated alone.

2-3x typical speedup

Production benchmarks with off-the-shelf EAGLE3 draft models on general queries. The speedup is essentially free: output quality is identical.

Up to 5x optimized

Domain-specific or hardware-optimized implementations reach 5-5.5x speedup over standard autoregressive decoding.

Draft latency matters most

Recent benchmarks show little correlation between draft model accuracy and throughput. The draft model's latency is the stronger determinant of end-to-end speed.

FlashAttention

FlashAttention reorganizes the attention computation to minimize memory I/O by tiling the computation and fusing softmax with matrix multiplication. FlashAttention-3 provides the fastest custom attention kernels available, and is integrated into both vLLM and SGLang.

Inference Engines Compared

Four engines dominate production LLM serving in 2026. Each takes a different optimization approach.

EngineVersionThroughput (H100)Key FeatureBest For
SGLangv0.4.316,200 tok/sRadixAttention prefix cachingPrefix-heavy workloads (RAG, chat)
LMDeployLatest16,200 tok/sPersistent batch schedulingHigh-throughput serving
vLLMv0.7.312,500 tok/sPagedAttention, Blackwell supportFlexibility, frequent model swaps
TensorRT-LLMLatestHighest at high concurrencyCompiled CUDA kernelsSingle-model, long-term production

The 29% throughput gap between SGLang/LMDeploy and vLLM narrows under prefix-heavy workloads where SGLang's RadixAttention provides additional advantages. TensorRT-LLM requires a compilation step but delivers the highest throughput at scale once compiled.

For most teams, the recommendation: vLLM if you swap models frequently and want the easiest path to production. SGLang if your workload has shared prefixes (chatbots, RAG, multi-turn). TensorRT-LLM if you're running one model in long-term production and throughput is the priority.

Application-Level Optimizations

Application-level techniques reduce the tokens you send before they reach the model. They are the highest-ROI optimizations for teams consuming LLMs via API, because they compound with whatever model-level and system-level work your provider has already done.

Prompt Caching

Prompt caching reuses previously computed KV tensors from attention layers. When consecutive requests share a common prefix (system prompt, conversation history), the cached portion skips the prefill phase entirely.

Anthropic, OpenAI, and Google all offer prompt caching. For contexts over 10K tokens, cached portions see 80-90% latency reduction. With Anthropic's implementation, cached input tokens don't count toward rate limits, effectively multiplying throughput by 5x at 80% cache hit rate.

Semantic Caching

Semantic caching goes further: it stores complete request-response pairs and returns cached responses for semantically similar queries. On cache hits, the LLM inference call is eliminated entirely. AWS benchmarks show 3-10x cost savings for workloads with repetitive query patterns.

Context Compression

Most input tokens in agentic workflows are low-signal: old conversation turns, boilerplate headers, file contents the model already processed. Context compression removes them before inference.

Techniques like LLMLingua achieve up to 20x compression by ranking and preserving key tokens. But compression methods that rewrite content introduce a fidelity problem. Summarization-based approaches score 3.4-3.7/5 on accuracy in production evaluations because they paraphrase away file paths, error codes, and specific decisions.

Verbatim compaction takes a different approach: it deletes low-information tokens while keeping every surviving sentence character-for-character. No generated content, no reformatting. JetBrains found that summarization causes 13-15% longer agent trajectories compared to verbatim compaction, because agents re-derive information that was paraphrased away.

Morph Compact

Morph Compact runs verbatim context compaction at 33,000 tok/s on a custom inference engine. It shrinks context 50-70% while keeping every surviving sentence word-for-word. Fast enough to run inline before every LLM call, not just at the 95% capacity cliff.

Model Routing

Not every request needs your most expensive model. Routing classification and extraction tasks to Haiku ($0.25/M input) instead of Sonnet ($3/M input) yields a 12x cost reduction with minimal quality difference for those task types. Production routing typically delivers 2-5x aggregate cost savings.

Stacking the Optimization Layers

Each layer targets a different bottleneck. They compound without overlap.

LayerWhat It ReducesTypical SavingsEffort
Quantization (Model)Memory per parameter2-4x memory, ~50% costLow (tooling exists)
Continuous Batching (System)GPU idle time3-10x throughputLow (engine config)
PagedAttention (System)KV cache memory wasteUp to 24x throughputLow (use vLLM/SGLang)
Speculative Decoding (System)Decode latency2-5x speedMedium (draft model selection)
Context Compaction (App)Input tokens sent50-70% token reductionLow (API call)
Prompt Caching (App)Redundant prefill80-90% latency on cachedLow (API flag)
Model Routing (App)Cost per request2-5x aggregate savingsMedium (classifier needed)

A concrete example: a coding agent running on a quantized Llama 70B model (2x cheaper), served via vLLM with continuous batching (5x throughput), using Morph Compact to compress context before each call (60% fewer input tokens). The combined effect: roughly 80% lower cost per task compared to naive FP16 serving with full context.

For teams using hosted APIs (OpenAI, Anthropic, Google), the model and system layers are handled by the provider. Application-layer optimizations, specifically context compression, prompt caching, and model routing, are the levers you control. They are also the highest-ROI, because they reduce the tokens entering a system that your provider has already optimized.

Measuring Optimization Impact

The wrong metric hides waste. Track these separately:

Tokens per task

Total tokens consumed to complete a unit of work (not per request). This is the metric that maps to cost. A coding agent that takes 50 requests averaging 8K tokens costs 400K tokens per task.

Time to first token (TTFT)

Latency from request to first response byte. Dominated by prefill time. Context compression and prompt caching directly reduce TTFT.

Tokens per second (TPS)

Decode throughput. Affected by model size, quantization, batch size, and speculative decoding. Measure under realistic concurrency, not single-request benchmarks.

Cost per task

The bottom line. tokens per task multiplied by price per token. This is what you optimize. A 60% reduction in tokens per task is a 60% cost reduction, regardless of per-token pricing.

Common mistake

Optimizing tokens per second while ignoring tokens per task. A faster engine processing bloated context still costs more than a slower engine processing compressed context. Measure from the application outward, not from the GPU inward.

Frequently Asked Questions

What is LLM inference optimization?

LLM inference optimization is the set of techniques that reduce the cost, latency, and memory consumption of running large language model predictions. It spans three layers: model-level (quantization, pruning, distillation), system-level (continuous batching, PagedAttention, speculative decoding), and application-level (context compression, prompt caching). Stacking optimizations across all three layers can reduce inference cost by 80% or more.

How much does quantization reduce LLM inference cost?

Quantizing from FP16 to INT8 or INT4 reduces memory by 2-4x and cuts inference cost by roughly 50% while maintaining 95-99% of original accuracy. Google's TurboQuant (2026) compresses the KV cache to 3 bits with zero measured accuracy loss, achieving 6x memory reduction. SmoothQuant achieves 2x memory reduction and 1.56x speedup.

What is speculative decoding and how fast is it?

Speculative decoding uses a small, fast draft model to propose multiple tokens, then the larger target model verifies them in a single parallel pass. The output is mathematically identical to normal autoregressive decoding. Production benchmarks show 2-3x speedup with off-the-shelf draft models, and optimized implementations reach 5x.

Which LLM inference engine is fastest in 2026?

SGLang v0.4.3 and LMDeploy both hit approximately 16,200 tokens per second on H100. vLLM v0.7.3 follows at 12,500 tok/s. TensorRT-LLM leads at every concurrency level once compiled. The right choice depends on workload: vLLM for flexibility, SGLang for prefix-heavy workloads, TensorRT-LLM for maximum throughput at scale.

How does context compression differ from summarization?

Summarization rewrites your context in fewer words, paraphrasing away file paths, error codes, and specific decisions. Production evaluations score it 3.4-3.7/5 on accuracy. Context compaction (like Morph Compact) deletes filler while keeping every surviving sentence word-for-word. JetBrains found summarization causes 13-15% longer agent trajectories compared to verbatim compaction.

What is the ROI of prompt caching for LLM inference?

Prompt caching reuses previously computed KV tensors from attention layers. For contexts over 10K tokens, cached portions see 80-90% latency reduction. With Anthropic's prompt caching, cached input tokens don't count toward rate limits, effectively multiplying throughput by 5x at 80% cache hit rate.

Can you combine multiple inference optimization techniques?

Yes, and you should. Model-level (quantization), system-level (batching, PagedAttention), and application-level (context compression) optimizations are independent and compound. A quantized model served on vLLM with context compaction can cost 80% less than the unoptimized baseline. Each layer targets a different bottleneck, so there is minimal overlap or diminishing returns.

Cut Inference Cost at the Application Layer

Morph Compact compresses context 50-70% at 33,000 tok/s. Verbatim compaction, not summarization. Stacks with whatever model and system optimizations you already have.