Quick Pricing Overview
Most developers pick an LLM API based on vibes. They read a blog post, see a benchmark, and hardcode an OpenAI key. Six months later they are paying 10x what they should for a task that a $0.10/MTok model handles fine.
The market in 2026 has 11+ production-grade providers. Prices range from $0.08 to $25 per million output tokens. Context windows span 128K to 1M tokens. Throughput ranges from 80 tok/s to 840 tok/s. A chatbot doing 100M output tokens per month pays $1,000 on GPT-5.4 or $42 on DeepSeek V3.2. Same quality tier for most conversational tasks.
The 80/20 Rule
80% of applications work fine with a model in the $0.40-$2.50/MTok output range. Only coding agents, complex reasoning, and multi-step planning benefit from $10+/MTok models. Start cheap, benchmark your specific task, upgrade only when you measure a quality gap.
Provider-by-Provider Breakdown
OpenAI
The default choice for most developers. OpenAI offers the widest model range: GPT-4.1-nano at $0.10/$0.40, GPT-4.1 at $2/$8, and reasoning models o3/o4-mini from $1.10-$8 per MTok. The GPT-4.1 family (April 2025) replaced GPT-4o as the recommended line. Key advantage: 1M token context on GPT-4.1, the most mature function calling and structured output implementation, and the largest ecosystem of tools and libraries.
| Model | Input/MTok | Output/MTok | Context | Max Output |
|---|---|---|---|---|
| GPT-4.1 | $2.00 | $8.00 | 1M | 32K |
| GPT-4.1-mini | $0.40 | $1.60 | 1M | 32K |
| GPT-4.1-nano | $0.10 | $0.40 | 1M | 32K |
| o3 (reasoning) | $2.00 | $8.00 | 200K | 100K |
| o4-mini (reasoning) | $1.10 | $4.40 | 200K | 100K |
| GPT-5.4 | $2.50 | $10.00 | 128K | 16K |
| GPT-5.4-mini | $0.15 | $0.60 | 128K | 16K |
GPT-4.1 is cheaper than GPT-5.4 ($8 vs $10 output) with 8x the context window (1M vs 128K). No reason to use GPT-5.4 for new projects unless you depend on its fine-tuning ecosystem.
Anthropic (Claude)
Claude models lead coding benchmarks. Claude Sonnet 4.5 holds the top SWE-Bench score at 82%. Claude Opus 4.6 is the most capable model for complex reasoning and agent workflows, with 128K max output tokens and extended thinking. Anthropic also has native MCP (Model Context Protocol) support, giving Claude direct access to external tools and data sources.
| Model | Input/MTok | Output/MTok | Context | Max Output |
|---|---|---|---|---|
| Claude Opus 4.6 | $5.00 | $25.00 | 200K (1M beta) | 128K |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 200K (1M beta) | 64K |
| Claude Haiku 4.5 | $1.00 | $5.00 | 200K | 64K |
Claude pricing is 2-3x higher than OpenAI for comparable tiers. The premium buys better coding performance, longer outputs (128K vs 32K on GPT-4.1), and extended thinking. For non-coding tasks, the quality gap narrows and the price difference matters more.
Google (Gemini)
Google undercuts both OpenAI and Anthropic at every price tier. Gemini 2.5 Flash at $0.30/$2.50 per MTok handles 1M token contexts. Gemini 2.5 Flash-Lite at $0.10/$0.40 matches GPT-4.1-nano pricing with 1M context. Google also offers a free tier for development, which no other frontier provider does.
| Model | Input/MTok | Output/MTok | Context | Notes |
|---|---|---|---|---|
| Gemini 3.1 Pro | $2.00 | $12.00 | 1M+ | Latest flagship |
| Gemini 2.5 Pro | $1.25 | $10.00 | 1M | Frontier reasoning |
| Gemini 2.5 Flash | $0.30 | $2.50 | 1M | Best value per token |
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | 1M | Budget tier, free tier available |
Gemini 2.5 Flash is arguably the best value: 1M context, $2.50/MTok output, free tier available, competitive quality on most benchmarks. The tradeoff is less mature function calling and structured output support compared to OpenAI.
DeepSeek
The price disruptor. DeepSeek V3.2 matches GPT-5.4-class quality at $0.28/$0.42 per MTok. That is 24x cheaper on output. Cache hits drop input cost to $0.028/MTok. The catch: DeepSeek has experienced reliability issues during peak usage, and data routes through servers in China. For applications with data sovereignty requirements, DeepSeek may not work.
| Model | Input/MTok | Output/MTok | Context | Notes |
|---|---|---|---|---|
| DeepSeek V3.2 (chat) | $0.28 | $0.42 | 128K | Cache hit: $0.028 input |
| DeepSeek V3.2 (reasoner) | $0.28 | $0.42 | 128K | Thinking mode, 64K max output |
Groq
Custom LPU (Language Processing Unit) hardware optimized for inference speed. Llama 3.1 8B runs at 840 tok/s. Llama 4 Scout at 594 tok/s. Groq hosts open-weight models only (Llama, Qwen, Mistral). You pay for speed, not the cheapest per-token pricing.
| Model | Input/MTok | Output/MTok | Speed | Context |
|---|---|---|---|---|
| Llama 3.1 8B | $0.05 | $0.08 | 840 tok/s | 128K |
| Qwen3 32B | $0.29 | $0.59 | 662 tok/s | 131K |
| Llama 4 Scout | $0.11 | $0.34 | 594 tok/s | 128K |
| Llama 4 Maverick | $0.20 | $0.60 | 562 tok/s | 128K |
| Llama 3.3 70B | $0.59 | $0.79 | 394 tok/s | 128K |
Together AI
The widest selection of open-weight models in one place: Llama, DeepSeek, Qwen, Mistral, GLM, Kimi, and smaller community models. Batch API at 50% discount. Good option for teams that want to evaluate multiple open-source models without managing infrastructure.
| Model | Input/MTok | Output/MTok | Notes |
|---|---|---|---|
| Llama 4 Maverick | $0.27 | $0.85 | MoE, 128K context |
| Llama 3.3 70B | $0.88 | $0.88 | Dense model |
| DeepSeek V3.1 | $0.60 | $1.70 | Hosted alternative |
| DeepSeek R1 | $3.00 | $7.00 | Reasoning model |
| Qwen 2.5 7B | $0.30 | $0.30 | Small, fast |
Fireworks AI
Tiered pricing by model parameter count for open-weight models, plus specific pricing for popular models. Cached inputs at 50% off. On-demand GPU deployments available: A100 at $2.90/hr, H100 at $4.00/hr, H200 at $6.00/hr for teams needing dedicated capacity.
| Model / Tier | Input/MTok | Output/MTok | Notes |
|---|---|---|---|
| DeepSeek V3 | $0.56 | $1.68 | Hosted DeepSeek |
| Kimi K2.5 | $0.60 | $3.00 | Cached input: $0.10 |
| GLM-5 | $1.00 | $3.20 | Cached input: $0.20 |
| < 4B params (tier) | $0.10 | $0.10 | Any small model |
| 16B+ params (tier) | $0.90 | $0.90 | Any large model |
Cohere
Focused on enterprise RAG and search rather than general chat. Command A handles document processing with 256K context. Embed v4.0 and Rerank v4.0 are among the best retrieval models available. Less relevant for general-purpose LLM API usage, but the strongest option for search-and-retrieval pipelines.
Amazon Bedrock and Azure OpenAI
Cloud-hosted wrappers around first-party models. Bedrock serves Claude, Llama, Mistral, and Amazon Nova through AWS billing. Azure OpenAI provides OpenAI models with enterprise features: VNet integration, managed identity, content filtering. Pricing runs 1-2x the direct API cost. The value is compliance (SOC 2, HIPAA, ISO 27001), VPC integration, and consolidated cloud billing.
Mistral
Both proprietary and open-weight models. Mistral Large 3 is their flagship. Codestral targets code generation with 256K context. Devstral 2 is built for software engineering agents. Popular in Europe for EU data residency options. Open-weight Mistral models (7B, 8x7B) are available on every inference provider.
Full Pricing Table: Output Tokens Ranked by Cost
Output tokens dominate most API bills because they cost 2-5x more than input and most applications generate substantial output. This table ranks every major model by output cost.
| Model | Provider | Output/MTok | Input/MTok | Context |
|---|---|---|---|---|
| Llama 3.1 8B | Groq | $0.08 | $0.05 | 128K |
| Llama 4 Scout | Groq | $0.34 | $0.11 | 128K |
| GPT-4.1-nano | OpenAI | $0.40 | $0.10 | 1M |
| Gemini 2.5 Flash-Lite | $0.40 | $0.10 | 1M | |
| DeepSeek V3.2 | DeepSeek | $0.42 | $0.28 | 128K |
| Qwen3 32B | Groq | $0.59 | $0.29 | 131K |
| GPT-5.4-mini | OpenAI | $0.60 | $0.15 | 128K |
| Llama 4 Maverick | Groq | $0.60 | $0.20 | 128K |
| Llama 3.3 70B | Groq | $0.79 | $0.59 | 128K |
| Llama 4 Maverick | Together | $0.85 | $0.27 | 128K |
| GPT-4.1-mini | OpenAI | $1.60 | $0.40 | 1M |
| Gemini 2.5 Flash | $2.50 | $0.30 | 1M | |
| o4-mini | OpenAI | $4.40 | $1.10 | 200K |
| Claude Haiku 4.5 | Anthropic | $5.00 | $1.00 | 200K |
| GPT-4.1 | OpenAI | $8.00 | $2.00 | 1M |
| o3 | OpenAI | $8.00 | $2.00 | 200K |
| GPT-5.4 | OpenAI | $10.00 | $2.50 | 128K |
| Gemini 2.5 Pro | $10.00 | $1.25 | 1M | |
| Gemini 3.1 Pro | $12.00 | $2.00 | 1M+ | |
| Claude Sonnet 4.6 | Anthropic | $15.00 | $3.00 | 200K |
| Claude Opus 4.6 | Anthropic | $25.00 | $5.00 | 200K |
Monthly Cost at 50M Output Tokens
- DeepSeek V3.2: $21/month
- GPT-4.1-nano: $20/month
- GPT-4.1-mini: $80/month
- Gemini 2.5 Flash: $125/month
- GPT-4.1: $400/month
- Claude Sonnet 4.6: $750/month
- Claude Opus 4.6: $1,250/month
Cost Calculator: Real Workloads
Per-token prices mean nothing until you map them to actual usage patterns.
Coding Agent: 1,000 Files/Day
A coding agent reads ~50K input tokens per file (context, file contents, instructions) and generates ~5K output tokens (edits, explanations). Daily: 50M input, 5M output.
| Model | Daily Input | Daily Output | Monthly Total |
|---|---|---|---|
| Claude Opus 4.6 | $250 | $125 | $11,250 |
| Claude Sonnet 4.6 | $150 | $75 | $6,750 |
| GPT-4.1 | $100 | $40 | $4,200 |
| Gemini 2.5 Pro | $63 | $50 | $3,390 |
| GPT-4.1-mini | $20 | $8 | $840 |
| DeepSeek V3.2 | $14 | $2.10 | $484 |
| GPT-4.1-nano | $5 | $2 | $210 |
Customer Support Chatbot: 10K Conversations/Day
Average conversation: 2K input tokens (system prompt + history), 500 output tokens. Daily: 20M input, 5M output.
| Model | Daily Cost | Monthly Cost |
|---|---|---|
| Claude Haiku 4.5 | $45 | $1,350 |
| GPT-4.1-mini | $16 | $480 |
| Gemini 2.5 Flash | $19 | $563 |
| GPT-4.1-nano | $4 | $120 |
| Gemini 2.5 Flash-Lite | $4 | $120 |
| DeepSeek V3.2 | $8 | $234 |
| Llama 3.1 8B @ Groq | $1.40 | $42 |
RAG Batch Processing: 1M Documents
Processing 1M documents averaging 2K tokens each for summarization. One-time batch: 2B input tokens, 200M output tokens.
| Model | Input Cost | Output Cost | Total |
|---|---|---|---|
| GPT-4.1 (batch 50% off) | $2,000 | $800 | $2,800 |
| Gemini 2.5 Flash (batch 50% off) | $300 | $250 | $550 |
| DeepSeek V3.2 | $560 | $84 | $644 |
| GPT-4.1-nano (batch 50% off) | $100 | $40 | $140 |
The 50x Cost Gap
The same coding agent workload costs $11,250/month on Claude Opus 4.6 and $210/month on GPT-4.1-nano. Both produce usable code, but Opus handles complex multi-file refactors. The production answer: route simple edits to a cheap model and complex tasks to Opus. Most teams land at $800-1,500/month with routing.
Feature Matrix
Price and speed are not the only variables. Function calling maturity, structured output guarantees, vision capabilities, and batch APIs all determine which provider fits your use case.
| Feature | OpenAI | Anthropic | DeepSeek | |
|---|---|---|---|---|
| Function Calling | Best (parallel, schema) | Good (parallel) | Good + grounding | Supported |
| Structured Outputs | JSON schema enforcement | JSON mode | JSON mode | JSON mode |
| Vision (Image Input) | GPT-5.4, GPT-4.1 | All current models | All Gemini models | No |
| Streaming | Yes | Yes | Yes | Yes |
| Batch API | 50% discount | 50% discount | 50% discount | No |
| Prompt Caching | 50% off auto | 90% off manual | Tiered pricing | 90% off auto |
| Max Context | 1M (GPT-4.1) | 1M (beta) | 1M+ | 128K |
| Max Output | 100K (o3) | 128K (Opus 4.6) | 65K | 64K |
| Reasoning Mode | o3, o4-mini | Extended thinking | Thinking budgets | Reasoner mode |
| Fine-tuning | GPT-5.4, GPT-4.1-mini | Not available | Gemini Flash | Not available |
| MCP Support | No | Native (Claude) | No | No |
| Self-hosting | No | No | No | Yes (open weights) |
Best Function Calling
OpenAI. Parallel function calls with JSON schema enforcement at the decoding level. GPT-4.1 and o3 handle complex multi-tool chains. Guaranteed valid JSON output reduces parsing errors and retries.
Best Long Context
Google Gemini. 1M tokens at $2.50/MTok on Flash. OpenAI offers 1M on GPT-4.1 at $8. Anthropic offers 1M via beta at $15-25. For cost-effective long-context processing, Gemini wins by 3-6x.
Best for Agents
Anthropic Claude. Native MCP support, extended thinking, 128K max output. Opus 4.6 and Sonnet 4.5 lead coding benchmarks. The only provider with a standardized tool protocol (MCP) built in.
Tool Calling and MCP
OpenAI and Anthropic have the most tested tool calling implementations. Both support parallel tool calls (multiple tools in a single turn), which is critical for agents. Anthropic has a structural advantage: native MCP support means Claude connects to external tools through a standardized interface. This matters for coding agents that read files, run commands, and search codebases.
Structured Outputs
OpenAI enforces JSON schemas at the decoding level, guaranteeing valid output. Other providers offer JSON mode where the model attempts JSON but does not guarantee schema compliance. For applications parsing LLM output programmatically, OpenAI's approach reduces retry rates.
Speed and Throughput
Two metrics matter: time to first token (TTFT), which determines how fast streaming begins, and tokens per second (tok/s), which determines how fast responses complete. For interactive apps, sub-500ms TTFT feels instant. For batch processing, total throughput drives job time.
| Provider / Model | Throughput | TTFT | Best For |
|---|---|---|---|
| Groq (Llama 3.1 8B) | 840 tok/s | ~50ms | Fastest available, simple tasks |
| Groq (Qwen3 32B) | 662 tok/s | ~80ms | Quality + speed balance |
| Groq (Llama 4 Scout) | 594 tok/s | ~100ms | High throughput, MoE |
| Groq (Llama 4 Maverick) | 562 tok/s | ~100ms | Larger MoE model |
| Groq (Llama 3.3 70B) | 394 tok/s | ~150ms | Quality at speed |
| Google (Gemini 2.5 Flash) | ~200 tok/s | ~150ms | Multimodal + speed |
| OpenAI (GPT-4.1) | ~100 tok/s | ~200ms | Flagship general purpose |
| Anthropic (Claude Sonnet 4.6) | ~80 tok/s | ~300ms | Coding and reasoning |
| Anthropic (Claude Opus 4.6) | ~60 tok/s | ~500ms | Most complex tasks |
Groq dominates throughput using custom LPU hardware instead of GPUs. The tradeoff: Groq only hosts open-weight models. No GPT-4.1 or Claude on Groq infrastructure.
Reasoning models (o3, o4-mini, DeepSeek R1) are inherently slower because they generate internal thinking tokens before visible output. A 500-token response from o3 might consume 2,000+ thinking tokens internally, adding seconds of latency.
Interactive Apps: Optimize TTFT
For chatbots, code assistants, and streaming UIs, time to first token matters most. Target sub-500ms. Use Flash/Nano/Haiku tiers. Groq delivers sub-100ms on smaller models.
Batch Processing: Optimize Throughput
For document processing and offline analysis, total tok/s drives job time. Use batch APIs (50% discount from OpenAI, Google, Anthropic) or run parallel requests through inference providers.
Best API by Use Case
| Use Case | Recommended API | Why | Output Cost |
|---|---|---|---|
| Chatbot / support | GPT-4.1-mini or Gemini Flash | Good quality, low cost, fast | $1.60-2.50/MTok |
| RAG / document Q&A | Gemini 2.5 Pro or Cohere | Long context, retrieval tools | $2.50-10.00/MTok |
| Code generation | Claude Sonnet 4.6 or GPT-4.1 | Top benchmarks (SWE-Bench 82%) | $8-15/MTok |
| Code editing / diffs | Morph Fast Apply | 10,500 tok/s, 98% accuracy | $0.80-1.20/MTok |
| Codebase search | Morph WarpGrep | 0.73 F1, semantic, MCP server | Token-based |
| Agentic workflows | Claude Opus 4.6 or o3 | Extended thinking, 128K output | $8-25/MTok |
| High-volume / budget | DeepSeek V3.2 or GPT-4.1-nano | Frontier quality under $0.50/MTok | $0.40-0.42/MTok |
| Low latency | Groq (Llama 4 Scout) | 594 tok/s on custom hardware | $0.34/MTok |
| Enterprise compliance | Azure OpenAI or Bedrock | SOC 2, HIPAA, VPC, SLA | 1-2x direct cost |
| Multimodal | Gemini 2.5 Pro or GPT-5.4 | Image, video, audio input | $10/MTok |
Specialized APIs: When General-Purpose Falls Short
General-purpose LLM APIs handle most tasks. But specific high-volume operations benefit from purpose-built models. Code editing is the clearest example: a task narrow enough for specialization, high-volume enough to justify it.
Coding agents spend the majority of their compute on two operations: searching codebases for context and applying edits to files. Cognition (the team behind Devin) measured that their agent spends 60% of its time on search alone. Both operations are bottlenecked by general-purpose LLM throughput of 80-100 tok/s.
Morph Fast Apply
Code editing API at 10,500 tok/s with 98% accuracy. A 7B model with custom CUDA kernels and speculative decoding. Agents output edit snippets; Fast Apply merges them into complete files in 1-3 seconds. $0.80-1.20/MTok.
Morph WarpGrep
RL-trained semantic codebase search MCP server. 8 parallel tool calls per turn, 4 turns average, searches complete in under 6 seconds. 0.73 F1 vs 0.29 for baseline grep. Works with any MCP-compatible agent.
| Metric | Morph Fast Apply | Claude Sonnet 4.6 | GPT-4.1 |
|---|---|---|---|
| Throughput | 10,500 tok/s | ~80 tok/s | ~100 tok/s |
| Edit Accuracy | 98% | 95% | 92% |
| Cost per MTok | $0.80-1.20 | $15.00 | $8.00 |
| Task Scope | File edit application | General code + reasoning | General code + reasoning |
The tradeoff is scope. Fast Apply does one thing: merge code edits into files. It does not generate code from scratch, answer questions, or reason about architecture. You use it alongside a general-purpose LLM. The general model reasons and generates edit instructions. Fast Apply executes them at 100x the speed.
How to Choose an LLM API
Decision Framework
- Step 1: Define your quality floor. Run your actual prompts through GPT-4.1-nano ($0.40/MTok), GPT-4.1-mini ($1.60/MTok), and GPT-4.1 ($8/MTok). If the cheapest passes your quality bar, stop there.
- Step 2: Check for specialized APIs. Code editing, embeddings, reranking, and speech have purpose-built APIs that outperform general models at lower cost. Do not use a $15/MTok model for a task a $1/MTok specialized model handles better.
- Step 3: Evaluate provider constraints. Need self-hosting? DeepSeek or Llama. Compliance? Azure or Bedrock. Speed? Groq. Widest model selection? Together AI or Fireworks.
- Step 4: Test with real traffic. Benchmarks measure different tasks than your application. A/B test on actual user queries. Measure quality, latency, and cost simultaneously.
The most common mistake is starting with the most expensive model and never testing cheaper alternatives. GPT-4.1-nano at $0.40/MTok handles summarization, classification, extraction, and simple chat as well as models costing 20x more. Reserve Claude Opus 4.6 and o3 for tasks where you have measured a quality difference on your specific workload.
The second most common mistake is using a general-purpose API for a specialized task. Embedding models outperform using an LLM for vector search. Fast Apply outperforms any general LLM for code editing by 100x on throughput. WarpGrep outperforms grep-based search by 2.5x on F1. Match the tool to the task.
The Two-Model Pattern
The most effective production architecture is two models behind a router. Route 80-95% of traffic to a cheap model (GPT-4.1-nano, Gemini Flash-Lite, DeepSeek V3.2) and escalate complex tasks to frontier (Claude Opus 4.6, o3, Gemini 2.5 Pro). This cuts costs 60-80% compared to running everything through frontier, with minimal quality loss on the tasks that matter.
Frequently Asked Questions
What is the cheapest LLM API in 2026?
DeepSeek V3.2 at $0.28/$0.42 per MTok (input/output) is the cheapest frontier-class option. GPT-4.1-nano ($0.10/$0.40) and Gemini 2.5 Flash-Lite ($0.10/$0.40) are similarly cheap with 1M context windows. Groq's Llama 3.1 8B at $0.05/$0.08 is the absolute cheapest if you do not need frontier intelligence.
Which LLM API is best for coding?
For code generation and reasoning, Claude Sonnet 4.5 leads SWE-Bench at 82%, with Claude Opus 4.6 close behind. For code editing and diff application, Morph Fast Apply runs at 10,500 tok/s with 98% accuracy. For budget code generation, DeepSeek V3.2 and GPT-4.1 both perform well.
Which LLM API has the fastest inference?
Groq, by a wide margin. Custom LPU hardware delivers 840 tok/s on Llama 3.1 8B and 594 tok/s on Llama 4 Scout. For frontier models, Gemini Flash and GPT-4.1-nano are the fastest. For code editing specifically, Morph Fast Apply reaches 10,500 tok/s.
What is the largest context window available?
Google Gemini models support 1M+ tokens natively. OpenAI GPT-4.1 supports 1M tokens. Anthropic Claude Opus 4.6 and Sonnet 4.6 support 1M via beta header. Cohere Command A supports 256K tokens. DeepSeek V3.2 supports 128K.
Should I use one provider or multiple?
Multiple providers with routing is the production standard. Route most requests to cheap models, escalate complex tasks to frontier. LiteLLM, Portkey, and OpenRouter handle routing, failover, and cost tracking. This eliminates single-provider downtime risk and cuts costs 60-80%.
Do all LLM APIs support function calling?
All major providers support it. OpenAI has the most mature implementation with parallel calls and JSON schema enforcement. Anthropic supports parallel calls and native MCP. Google adds grounding with Search. Inference providers (Groq, Together, Fireworks) support it on select models.
When should I self-host instead of using an API?
Self-host when you need data sovereignty, consistent latency at 50+ concurrent users, or monthly API spend exceeds $5K-10K. vLLM is the standard framework (24x throughput vs naive inference). Self-hosting is only cheaper at 70%+ GPU utilization. Below that, API providers win on cost.
Related Resources
Code Editing at 10,500 tok/s
General-purpose LLM APIs process code edits at 80-100 tok/s. Morph Fast Apply uses a specialized 7B model with custom CUDA kernels to apply diffs at 10,500 tok/s with 98% accuracy. OpenAI-compatible API.