LLM API Comparison 2026: Pricing, Speed, and Features Across Every Major Provider

Side-by-side comparison of 11 LLM API providers. Pricing per million tokens, context windows, throughput benchmarks, and a decision framework for OpenAI, Anthropic, Google Gemini, DeepSeek, Groq, Together AI, Fireworks, Mistral, Cohere, Bedrock, and Azure.

March 12, 2026 ยท 2 min read

Quick Pricing Overview

Most developers pick an LLM API based on vibes. They read a blog post, see a benchmark, and hardcode an OpenAI key. Six months later they are paying 10x what they should for a task that a $0.10/MTok model handles fine.

The market in 2026 has 11+ production-grade providers. Prices range from $0.08 to $25 per million output tokens. Context windows span 128K to 1M tokens. Throughput ranges from 80 tok/s to 840 tok/s. A chatbot doing 100M output tokens per month pays $1,000 on GPT-5.4 or $42 on DeepSeek V3.2. Same quality tier for most conversational tasks.

300x
Price range across all models
1M
Max context (tokens)
840
Peak tok/s (Groq)
11+
Production-grade providers

The 80/20 Rule

80% of applications work fine with a model in the $0.40-$2.50/MTok output range. Only coding agents, complex reasoning, and multi-step planning benefit from $10+/MTok models. Start cheap, benchmark your specific task, upgrade only when you measure a quality gap.

Provider-by-Provider Breakdown

OpenAI

The default choice for most developers. OpenAI offers the widest model range: GPT-4.1-nano at $0.10/$0.40, GPT-4.1 at $2/$8, and reasoning models o3/o4-mini from $1.10-$8 per MTok. The GPT-4.1 family (April 2025) replaced GPT-4o as the recommended line. Key advantage: 1M token context on GPT-4.1, the most mature function calling and structured output implementation, and the largest ecosystem of tools and libraries.

ModelInput/MTokOutput/MTokContextMax Output
GPT-4.1$2.00$8.001M32K
GPT-4.1-mini$0.40$1.601M32K
GPT-4.1-nano$0.10$0.401M32K
o3 (reasoning)$2.00$8.00200K100K
o4-mini (reasoning)$1.10$4.40200K100K
GPT-5.4$2.50$10.00128K16K
GPT-5.4-mini$0.15$0.60128K16K

GPT-4.1 is cheaper than GPT-5.4 ($8 vs $10 output) with 8x the context window (1M vs 128K). No reason to use GPT-5.4 for new projects unless you depend on its fine-tuning ecosystem.

Anthropic (Claude)

Claude models lead coding benchmarks. Claude Sonnet 4.5 holds the top SWE-Bench score at 82%. Claude Opus 4.6 is the most capable model for complex reasoning and agent workflows, with 128K max output tokens and extended thinking. Anthropic also has native MCP (Model Context Protocol) support, giving Claude direct access to external tools and data sources.

ModelInput/MTokOutput/MTokContextMax Output
Claude Opus 4.6$5.00$25.00200K (1M beta)128K
Claude Sonnet 4.6$3.00$15.00200K (1M beta)64K
Claude Haiku 4.5$1.00$5.00200K64K

Claude pricing is 2-3x higher than OpenAI for comparable tiers. The premium buys better coding performance, longer outputs (128K vs 32K on GPT-4.1), and extended thinking. For non-coding tasks, the quality gap narrows and the price difference matters more.

Google (Gemini)

Google undercuts both OpenAI and Anthropic at every price tier. Gemini 2.5 Flash at $0.30/$2.50 per MTok handles 1M token contexts. Gemini 2.5 Flash-Lite at $0.10/$0.40 matches GPT-4.1-nano pricing with 1M context. Google also offers a free tier for development, which no other frontier provider does.

ModelInput/MTokOutput/MTokContextNotes
Gemini 3.1 Pro$2.00$12.001M+Latest flagship
Gemini 2.5 Pro$1.25$10.001MFrontier reasoning
Gemini 2.5 Flash$0.30$2.501MBest value per token
Gemini 2.5 Flash-Lite$0.10$0.401MBudget tier, free tier available

Gemini 2.5 Flash is arguably the best value: 1M context, $2.50/MTok output, free tier available, competitive quality on most benchmarks. The tradeoff is less mature function calling and structured output support compared to OpenAI.

DeepSeek

The price disruptor. DeepSeek V3.2 matches GPT-5.4-class quality at $0.28/$0.42 per MTok. That is 24x cheaper on output. Cache hits drop input cost to $0.028/MTok. The catch: DeepSeek has experienced reliability issues during peak usage, and data routes through servers in China. For applications with data sovereignty requirements, DeepSeek may not work.

ModelInput/MTokOutput/MTokContextNotes
DeepSeek V3.2 (chat)$0.28$0.42128KCache hit: $0.028 input
DeepSeek V3.2 (reasoner)$0.28$0.42128KThinking mode, 64K max output

Groq

Custom LPU (Language Processing Unit) hardware optimized for inference speed. Llama 3.1 8B runs at 840 tok/s. Llama 4 Scout at 594 tok/s. Groq hosts open-weight models only (Llama, Qwen, Mistral). You pay for speed, not the cheapest per-token pricing.

ModelInput/MTokOutput/MTokSpeedContext
Llama 3.1 8B$0.05$0.08840 tok/s128K
Qwen3 32B$0.29$0.59662 tok/s131K
Llama 4 Scout$0.11$0.34594 tok/s128K
Llama 4 Maverick$0.20$0.60562 tok/s128K
Llama 3.3 70B$0.59$0.79394 tok/s128K

Together AI

The widest selection of open-weight models in one place: Llama, DeepSeek, Qwen, Mistral, GLM, Kimi, and smaller community models. Batch API at 50% discount. Good option for teams that want to evaluate multiple open-source models without managing infrastructure.

ModelInput/MTokOutput/MTokNotes
Llama 4 Maverick$0.27$0.85MoE, 128K context
Llama 3.3 70B$0.88$0.88Dense model
DeepSeek V3.1$0.60$1.70Hosted alternative
DeepSeek R1$3.00$7.00Reasoning model
Qwen 2.5 7B$0.30$0.30Small, fast

Fireworks AI

Tiered pricing by model parameter count for open-weight models, plus specific pricing for popular models. Cached inputs at 50% off. On-demand GPU deployments available: A100 at $2.90/hr, H100 at $4.00/hr, H200 at $6.00/hr for teams needing dedicated capacity.

Model / TierInput/MTokOutput/MTokNotes
DeepSeek V3$0.56$1.68Hosted DeepSeek
Kimi K2.5$0.60$3.00Cached input: $0.10
GLM-5$1.00$3.20Cached input: $0.20
< 4B params (tier)$0.10$0.10Any small model
16B+ params (tier)$0.90$0.90Any large model

Cohere

Focused on enterprise RAG and search rather than general chat. Command A handles document processing with 256K context. Embed v4.0 and Rerank v4.0 are among the best retrieval models available. Less relevant for general-purpose LLM API usage, but the strongest option for search-and-retrieval pipelines.

Amazon Bedrock and Azure OpenAI

Cloud-hosted wrappers around first-party models. Bedrock serves Claude, Llama, Mistral, and Amazon Nova through AWS billing. Azure OpenAI provides OpenAI models with enterprise features: VNet integration, managed identity, content filtering. Pricing runs 1-2x the direct API cost. The value is compliance (SOC 2, HIPAA, ISO 27001), VPC integration, and consolidated cloud billing.

Mistral

Both proprietary and open-weight models. Mistral Large 3 is their flagship. Codestral targets code generation with 256K context. Devstral 2 is built for software engineering agents. Popular in Europe for EU data residency options. Open-weight Mistral models (7B, 8x7B) are available on every inference provider.

Full Pricing Table: Output Tokens Ranked by Cost

Output tokens dominate most API bills because they cost 2-5x more than input and most applications generate substantial output. This table ranks every major model by output cost.

ModelProviderOutput/MTokInput/MTokContext
Llama 3.1 8BGroq$0.08$0.05128K
Llama 4 ScoutGroq$0.34$0.11128K
GPT-4.1-nanoOpenAI$0.40$0.101M
Gemini 2.5 Flash-LiteGoogle$0.40$0.101M
DeepSeek V3.2DeepSeek$0.42$0.28128K
Qwen3 32BGroq$0.59$0.29131K
GPT-5.4-miniOpenAI$0.60$0.15128K
Llama 4 MaverickGroq$0.60$0.20128K
Llama 3.3 70BGroq$0.79$0.59128K
Llama 4 MaverickTogether$0.85$0.27128K
GPT-4.1-miniOpenAI$1.60$0.401M
Gemini 2.5 FlashGoogle$2.50$0.301M
o4-miniOpenAI$4.40$1.10200K
Claude Haiku 4.5Anthropic$5.00$1.00200K
GPT-4.1OpenAI$8.00$2.001M
o3OpenAI$8.00$2.00200K
GPT-5.4OpenAI$10.00$2.50128K
Gemini 2.5 ProGoogle$10.00$1.251M
Gemini 3.1 ProGoogle$12.00$2.001M+
Claude Sonnet 4.6Anthropic$15.00$3.00200K
Claude Opus 4.6Anthropic$25.00$5.00200K

Monthly Cost at 50M Output Tokens

  • DeepSeek V3.2: $21/month
  • GPT-4.1-nano: $20/month
  • GPT-4.1-mini: $80/month
  • Gemini 2.5 Flash: $125/month
  • GPT-4.1: $400/month
  • Claude Sonnet 4.6: $750/month
  • Claude Opus 4.6: $1,250/month

Cost Calculator: Real Workloads

Per-token prices mean nothing until you map them to actual usage patterns.

Coding Agent: 1,000 Files/Day

A coding agent reads ~50K input tokens per file (context, file contents, instructions) and generates ~5K output tokens (edits, explanations). Daily: 50M input, 5M output.

ModelDaily InputDaily OutputMonthly Total
Claude Opus 4.6$250$125$11,250
Claude Sonnet 4.6$150$75$6,750
GPT-4.1$100$40$4,200
Gemini 2.5 Pro$63$50$3,390
GPT-4.1-mini$20$8$840
DeepSeek V3.2$14$2.10$484
GPT-4.1-nano$5$2$210

Customer Support Chatbot: 10K Conversations/Day

Average conversation: 2K input tokens (system prompt + history), 500 output tokens. Daily: 20M input, 5M output.

ModelDaily CostMonthly Cost
Claude Haiku 4.5$45$1,350
GPT-4.1-mini$16$480
Gemini 2.5 Flash$19$563
GPT-4.1-nano$4$120
Gemini 2.5 Flash-Lite$4$120
DeepSeek V3.2$8$234
Llama 3.1 8B @ Groq$1.40$42

RAG Batch Processing: 1M Documents

Processing 1M documents averaging 2K tokens each for summarization. One-time batch: 2B input tokens, 200M output tokens.

ModelInput CostOutput CostTotal
GPT-4.1 (batch 50% off)$2,000$800$2,800
Gemini 2.5 Flash (batch 50% off)$300$250$550
DeepSeek V3.2$560$84$644
GPT-4.1-nano (batch 50% off)$100$40$140

The 50x Cost Gap

The same coding agent workload costs $11,250/month on Claude Opus 4.6 and $210/month on GPT-4.1-nano. Both produce usable code, but Opus handles complex multi-file refactors. The production answer: route simple edits to a cheap model and complex tasks to Opus. Most teams land at $800-1,500/month with routing.

Feature Matrix

Price and speed are not the only variables. Function calling maturity, structured output guarantees, vision capabilities, and batch APIs all determine which provider fits your use case.

FeatureOpenAIAnthropicGoogleDeepSeek
Function CallingBest (parallel, schema)Good (parallel)Good + groundingSupported
Structured OutputsJSON schema enforcementJSON modeJSON modeJSON mode
Vision (Image Input)GPT-5.4, GPT-4.1All current modelsAll Gemini modelsNo
StreamingYesYesYesYes
Batch API50% discount50% discount50% discountNo
Prompt Caching50% off auto90% off manualTiered pricing90% off auto
Max Context1M (GPT-4.1)1M (beta)1M+128K
Max Output100K (o3)128K (Opus 4.6)65K64K
Reasoning Modeo3, o4-miniExtended thinkingThinking budgetsReasoner mode
Fine-tuningGPT-5.4, GPT-4.1-miniNot availableGemini FlashNot available
MCP SupportNoNative (Claude)NoNo
Self-hostingNoNoNoYes (open weights)

Best Function Calling

OpenAI. Parallel function calls with JSON schema enforcement at the decoding level. GPT-4.1 and o3 handle complex multi-tool chains. Guaranteed valid JSON output reduces parsing errors and retries.

Best Long Context

Google Gemini. 1M tokens at $2.50/MTok on Flash. OpenAI offers 1M on GPT-4.1 at $8. Anthropic offers 1M via beta at $15-25. For cost-effective long-context processing, Gemini wins by 3-6x.

Best for Agents

Anthropic Claude. Native MCP support, extended thinking, 128K max output. Opus 4.6 and Sonnet 4.5 lead coding benchmarks. The only provider with a standardized tool protocol (MCP) built in.

Tool Calling and MCP

OpenAI and Anthropic have the most tested tool calling implementations. Both support parallel tool calls (multiple tools in a single turn), which is critical for agents. Anthropic has a structural advantage: native MCP support means Claude connects to external tools through a standardized interface. This matters for coding agents that read files, run commands, and search codebases.

Structured Outputs

OpenAI enforces JSON schemas at the decoding level, guaranteeing valid output. Other providers offer JSON mode where the model attempts JSON but does not guarantee schema compliance. For applications parsing LLM output programmatically, OpenAI's approach reduces retry rates.

Speed and Throughput

Two metrics matter: time to first token (TTFT), which determines how fast streaming begins, and tokens per second (tok/s), which determines how fast responses complete. For interactive apps, sub-500ms TTFT feels instant. For batch processing, total throughput drives job time.

Provider / ModelThroughputTTFTBest For
Groq (Llama 3.1 8B)840 tok/s~50msFastest available, simple tasks
Groq (Qwen3 32B)662 tok/s~80msQuality + speed balance
Groq (Llama 4 Scout)594 tok/s~100msHigh throughput, MoE
Groq (Llama 4 Maverick)562 tok/s~100msLarger MoE model
Groq (Llama 3.3 70B)394 tok/s~150msQuality at speed
Google (Gemini 2.5 Flash)~200 tok/s~150msMultimodal + speed
OpenAI (GPT-4.1)~100 tok/s~200msFlagship general purpose
Anthropic (Claude Sonnet 4.6)~80 tok/s~300msCoding and reasoning
Anthropic (Claude Opus 4.6)~60 tok/s~500msMost complex tasks

Groq dominates throughput using custom LPU hardware instead of GPUs. The tradeoff: Groq only hosts open-weight models. No GPT-4.1 or Claude on Groq infrastructure.

Reasoning models (o3, o4-mini, DeepSeek R1) are inherently slower because they generate internal thinking tokens before visible output. A 500-token response from o3 might consume 2,000+ thinking tokens internally, adding seconds of latency.

Interactive Apps: Optimize TTFT

For chatbots, code assistants, and streaming UIs, time to first token matters most. Target sub-500ms. Use Flash/Nano/Haiku tiers. Groq delivers sub-100ms on smaller models.

Batch Processing: Optimize Throughput

For document processing and offline analysis, total tok/s drives job time. Use batch APIs (50% discount from OpenAI, Google, Anthropic) or run parallel requests through inference providers.

Best API by Use Case

Use CaseRecommended APIWhyOutput Cost
Chatbot / supportGPT-4.1-mini or Gemini FlashGood quality, low cost, fast$1.60-2.50/MTok
RAG / document Q&AGemini 2.5 Pro or CohereLong context, retrieval tools$2.50-10.00/MTok
Code generationClaude Sonnet 4.6 or GPT-4.1Top benchmarks (SWE-Bench 82%)$8-15/MTok
Code editing / diffsMorph Fast Apply10,500 tok/s, 98% accuracy$0.80-1.20/MTok
Codebase searchMorph WarpGrep0.73 F1, semantic, MCP serverToken-based
Agentic workflowsClaude Opus 4.6 or o3Extended thinking, 128K output$8-25/MTok
High-volume / budgetDeepSeek V3.2 or GPT-4.1-nanoFrontier quality under $0.50/MTok$0.40-0.42/MTok
Low latencyGroq (Llama 4 Scout)594 tok/s on custom hardware$0.34/MTok
Enterprise complianceAzure OpenAI or BedrockSOC 2, HIPAA, VPC, SLA1-2x direct cost
MultimodalGemini 2.5 Pro or GPT-5.4Image, video, audio input$10/MTok

Specialized APIs: When General-Purpose Falls Short

General-purpose LLM APIs handle most tasks. But specific high-volume operations benefit from purpose-built models. Code editing is the clearest example: a task narrow enough for specialization, high-volume enough to justify it.

Coding agents spend the majority of their compute on two operations: searching codebases for context and applying edits to files. Cognition (the team behind Devin) measured that their agent spends 60% of its time on search alone. Both operations are bottlenecked by general-purpose LLM throughput of 80-100 tok/s.

Morph Fast Apply

Code editing API at 10,500 tok/s with 98% accuracy. A 7B model with custom CUDA kernels and speculative decoding. Agents output edit snippets; Fast Apply merges them into complete files in 1-3 seconds. $0.80-1.20/MTok.

Morph WarpGrep

RL-trained semantic codebase search MCP server. 8 parallel tool calls per turn, 4 turns average, searches complete in under 6 seconds. 0.73 F1 vs 0.29 for baseline grep. Works with any MCP-compatible agent.

MetricMorph Fast ApplyClaude Sonnet 4.6GPT-4.1
Throughput10,500 tok/s~80 tok/s~100 tok/s
Edit Accuracy98%95%92%
Cost per MTok$0.80-1.20$15.00$8.00
Task ScopeFile edit applicationGeneral code + reasoningGeneral code + reasoning

The tradeoff is scope. Fast Apply does one thing: merge code edits into files. It does not generate code from scratch, answer questions, or reason about architecture. You use it alongside a general-purpose LLM. The general model reasons and generates edit instructions. Fast Apply executes them at 100x the speed.

How to Choose an LLM API

Decision Framework

  • Step 1: Define your quality floor. Run your actual prompts through GPT-4.1-nano ($0.40/MTok), GPT-4.1-mini ($1.60/MTok), and GPT-4.1 ($8/MTok). If the cheapest passes your quality bar, stop there.
  • Step 2: Check for specialized APIs. Code editing, embeddings, reranking, and speech have purpose-built APIs that outperform general models at lower cost. Do not use a $15/MTok model for a task a $1/MTok specialized model handles better.
  • Step 3: Evaluate provider constraints. Need self-hosting? DeepSeek or Llama. Compliance? Azure or Bedrock. Speed? Groq. Widest model selection? Together AI or Fireworks.
  • Step 4: Test with real traffic. Benchmarks measure different tasks than your application. A/B test on actual user queries. Measure quality, latency, and cost simultaneously.

The most common mistake is starting with the most expensive model and never testing cheaper alternatives. GPT-4.1-nano at $0.40/MTok handles summarization, classification, extraction, and simple chat as well as models costing 20x more. Reserve Claude Opus 4.6 and o3 for tasks where you have measured a quality difference on your specific workload.

The second most common mistake is using a general-purpose API for a specialized task. Embedding models outperform using an LLM for vector search. Fast Apply outperforms any general LLM for code editing by 100x on throughput. WarpGrep outperforms grep-based search by 2.5x on F1. Match the tool to the task.

The Two-Model Pattern

The most effective production architecture is two models behind a router. Route 80-95% of traffic to a cheap model (GPT-4.1-nano, Gemini Flash-Lite, DeepSeek V3.2) and escalate complex tasks to frontier (Claude Opus 4.6, o3, Gemini 2.5 Pro). This cuts costs 60-80% compared to running everything through frontier, with minimal quality loss on the tasks that matter.

Frequently Asked Questions

What is the cheapest LLM API in 2026?

DeepSeek V3.2 at $0.28/$0.42 per MTok (input/output) is the cheapest frontier-class option. GPT-4.1-nano ($0.10/$0.40) and Gemini 2.5 Flash-Lite ($0.10/$0.40) are similarly cheap with 1M context windows. Groq's Llama 3.1 8B at $0.05/$0.08 is the absolute cheapest if you do not need frontier intelligence.

Which LLM API is best for coding?

For code generation and reasoning, Claude Sonnet 4.5 leads SWE-Bench at 82%, with Claude Opus 4.6 close behind. For code editing and diff application, Morph Fast Apply runs at 10,500 tok/s with 98% accuracy. For budget code generation, DeepSeek V3.2 and GPT-4.1 both perform well.

Which LLM API has the fastest inference?

Groq, by a wide margin. Custom LPU hardware delivers 840 tok/s on Llama 3.1 8B and 594 tok/s on Llama 4 Scout. For frontier models, Gemini Flash and GPT-4.1-nano are the fastest. For code editing specifically, Morph Fast Apply reaches 10,500 tok/s.

What is the largest context window available?

Google Gemini models support 1M+ tokens natively. OpenAI GPT-4.1 supports 1M tokens. Anthropic Claude Opus 4.6 and Sonnet 4.6 support 1M via beta header. Cohere Command A supports 256K tokens. DeepSeek V3.2 supports 128K.

Should I use one provider or multiple?

Multiple providers with routing is the production standard. Route most requests to cheap models, escalate complex tasks to frontier. LiteLLM, Portkey, and OpenRouter handle routing, failover, and cost tracking. This eliminates single-provider downtime risk and cuts costs 60-80%.

Do all LLM APIs support function calling?

All major providers support it. OpenAI has the most mature implementation with parallel calls and JSON schema enforcement. Anthropic supports parallel calls and native MCP. Google adds grounding with Search. Inference providers (Groq, Together, Fireworks) support it on select models.

When should I self-host instead of using an API?

Self-host when you need data sovereignty, consistent latency at 50+ concurrent users, or monthly API spend exceeds $5K-10K. vLLM is the standard framework (24x throughput vs naive inference). Self-hosting is only cheaper at 70%+ GPU utilization. Below that, API providers win on cost.

Related Resources

Code Editing at 10,500 tok/s

General-purpose LLM APIs process code edits at 80-100 tok/s. Morph Fast Apply uses a specialized 7B model with custom CUDA kernels to apply diffs at 10,500 tok/s with 98% accuracy. OpenAI-compatible API.