LLM API Comparison 2026: Pricing, Speed, Features

Quick Pricing Overview

Most developers pick an LLM API based on vibes. They read a blog post, see a benchmark, and hardcode an OpenAI key. Six months later they are paying 10x what they should for a task that a $0.10/MTok model handles fine.

The market in 2026 has 11+ production-grade providers. Prices range from $0.08 to $25 per million output tokens. Context windows span 128K to 1M tokens. Throughput ranges from 80 tok/s to 840 tok/s. A chatbot doing 100M output tokens per month pays $1,000 on GPT-5.4 or $42 on DeepSeek V3.2. Same quality tier for most conversational tasks.

300x

Price range across all models

Max context (tokens)

840

Peak tok/s (Groq)

11+

Production-grade providers

The 80/20 Rule

80% of applications work fine with a model in the $0.40-$2.50/MTok output range. Only coding agents, complex reasoning, and multi-step planning benefit from $10+/MTok models. Start cheap, benchmark your specific task, upgrade only when you measure a quality gap.

Provider-by-Provider Breakdown

OpenAI

The default choice for most developers. OpenAI offers the widest model range: GPT-4.1-nano at $0.10/$0.40, GPT-4.1 at $2/$8, and reasoning models o3/o4-mini from $1.10-$8 per MTok. The GPT-4.1 family (April 2025) replaced GPT-4o as the recommended line. Key advantage: 1M token context on GPT-4.1, the most mature function calling and structured output implementation, and the largest ecosystem of tools and libraries.

Model	Input/MTok	Output/MTok	Context	Max Output
GPT-4.1	$2.00	$8.00	1M	32K
GPT-4.1-mini	$0.40	$1.60	1M	32K
GPT-4.1-nano	$0.10	$0.40	1M	32K
o3 (reasoning)	$2.00	$8.00	200K	100K
o4-mini (reasoning)	$1.10	$4.40	200K	100K
GPT-5.4	$2.50	$10.00	128K	16K
GPT-5.4-mini	$0.15	$0.60	128K	16K

GPT-4.1 is cheaper than GPT-5.4 ($8 vs $10 output) with 8x the context window (1M vs 128K). No reason to use GPT-5.4 for new projects unless you depend on its fine-tuning ecosystem.

Anthropic (Claude)

Claude models lead coding benchmarks. Claude Sonnet 4.5 holds the top SWE-Bench score at 82%. Claude Opus 4.6 is the most capable model for complex reasoning and agent workflows, with 128K max output tokens and extended thinking. Anthropic also has native MCP (Model Context Protocol) support, giving Claude direct access to external tools and data sources.

Model	Input/MTok	Output/MTok	Context	Max Output
Claude Opus 4.6	$5.00	$25.00	200K (1M beta)	128K
Claude Sonnet 4.6	$3.00	$15.00	200K (1M beta)	64K
Claude Haiku 4.5	$1.00	$5.00	200K	64K

Claude pricing is 2-3x higher than OpenAI for comparable tiers. The premium buys better coding performance, longer outputs (128K vs 32K on GPT-4.1), and extended thinking. For non-coding tasks, the quality gap narrows and the price difference matters more.

Google (Gemini)

Google undercuts both OpenAI and Anthropic at every price tier. Gemini 2.5 Flash at $0.30/$2.50 per MTok handles 1M token contexts. Gemini 2.5 Flash-Lite at $0.10/$0.40 matches GPT-4.1-nano pricing with 1M context. Google also offers a free tier for development, which no other frontier provider does.

Model	Input/MTok	Output/MTok	Context	Notes
Gemini 3.1 Pro	$2.00	$12.00	1M+	Latest flagship
Gemini 2.5 Pro	$1.25	$10.00	1M	Frontier reasoning
Gemini 2.5 Flash	$0.30	$2.50	1M	Best value per token
Gemini 2.5 Flash-Lite	$0.10	$0.40	1M	Budget tier, free tier available

Gemini 2.5 Flash is arguably the best value: 1M context, $2.50/MTok output, free tier available, competitive quality on most benchmarks. The tradeoff is less mature function calling and structured output support compared to OpenAI.

DeepSeek

The price disruptor. DeepSeek V3.2 matches GPT-5.4-class quality at $0.28/$0.42 per MTok. That is 24x cheaper on output. Cache hits drop input cost to $0.028/MTok. The catch: DeepSeek has experienced reliability issues during peak usage, and data routes through servers in China. For applications with data sovereignty requirements, DeepSeek may not work.

Model	Input/MTok	Output/MTok	Context	Notes
DeepSeek V3.2 (chat)	$0.28	$0.42	128K	Cache hit: $0.028 input
DeepSeek V3.2 (reasoner)	$0.28	$0.42	128K	Thinking mode, 64K max output

Groq

Custom LPU (Language Processing Unit) hardware optimized for inference speed. Llama 3.1 8B runs at 840 tok/s. Llama 4 Scout at 594 tok/s. Groq hosts open-weight models only (Llama, Qwen, Mistral). You pay for speed, not the cheapest per-token pricing.

Model	Input/MTok	Output/MTok	Speed	Context
Llama 3.1 8B	$0.05	$0.08	840 tok/s	128K
Qwen3 32B	$0.29	$0.59	662 tok/s	131K
Llama 4 Scout	$0.11	$0.34	594 tok/s	128K
Llama 4 Maverick	$0.20	$0.60	562 tok/s	128K
Llama 3.3 70B	$0.59	$0.79	394 tok/s	128K

Together AI

The widest selection of open-weight models in one place: Llama, DeepSeek, Qwen, Mistral, GLM, Kimi, and smaller community models. Batch API at 50% discount. Good option for teams that want to evaluate multiple open-source models without managing infrastructure.

Model	Input/MTok	Output/MTok	Notes
Llama 4 Maverick	$0.27	$0.85	MoE, 128K context
Llama 3.3 70B	$0.88	$0.88	Dense model
DeepSeek V3.1	$0.60	$1.70	Hosted alternative
DeepSeek R1	$3.00	$7.00	Reasoning model
Qwen 2.5 7B	$0.30	$0.30	Small, fast

Fireworks AI

Tiered pricing by model parameter count for open-weight models, plus specific pricing for popular models. Cached inputs at 50% off. On-demand GPU deployments available: A100 at $2.90/hr, H100 at $4.00/hr, H200 at $6.00/hr for teams needing dedicated capacity.

Model / Tier	Input/MTok	Output/MTok	Notes
DeepSeek V3	$0.56	$1.68	Hosted DeepSeek
Kimi K2.5	$0.60	$3.00	Cached input: $0.10
GLM-5	$1.00	$3.20	Cached input: $0.20
< 4B params (tier)	$0.10	$0.10	Any small model
16B+ params (tier)	$0.90	$0.90	Any large model

Cohere

Focused on enterprise RAG and search rather than general chat. Command A handles document processing with 256K context. Embed v4.0 and Rerank v4.0 are among the best retrieval models available. Less relevant for general-purpose LLM API usage, but the strongest option for search-and-retrieval pipelines.

Amazon Bedrock and Azure OpenAI

Cloud-hosted wrappers around first-party models. Bedrock serves Claude, Llama, Mistral, and Amazon Nova through AWS billing. Azure OpenAI provides OpenAI models with enterprise features: VNet integration, managed identity, content filtering. Pricing runs 1-2x the direct API cost. The value is compliance (SOC 2, HIPAA, ISO 27001), VPC integration, and consolidated cloud billing.

Mistral

Both proprietary and open-weight models. Mistral Large 3 is their flagship. Codestral targets code generation with 256K context. Devstral 2 is built for software engineering agents. Popular in Europe for EU data residency options. Open-weight Mistral models (7B, 8x7B) are available on every inference provider.

Full Pricing Table: Output Tokens Ranked by Cost

Output tokens dominate most API bills because they cost 2-5x more than input and most applications generate substantial output. This table ranks every major model by output cost.

Model	Provider	Output/MTok	Input/MTok	Context
Llama 3.1 8B	Groq	$0.08	$0.05	128K
Llama 4 Scout	Groq	$0.34	$0.11	128K
GPT-4.1-nano	OpenAI	$0.40	$0.10	1M
Gemini 2.5 Flash-Lite	Google	$0.40	$0.10	1M
DeepSeek V3.2	DeepSeek	$0.42	$0.28	128K
Qwen3 32B	Groq	$0.59	$0.29	131K
GPT-5.4-mini	OpenAI	$0.60	$0.15	128K
Llama 4 Maverick	Groq	$0.60	$0.20	128K
Llama 3.3 70B	Groq	$0.79	$0.59	128K
Llama 4 Maverick	Together	$0.85	$0.27	128K
GPT-4.1-mini	OpenAI	$1.60	$0.40	1M
Gemini 2.5 Flash	Google	$2.50	$0.30	1M
o4-mini	OpenAI	$4.40	$1.10	200K
Claude Haiku 4.5	Anthropic	$5.00	$1.00	200K
GPT-4.1	OpenAI	$8.00	$2.00	1M
o3	OpenAI	$8.00	$2.00	200K
GPT-5.4	OpenAI	$10.00	$2.50	128K
Gemini 2.5 Pro	Google	$10.00	$1.25	1M
Gemini 3.1 Pro	Google	$12.00	$2.00	1M+
Claude Sonnet 4.6	Anthropic	$15.00	$3.00	200K
Claude Opus 4.6	Anthropic	$25.00	$5.00	200K

Monthly Cost at 50M Output Tokens

DeepSeek V3.2: $21/month
GPT-4.1-nano: $20/month
GPT-4.1-mini: $80/month
Gemini 2.5 Flash: $125/month
GPT-4.1: $400/month
Claude Sonnet 4.6: $750/month
Claude Opus 4.6: $1,250/month

Cost Calculator: Real Workloads

Per-token prices mean nothing until you map them to actual usage patterns.

Coding Agent: 1,000 Files/Day

A coding agent reads ~50K input tokens per file (context, file contents, instructions) and generates ~5K output tokens (edits, explanations). Daily: 50M input, 5M output.

Model	Daily Input	Daily Output	Monthly Total
Claude Opus 4.6	$250	$125	$11,250
Claude Sonnet 4.6	$150	$75	$6,750
GPT-4.1	$100	$40	$4,200
Gemini 2.5 Pro	$63	$50	$3,390
GPT-4.1-mini	$20	$8	$840
DeepSeek V3.2	$14	$2.10	$484
GPT-4.1-nano	$5	$2	$210

Customer Support Chatbot: 10K Conversations/Day

Average conversation: 2K input tokens (system prompt + history), 500 output tokens. Daily: 20M input, 5M output.

Model	Daily Cost	Monthly Cost
Claude Haiku 4.5	$45	$1,350
GPT-4.1-mini	$16	$480
Gemini 2.5 Flash	$19	$563
GPT-4.1-nano	$4	$120
Gemini 2.5 Flash-Lite	$4	$120
DeepSeek V3.2	$8	$234
Llama 3.1 8B @ Groq	$1.40	$42

RAG Batch Processing: 1M Documents

Processing 1M documents averaging 2K tokens each for summarization. One-time batch: 2B input tokens, 200M output tokens.

Model	Input Cost	Output Cost	Total
GPT-4.1 (batch 50% off)	$2,000	$800	$2,800
Gemini 2.5 Flash (batch 50% off)	$300	$250	$550
DeepSeek V3.2	$560	$84	$644
GPT-4.1-nano (batch 50% off)	$100	$40	$140

The 50x Cost Gap

The same coding agent workload costs $11,250/month on Claude Opus 4.6 and $210/month on GPT-4.1-nano. Both produce usable code, but Opus handles complex multi-file refactors. The production answer: route simple edits to a cheap model and complex tasks to Opus. Most teams land at $800-1,500/month with routing.

Feature Matrix

Price and speed are not the only variables. Function calling maturity, structured output guarantees, vision capabilities, and batch APIs all determine which provider fits your use case.

Feature	OpenAI	Anthropic	Google	DeepSeek
Function Calling	Best (parallel, schema)	Good (parallel)	Good + grounding	Supported
Structured Outputs	JSON schema enforcement	JSON mode	JSON mode	JSON mode
Vision (Image Input)	GPT-5.4, GPT-4.1	All current models	All Gemini models	No
Streaming	Yes	Yes	Yes	Yes
Batch API	50% discount	50% discount	50% discount	No
Prompt Caching	50% off auto	90% off manual	Tiered pricing	90% off auto
Max Context	1M (GPT-4.1)	1M (beta)	1M+	128K
Max Output	100K (o3)	128K (Opus 4.6)	65K	64K
Reasoning Mode	o3, o4-mini	Extended thinking	Thinking budgets	Reasoner mode
Fine-tuning	GPT-5.4, GPT-4.1-mini	Not available	Gemini Flash	Not available
MCP Support	No	Native (Claude)	No	No
Self-hosting	No	No	No	Yes (open weights)

Best Function Calling

OpenAI. Parallel function calls with JSON schema enforcement at the decoding level. GPT-4.1 and o3 handle complex multi-tool chains. Guaranteed valid JSON output reduces parsing errors and retries.

Best Long Context

Google Gemini. 1M tokens at $2.50/MTok on Flash. OpenAI offers 1M on GPT-4.1 at $8. Anthropic offers 1M via beta at $15-25. For cost-effective long-context processing, Gemini wins by 3-6x.

Best for Agents

Anthropic Claude. Native MCP support, extended thinking, 128K max output. Opus 4.6 and Sonnet 4.5 lead coding benchmarks. The only provider with a standardized tool protocol (MCP) built in.

Tool Calling and MCP

OpenAI and Anthropic have the most tested tool calling implementations. Both support parallel tool calls (multiple tools in a single turn), which is critical for agents. Anthropic has a structural advantage: native MCP support means Claude connects to external tools through a standardized interface. This matters for coding agents that read files, run commands, and search codebases.

Structured Outputs

OpenAI enforces JSON schemas at the decoding level, guaranteeing valid output. Other providers offer JSON mode where the model attempts JSON but does not guarantee schema compliance. For applications parsing LLM output programmatically, OpenAI's approach reduces retry rates.

Speed and Throughput

Two metrics matter: time to first token (TTFT), which determines how fast streaming begins, and tokens per second (tok/s), which determines how fast responses complete. For interactive apps, sub-500ms TTFT feels instant. For batch processing, total throughput drives job time.

Provider / Model	Throughput	TTFT	Best For
Groq (Llama 3.1 8B)	840 tok/s	~50ms	Fastest available, simple tasks
Groq (Qwen3 32B)	662 tok/s	~80ms	Quality + speed balance
Groq (Llama 4 Scout)	594 tok/s	~100ms	High throughput, MoE
Groq (Llama 4 Maverick)	562 tok/s	~100ms	Larger MoE model
Groq (Llama 3.3 70B)	394 tok/s	~150ms	Quality at speed
Google (Gemini 2.5 Flash)	~200 tok/s	~150ms	Multimodal + speed
OpenAI (GPT-4.1)	~100 tok/s	~200ms	Flagship general purpose
Anthropic (Claude Sonnet 4.6)	~80 tok/s	~300ms	Coding and reasoning
Anthropic (Claude Opus 4.6)	~60 tok/s	~500ms	Most complex tasks

Groq dominates throughput using custom LPU hardware instead of GPUs. The tradeoff: Groq only hosts open-weight models. No GPT-4.1 or Claude on Groq infrastructure.

Reasoning models (o3, o4-mini, DeepSeek R1) are inherently slower because they generate internal thinking tokens before visible output. A 500-token response from o3 might consume 2,000+ thinking tokens internally, adding seconds of latency.

Interactive Apps: Optimize TTFT

For chatbots, code assistants, and streaming UIs, time to first token matters most. Target sub-500ms. Use Flash/Nano/Haiku tiers. Groq delivers sub-100ms on smaller models.

Batch Processing: Optimize Throughput

For document processing and offline analysis, total tok/s drives job time. Use batch APIs (50% discount from OpenAI, Google, Anthropic) or run parallel requests through inference providers.

Best API by Use Case

Use Case	Recommended API	Why	Output Cost
Chatbot / support	GPT-4.1-mini or Gemini Flash	Good quality, low cost, fast	$1.60-2.50/MTok
RAG / document Q&A	Gemini 2.5 Pro or Cohere	Long context, retrieval tools	$2.50-10.00/MTok
Code generation	Claude Sonnet 4.6 or GPT-4.1	Top benchmarks (SWE-Bench 82%)	$8-15/MTok
Code editing / diffs	Morph Fast Apply	10,500 tok/s, 98% accuracy	$0.80-1.20/MTok
Codebase search	Morph WarpGrep	0.73 F1, semantic, MCP server	Token-based
Agentic workflows	Claude Opus 4.6 or o3	Extended thinking, 128K output	$8-25/MTok
High-volume / budget	DeepSeek V3.2 or GPT-4.1-nano	Frontier quality under $0.50/MTok	$0.40-0.42/MTok
Low latency	Groq (Llama 4 Scout)	594 tok/s on custom hardware	$0.34/MTok
Enterprise compliance	Azure OpenAI or Bedrock	SOC 2, HIPAA, VPC, SLA	1-2x direct cost
Multimodal	Gemini 2.5 Pro or GPT-5.4	Image, video, audio input	$10/MTok

Specialized APIs: When General-Purpose Falls Short

General-purpose LLM APIs handle most tasks. But specific high-volume operations benefit from purpose-built models. Code editing is the clearest example: a task narrow enough for specialization, high-volume enough to justify it.

Coding agents spend the majority of their compute on two operations: searching codebases for context and applying edits to files. Cognition (the team behind Devin) measured that their agent spends 60% of its time on search alone. Both operations are bottlenecked by general-purpose LLM throughput of 80-100 tok/s.

Morph Fast Apply

Code editing API at 10,500 tok/s with 98% accuracy. A 7B model with custom CUDA kernels and speculative decoding. Agents output edit snippets; Fast Apply merges them into complete files in 1-3 seconds. $0.80-1.20/MTok.

Morph WarpGrep

RL-trained semantic codebase search MCP server. 8 parallel tool calls per turn, 4 turns average, searches complete in under 6 seconds. 0.73 F1 vs 0.29 for baseline grep. Works with any MCP-compatible agent.

Metric	Morph Fast Apply	Claude Sonnet 4.6	GPT-4.1
Throughput	10,500 tok/s	~80 tok/s	~100 tok/s
Edit Accuracy	98%	95%	92%
Cost per MTok	$0.80-1.20	$15.00	$8.00
Task Scope	File edit application	General code + reasoning	General code + reasoning

The tradeoff is scope. Fast Apply does one thing: merge code edits into files. It does not generate code from scratch, answer questions, or reason about architecture. You use it alongside a general-purpose LLM. The general model reasons and generates edit instructions. Fast Apply executes them at 100x the speed.

How to Choose an LLM API

Decision Framework

Step 1: Define your quality floor. Run your actual prompts through GPT-4.1-nano ($0.40/MTok), GPT-4.1-mini ($1.60/MTok), and GPT-4.1 ($8/MTok). If the cheapest passes your quality bar, stop there.
Step 2: Check for specialized APIs. Code editing, embeddings, reranking, and speech have purpose-built APIs that outperform general models at lower cost. Do not use a $15/MTok model for a task a $1/MTok specialized model handles better.
Step 3: Evaluate provider constraints. Need self-hosting? DeepSeek or Llama. Compliance? Azure or Bedrock. Speed? Groq. Widest model selection? Together AI or Fireworks.
Step 4: Test with real traffic. Benchmarks measure different tasks than your application. A/B test on actual user queries. Measure quality, latency, and cost simultaneously.

The most common mistake is starting with the most expensive model and never testing cheaper alternatives. GPT-4.1-nano at $0.40/MTok handles summarization, classification, extraction, and simple chat as well as models costing 20x more. Reserve Claude Opus 4.6 and o3 for tasks where you have measured a quality difference on your specific workload.

The second most common mistake is using a general-purpose API for a specialized task. Embedding models outperform using an LLM for vector search. Fast Apply outperforms any general LLM for code editing by 100x on throughput. WarpGrep outperforms grep-based search by 2.5x on F1. Match the tool to the task.

The Two-Model Pattern

The most effective production architecture is two models behind a router. Route 80-95% of traffic to a cheap model (GPT-4.1-nano, Gemini Flash-Lite, DeepSeek V3.2) and escalate complex tasks to frontier (Claude Opus 4.6, o3, Gemini 2.5 Pro). This cuts costs 60-80% compared to running everything through frontier, with minimal quality loss on the tasks that matter.

Frequently Asked Questions

What is the cheapest LLM API in 2026?

DeepSeek V3.2 at $0.28/$0.42 per MTok (input/output) is the cheapest frontier-class option. GPT-4.1-nano ($0.10/$0.40) and Gemini 2.5 Flash-Lite ($0.10/$0.40) are similarly cheap with 1M context windows. Groq's Llama 3.1 8B at $0.05/$0.08 is the absolute cheapest if you do not need frontier intelligence.

Which LLM API is best for coding?

For code generation and reasoning, Claude Sonnet 4.5 leads SWE-Bench at 82%, with Claude Opus 4.6 close behind. For code editing and diff application, Morph Fast Apply runs at 10,500 tok/s with 98% accuracy. For budget code generation, DeepSeek V3.2 and GPT-4.1 both perform well.

Which LLM API has the fastest inference?

Groq, by a wide margin. Custom LPU hardware delivers 840 tok/s on Llama 3.1 8B and 594 tok/s on Llama 4 Scout. For frontier models, Gemini Flash and GPT-4.1-nano are the fastest. For code editing specifically, Morph Fast Apply reaches 10,500 tok/s.

What is the largest context window available?

Google Gemini models support 1M+ tokens natively. OpenAI GPT-4.1 supports 1M tokens. Anthropic Claude Opus 4.6 and Sonnet 4.6 support 1M via beta header. Cohere Command A supports 256K tokens. DeepSeek V3.2 supports 128K.

Should I use one provider or multiple?

Multiple providers with routing is the production standard. Route most requests to cheap models, escalate complex tasks to frontier. LiteLLM, Portkey, and OpenRouter handle routing, failover, and cost tracking. This eliminates single-provider downtime risk and cuts costs 60-80%.

Do all LLM APIs support function calling?

All major providers support it. OpenAI has the most mature implementation with parallel calls and JSON schema enforcement. Anthropic supports parallel calls and native MCP. Google adds grounding with Search. Inference providers (Groq, Together, Fireworks) support it on select models.

When should I self-host instead of using an API?

Self-host when you need data sovereignty, consistent latency at 50+ concurrent users, or monthly API spend exceeds $5K-10K. vLLM is the standard framework (24x throughput vs naive inference). Self-hosting is only cheaper at 70%+ GPU utilization. Below that, API providers win on cost.

Related Resources

Code Editing at 10,500 tok/s

General-purpose LLM APIs process code edits at 80-100 tok/s. Morph Fast Apply uses a specialized 7B model with custom CUDA kernels to apply diffs at 10,500 tok/s with 98% accuracy. OpenAI-compatible API.

Try Fast Apply

View API Docs

Morph Fast Apply

Morph WarpGrep

Morph Compact

Morph Glance

Morph MCP

Morph Monitor

Blog

Startup Credits

Students

Contact Us

About

Careers

LLM API Comparison 2026: Pricing, Speed, and Features Across Every Major Provider

Quick Pricing Overview

The 80/20 Rule

Provider-by-Provider Breakdown

OpenAI

Anthropic (Claude)

Google (Gemini)

DeepSeek

Groq

Together AI

Fireworks AI

Cohere

Amazon Bedrock and Azure OpenAI

Mistral

Full Pricing Table: Output Tokens Ranked by Cost

Monthly Cost at 50M Output Tokens

Cost Calculator: Real Workloads

Coding Agent: 1,000 Files/Day

Customer Support Chatbot: 10K Conversations/Day

RAG Batch Processing: 1M Documents

The 50x Cost Gap

Feature Matrix

Best Function Calling

Best Long Context

Best for Agents

Tool Calling and MCP

Structured Outputs

Speed and Throughput

Interactive Apps: Optimize TTFT

Batch Processing: Optimize Throughput

Best API by Use Case

Specialized APIs: When General-Purpose Falls Short

Morph Fast Apply

Morph WarpGrep

How to Choose an LLM API

Decision Framework

The Two-Model Pattern

Frequently Asked Questions

What is the cheapest LLM API in 2026?

Which LLM API is best for coding?

Which LLM API has the fastest inference?

What is the largest context window available?

Should I use one provider or multiple?

Do all LLM APIs support function calling?

When should I self-host instead of using an API?

Related Resources

Code Editing at 10,500 tok/s