Token Pricing / Reference

LLM pricing comparison

Per-million-token prices across every major model. Click a column header to sort. Prices change every few months — most recently verified 2026-05-01; always confirm against the provider's pricing page for production budgeting.

Flagship Balanced Fast Reasoning

Model	Provider	Input / 1M	Cached / 1M	Output / 1M	Context	Tier
Ministral 3B Cheapest Mistral; edge / on-device viable.	Mistral	$0.0400	—	$0.0400	128K	fast
GPT-5 Nano	OpenAI	$0.0500	$0.0050	$0.400	400K	fast
GPT-4.1 Nano	OpenAI	$0.0500	$0.0250	$0.200	1,000K	fast
Llama 3.1 8B Instant (Groq) Cheapest small-model option, very fast on Groq.	Groq	$0.0500	—	$0.0800	131.072K	fast
Gemini 2.5 Flash-Lite	Google	$0.100	$0.0250	$0.400	1,000K	fast
Mistral Small 3	Mistral	$0.100	—	$0.300	128K	fast
Llama 4 Scout (Groq) Llama 4 mixture-of-experts; cheap and fast on Groq.	Groq	$0.110	—	$0.340	128K	balanced
GPT-5 Mini	OpenAI	$0.125	$0.0250	$1.000	400K	fast
DeepSeek V4 Flash Replaces deprecated DeepSeek V3. 1M context. Automatic disk-based context cache (no cache-write fee). Cache-hit price reduced to ~10% of cache-miss as of 2026-04-26.	DeepSeek	$0.140	$0.0280	$0.280	1,000K	balanced
GPT-4o Mini	OpenAI	$0.150	$0.0750	$0.600	128K	fast
Grok 4 Fast Strong price/performance: 2M context at $0.20/M input.	xAI	$0.200	$0.0500	$0.500	2,000K	fast
Llama 4 Maverick (Together) Llama 4 Maverick larger MoE; 1M context.	Meta (via Together / Replicate)	$0.270	—	$0.850	1,000K	balanced
Gemini 2.5 Flash Audio input is priced higher: $1.00/M.	Google	$0.300	$0.0750	$2.500	1,000K	balanced
Codestral Code-specialized; 256K context for whole-repo prompts.	Mistral	$0.300	—	$0.900	256K	fast
GPT-4.1 Mini	OpenAI	$0.400	$0.100	$1.600	1,000K	fast
Qwen 3.5 Plus (Alibaba)	Alibaba (Qwen)	$0.400	—	$2.400	128K	balanced
GPT-3.5 Turbo Legacy model, no prompt caching support. Still widely used for cost-sensitive routing.	OpenAI	$0.500	—	$1.000	16.385K	fast
Gemini 3 Flash Preview	Google	$0.500	$0.0500	$3.000	1,000K	balanced
Mistral Large 3 Replaces Mistral Large 2 with significant price drop ($2/$6 → $0.50/$1.50).	Mistral	$0.500	—	$1.500	128K	balanced
Command R (Cohere)	Cohere	$0.500	—	$1.500	128K	fast
o3-mini (reasoning) Cached input price equals base input on this model (no caching discount). Hidden reasoning tokens.	OpenAI	$0.550	$0.550	$2.200	200K	reasoning
Llama 3.3 70B (Groq) Same model as Together listing, hosted on Groq's LPU chips — ~2× faster, slightly cheaper.	Groq	$0.590	—	$0.790	131.072K	fast
Llama 3.3 70B (Together)	Meta (via Together / Replicate)	$0.880	—	$0.880	131.072K	balanced
Claude Haiku 4.5	Anthropic	$1.000	$0.100	$5.000	200K	fast
Qwen 3 Max (Alibaba) Qwen 3.6 Max; strong open-weights alternative widely deployed in 2026.	Alibaba (Qwen)	$1.040	—	$4.160	256K	flagship
o4-mini (reasoning) Hidden reasoning tokens.	OpenAI	$1.100	$0.275	$4.400	200K	reasoning
GPT-5	OpenAI	$1.250	$0.125	$10.00	400K	flagship
Gemini 2.5 Pro Tiered pricing: $2.50/M input, $15/M output above 200K tokens.	Google	$1.250	$0.125	$10.00	2,000K	flagship
DeepSeek V4 Pro (reasoning) Replaces deprecated DeepSeek R1. Currently 75% off through 2026-05-31 → effective $0.435/$0.87. Hidden reasoning tokens.	DeepSeek	$1.740	$0.145	$3.480	1,000K	reasoning
GPT-4.1	OpenAI	$2.000	$0.500	$8.000	1,000K	balanced
o3 (reasoning) OpenAI cut o3 prices substantially in 2025. Output includes hidden reasoning tokens — actual cost typically 4-10× output sticker.	OpenAI	$2.000	$0.500	$8.000	200K	reasoning
Gemini 3.1 Pro Preview Tiered pricing: $4/M input, $18/M output above 200K tokens. Includes thinking tokens by default.	Google	$2.000	$0.200	$12.00	1,000K	flagship
GPT-5.4	OpenAI	$2.500	$0.250	$15.00	1,000K	flagship
GPT-4o	OpenAI	$2.500	$1.250	$10.00	128K	balanced
Command A (Cohere) Cohere's 2026 flagship at the same price as R+. Positioned for enterprise/RAG, not raw benchmark wins.	Cohere	$2.500	—	$10.00	256K	flagship
Command R+ (Cohere)	Cohere	$2.500	—	$10.00	128K	balanced
Claude Sonnet 4.6	Anthropic	$3.000	$0.300	$15.00	1,000K	balanced
Grok 4 (reasoning) xAI's 2026 flagship. Reasoning by default — output includes hidden reasoning tokens.	xAI	$3.000	$0.750	$15.00	256K	reasoning
Grok 3	xAI	$3.000	$0.750	$15.00	131.072K	balanced
Llama 3.1 405B (Together) Open-weights flagship; falls behind 2026 frontier on benchmarks.	Meta (via Together / Replicate)	$3.500	—	$3.500	131.072K	flagship
Claude Opus 4.7 Adaptive thinking adds hidden reasoning tokens to output. New tokenizer uses ~35% more tokens than older Claudes — actual API cost will be ~35% higher than shown.	Anthropic	$5.000	$0.500	$25.00	1,000K	flagship
GPT-5.5 (reasoning) Top of Artificial Analysis Intelligence Index v4. Hidden reasoning tokens — real cost typically 4-10× output sticker.	OpenAI	$5.000	$0.500	$30.00	1,000K	reasoning
Llama 3.1 405B (Cerebras) Cerebras serves 405B at ~970 tok/s vs Together's ~50. Premium price for premium speed — 19× faster on the same model.	Cerebras	$6.000	—	$12.00	128K	flagship

How to read this table

Prices are USD per 1,000,000 tokens. Most modern APIs bill in fractions of cents — a typical chat-completion call costs $0.0005–$0.05 depending on length and model.

Input

What you pay per token of context sent to the model: system prompt, user message, conversation history, embedded documents, tool definitions. Anything counts as input if it goes into the prompt window.

Cached input

When the same prefix (system prompt, document, etc.) is reused across requests, providers offer big discounts on the cached portion. Anthropic and Google charge ~10% of input price for cached reads; OpenAI applies a 50% discount automatically. Caching is the biggest cost lever in any production app — see the deep guide.

Output

Per token of model response. Typically 3–5× input price. Reasoning models (o3, DeepSeek R1) include hidden reasoning tokens in this bucket — the visible response you see is only a fraction; you pay for the rest.

Context window

Maximum tokens the model can handle in a single request. Bigger is not always better — most models suffer "lost in the middle" beyond ~50K tokens, retrieving worse from the middle of long contexts than from the start or end. Plan for the smallest window that fits your task; you'll get better quality and lower latency.

Pricing tiers explained

Flagship — peak capability. Use for hard reasoning, novel problems, anything where quality dominates cost. Examples: Claude Opus 4.7, GPT-5, Gemini 2.5 Pro, Llama 405B.
Balanced — the workhorse tier. 80% of flagship capability at 20% of the price for many tasks. Examples: Claude Sonnet 4.6, GPT-4.1, Gemini 2.5 Flash.
Fast — cheap, low-latency, simpler tasks. Examples: Claude Haiku 4.5, GPT-4o-mini, Gemini 2.5 Flash-Lite. Often 10–50× cheaper than flagship for ~70% of the quality on routine tasks.
Reasoning — inference-time-extended-thinking models. They consume hidden reasoning tokens before producing output, paying for "thinking" time. Use for hard math, complex code, multi-step planning. Output costs are deceptively higher than they look on this table.

The 4 cost-cutting moves that actually work

Pick the right tier. A surprising amount of production traffic doesn't need a flagship model. Run an A/B on fast vs balanced before scaling.
Cache the static prefix. System prompts, document context, and tool schemas don't change between turns — cache them. Caching is the single biggest lever in most apps.
Constrain output. Use structured output (JSON mode, Zod schema, tool-call format) and ask for the shortest useful response. Output tokens are typically 3-5× input.
Avoid reasoning models when you don't need them. The hidden token cost is real. For tasks that don't require step-by-step working, a non-reasoning model is 5-15× cheaper.

Provider pricing pages (verify before you commit)

Anthropic — Claude 4.x uses a new BPE tokenizer that consumes ~35% more tokens than older Claudes for the same text. We approximate with cl100k_base — costs shown for Claude 4.x are roughly 35% under what the API will bill. Verify with client.messages.countTokens() for exact numbers.
OpenAI — Exact tokenization via tiktoken (o200k_base for GPT-4o/4.1/5/o-series; cl100k_base for GPT-3.5/4).
Google — Gemini uses SentencePiece; we approximate with cl100k_base.
Meta (via Together / Replicate) — Llama uses tiktoken cl100k_base derivative — close approximation.
DeepSeek
xAI
Mistral
Cohere
Groq — Hosts open-weight models on LPU chips; tokenizer matches the underlying model.
Cerebras — Wafer-scale chip; 5-20× faster than GPU hosts on the same model.
Alibaba (Qwen)

← Back to the calculator