How to read this table
Prices are USD per 1,000,000 tokens. Most modern APIs bill in fractions of cents — a typical chat-completion call costs $0.0005–$0.05 depending on length and model.
Input
What you pay per token of context sent to the model: system prompt, user message, conversation history, embedded documents, tool definitions. Anything counts as input if it goes into the prompt window.
Cached input
When the same prefix (system prompt, document, etc.) is reused across requests, providers offer big discounts on the cached portion. Anthropic and Google charge ~10% of input price for cached reads; OpenAI applies a 50% discount automatically. Caching is the biggest cost lever in any production app — see the deep guide.
Output
Per token of model response. Typically 3–5× input price. Reasoning models (o3, DeepSeek R1) include hidden reasoning tokens in this bucket — the visible response you see is only a fraction; you pay for the rest.
Context window
Maximum tokens the model can handle in a single request. Bigger is not always better — most models suffer "lost in the middle" beyond ~50K tokens, retrieving worse from the middle of long contexts than from the start or end. Plan for the smallest window that fits your task; you'll get better quality and lower latency.
Pricing tiers explained
- Flagship — peak capability. Use for hard reasoning, novel problems, anything where quality dominates cost. Examples: Claude Opus 4.7, GPT-5, Gemini 2.5 Pro, Llama 405B.
- Balanced — the workhorse tier. 80% of flagship capability at 20% of the price for many tasks. Examples: Claude Sonnet 4.6, GPT-4.1, Gemini 2.5 Flash.
- Fast — cheap, low-latency, simpler tasks. Examples: Claude Haiku 4.5, GPT-4o-mini, Gemini 2.5 Flash-Lite. Often 10–50× cheaper than flagship for ~70% of the quality on routine tasks.
- Reasoning — inference-time-extended-thinking models. They consume hidden reasoning tokens before producing output, paying for "thinking" time. Use for hard math, complex code, multi-step planning. Output costs are deceptively higher than they look on this table.
The 4 cost-cutting moves that actually work
- Pick the right tier. A surprising amount of production traffic doesn't need a flagship model. Run an A/B on fast vs balanced before scaling.
- Cache the static prefix. System prompts, document context, and tool schemas don't change between turns — cache them. Caching is the single biggest lever in most apps.
- Constrain output. Use structured output (JSON mode, Zod schema, tool-call format) and ask for the shortest useful response. Output tokens are typically 3-5× input.
- Avoid reasoning models when you don't need them. The hidden token cost is real. For tasks that don't require step-by-step working, a non-reasoning model is 5-15× cheaper.
Provider pricing pages (verify before you commit)
- Anthropic — Claude 4.x uses a new BPE tokenizer that consumes ~35% more tokens than older Claudes for the same text. We approximate with cl100k_base — costs shown for Claude 4.x are roughly 35% under what the API will bill. Verify with client.messages.countTokens() for exact numbers.
- OpenAI — Exact tokenization via tiktoken (o200k_base for GPT-4o/4.1/5/o-series; cl100k_base for GPT-3.5/4).
- Google — Gemini uses SentencePiece; we approximate with cl100k_base.
- Meta (via Together / Replicate) — Llama uses tiktoken cl100k_base derivative — close approximation.
- DeepSeek
- xAI
- Mistral
- Cohere
- Groq — Hosts open-weight models on LPU chips; tokenizer matches the underlying model.
- Cerebras — Wafer-scale chip; 5-20× faster than GPU hosts on the same model.
- Alibaba (Qwen)