The ranking
| # | Model | Input $/MTok | Output $/MTok | Blended |
|---|---|---|---|---|
| 1 | Ministral 3B | $0.0400 | $0.0400 | $0.0400 |
| 2 | Llama 3.1 8B Instant (Groq) | $0.0500 | $0.0800 | $0.0650 |
| 3 | GPT-4.1 Nano | $0.0500 | $0.200 | $0.125 |
| 4 | Mistral Small 3 | $0.100 | $0.300 | $0.200 |
| 5 | DeepSeek V4 Flash | $0.140 | $0.280 | $0.210 |
| 6 | GPT-5 Nano | $0.0500 | $0.400 | $0.225 |
| 7 | Llama 4 Scout (Groq) | $0.110 | $0.340 | $0.225 |
| 8 | Gemini 2.5 Flash-Lite | $0.100 | $0.400 | $0.250 |
| 9 | Grok 4 Fast | $0.200 | $0.500 | $0.350 |
| 10 | GPT-4o Mini | $0.150 | $0.600 | $0.375 |
| 11 | Llama 4 Maverick (Together) | $0.270 | $0.850 | $0.560 |
| 12 | GPT-5 Mini | $0.125 | $1.000 | $0.563 |
| 13 | Codestral | $0.300 | $0.900 | $0.600 |
| 14 | Llama 3.3 70B (Groq) | $0.590 | $0.790 | $0.690 |
Where the listed price hides extra cost
Tokenizer efficiency
Different tokenizers split the same text into different numbers
of tokens. Anthropic's tokenizer averages ~30–40% more tokens
per English character than GPT's cl100k_base /
o200k_base. A model that looks 10% cheaper on
$/MTok can be 25% more expensive on real text.
Workaround: estimate cost on tokens-per-character, not tokens. Our calculator handles this automatically per provider.
Output verbosity
Models trained to be helpful tend to be wordy. The same answer can be 200 tokens from one model and 350 from another. Output is the expensive side of the bill, so a 75% jump in output tokens overwhelms small per-token savings.
Function-calling overhead
Tools and structured-output schemas count as input tokens, paid every call. A 4 KB JSON schema injected into every request adds ~1000 tokens × your call rate. Models with native structured output (gpt-5, claude-sonnet-4-6, gemini-2-5-pro) handle this without re-paying schema tokens on each call when configured with response format constraints.
Reasoning tokens
Reasoning models (o3, o4-mini, gpt-5 with thinking) emit invisible "thinking" tokens that are billed at output rate. A single user question can produce 5,000 reasoning tokens before the visible answer. Listed price applies; effective price is higher.
Rate limits and reliability
A model is only cheap if you can actually use it at your volume. DeepSeek, Mistral, and Cerebras have meaningfully lower rate limits than OpenAI or Anthropic at equivalent paid tiers. Building retry logic across providers is its own cost.
Discounts you can stack
| Discount | Typical savings | When it applies |
|---|---|---|
| Batch API | 50% | Jobs that can wait up to 24h |
| Prompt caching | 50–90% | Repeated prefixes (system prompts, RAG context) |
| Volume / committed-use | 10–30% | Enterprise contracts |
| Together / Replicate hosting | varies | Open-weight models — sometimes cheaper than first-party APIs |
See the prompt caching guide for how to structure prompts to maximize cache hits, and hidden costs of LLM apps for the costs that don't appear on any pricing page.
How to actually pick
- Pick the cheapest 3 models that meet your minimum quality bar for the task.
- Run a quality eval — even a 50-prompt human-rated set — against each.
- Calculate end-to-end cost on your real workload (system prompt + history + outputs), not list price.
- Pick the lowest-cost model that crosses your quality bar. Re-evaluate quarterly; pricing and quality both move.
Related
- Cheapest LLM for chatbots — workload-specific math
- Claude vs GPT pricing
- LLM pricing changes — May 2026
- Prompt caching — how to actually save money
FAQ
Why blend input + output equally?
Because real workloads vary. Pure RAG generates ~5x more output than input. A summarizer is the opposite. Blending equally is a fair starting point that ranks models on raw $/MTok rather than a workload-specific bias. The /pricing table on this site lets you weight differently.
Why not just sort by input price?
Output price is usually 3-5x input. A model that's cheap on input but expensive on output (the GPT-5 family does this) ranks worse for chat than a model with balanced pricing.
Are these list prices? What about batch / cached?
List prices, no batch or caching applied. Batch APIs typically halve cost for jobs that can wait. Prompt caching cuts input cost 50–90% for repeated prefixes. Both are stackable on most providers; see the /prompt-caching guide.
Why is DeepSeek so cheap?
DeepSeek's MoE architecture lets it serve a 670B-parameter model with the per-token cost of a much smaller dense model. Plus China-based inference economics. The catch: data residency and rate-limit policies differ from US-hosted providers.
Is the cheapest model always the right choice?
No. Latency, tokens-per-second, intelligence, JSON-mode quality, function-calling reliability, and tokenizer efficiency all matter. A 'cheap' model that produces 30% more tokens for the same answer ends up more expensive than the listed price suggests.