Token Pricing / Comparisons

The cheapest LLMs in 2026

Ranked by blended input + output cost per million tokens. List prices, no batch or caching applied — those discounts apply on top. Updated 2026-05-05.

Try the cost calculator with your own workload →

The ranking

# Model Input $/MTok Output $/MTok Blended
1 Ministral 3B $0.0400 $0.0400 $0.0400
2 Llama 3.1 8B Instant (Groq) $0.0500 $0.0800 $0.0650
3 GPT-4.1 Nano $0.0500 $0.200 $0.125
4 Mistral Small 3 $0.100 $0.300 $0.200
5 DeepSeek V4 Flash $0.140 $0.280 $0.210
6 GPT-5 Nano $0.0500 $0.400 $0.225
7 Llama 4 Scout (Groq) $0.110 $0.340 $0.225
8 Gemini 2.5 Flash-Lite $0.100 $0.400 $0.250
9 Grok 4 Fast $0.200 $0.500 $0.350
10 GPT-4o Mini $0.150 $0.600 $0.375
11 Llama 4 Maverick (Together) $0.270 $0.850 $0.560
12 GPT-5 Mini $0.125 $1.000 $0.563
13 Codestral $0.300 $0.900 $0.600
14 Llama 3.3 70B (Groq) $0.590 $0.790 $0.690

Full pricing table →

Where the listed price hides extra cost

Tokenizer efficiency

Different tokenizers split the same text into different numbers of tokens. Anthropic's tokenizer averages ~30–40% more tokens per English character than GPT's cl100k_base / o200k_base. A model that looks 10% cheaper on $/MTok can be 25% more expensive on real text.

Workaround: estimate cost on tokens-per-character, not tokens. Our calculator handles this automatically per provider.

Output verbosity

Models trained to be helpful tend to be wordy. The same answer can be 200 tokens from one model and 350 from another. Output is the expensive side of the bill, so a 75% jump in output tokens overwhelms small per-token savings.

Function-calling overhead

Tools and structured-output schemas count as input tokens, paid every call. A 4 KB JSON schema injected into every request adds ~1000 tokens × your call rate. Models with native structured output (gpt-5, claude-sonnet-4-6, gemini-2-5-pro) handle this without re-paying schema tokens on each call when configured with response format constraints.

Reasoning tokens

Reasoning models (o3, o4-mini, gpt-5 with thinking) emit invisible "thinking" tokens that are billed at output rate. A single user question can produce 5,000 reasoning tokens before the visible answer. Listed price applies; effective price is higher.

Rate limits and reliability

A model is only cheap if you can actually use it at your volume. DeepSeek, Mistral, and Cerebras have meaningfully lower rate limits than OpenAI or Anthropic at equivalent paid tiers. Building retry logic across providers is its own cost.

Discounts you can stack

DiscountTypical savingsWhen it applies
Batch API50%Jobs that can wait up to 24h
Prompt caching50–90%Repeated prefixes (system prompts, RAG context)
Volume / committed-use10–30%Enterprise contracts
Together / Replicate hostingvariesOpen-weight models — sometimes cheaper than first-party APIs

See the prompt caching guide for how to structure prompts to maximize cache hits, and hidden costs of LLM apps for the costs that don't appear on any pricing page.

How to actually pick

  1. Pick the cheapest 3 models that meet your minimum quality bar for the task.
  2. Run a quality eval — even a 50-prompt human-rated set — against each.
  3. Calculate end-to-end cost on your real workload (system prompt + history + outputs), not list price.
  4. Pick the lowest-cost model that crosses your quality bar. Re-evaluate quarterly; pricing and quality both move.

Related

FAQ

Why blend input + output equally?

Because real workloads vary. Pure RAG generates ~5x more output than input. A summarizer is the opposite. Blending equally is a fair starting point that ranks models on raw $/MTok rather than a workload-specific bias. The /pricing table on this site lets you weight differently.

Why not just sort by input price?

Output price is usually 3-5x input. A model that's cheap on input but expensive on output (the GPT-5 family does this) ranks worse for chat than a model with balanced pricing.

Are these list prices? What about batch / cached?

List prices, no batch or caching applied. Batch APIs typically halve cost for jobs that can wait. Prompt caching cuts input cost 50–90% for repeated prefixes. Both are stackable on most providers; see the /prompt-caching guide.

Why is DeepSeek so cheap?

DeepSeek's MoE architecture lets it serve a 670B-parameter model with the per-token cost of a much smaller dense model. Plus China-based inference economics. The catch: data residency and rate-limit policies differ from US-hosted providers.

Is the cheapest model always the right choice?

No. Latency, tokens-per-second, intelligence, JSON-mode quality, function-calling reliability, and tokenizer efficiency all matter. A 'cheap' model that produces 30% more tokens for the same answer ends up more expensive than the listed price suggests.

Open the calculator →