Token Pricing / Long context

What does 1M context cost?

Loading a million input tokens — about 750,000 English words — runs from roughly a dollar to several, depending on the model and whether you're paying long-context premium tier rates. Below: per-call math, when it beats RAG, and how prompt caching changes the picture.

Calculate exact cost for your context size →

1M-token loads, by model

Model Window Input $/MTok 1M load (uncached) 1M load (cached read)
Claude Sonnet 4.6 1000K $3.000 $3.000 $0.300
Claude Opus 4.7 1000K $5.000 $5.000 $0.500
Gemini 2.5 Pro 2000K $1.250 $1.250 $0.125
GPT-5 400K $1.250 $1.250 $0.125

"1M load" is the cost of one input pass at full context. Output tokens are billed separately at each model's output rate.

What 1M tokens actually fits

1M context vs RAG: when each wins

1M context wins when

RAG wins when

Prompt caching: the changing factor

On Claude and Gemini, a previously-processed 1M-token context costs 80–90% less to re-read. The math then becomes:

First call:  $1.25 (Sonnet, 1M write)
Second call: $0.13 (Sonnet, 1M cached read)
...
After 10 calls: $2.40 average per call
After 100 calls: $0.25 average per call

At ~10+ calls against the same context, cached 1M context becomes cheaper than RAG-and-rebuild for many corpora. See our prompt-caching guide for how to structure prompts so the 1M prefix actually hits the cache.

Output is the other half of the bill

Output tokens aren't subject to long-context surcharges (output cap is much smaller — typically 4K–32K). But if you ask a 1M- context model for a long response, output cost is often the majority of the bill on a single call.

Concrete: one 1M-token Claude Sonnet 4.6 call with a 4K-token response runs ~$1.35 input + ~$0.06 output = $1.41 total. Output cost is ~5% on big-context loads, ~30% on small-context loads. The crossover is around 100K input.

Practical patterns

  1. Code base Q&A. Load the repo once into context, cache, query many times. Cached reads make this competitive with embedding-based code search.
  2. Long-document summarization. One pass, no cache. Long context wins on coverage; cost is bounded.
  3. Multi-document compare. Two contracts in one window beats two separate passes plus a third reconciliation step.
  4. Avoid: using 1M context for a workload where you'd otherwise use a 32K-context request. You're paying for headroom you don't need.

Related

FAQ

What does '1M context' actually mean?

The maximum number of tokens the model can read in one request — system prompt + user input + history + retrieved documents combined. Output is capped separately. 1 million tokens is roughly 750,000 English words, or about 6 medium-length books.

Do I always pay for the full window?

No — you pay only for the tokens you send. A 100K-token request on a 1M-window model costs the same as on a 200K-window model with the same per-token rate. The 1M window is the ceiling, not a fixed price.

Why is Anthropic's 1M tier priced higher than 200K?

Anthropic introduced an explicit 'long context' tier with a 35% surcharge on inputs above 200K tokens. The serving cost goes up disproportionately past that threshold (KV cache memory, attention compute), so they pass it through transparently. OpenAI and Google bundle long context at one rate.

Is 1M cheaper than RAG?

It depends on read repetition. If you'll query the same document set many times, RAG (split into chunks, embed, retrieve top-K) is cheaper because each query reads ~5K tokens of retrieved context, not 1M. If you query once or queries depend on full-document understanding, 1M context wins.

Does prompt caching change the math?

Massively. A 1M-token context cached costs 80–90% less to read again. Loading a code base once, then querying it dozens of times, drops effective per-call cost into pennies after the first call. See the /prompt-caching guide.

Open the calculator →