What does 1M context cost?

Q: What does '1M context' actually mean?

The maximum number of tokens the model can read in one request — system prompt + user input + history + retrieved documents combined. Output is capped separately. 1 million tokens is roughly 750,000 English words, or about 6 medium-length books.

Q: Do I always pay for the full window?

No — you pay only for the tokens you send. A 100K-token request on a 1M-window model costs the same as on a 200K-window model with the same per-token rate. The 1M window is the ceiling, not a fixed price.

Q: Why is Anthropic's 1M tier priced higher than 200K?

Anthropic introduced an explicit 'long context' tier with a 35% surcharge on inputs above 200K tokens. The serving cost goes up disproportionately past that threshold (KV cache memory, attention compute), so they pass it through transparently. OpenAI and Google bundle long context at one rate.

Q: Is 1M cheaper than RAG?

It depends on read repetition. If you'll query the same document set many times, RAG (split into chunks, embed, retrieve top-K) is cheaper because each query reads ~5K tokens of retrieved context, not 1M. If you query once or queries depend on full-document understanding, 1M context wins.

Q: Does prompt caching change the math?

Massively. A 1M-token context cached costs 80–90% less to read again. Loading a code base once, then querying it dozens of times, drops effective per-call cost into pennies after the first call. See the /prompt-caching guide.

Loading a million input tokens — about 750,000 English words — runs from roughly a dollar to several, depending on the model and whether you're paying long-context premium tier rates. Below: per-call math, when it beats RAG, and how prompt caching changes the picture.

Calculate exact cost for your context size →

1M-token loads, by model

Model	Window	Input $/MTok	1M load (uncached)	1M load (cached read)
Claude Sonnet 4.6	1000K	$3.000	$3.000	$0.300
Claude Opus 4.7	1000K	$5.000	$5.000	$0.500
Gemini 2.5 Pro	2000K	$1.250	$1.250	$0.125
GPT-5	400K	$1.250	$1.250	$0.125

"1M load" is the cost of one input pass at full context. Output tokens are billed separately at each model's output rate.

What 1M tokens actually fits

~750,000 English words
~6 average-length books
~50,000 lines of TypeScript (~50–80% of a mid-size codebase)
~1,500 standard 8K Slack messages
~250 pages of PDFs, depending on layout

1M context vs RAG: when each wins

1M context wins when

The query needs cross-document reasoning — comparing two contracts, answering "did anything in this PR break a previous test", summarizing a meeting alongside related slack threads.
The corpus changes between every query, so there's no benefit to indexing.
You're querying once or twice — the cost of building and maintaining a vector index isn't amortized.
You need ground-truth citations with guaranteed coverage. RAG can miss content that doesn't match a top-K retrieval.

RAG wins when

You'll query the same corpus dozens or hundreds of times. Embedding once + retrieving 5K tokens per query is orders of magnitude cheaper than reloading 1M every time.
Latency matters. RAG retrieves in ~100ms; a 1M-context request can take 30+ seconds before first token.
The relevant subset of the corpus is small per query (<5% of total).
You need to filter by metadata (date, author, permissions) before retrieval — RAG handles this naturally.

Prompt caching: the changing factor

On Claude and Gemini, a previously-processed 1M-token context costs 80–90% less to re-read. The math then becomes:

First call:  $1.25 (Sonnet, 1M write)
Second call: $0.13 (Sonnet, 1M cached read)
...
After 10 calls: $2.40 average per call
After 100 calls: $0.25 average per call

At ~10+ calls against the same context, cached 1M context becomes cheaper than RAG-and-rebuild for many corpora. See our prompt-caching guide for how to structure prompts so the 1M prefix actually hits the cache.

Output is the other half of the bill

Output tokens aren't subject to long-context surcharges (output cap is much smaller — typically 4K–32K). But if you ask a 1M- context model for a long response, output cost is often the majority of the bill on a single call.

Concrete: one 1M-token Claude Sonnet 4.6 call with a 4K-token response runs ~$1.35 input + ~$0.06 output = $1.41 total. Output cost is ~5% on big-context loads, ~30% on small-context loads. The crossover is around 100K input.

Practical patterns

Code base Q&A. Load the repo once into context, cache, query many times. Cached reads make this competitive with embedding-based code search.
Long-document summarization. One pass, no cache. Long context wins on coverage; cost is bounded.
Multi-document compare. Two contracts in one window beats two separate passes plus a third reconciliation step.
Avoid: using 1M context for a workload where you'd otherwise use a 32K-context request. You're paying for headroom you don't need.

FAQ

What does '1M context' actually mean?

The maximum number of tokens the model can read in one request — system prompt + user input + history + retrieved documents combined. Output is capped separately. 1 million tokens is roughly 750,000 English words, or about 6 medium-length books.

Do I always pay for the full window?