1M-token loads, by model
| Model | Window | Input $/MTok | 1M load (uncached) | 1M load (cached read) |
|---|---|---|---|---|
| Claude Sonnet 4.6 | 1000K | $3.000 | $3.000 | $0.300 |
| Claude Opus 4.7 | 1000K | $5.000 | $5.000 | $0.500 |
| Gemini 2.5 Pro | 2000K | $1.250 | $1.250 | $0.125 |
| GPT-5 | 400K | $1.250 | $1.250 | $0.125 |
"1M load" is the cost of one input pass at full context. Output tokens are billed separately at each model's output rate.
What 1M tokens actually fits
- ~750,000 English words
- ~6 average-length books
- ~50,000 lines of TypeScript (~50–80% of a mid-size codebase)
- ~1,500 standard 8K Slack messages
- ~250 pages of PDFs, depending on layout
1M context vs RAG: when each wins
1M context wins when
- The query needs cross-document reasoning — comparing two contracts, answering "did anything in this PR break a previous test", summarizing a meeting alongside related slack threads.
- The corpus changes between every query, so there's no benefit to indexing.
- You're querying once or twice — the cost of building and maintaining a vector index isn't amortized.
- You need ground-truth citations with guaranteed coverage. RAG can miss content that doesn't match a top-K retrieval.
RAG wins when
- You'll query the same corpus dozens or hundreds of times. Embedding once + retrieving 5K tokens per query is orders of magnitude cheaper than reloading 1M every time.
- Latency matters. RAG retrieves in ~100ms; a 1M-context request can take 30+ seconds before first token.
- The relevant subset of the corpus is small per query (<5% of total).
- You need to filter by metadata (date, author, permissions) before retrieval — RAG handles this naturally.
Prompt caching: the changing factor
On Claude and Gemini, a previously-processed 1M-token context costs 80–90% less to re-read. The math then becomes:
First call: $1.25 (Sonnet, 1M write)
Second call: $0.13 (Sonnet, 1M cached read)
...
After 10 calls: $2.40 average per call
After 100 calls: $0.25 average per call At ~10+ calls against the same context, cached 1M context becomes cheaper than RAG-and-rebuild for many corpora. See our prompt-caching guide for how to structure prompts so the 1M prefix actually hits the cache.
Output is the other half of the bill
Output tokens aren't subject to long-context surcharges (output cap is much smaller — typically 4K–32K). But if you ask a 1M- context model for a long response, output cost is often the majority of the bill on a single call.
Concrete: one 1M-token Claude Sonnet 4.6 call with a 4K-token response runs ~$1.35 input + ~$0.06 output = $1.41 total. Output cost is ~5% on big-context loads, ~30% on small-context loads. The crossover is around 100K input.
Practical patterns
- Code base Q&A. Load the repo once into context, cache, query many times. Cached reads make this competitive with embedding-based code search.
- Long-document summarization. One pass, no cache. Long context wins on coverage; cost is bounded.
- Multi-document compare. Two contracts in one window beats two separate passes plus a third reconciliation step.
- Avoid: using 1M context for a workload where you'd otherwise use a 32K-context request. You're paying for headroom you don't need.
Related
- How prompt caching actually saves money
- System prompt cost — the silent multiplier
- Hidden costs of LLM apps
- Cheapest LLMs ranked
FAQ
What does '1M context' actually mean?
The maximum number of tokens the model can read in one request — system prompt + user input + history + retrieved documents combined. Output is capped separately. 1 million tokens is roughly 750,000 English words, or about 6 medium-length books.
Do I always pay for the full window?
No — you pay only for the tokens you send. A 100K-token request on a 1M-window model costs the same as on a 200K-window model with the same per-token rate. The 1M window is the ceiling, not a fixed price.
Why is Anthropic's 1M tier priced higher than 200K?
Anthropic introduced an explicit 'long context' tier with a 35% surcharge on inputs above 200K tokens. The serving cost goes up disproportionately past that threshold (KV cache memory, attention compute), so they pass it through transparently. OpenAI and Google bundle long context at one rate.
Is 1M cheaper than RAG?
It depends on read repetition. If you'll query the same document set many times, RAG (split into chunks, embed, retrieve top-K) is cheaper because each query reads ~5K tokens of retrieved context, not 1M. If you query once or queries depend on full-document understanding, 1M context wins.
Does prompt caching change the math?
Massively. A 1M-token context cached costs 80–90% less to read again. Loading a code base once, then querying it dozens of times, drops effective per-call cost into pennies after the first call. See the /prompt-caching guide.