What prompt caching is, in one paragraph
When you call an LLM, the request includes a prefix you control: system prompt, tool definitions, retrieved documents, conversation history. If two consecutive requests share that prefix, the model's attention computation over the prefix is identical. Caching means the provider stores that computation server-side for a few minutes to a few hours; on cache hit, you pay a fraction of normal input price for the cached portion.
Anthropic — explicit caching with 5-minute and 1-hour TTLs
Anthropic's caching is opt-in via cache_control markers
in your request. You mark up to 4 cache breakpoints in any
combination of system, tools, and message content. Each breakpoint
creates a cache prefix; subsequent requests with the same prefix
before the breakpoint get the cached portion read at 10% of input
price.
| Operation | Cost (vs base input price) |
|---|---|
| Cache write (5-min TTL) | 1.25× base |
| Cache write (1-hour TTL) | 2.00× base |
| Cache read | 0.10× base (90% off) |
| Un-cached input | 1.00× base |
The break-even math
A 5-min cached prefix pays for itself starting at 3 reads:
cost without cache = N × 1.0
cost with cache = 1.25 (write) + (N - 1) × 0.10
= 1.25 + 0.1N - 0.1
break-even at N where 1.0N = 1.15 + 0.1N
⇒ 0.9N = 1.15
⇒ N ≈ 1.28
But you only get a write on the first call, so:
N = 1: 1.25 vs 1.0 (cache loses)
N = 2: 1.35 vs 2.0 (cache wins by 0.65)
N = 3: 1.45 vs 3.0 (cache wins by 1.55) For 1-hour TTL the surcharge is 2× so break-even is at N=4. Use 5-min for short bursts, 1-hour for steady traffic across the day.
What to cache
The biggest hits, in order:
- System prompt + tool definitions. These are identical across every request in a session. Marking the cache breakpoint right after the tools section means every subsequent turn reads them at 10%.
- Document context for RAG. If you're stuffing retrieved documents into a long context, cache them. A 50K-token document re-read 5 times saves you 200K worth of billing.
- Few-shot examples. Static blocks of demonstrative examples cache cheaply and pay back across sessions.
What kills the cache
- Any change to the prefix. A single character edit anywhere before the breakpoint invalidates the entire cache. Lock your system prompt and tool definitions; revise them in batches, not piecemeal.
- Reordering. If you move tools before system prompt or vice-versa, that's a new prefix.
- Time gaps. 5-min cache evicts after exactly 5 minutes of inactivity. If your traffic is bursty with quiet periods >5 min, switch to 1-hour TTL.
- Crossing model versions. Cache is per-model. Switching from Claude Sonnet 4.6 to Haiku 4.5 mid-session re-pays the cache write.
OpenAI — automatic, 50% off, no opt-in
OpenAI's caching has been automatic since late 2024. Send a prompt
≥1024 tokens; if the same prefix appeared in another request within
~5–10 minutes, the cached portion bills at 50% of input price.
There's no API parameter, no cache_control, no cache
IDs — it just happens.
Implications
- You don't pay a write surcharge. First call is full price; subsequent calls within the window get the discount. This means caching has no break-even penalty — even a single cache hit saves money.
- The discount is half what Anthropic offers (50% vs 90%). For very high-traffic apps with stable prefixes, Anthropic + careful breakpoints can be cheaper despite the surcharge.
- You can't tune it. If your prefix changes every request (e.g. timestamps in the system prompt, request IDs in tool descriptions), you get no discount. Stripping that variability is the only lever.
- Verify cache hits via the response metadata's
usage.prompt_tokens_details.cached_tokens. If this is consistently 0, your prefix is varying somewhere.
Google Gemini — explicit, hour-based, storage billed separately
Gemini's caching is the most explicit of the three. You POST to
the /cachedContents endpoint with your context, get
back a cache resource ID, then pass that ID in subsequent
generateContent calls. The cache lives for a TTL you
specify (5 min to 1 hour standard; can extend with permission).
Pricing structure
- Cached input read: 25% of normal input price (75% off).
- Cache storage: $1.00 per 1M tokens per hour.
- Cache write: counted as normal input on the first call.
When Gemini caching pays off
Storage cost makes the math different from Anthropic. A 100K-token cache costs $0.10/hour just to keep alive — only worth it if you'll re-read enough times in that hour to overcome the storage:
For Gemini 2.5 Pro at $1.25/M input:
100K tokens un-cached: $0.125 per read
100K tokens cached: $0.031 per read
Storage: $0.10 per hour
Savings per read: $0.094
Storage breaks even at $0.10 / $0.094 ≈ 1.06 reads per hour.
So at 2+ reads/hour of the same large context, Gemini caching wins. Gemini caching is designed for large contexts reused over hours — full books, video transcripts, codebases. Anthropic's design is better for shorter prefixes reused rapidly — system prompts, tool definitions.
Decision flow: which caching, when
| Pattern | Best fit | Why |
|---|---|---|
| Chat app, stable system prompt, dozens of turns / minute | Anthropic 5-min | 90% off reads, low surcharge amortizes fast |
| Daily-driver agent, system + tools, steady traffic | Anthropic 1-hour | Survives quiet periods, still 90% off reads |
| Bursty traffic, varying prompts, OpenAI ecosystem | OpenAI automatic | No tuning, no surcharge, but only 50% off |
| Document Q&A on a 200-page PDF used over hours | Gemini context caching | Designed for large contexts, hour-scale TTL |
| One-off requests, unique prompts | No caching | Cache write surcharge with no read = pure loss |
Common mistakes
Caching the user message
It varies every request. Caching it is paying the write surcharge with zero hit rate. Cache the static prefix only — system, tools, documents.
Forgetting timestamps invalidate
A system prompt with "Today is Friday May 1, 2026" busts the cache every day at midnight. Move the timestamp to a separate user message, or put it after the cache breakpoint.
Tool definitions with dynamic content
Some teams template tool descriptions with user-specific info. That defeats caching across users. Keep tool definitions static; pass per-user context through user messages instead.
Mixing models in a session
Cache is per-model. A multi-model agent that switches between Sonnet and Haiku mid-session writes a new cache each time it switches.
What to verify before you celebrate the savings
-
Check the response usage object for
cache_read_input_tokens(Anthropic) orcached_tokens(OpenAI). If it's zero on calls you expected to hit cache, something is wrong with the prefix. - Track cache hit rate over a week. Below 50% means your prefix isn't stable; above 90% means caching is doing its job.
- Run the math on actual traffic. The break-even is asymmetric across providers; what works for one project may lose money on another.
Reference
FAQ
Which provider has the best caching?
It depends on traffic shape. Anthropic gives 90% off reads but charges a 25% surcharge to write — pays off if you re-read the same context ≥3 times within 5 minutes. OpenAI is automatic with no opt-in and a flat 50% discount. Gemini gives 75% off reads but you must explicitly create cache objects and pay storage by the hour — best for very large contexts (≥32K tokens) reused over hours.
Does Anthropic's prompt caching work with the system prompt and tool definitions?
Yes. You can mark up to 4 cache breakpoints in any combination of system prompt, tool definitions, document content, or user messages. Each breakpoint creates a cache prefix; subsequent requests with the same prefix get cached reads. Tool definitions and system prompts are the highest-value caching targets in agentic apps.
Why might caching cost more than not caching?
Three traps: (1) you write a cache for a prompt that's only used once — you pay the 25% surcharge for nothing; (2) the cache TTL expires before the next read, so you write again; (3) for very small prompts (<1024 tokens for Anthropic), the cache writeoff per token can outweigh savings. The break-even formula: a 5-min Anthropic cache is profitable starting at ~3 reads of the same prefix.
Does anything invalidate the cache mid-prefix?
Yes. Any change to the prefix invalidates the cache from that point forward. Order of system prompt, tool definitions, and messages matters — flip them and you lose the cache. Tool definition changes (a single character in a description) invalidate the entire cached tools section. Stable ordering and stable definitions are the entire game.
Is OpenAI's automatic caching always on?
Yes — for prompts ≥1024 tokens, OpenAI automatically detects identical prefixes within ~5-10 minutes and discounts the cached portion by 50%. There's no API parameter and no explicit cache management. The flip side: you can't tune it; if your traffic doesn't repeat prefixes within the window, you don't see the discount.
What about Gemini's context caching?
Manual and explicit: you POST to /cachedContents with your large context, get a cache ID, and reference it in subsequent generateContent calls. 25% of normal input price for cached portion; you also pay storage at $1/M tokens/hour. Designed for very large contexts (long videos, documents) reused over hours, not for shorter system prompts reused over minutes.