Prompt caching deep dive

Q: Does anything invalidate the cache mid-prefix?

Yes. Any change to the prefix invalidates the cache from that point forward. Order of system prompt, tool definitions, and messages matters — flip them and you lose the cache. Tool definition changes (a single character in a description) invalidate the entire cached tools section. Stable ordering and stable definitions are the entire game.

Q: Is OpenAI's automatic caching always on?

Yes — for prompts ≥1024 tokens, OpenAI automatically detects identical prefixes within ~5-10 minutes and discounts the cached portion by 50%. There's no API parameter and no explicit cache management. The flip side: you can't tune it; if your traffic doesn't repeat prefixes within the window, you don't see the discount.

Q: What about Gemini's context caching?

Manual and explicit: you POST to /cachedContents with your large context, get a cache ID, and reference it in subsequent generateContent calls. 25% of normal input price for cached portion; you also pay storage at $1/M tokens/hour. Designed for very large contexts (long videos, documents) reused over hours, not for shorter system prompts reused over minutes.

Prompt caching is the single biggest cost lever in production LLM apps and the most under-used. Done right, it cuts your bill 60–80% with no quality regression. Done wrong, it quietly increases costs and you wonder why caching "didn't work." Here's how each provider's caching actually works, the math to run before you turn it on, and the traps that catch most teams.

See cached vs un-cached cost in the calculator →

What prompt caching is, in one paragraph

When you call an LLM, the request includes a prefix you control: system prompt, tool definitions, retrieved documents, conversation history. If two consecutive requests share that prefix, the model's attention computation over the prefix is identical. Caching means the provider stores that computation server-side for a few minutes to a few hours; on cache hit, you pay a fraction of normal input price for the cached portion.

Anthropic — explicit caching with 5-minute and 1-hour TTLs

Anthropic's caching is opt-in via cache_control markers in your request. You mark up to 4 cache breakpoints in any combination of system, tools, and message content. Each breakpoint creates a cache prefix; subsequent requests with the same prefix before the breakpoint get the cached portion read at 10% of input price.

Operation	Cost (vs base input price)
Cache write (5-min TTL)	1.25× base
Cache write (1-hour TTL)	2.00× base
Cache read	0.10× base (90% off)
Un-cached input	1.00× base

The break-even math

A 5-min cached prefix pays for itself starting at 3 reads:

cost without cache  = N × 1.0
cost with cache     = 1.25 (write) + (N - 1) × 0.10
                    = 1.25 + 0.1N - 0.1

break-even at N where 1.0N = 1.15 + 0.1N
            ⇒ 0.9N = 1.15
            ⇒ N ≈ 1.28

But you only get a write on the first call, so:
N = 1: 1.25 vs 1.0 (cache loses)
N = 2: 1.35 vs 2.0 (cache wins by 0.65)
N = 3: 1.45 vs 3.0 (cache wins by 1.55)

For 1-hour TTL the surcharge is 2× so break-even is at N=4. Use 5-min for short bursts, 1-hour for steady traffic across the day.

What to cache

The biggest hits, in order:

System prompt + tool definitions. These are identical across every request in a session. Marking the cache breakpoint right after the tools section means every subsequent turn reads them at 10%.
Document context for RAG. If you're stuffing retrieved documents into a long context, cache them. A 50K-token document re-read 5 times saves you 200K worth of billing.
Few-shot examples. Static blocks of demonstrative examples cache cheaply and pay back across sessions.

What kills the cache

Any change to the prefix. A single character edit anywhere before the breakpoint invalidates the entire cache. Lock your system prompt and tool definitions; revise them in batches, not piecemeal.
Reordering. If you move tools before system prompt or vice-versa, that's a new prefix.
Time gaps. 5-min cache evicts after exactly 5 minutes of inactivity. If your traffic is bursty with quiet periods >5 min, switch to 1-hour TTL.
Crossing model versions. Cache is per-model. Switching from Claude Sonnet 4.6 to Haiku 4.5 mid-session re-pays the cache write.

OpenAI — automatic, 50% off, no opt-in

OpenAI's caching has been automatic since late 2024. Send a prompt ≥1024 tokens; if the same prefix appeared in another request within ~5–10 minutes, the cached portion bills at 50% of input price. There's no API parameter, no cache_control, no cache IDs — it just happens.

Implications

You don't pay a write surcharge. First call is full price; subsequent calls within the window get the discount. This means caching has no break-even penalty — even a single cache hit saves money.
The discount is half what Anthropic offers (50% vs 90%). For very high-traffic apps with stable prefixes, Anthropic + careful breakpoints can be cheaper despite the surcharge.
You can't tune it. If your prefix changes every request (e.g. timestamps in the system prompt, request IDs in tool descriptions), you get no discount. Stripping that variability is the only lever.
Verify cache hits via the response metadata's usage.prompt_tokens_details.cached_tokens. If this is consistently 0, your prefix is varying somewhere.

Google Gemini — explicit, hour-based, storage billed separately

Gemini's caching is the most explicit of the three. You POST to the /cachedContents endpoint with your context, get back a cache resource ID, then pass that ID in subsequent generateContent calls. The cache lives for a TTL you specify (5 min to 1 hour standard; can extend with permission).

Pricing structure

Cached input read: 25% of normal input price (75% off).
Cache storage: $1.00 per 1M tokens per hour.
Cache write: counted as normal input on the first call.

When Gemini caching pays off

Storage cost makes the math different from Anthropic. A 100K-token cache costs $0.10/hour just to keep alive — only worth it if you'll re-read enough times in that hour to overcome the storage:

For Gemini 2.5 Pro at $1.25/M input:
  100K tokens un-cached: $0.125 per read
  100K tokens cached:    $0.031 per read
  Storage:               $0.10 per hour
  Savings per read:      $0.094

  Storage breaks even at $0.10 / $0.094 ≈ 1.06 reads per hour.

So at 2+ reads/hour of the same large context, Gemini caching wins.

Gemini caching is designed for large contexts reused over hours — full books, video transcripts, codebases. Anthropic's design is better for shorter prefixes reused rapidly — system prompts, tool definitions.

Decision flow: which caching, when

Pattern	Best fit	Why
Chat app, stable system prompt, dozens of turns / minute	Anthropic 5-min	90% off reads, low surcharge amortizes fast
Daily-driver agent, system + tools, steady traffic	Anthropic 1-hour	Survives quiet periods, still 90% off reads
Bursty traffic, varying prompts, OpenAI ecosystem	OpenAI automatic	No tuning, no surcharge, but only 50% off
Document Q&A on a 200-page PDF used over hours	Gemini context caching	Designed for large contexts, hour-scale TTL
One-off requests, unique prompts	No caching	Cache write surcharge with no read = pure loss

Common mistakes

Caching the user message

It varies every request. Caching it is paying the write surcharge with zero hit rate. Cache the static prefix only — system, tools, documents.

Forgetting timestamps invalidate

A system prompt with "Today is Friday May 1, 2026" busts the cache every day at midnight. Move the timestamp to a separate user message, or put it after the cache breakpoint.

Tool definitions with dynamic content

Some teams template tool descriptions with user-specific info. That defeats caching across users. Keep tool definitions static; pass per-user context through user messages instead.

Mixing models in a session

Cache is per-model. A multi-model agent that switches between Sonnet and Haiku mid-session writes a new cache each time it switches.

What to verify before you celebrate the savings

Check the response usage object for cache_read_input_tokens (Anthropic) or cached_tokens (OpenAI). If it's zero on calls you expected to hit cache, something is wrong with the prefix.
Track cache hit rate over a week. Below 50% means your prefix isn't stable; above 90% means caching is doing its job.
Run the math on actual traffic. The break-even is asymmetric across providers; what works for one project may lose money on another.

Reference

FAQ

Which provider has the best caching?

It depends on traffic shape. Anthropic gives 90% off reads but charges a 25% surcharge to write — pays off if you re-read the same context ≥3 times within 5 minutes. OpenAI is automatic with no opt-in and a flat 50% discount. Gemini gives 75% off reads but you must explicitly create cache objects and pay storage by the hour — best for very large contexts (≥32K tokens) reused over hours.

Does Anthropic's prompt caching work with the system prompt and tool definitions?

Yes. You can mark up to 4 cache breakpoints in any combination of system prompt, tool definitions, document content, or user messages. Each breakpoint creates a cache prefix; subsequent requests with the same prefix get cached reads. Tool definitions and system prompts are the highest-value caching targets in agentic apps.

Why might caching cost more than not caching?

Three traps: (1) you write a cache for a prompt that's only used once — you pay the 25% surcharge for nothing; (2) the cache TTL expires before the next read, so you write again; (3) for very small prompts (<1024 tokens for Anthropic), the cache writeoff per token can outweigh savings. The break-even formula: a 5-min Anthropic cache is profitable starting at ~3 reads of the same prefix.

Does anything invalidate the cache mid-prefix?

Yes. Any change to the prefix invalidates the cache from that point forward. Order of system prompt, tool definitions, and messages matters — flip them and you lose the cache. Tool definition changes (a single character in a description) invalidate the entire cached tools section. Stable ordering and stable definitions are the entire game.

Is OpenAI's automatic caching always on?

Yes — for prompts ≥1024 tokens, OpenAI automatically detects identical prefixes within ~5-10 minutes and discounts the cached portion by 50%. There's no API parameter and no explicit cache management. The flip side: you can't tune it; if your traffic doesn't repeat prefixes within the window, you don't see the discount.

What about Gemini's context caching?

Manual and explicit: you POST to /cachedContents with your large context, get a cache ID, and reference it in subsequent generateContent calls. 25% of normal input price for cached portion; you also pay storage at $1/M tokens/hour. Designed for very large contexts (long videos, documents) reused over hours, not for shorter system prompts reused over minutes.

Run your own caching numbers in the calculator →