Token Pricing / Guides

8 hidden costs of LLM apps

The pricing page says one number. The invoice says another. Here's where the gap comes from — eight specific patterns that inflate real bills 1.5–10× over back-of-envelope estimates, with the math on each.

Run your real numbers in the calculator →

Why this matters

A typical "we'll spend $X/month on the API" estimate misses somewhere between 50% and 500% of the true cost depending on app shape. Most teams discover this when month one's invoice arrives and there's a frantic Slack thread.

Each of the eight items below has a real number you can compute ahead of time. Run through them before you size your budget.

1. Hidden reasoning tokens (worst offender, 4–10×)

Reasoning models — o3, o4-mini, GPT-5.5, DeepSeek R1, Grok 4, Claude Opus 4.7 with adaptive thinking — produce internal "thinking tokens" before their visible output. You don't see them in the chat UI; you do see them on your bill.

Typical multipliers (visible output → actual billed output):

Model	Reasoning multiplier
o3	5.4×
o4-mini	4.1×
GPT-5.5	6.1×
DeepSeek R1 (V4 Pro)	3.2×
Claude Opus 4.7 (adaptive)	2.5×
Grok 4	~5×

Multipliers vary by task — math/reasoning prompts run higher, simple prose prompts lower. The above are averages from public telemetry on common eval suites.

Concrete example. An o3 "summarize this article" call producing a 200-token visible response actually bills around 1,080 output tokens. At $8/M output that's $0.0086 per call instead of $0.0016 — 5.4× the back-of-envelope estimate.

How to see it

// OpenAI response
{
  "usage": {
    "completion_tokens": 1080,            // ← billed
    "completion_tokens_details": {
      "reasoning_tokens": 880              // ← invisible
    }
  }
}

2. Retries on rate-limit / timeout / 5xx (~3–10% of all calls)

Production traffic isn't smooth. Public reliability reports for the major LLM APIs typically show 1–3% transient error rate during normal operations and 5–15% during incidents.

If your retry logic re-sends the full prompt (most do, by default), every retry is a billable redo. Plan on 5–10% cost overhead from retries on a typical app — and it spikes during incidents.

Fix: idempotency keys (some providers accept them), exponential backoff with cap, and a "hard cap" on retries per request so a provider outage doesn't 100× your bill in 5 minutes.

3. Tokenizer overhead (Claude 4.x: ~35%)

Anthropic's Claude 4.x family ships a new BPE tokenizer that consumes roughly 35% more tokens than older Claudes or GPT's o200k_base for the same English text. Per-token pricing is what you compare on the pricing page; per-task billing is what your invoice reflects.

Concrete: a 1,000-token (cl100k) prompt costs Claude Opus 4.7 about $5 × 1.35 / 1M = $0.00675 input — compared to the headline $5 × 1 / 1M = $0.005 a back-of-envelope estimate would produce. Across 100K calls, that's a $170 difference.

See the Claude vs GPT comparison for worked examples across all tiers.

4. Tool / function definitions (300–1,500 tokens per call)

Every tools definition you pass is part of the input prompt and billed as input tokens on every API call. A typical agentic setup with 5 tools adds 600–1,200 tokens per turn. A sophisticated MCP-server-style setup with 30+ tools can add 4,000–8,000 tokens per turn.

Real numbers from a few common patterns:

Setup	Tool overhead per call
Single function (e.g., get_weather)	~150 tokens
Standard agent (5 tools)	~600–1,200 tokens
Code agent (10–15 tools)	~2,000–3,500 tokens
MCP-style (30+ tools)	~4,000–8,000 tokens

At 1M calls/month with a 1,000-token tool overhead on Claude Sonnet 4.6, that's 1,000 × 1M / 1M × $3 = $3,000/month in tool definitions alone. Cache them (system prompt + tools cache together as one prefix on Anthropic) to drop this to ~$300/month — 90% off cached reads.

5. Cache write surcharge (+25% the first call)

Anthropic's cached input charges a 25% premium on the first write (5-minute TTL) or 100% premium (1-hour TTL). If you cache a prefix and only use it once, you've paid extra for nothing.

Break-even math: a 5-minute cache pays for itself starting at 3 reads of the same prefix. For traffic that bursts briefly then idles, you may be on the wrong side of this line — measure cache_read_input_tokens vs cache_creation_input_tokens in your logs.

OpenAI doesn't have this trap — caching is automatic with no write surcharge. Tradeoff: smaller discount (50% off vs 90%). See caching deep dive.

6. Long-context degradation (12% retry rate above 50K)

"Lost in the middle" — most models retrieve much worse from positions in the middle of long contexts than from the start or end. Public benchmarks show retrieval accuracy drops sharply for prompts beyond ~50K tokens, even on models with 1M-token windows.

In production, this manifests as users repeating themselves or re-running queries when the model misses information. Empirically, apps using >50K context show ~12% higher retry rate than apps keeping context under 50K. Each retry is a full bill.

Plus, long context costs more linearly per token but degrades sub-linearly in usefulness. If you can use RAG instead of stuffing a 500K context, you usually should.

7. Structured output overhead (15–30% extra output tokens)

Asking for JSON output (or strict schema, or function call) costs more than free-form output:

The schema is part of the prompt — adds 50–500 tokens depending on schema complexity.
Structural overhead in the response — quotes, braces, field names. A 100-token answer in JSON often outputs as 130–150 tokens.
Null-padding — some models include all schema-defined fields as nulls. Specifying required in the schema helps.

Net: budget 15–30% more output tokens for structured outputs vs. free-form. Most production apps use structured outputs heavily, so this stacks with everything else.

8. Streaming overhead (+1–3% on retries)

Streamed responses are billed at the same per-token rate as non-streamed, but streaming adds two cost vectors:

Mid-stream timeouts — if your client gives up and retries, you've already paid for the partial response that arrived. Common SDKs swallow this silently.
Partial billing on connection drops — most providers bill for tokens generated even if the connection drops mid-stream and the client never received them.

Net: ~1–3% extra cost in production. Small, but noticeable at scale. Add timeouts to your stream consumers and use idempotency keys where supported.

The full stack: what your real bill looks like

Worked example. Imagine a customer-support chatbot using GPT-4.1 (sticker: $2/$8). Your back-of-envelope: 1K input + 500 output per turn, 100K turns/month → $2 × 100K × 1K/1M + $8 × 100K × 500/1M = $200 + $400 = $600/month.

Real bill, after applying the eight hidden costs:

Cost driver	Multiplier	Running total
Sticker estimate	1.00×	$600
Tool definitions (5 tools, +800 input tokens)	+27%	$760
Structured output (+20% out)	+13%	$859
Retries (8% redo rate)	+8%	$928
Streaming overhead	+2%	$946
Long-context retries (10% of calls hit it)	+1.2%	$957

$957 actual vs $600 estimate — 60% over. And this example is the easy case (non-reasoning model, no tokenizer overhead, OpenAI not Claude). For a reasoning model on Anthropic with a heavy agent loop, it's not unusual to see 5–10× the sticker estimate.

How to find these costs in your bills

Log the full usage object on every API response from day one. Don't just log prompt_tokens / completion_tokens — the hidden costs live in completion_tokens_details / cache_read_input_tokens / cache_creation_input_tokens.
Track retry rate. If it's above 5%, your retry policy needs an audit.
Watch for cache_read vs cache_creation balance. On Anthropic, cache_creation should be a small fraction of cache_read in steady state. If they're equal, your prefix isn't stable.
Compute the "tokenizer multiplier" for any Claude 4.x usage by sending a known-length string and reading back the token count.

Recommendations to cut these

Cache the static prefix — system prompt + tool definitions. Single biggest lever in any production app.
Use the smallest tier that passes your evals — most production traffic doesn't need a flagship model.
Constrain output length — strict schemas, short-prose instructions ("answer in 2 sentences"), and max_tokens cap.
Avoid reasoning models for tasks that don't need them — the hidden token multiplier is real.
Audit retry policy — exponential backoff with a hard cap, idempotency keys where supported.

FAQ

Why is my OpenAI bill higher than I calculated?

The most common causes, in order: (1) hidden reasoning tokens on o3/o4 (4-10× the visible output), (2) retries on rate-limited or failed responses, (3) you're using GPT-4 not GPT-4o (different prices), (4) function/tool definitions count as input on every turn, (5) streaming responses sometimes get billed twice on retries with some libraries.

How can I see hidden tokens in my bill?

OpenAI's response includes usage.completion_tokens_details.reasoning_tokens — that's the hidden reasoning count. Anthropic's response usage.cache_read_input_tokens and usage.cache_creation_input_tokens show caching activity. Track these in your logs from day one; they're invisible if you only watch top-level prompt_tokens / completion_tokens.

Do tool definitions cost tokens?

Yes — every tool definition is part of the input prompt and billed as input tokens on every API call. A typical 5-tool agent setup adds 600-1,200 tokens of overhead per call. Cache them (system prompt + tool definitions cache together on Anthropic) or you'll pay this on every turn.

Why does structured output (JSON mode) cost more?

Two reasons. (1) The schema definition is sent as part of the prompt, adding tokens. (2) The model emits structural overhead (braces, quotes, field names) that wouldn't exist in free-form output. A 100-token answer in JSON mode often outputs as 130-150 tokens.

Are streaming responses billed differently from non-streaming?

Same per-token rates, but streaming adds latency-driven retry risk. If your client times out mid-stream and retries, you pay for both partial completions. Some SDKs handle this transparently — check yours. Use idempotency-key headers where the provider supports them.

Run your real numbers in the calculator →