The rates
| Model | Input $/MTok | Output $/MTok | Cached read | Context |
|---|---|---|---|---|
| DeepSeek V4 Flash | $0.140 | $0.280 | $0.0280 | 1000K |
| DeepSeek V4 Pro (reasoning) | $1.740 | $3.480 | $0.145 | 1000K |
Cached reads (where the same prefix has already been processed) are typically 80%+ off list. How prompt caching works →
Off-peak discount
DeepSeek operates a daily off-peak window (UTC 16:30 – 00:30, which is 00:30 – 08:30 China Standard Time) with a ~50% discount on most rates. Batch jobs and async workloads are the obvious fit; live chat usually isn't.
Stacked with prompt caching, off-peak inference can drop effective input cost below $0.05 per million tokens for V4 Flash — roughly an order of magnitude cheaper than equivalent Western providers.
Head-to-head: V4 Flash vs cheap competitors
| Model | Input | Output | Blended | Hosted in |
|---|---|---|---|---|
| DeepSeek V4 Flash | $0.140 | $0.280 | $0.210 | China (or Together / DeepInfra mirrors) |
| Claude Haiku 4.5 | $1.000 | $5.000 | $3.000 | US |
| GPT-5 Mini | $0.125 | $1.000 | $0.563 | US |
| Gemini 2.5 Flash-Lite | $0.100 | $0.400 | $0.250 | Google global |
Where DeepSeek wins
- Bulk text classification, summarization, and translation — high token volumes where 5–10x cost savings compound.
- Async batch jobs: nightly enrichment, data cleaning, document processing.
- Mixed Chinese / English workloads — the tokenizer's Chinese efficiency is a real second-order saving.
- Cost-sensitive prototyping before committing to a provider.
Where it's the wrong call
- Strict data residency (HIPAA, GDPR + EU-only, regulated finance). Use the open-weight V4 models on a US/EU-hosted provider, or pick a different vendor.
- Function calling / structured output for production agents. The first-party API supports it; quality is a tier behind Claude / GPT for complex tool chains.
- Latency-sensitive UX. China-hosted endpoints add round-trip latency for users in the Americas / Europe.
- Long-form reasoning over very long context. DeepSeek's 128K window matches GPT-5 but trails Claude's 1M and Gemini's 1M+ for code-base or research-paper workloads.
Open-weight alternative paths
DeepSeek V4 weights are open. Same architecture is available on:
- Together AI — US-hosted, slightly higher per-token cost, full OpenAI-compatible API.
- DeepInfra — similar.
- Fireworks AI — pricing competitive with Together for V4.
- Self-hosted on H100s — only economical above ~2B tokens/day.
These trade some of the cost advantage for residency and latency guarantees that the first-party API doesn't provide.
Related
FAQ
Why is DeepSeek cheaper than US-hosted models?
Three reasons: a Mixture-of-Experts architecture that activates only ~37B of its ~670B parameters per token (so per-token compute matches a much smaller model), aggressive Chinese inference economics, and a focus on raw cost-per-token rather than the polished SDK / tooling overhead of US providers.
Are there off-peak or batch discounts?
Yes. DeepSeek runs an off-peak window (UTC 16:30–00:30) with ~50% discount on most rates. Batch jobs (24-hour SLA) are also discounted. Both stack with prompt caching.
How does the tokenizer compare?
DeepSeek uses its own BPE tokenizer optimized for Chinese + English. For English text it averages roughly the same token-count as GPT's o200k_base. For Chinese it's significantly more efficient. Plug your own text into the /tokens-per-word logic for an estimate.
What are the data and residency considerations?
DeepSeek's API is hosted in China by default, which has data-residency, export-control, and compliance implications for many companies. Some customers route via Together AI, Fireworks, or DeepInfra hosting their open-weight V4 models — same architecture, US/EU residency, slightly different price.
Is there a free tier?
DeepSeek runs a free tier with generous limits for testing — typically 10K tokens/min and a low daily cap. Paid tier requires payment in CNY or via supported intermediaries.