intermediate

DeepSeek Cache Hit Strategy: How to Save 75% on API Costs

Chinese developers routinely cut DeepSeek bills by 75% using cache hit optimization. Here's the technique, with code examples and prompt architecture patterns.

DeepSeek’s cache miss input price is $0.27/M tokens. Cache hit input price is $0.07/M — that’s a 74% discount. For the reasoner model, it’s even steeper: $0.55 down to $0.14, a 75% cut.

Chinese developer communities have been optimizing for this aggressively since mid-2025. Most English-language guides mention caching in passing. None explain how to architect your prompts to maximize hit rates.

This guide does.

How DeepSeek Caching Works

DeepSeek automatically caches the prefix of your input. If your next request starts with the same tokens, those tokens are charged at cache hit rates.

Key rules:

  • Caching is prefix-based, not semantic. The first N tokens must be identical, byte-for-byte.
  • Minimum cache unit is 64 tokens. Prefixes shorter than that won’t be cached.
  • Cache persists for several minutes of inactivity (exact TTL isn’t documented, but Chinese devs report 5-10 minutes).
  • Cache is per-model, per-account. Different API keys under the same account share cache.

The Architecture Pattern

The goal: make the first 80-90% of every request identical.

┌─────────────────────────────────────┐
│  System Prompt (fixed, long)        │  ← Cached (75% cheaper)
│  + Schema definitions               │
│  + Few-shot examples                │
│  + Context/knowledge base           │
├─────────────────────────────────────┤
│  User message (variable, short)     │  ← Not cached (full price)
└─────────────────────────────────────┘

The trick is simple: put everything that doesn’t change at the front.

Example: Before vs After

Before (No Cache Optimization)

# Each request has a different structure — cache misses every time
response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "user", "content": f"Analyze this code:\n{code}\n\nCheck for bugs."}
    ]
)

Cache hit rate: ~0%. Every request is different from the start.

After (Cache Optimized)

# Fixed prefix — system prompt + schema + examples
SYSTEM_PROMPT = """You are a code reviewer. Follow these rules:
1. Check for security vulnerabilities (SQL injection, XSS, command injection)
2. Check for performance issues (N+1 queries, unnecessary allocations)
3. Check for logic errors
4. Output format: JSON with fields: severity, line, issue, suggestion

Example output:
[
  {"severity": "high", "line": 42, "issue": "SQL injection via string concatenation", "suggestion": "Use parameterized queries"},
  {"severity": "low", "line": 15, "issue": "Unnecessary list copy", "suggestion": "Use iterator instead"}
]

Rules for severity:
- high: security vulnerabilities, data loss risks
- medium: performance issues, error handling gaps
- low: style, minor inefficiencies
"""

# Variable part — only the actual code changes
response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Review this code:\n```\n{code}\n```"}
    ]
)

Cache hit rate: ~90%+ (the system prompt is ~300 tokens, cached after first call).

Cost difference at 10M tokens/month:

  • Before: $2.70 (all cache miss)
  • After: ~$0.88 (90% cache hit)
  • Savings: 67%

Advanced Patterns

Pattern 1: RAG with Fixed Context Prefix

If you’re building RAG and your retrieved context changes per query, put the static parts first:

messages = [
    {"role": "system", "content": LONG_SYSTEM_PROMPT},      # Cached
    {"role": "user", "content": SCHEMA_AND_RULES},           # Cached
    {"role": "assistant", "content": "Understood."},         # Cached
    {"role": "user", "content": f"Context:\n{retrieved_docs}\n\nQuestion: {query}"}  # Not cached
]

The first three messages are identical every time — cached. Only the last message varies.

Pattern 2: Multi-Turn Conversation

For chatbots, the conversation history grows with each turn. The prefix (system prompt + earlier turns) is naturally cached:

Turn 1: [system] + [user_1]                    → All cache miss
Turn 2: [system] + [user_1] + [assistant_1] + [user_2]  → First 3 messages cached
Turn 3: [system] + [user_1] + [assistant_1] + [user_2] + [assistant_2] + [user_3]  → First 5 cached

Each turn caches more. By turn 5, 80%+ of your input is cached.

Pattern 3: Batch Processing with Shared Prefix

Processing 100 documents with the same instructions:

INSTRUCTIONS = """...(500 tokens of detailed instructions)..."""

for doc in documents:
    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "system", "content": INSTRUCTIONS},  # Cached after doc #1
            {"role": "user", "content": doc}               # Variable
        ]
    )

Documents #2-100 get the 500-token prefix cached. If each document is 2000 tokens, that’s 20% cached — saving 15% on input costs across the batch.

Pattern 4: Inflate Your System Prompt (Intentionally)

This is counterintuitive: making your system prompt longer can make your total cost lower.

If your system prompt is 100 tokens and your user message is 1000 tokens, only 9% is cacheable. But if you expand the system prompt to 1000 tokens (add examples, rules, schema), 50% is cacheable.

System PromptUser MessageCacheableEffective Cost (10M tok)
100 tokens1000 tokens9%$2.52
500 tokens1000 tokens33%$2.08
1000 tokens1000 tokens50%$1.78
2000 tokens1000 tokens67%$1.48

Chinese developers call this “以长换省” (trade length for savings).

What About Other Providers?

ProviderCachingDiscount
DeepSeekAutomatic prefix caching74-75% off input
Kimi K2.5Automatic, 75% discount75% off input
Qwen”Implicit caching” 20% off, explicit caching 90% offVaries by tier
OpenAIPrompt caching (50% off)50% off input
AnthropicPrompt caching (90% off, but $1.25/M write cost)Net ~70% off for reuse

DeepSeek and Kimi have the most developer-friendly caching: automatic, no extra API calls, no write costs. You just structure your prompts right and the savings happen.

Monitoring Cache Hits

DeepSeek returns cache hit info in the API response:

{
  "usage": {
    "prompt_tokens": 1500,
    "prompt_cache_hit_tokens": 1200,
    "prompt_cache_miss_tokens": 300,
    "completion_tokens": 500
  }
}

Track prompt_cache_hit_tokens / prompt_tokens to measure your hit rate. Aim for 70%+.

Quick Checklist

  1. Put your system prompt, schema, and examples at the front of every request
  2. Keep the prefix byte-identical — don’t add timestamps or random IDs
  3. Make requests within 5-10 minutes of each other to stay in cache
  4. Monitor prompt_cache_hit_tokens in responses
  5. Consider expanding your system prompt if it’s very short

Use our cost calculator to model the savings for your specific workload.

Full DeepSeek API setup: Getting Started Guide.

More Guides