Login
Back to Blog
"AI API Prompt Caching Guide 2026: Save 90% on Token Costs"

"AI API Prompt Caching Guide 2026: Save 90% on Token Costs"

C
Crazyrouter Team
April 8, 2026
0 viewsEnglishGuide
Share:

AI API Prompt Caching Guide 2026: Save 90% on Token Costs#

If you're repeatedly sending the same large system prompt, document, or context in every API request, you're wasting money. Prompt caching is one of the most impactful cost optimizations available in 2026 — and most developers aren't using it.

This guide shows you how to implement prompt caching across Claude, OpenAI, and Gemini APIs with real code examples.

What Is Prompt Caching?#

Normally, every API request processes every input token from scratch. With prompt caching, the provider stores a pre-computed representation of your repeated input, so you only pay a fraction of the cost when reusing it.

The numbers:

  • Claude: Cache read costs 10× less than regular input tokens
  • OpenAI: Cache read costs 50% less than regular input tokens
  • Gemini: Context caching saves 75%+ on repeated tokens

When Prompt Caching Pays Off#

Prompt caching is valuable when you have:

  1. Long system prompts (500+ tokens) sent with every request
  2. Document analysis — analyzing the same PDF/code/document in multiple calls
  3. Few-shot examples — many examples included in every prompt
  4. RAG applications — same retrieved context sent repeatedly
  5. Multi-turn agents — same tool definitions sent with every call

Not worth it for:

  • Short, unique prompts (caching overhead not worth it)
  • Single-call workflows
  • Highly dynamic prompts that change every request

Provider Comparison: Caching Support#

ProviderFeatureCache CostCache DurationMin Size
Claudecache_control10% of normal5 min (ephemeral)1,024 tokens
OpenAIAutomatic50% of normal~1 hour1,024 tokens
GeminiExplicit cache~25% of normalUser-defined32,768 tokens

Claude Prompt Caching#

Claude's caching uses the cache_control parameter with {"type": "ephemeral"} to mark content for caching.

Python — Cache a System Prompt#

python
import anthropic
from openai import OpenAI

# Via Crazyrouter (OpenAI-compatible format)
client = OpenAI(
    api_key="your-crazyrouter-key",
    base_url="https://crazyrouter.com/v1"
)

# Load your large document once
with open("technical_spec.txt", "r") as f:
    large_document = f.read()  # Assume 5,000 tokens

def analyze_document_with_cache(question: str) -> str:
    """Each call reuses the cached document — pay 10% of document cost."""
    
    response = client.chat.completions.create(
        model="claude-opus-4-6",
        messages=[
            {
                "role": "system",
                "content": [
                    {
                        "type": "text",
                        "text": "You are a technical analyst. Answer questions about the document below.",
                    },
                    {
                        "type": "text",
                        "text": large_document,
                        "cache_control": {"type": "ephemeral"}  # Cache this!
                    }
                ]
            },
            {
                "role": "user",
                "content": question
            }
        ]
    )
    
    return response.choices[0].message.content

# First call: full cost (caches the document)
answer1 = analyze_document_with_cache("What are the API endpoints?")

# Subsequent calls: 90% cheaper for the document tokens
answer2 = analyze_document_with_cache("What authentication methods are supported?")
answer3 = analyze_document_with_cache("List the rate limits")

Cost Savings Calculation#

python
def calculate_cache_savings(
    document_tokens: int,
    num_calls: int,
    input_price_per_1m: float = 15.0,  # Claude Opus
    cache_read_price_per_1m: float = 1.5,  # 10% of normal
    cache_write_price_per_1m: float = 18.75  # 1.25x on first write
) -> dict:
    """Calculate how much prompt caching saves."""
    
    # Without caching: all calls pay full price
    cost_without_cache = (document_tokens / 1_000_000) * input_price_per_1m * num_calls
    
    # With caching: 1 write + (n-1) cache reads
    cost_write = (document_tokens / 1_000_000) * cache_write_price_per_1m
    cost_reads = (document_tokens / 1_000_000) * cache_read_price_per_1m * (num_calls - 1)
    cost_with_cache = cost_write + cost_reads
    
    savings = cost_without_cache - cost_with_cache
    savings_pct = (savings / cost_without_cache) * 100
    
    return {
        "without_cache": f"${cost_without_cache:.4f}",
        "with_cache": f"${cost_with_cache:.4f}",
        "savings": f"${savings:.4f}",
        "savings_percent": f"{savings_pct:.1f}%"
    }

# Example: 5,000-token document analyzed 100 times
result = calculate_cache_savings(5000, 100)
print(result)
# Output: {"without_cache": "$7.5000", "with_cache": "$0.8438", 
#          "savings": "$6.6563", "savings_percent": "88.7%"}

Claude Multi-turn Conversation with Caching#

python
# Cache tool definitions in long agentic workflows
tools_definition = """
[Large tool definitions here — function schemas, descriptions, etc.]
"""

def agent_call(user_message: str, conversation_history: list) -> str:
    """Agent that caches tool definitions across many calls."""
    
    response = client.chat.completions.create(
        model="claude-sonnet-4-5",
        messages=[
            {
                "role": "system",
                "content": [
                    {
                        "type": "text",
                        "text": "You are a helpful agent with the following tools:"
                    },
                    {
                        "type": "text",
                        "text": tools_definition,
                        "cache_control": {"type": "ephemeral"}  # Cache tools
                    }
                ]
            },
            *conversation_history,
            {
                "role": "user",
                "content": user_message
            }
        ]
    )
    
    return response.choices[0].message.content

OpenAI Prompt Caching#

OpenAI's caching is automatic — you don't need to add any parameters. The API automatically caches the prefix of your prompts (system message + beginning of conversation). You'll see cached_tokens in the usage metadata.

python
from openai import OpenAI

client = OpenAI(
    api_key="your-crazyrouter-key",
    base_url="https://crazyrouter.com/v1"
)

LARGE_SYSTEM_PROMPT = """
You are an expert financial analyst with deep knowledge of:
[... 2,000 tokens of detailed instructions, examples, and context ...]
"""

def analyze_with_cache(query: str) -> tuple[str, dict]:
    response = client.chat.completions.create(
        model="gpt-5-2",
        messages=[
            {"role": "system", "content": LARGE_SYSTEM_PROMPT},
            {"role": "user", "content": query}
        ]
    )
    
    # Check how many tokens were cached
    usage = response.usage
    cached = getattr(usage, 'prompt_tokens_details', {})
    cached_tokens = getattr(cached, 'cached_tokens', 0) if cached else 0
    
    return response.choices[0].message.content, {
        "total_tokens": usage.total_tokens,
        "cached_tokens": cached_tokens,
        "prompt_tokens": usage.prompt_tokens
    }

# After the first call, subsequent calls will have cached_tokens > 0
result, usage = analyze_with_cache("Analyze the Q3 earnings report")
print(f"Cached tokens: {usage['cached_tokens']}")  # Should show cached tokens on 2nd+ call

OpenAI Caching Best Practices#

python
# OpenAI caches PREFIX — put stable content first
messages = [
    {
        "role": "system", 
        "content": LARGE_STABLE_SYSTEM_PROMPT  # This gets cached
    },
    # Don't mix stable and dynamic content in system prompt
    # Dynamic content should go in user messages
    {
        "role": "user",
        "content": f"Today's date: {datetime.now()}\n\n{user_query}"  # Dynamic goes here
    }
]

Gemini Context Caching#

Gemini requires explicit cache creation via a separate API call, then referencing the cache ID in requests. This is more complex but offers the longest cache duration.

python
import google.generativeai as genai
from google.generativeai import caching
import datetime

genai.configure(api_key="your-google-key")

def create_document_cache(document_text: str, ttl_minutes: int = 60):
    """Create a Gemini cache for a large document."""
    
    cache = caching.CachedContent.create(
        model="models/gemini-2.5-flash",
        system_instruction="You are a document analysis expert.",
        contents=[
            {
                "parts": [{"text": document_text}],
                "role": "user"
            }
        ],
        ttl=datetime.timedelta(minutes=ttl_minutes)
    )
    
    return cache

def query_with_cache(cache_id: str, question: str) -> str:
    """Query using pre-created cache."""
    
    model = genai.GenerativeModel.from_cached_content(cached_content=cache_id)
    response = model.generate_content(question)
    
    return response.text

# Create cache once
large_document = open("annual_report.txt").read()
cache = create_document_cache(large_document)

# Query multiple times using the same cache
questions = [
    "What was the revenue growth?",
    "What are the key risks mentioned?",
    "Summarize the executive message"
]

for q in questions:
    answer = query_with_cache(cache.name, q)
    print(f"Q: {q}\nA: {answer}\n")

# Clean up when done
# cache.delete()

Production Patterns#

Pattern 1: Cache Manager Class#

python
import time
from dataclasses import dataclass

@dataclass
class CacheEntry:
    content: str
    created_at: float
    ttl: int  # seconds

class PromptCacheManager:
    """Manage multiple cached prompts with TTL."""
    
    def __init__(self):
        self._caches: dict[str, CacheEntry] = {}
    
    def get_system_messages(self, cache_key: str, content: str) -> list:
        """Return system messages with cache_control for Claude."""
        
        # Cache TTL is 5 minutes (300 seconds) for Claude ephemeral cache
        return [
            {
                "role": "system",
                "content": [
                    {
                        "type": "text",
                        "text": content,
                        "cache_control": {"type": "ephemeral"}
                    }
                ]
            }
        ]
    
    def should_refresh(self, cache_key: str) -> bool:
        """Check if cache should be refreshed (approaching TTL)."""
        if cache_key not in self._caches:
            return True
        
        entry = self._caches[cache_key]
        age = time.time() - entry.created_at
        
        # Refresh at 80% of TTL to avoid cache misses
        return age > (entry.ttl * 0.8)

# Usage
cache_manager = PromptCacheManager()

with open("large_knowledge_base.txt") as f:
    kb_content = f.read()

def smart_query(question: str) -> str:
    system_messages = cache_manager.get_system_messages("knowledge_base", kb_content)
    
    response = client.chat.completions.create(
        model="claude-sonnet-4-5",
        messages=[
            *system_messages,
            {"role": "user", "content": question}
        ]
    )
    
    return response.choices[0].message.content

Pattern 2: RAG with Prompt Caching#

python
def rag_with_cache(query: str, retrieved_docs: list[str]) -> str:
    """RAG that caches base prompt, dynamic per query."""
    
    # Static: System instructions + base knowledge (cache this)
    static_context = """
You are a technical support agent with access to our documentation.
When answering:
1. Cite the relevant section
2. Provide step-by-step guidance
3. If unclear, ask for clarification
[... 2,000 tokens of base instructions ...]
"""
    
    # Dynamic: Retrieved documents + query (don't cache, changes per query)
    dynamic_context = "\n\n".join([f"Document:\n{doc}" for doc in retrieved_docs])
    
    response = client.chat.completions.create(
        model="claude-sonnet-4-5",
        messages=[
            {
                "role": "system",
                "content": [
                    {
                        "type": "text",
                        "text": static_context,
                        "cache_control": {"type": "ephemeral"}  # Cache static
                    },
                    {
                        "type": "text",
                        "text": f"Relevant documentation:\n{dynamic_context}"
                        # No cache_control — dynamic content
                    }
                ]
            },
            {"role": "user", "content": query}
        ]
    )
    
    return response.choices[0].message.content

Cost Savings Summary#

Let's calculate real savings for a typical RAG application:

ScenarioWithout CacheWith CacheSavings
1K calls, 2K-token system prompt (Sonnet)$6.00$0.7288%
1K calls, 10K-token document (Opus)$150$17.2588.5%
10K calls, 500-token few-shots (Sonnet)$15$3.1579%
100K calls, 1K-token system (Haiku)$80$9.3588%

Frequently Asked Questions#

Q: How do I know if my cache hit? A: Claude returns cache_read_input_tokens in usage metrics. OpenAI returns cached_tokens in prompt_tokens_details. Check these values to confirm cache hits.

Q: What's the minimum content size for caching? A: Both Claude and OpenAI require at least 1,024 tokens for a cache to be created. Below that, there's no caching benefit.

Q: Does caching work with streaming? A: Yes, prompt caching works with streaming responses. The input is still processed from cache; only the output is streamed.

Q: Can I cache across different models? A: No — caches are model-specific. A cache created with Claude Opus doesn't work with Claude Sonnet.

Q: How does prompt caching work via Crazyrouter? A: Crazyrouter passes through the cache_control parameters to the underlying provider. Caching benefits are fully preserved when routing through Crazyrouter.

Summary#

Prompt caching is one of the highest-ROI optimizations for AI API costs:

  • Claude: Add cache_control: {type: "ephemeral"} to cacheable content → 90% cost reduction on cached tokens
  • OpenAI: Automatic caching of prompt prefixes → 50% reduction, no code changes needed
  • Gemini: Explicit cache creation for 75%+ savings on repeated large documents

For any application that sends the same context in multiple requests — RAG systems, document Q&A, agents with fixed tool definitions — prompt caching should be implemented immediately.

Access all cached-enabled models through Crazyrouter with full caching support preserved.

Start saving on AI API costs with Crazyrouter

Related Articles