
"AI API Prompt Caching Guide 2026: Save 90% on Token Costs"
AI API Prompt Caching Guide 2026: Save 90% on Token Costs#
If you're repeatedly sending the same large system prompt, document, or context in every API request, you're wasting money. Prompt caching is one of the most impactful cost optimizations available in 2026 — and most developers aren't using it.
This guide shows you how to implement prompt caching across Claude, OpenAI, and Gemini APIs with real code examples.
What Is Prompt Caching?#
Normally, every API request processes every input token from scratch. With prompt caching, the provider stores a pre-computed representation of your repeated input, so you only pay a fraction of the cost when reusing it.
The numbers:
- Claude: Cache read costs 10× less than regular input tokens
- OpenAI: Cache read costs 50% less than regular input tokens
- Gemini: Context caching saves 75%+ on repeated tokens
When Prompt Caching Pays Off#
Prompt caching is valuable when you have:
- Long system prompts (500+ tokens) sent with every request
- Document analysis — analyzing the same PDF/code/document in multiple calls
- Few-shot examples — many examples included in every prompt
- RAG applications — same retrieved context sent repeatedly
- Multi-turn agents — same tool definitions sent with every call
Not worth it for:
- Short, unique prompts (caching overhead not worth it)
- Single-call workflows
- Highly dynamic prompts that change every request
Provider Comparison: Caching Support#
| Provider | Feature | Cache Cost | Cache Duration | Min Size |
|---|---|---|---|---|
| Claude | cache_control | 10% of normal | 5 min (ephemeral) | 1,024 tokens |
| OpenAI | Automatic | 50% of normal | ~1 hour | 1,024 tokens |
| Gemini | Explicit cache | ~25% of normal | User-defined | 32,768 tokens |
Claude Prompt Caching#
Claude's caching uses the cache_control parameter with {"type": "ephemeral"} to mark content for caching.
Python — Cache a System Prompt#
import anthropic
from openai import OpenAI
# Via Crazyrouter (OpenAI-compatible format)
client = OpenAI(
api_key="your-crazyrouter-key",
base_url="https://crazyrouter.com/v1"
)
# Load your large document once
with open("technical_spec.txt", "r") as f:
large_document = f.read() # Assume 5,000 tokens
def analyze_document_with_cache(question: str) -> str:
"""Each call reuses the cached document — pay 10% of document cost."""
response = client.chat.completions.create(
model="claude-opus-4-6",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a technical analyst. Answer questions about the document below.",
},
{
"type": "text",
"text": large_document,
"cache_control": {"type": "ephemeral"} # Cache this!
}
]
},
{
"role": "user",
"content": question
}
]
)
return response.choices[0].message.content
# First call: full cost (caches the document)
answer1 = analyze_document_with_cache("What are the API endpoints?")
# Subsequent calls: 90% cheaper for the document tokens
answer2 = analyze_document_with_cache("What authentication methods are supported?")
answer3 = analyze_document_with_cache("List the rate limits")
Cost Savings Calculation#
def calculate_cache_savings(
document_tokens: int,
num_calls: int,
input_price_per_1m: float = 15.0, # Claude Opus
cache_read_price_per_1m: float = 1.5, # 10% of normal
cache_write_price_per_1m: float = 18.75 # 1.25x on first write
) -> dict:
"""Calculate how much prompt caching saves."""
# Without caching: all calls pay full price
cost_without_cache = (document_tokens / 1_000_000) * input_price_per_1m * num_calls
# With caching: 1 write + (n-1) cache reads
cost_write = (document_tokens / 1_000_000) * cache_write_price_per_1m
cost_reads = (document_tokens / 1_000_000) * cache_read_price_per_1m * (num_calls - 1)
cost_with_cache = cost_write + cost_reads
savings = cost_without_cache - cost_with_cache
savings_pct = (savings / cost_without_cache) * 100
return {
"without_cache": f"${cost_without_cache:.4f}",
"with_cache": f"${cost_with_cache:.4f}",
"savings": f"${savings:.4f}",
"savings_percent": f"{savings_pct:.1f}%"
}
# Example: 5,000-token document analyzed 100 times
result = calculate_cache_savings(5000, 100)
print(result)
# Output: {"without_cache": "$7.5000", "with_cache": "$0.8438",
# "savings": "$6.6563", "savings_percent": "88.7%"}
Claude Multi-turn Conversation with Caching#
# Cache tool definitions in long agentic workflows
tools_definition = """
[Large tool definitions here — function schemas, descriptions, etc.]
"""
def agent_call(user_message: str, conversation_history: list) -> str:
"""Agent that caches tool definitions across many calls."""
response = client.chat.completions.create(
model="claude-sonnet-4-5",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a helpful agent with the following tools:"
},
{
"type": "text",
"text": tools_definition,
"cache_control": {"type": "ephemeral"} # Cache tools
}
]
},
*conversation_history,
{
"role": "user",
"content": user_message
}
]
)
return response.choices[0].message.content
OpenAI Prompt Caching#
OpenAI's caching is automatic — you don't need to add any parameters. The API automatically caches the prefix of your prompts (system message + beginning of conversation). You'll see cached_tokens in the usage metadata.
from openai import OpenAI
client = OpenAI(
api_key="your-crazyrouter-key",
base_url="https://crazyrouter.com/v1"
)
LARGE_SYSTEM_PROMPT = """
You are an expert financial analyst with deep knowledge of:
[... 2,000 tokens of detailed instructions, examples, and context ...]
"""
def analyze_with_cache(query: str) -> tuple[str, dict]:
response = client.chat.completions.create(
model="gpt-5-2",
messages=[
{"role": "system", "content": LARGE_SYSTEM_PROMPT},
{"role": "user", "content": query}
]
)
# Check how many tokens were cached
usage = response.usage
cached = getattr(usage, 'prompt_tokens_details', {})
cached_tokens = getattr(cached, 'cached_tokens', 0) if cached else 0
return response.choices[0].message.content, {
"total_tokens": usage.total_tokens,
"cached_tokens": cached_tokens,
"prompt_tokens": usage.prompt_tokens
}
# After the first call, subsequent calls will have cached_tokens > 0
result, usage = analyze_with_cache("Analyze the Q3 earnings report")
print(f"Cached tokens: {usage['cached_tokens']}") # Should show cached tokens on 2nd+ call
OpenAI Caching Best Practices#
# OpenAI caches PREFIX — put stable content first
messages = [
{
"role": "system",
"content": LARGE_STABLE_SYSTEM_PROMPT # This gets cached
},
# Don't mix stable and dynamic content in system prompt
# Dynamic content should go in user messages
{
"role": "user",
"content": f"Today's date: {datetime.now()}\n\n{user_query}" # Dynamic goes here
}
]
Gemini Context Caching#
Gemini requires explicit cache creation via a separate API call, then referencing the cache ID in requests. This is more complex but offers the longest cache duration.
import google.generativeai as genai
from google.generativeai import caching
import datetime
genai.configure(api_key="your-google-key")
def create_document_cache(document_text: str, ttl_minutes: int = 60):
"""Create a Gemini cache for a large document."""
cache = caching.CachedContent.create(
model="models/gemini-2.5-flash",
system_instruction="You are a document analysis expert.",
contents=[
{
"parts": [{"text": document_text}],
"role": "user"
}
],
ttl=datetime.timedelta(minutes=ttl_minutes)
)
return cache
def query_with_cache(cache_id: str, question: str) -> str:
"""Query using pre-created cache."""
model = genai.GenerativeModel.from_cached_content(cached_content=cache_id)
response = model.generate_content(question)
return response.text
# Create cache once
large_document = open("annual_report.txt").read()
cache = create_document_cache(large_document)
# Query multiple times using the same cache
questions = [
"What was the revenue growth?",
"What are the key risks mentioned?",
"Summarize the executive message"
]
for q in questions:
answer = query_with_cache(cache.name, q)
print(f"Q: {q}\nA: {answer}\n")
# Clean up when done
# cache.delete()
Production Patterns#
Pattern 1: Cache Manager Class#
import time
from dataclasses import dataclass
@dataclass
class CacheEntry:
content: str
created_at: float
ttl: int # seconds
class PromptCacheManager:
"""Manage multiple cached prompts with TTL."""
def __init__(self):
self._caches: dict[str, CacheEntry] = {}
def get_system_messages(self, cache_key: str, content: str) -> list:
"""Return system messages with cache_control for Claude."""
# Cache TTL is 5 minutes (300 seconds) for Claude ephemeral cache
return [
{
"role": "system",
"content": [
{
"type": "text",
"text": content,
"cache_control": {"type": "ephemeral"}
}
]
}
]
def should_refresh(self, cache_key: str) -> bool:
"""Check if cache should be refreshed (approaching TTL)."""
if cache_key not in self._caches:
return True
entry = self._caches[cache_key]
age = time.time() - entry.created_at
# Refresh at 80% of TTL to avoid cache misses
return age > (entry.ttl * 0.8)
# Usage
cache_manager = PromptCacheManager()
with open("large_knowledge_base.txt") as f:
kb_content = f.read()
def smart_query(question: str) -> str:
system_messages = cache_manager.get_system_messages("knowledge_base", kb_content)
response = client.chat.completions.create(
model="claude-sonnet-4-5",
messages=[
*system_messages,
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
Pattern 2: RAG with Prompt Caching#
def rag_with_cache(query: str, retrieved_docs: list[str]) -> str:
"""RAG that caches base prompt, dynamic per query."""
# Static: System instructions + base knowledge (cache this)
static_context = """
You are a technical support agent with access to our documentation.
When answering:
1. Cite the relevant section
2. Provide step-by-step guidance
3. If unclear, ask for clarification
[... 2,000 tokens of base instructions ...]
"""
# Dynamic: Retrieved documents + query (don't cache, changes per query)
dynamic_context = "\n\n".join([f"Document:\n{doc}" for doc in retrieved_docs])
response = client.chat.completions.create(
model="claude-sonnet-4-5",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": static_context,
"cache_control": {"type": "ephemeral"} # Cache static
},
{
"type": "text",
"text": f"Relevant documentation:\n{dynamic_context}"
# No cache_control — dynamic content
}
]
},
{"role": "user", "content": query}
]
)
return response.choices[0].message.content
Cost Savings Summary#
Let's calculate real savings for a typical RAG application:
| Scenario | Without Cache | With Cache | Savings |
|---|---|---|---|
| 1K calls, 2K-token system prompt (Sonnet) | $6.00 | $0.72 | 88% |
| 1K calls, 10K-token document (Opus) | $150 | $17.25 | 88.5% |
| 10K calls, 500-token few-shots (Sonnet) | $15 | $3.15 | 79% |
| 100K calls, 1K-token system (Haiku) | $80 | $9.35 | 88% |
Frequently Asked Questions#
Q: How do I know if my cache hit?
A: Claude returns cache_read_input_tokens in usage metrics. OpenAI returns cached_tokens in prompt_tokens_details. Check these values to confirm cache hits.
Q: What's the minimum content size for caching? A: Both Claude and OpenAI require at least 1,024 tokens for a cache to be created. Below that, there's no caching benefit.
Q: Does caching work with streaming? A: Yes, prompt caching works with streaming responses. The input is still processed from cache; only the output is streamed.
Q: Can I cache across different models? A: No — caches are model-specific. A cache created with Claude Opus doesn't work with Claude Sonnet.
Q: How does prompt caching work via Crazyrouter? A: Crazyrouter passes through the cache_control parameters to the underlying provider. Caching benefits are fully preserved when routing through Crazyrouter.
Summary#
Prompt caching is one of the highest-ROI optimizations for AI API costs:
- Claude: Add
cache_control: {type: "ephemeral"}to cacheable content → 90% cost reduction on cached tokens - OpenAI: Automatic caching of prompt prefixes → 50% reduction, no code changes needed
- Gemini: Explicit cache creation for 75%+ savings on repeated large documents
For any application that sends the same context in multiple requests — RAG systems, document Q&A, agents with fixed tool definitions — prompt caching should be implemented immediately.
Access all cached-enabled models through Crazyrouter with full caching support preserved.

