
Claude Sonnet 4.6 Pricing Explained — Caching, Tiers, and How to Save 45% with Crazyrouter
Claude Sonnet 4.6 Pricing Explained — Caching, Tiers, and How to Save 45% with Crazyrouter#

Claude Sonnet 4.6 is Anthropic's latest mid-range model — released in February 2026. It sits between the budget-friendly Haiku line and the premium Opus tier, making it the default choice for most production workloads: coding, chat, document analysis, and tool use.
But Anthropic's pricing isn't just "input + output." There's a layered caching system with two TTL tiers, a Batch API discount, and a data residency surcharge that can all stack on top of each other. This guide breaks down every component so you know exactly what you're paying — and how to pay less.
Last updated: April 27, 2026
Disclaimer: Prices in this article are accurate as of the publication date. Anthropic may adjust pricing at any time. Always verify on the official Anthropic pricing page before making production decisions.
Base Token Pricing#
The foundation of Claude Sonnet 4.6 pricing is straightforward:
| Token Type | Price per 1M Tokens |
|---|---|
| Input (base) | $3.00 |
| Output | $15.00 |
That's a 5:1 output-to-input ratio. For every dollar you spend on input tokens, you'd spend five dollars on the same number of output tokens. This ratio matters — if your workload is output-heavy (code generation, long-form writing), output costs will dominate your bill.
Quick reference: what does this actually cost?#
| Workload | Tokens | Cost |
|---|---|---|
| 1 short chat turn (500 in / 200 out) | 700 total | $0.0045 |
| 1 code review (2K in / 1K out) | 3K total | $0.021 |
| 1 document summary (10K in / 2K out) | 12K total | $0.06 |
| 1 hour of chatbot traffic (500K in / 200K out) | 700K total | $4.50 |
| 1 day of heavy API usage (5M in / 2M out) | 7M total | $45.00 |
Prompt Caching: The Biggest Cost Lever#
Prompt caching is where Anthropic's pricing gets interesting — and where the real savings live.

How it works#
When you send a request with cache_control enabled, Anthropic stores the computed state of your prompt prefix. On subsequent requests that start with the same bytes (same system prompt, same few-shot examples, same preamble), those tokens are served from cache instead of being reprocessed.
There are two cache duration tiers:
| Cache Operation | Price per 1M Tokens | Multiplier vs Base Input | Duration |
|---|---|---|---|
| 5-minute cache write | $3.75 | 1.25x | 5 minutes |
| 1-hour cache write | $6.00 | 2.0x | 1 hour |
| Cache hit (read) | $0.30 | 0.1x | — |
The math: when does caching pay off?#
5-minute cache (1.25x write):
- Write cost: $3.75/M (you pay 25% more than base input on the first request)
- Read cost: $0.30/M (you pay 90% less on every subsequent request)
- Break-even: 1 cache read. After just one cache hit, you've saved money.
- Write: 0.30 → Total for 2 requests: $4.05
- Without cache: 6.00
- Savings: $1.95 (32.5%)
1-hour cache (2.0x write):
- Write cost: $6.00/M (you pay double on the first request)
- Read cost: $0.30/M (same 90% discount on reads)
- Break-even: 2 cache reads. After two hits, you're ahead.
- Write: 0.60 → Total for 3 requests: $6.60
- Without cache: 9.00
- Savings: $2.40 (26.7%)
When to use which cache tier#
| Scenario | Recommended Cache | Why |
|---|---|---|
| Real-time chatbot (many requests/minute) | 5-minute | High request frequency, cache stays warm |
| Batch processing (bursts every few minutes) | 5-minute | Requests cluster within 5-min windows |
| Long-running agent sessions | 1-hour | Requests spread over 10-60 minutes |
| Scheduled jobs (hourly reports) | 1-hour | Predictable hourly pattern |
| One-off requests | No cache | No reuse opportunity |
How to enable caching#
Automatic caching (recommended for most cases):
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6-20260213",
max_tokens=1024,
cache_control={"type": "auto"}, # Automatic cache management
system="You are a senior code reviewer. Review the following code for bugs, security issues, and performance problems.",
messages=[
{"role": "user", "content": "Review this Python function:\n\ndef process_data(items):\n results = []\n for item in items:\n if item.get('status') == 'active':\n results.append(transform(item))\n return results"}
]
)
print(response.usage)
# Look for cache_creation_input_tokens and cache_read_input_tokens
Explicit cache breakpoints (fine-grained control):
response = client.messages.create(
model="claude-sonnet-4-6-20260213",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a senior code reviewer...",
"cache_control": {"type": "ephemeral"} # Cache this block
}
],
messages=[
{"role": "user", "content": "Review this code..."}
]
)
Reading your cache usage in the response#
Every response includes cache metrics in the usage object:
{
"usage": {
"input_tokens": 50,
"output_tokens": 320,
"cache_creation_input_tokens": 1200,
"cache_read_input_tokens": 0
}
}
cache_creation_input_tokens: tokens written to cache (charged at 1.25x or 2x)cache_read_input_tokens: tokens read from cache (charged at 0.1x)input_tokens: tokens processed normally (charged at base rate)
Your actual input cost for a request is:
cost = (input_tokens × $3.00/M)
+ (cache_creation_input_tokens × $3.75/M or $6.00/M)
+ (cache_read_input_tokens × $0.30/M)
+ (output_tokens × $15.00/M)
Batch API Discount#
Anthropic offers a Batch API for asynchronous processing. You submit requests in bulk, and results are returned within 24 hours. The tradeoff: no real-time responses, but you get a 50% discount on all token types.
| Token Type | Standard | Batch API |
|---|---|---|
| Input | $3.00/M | $1.50/M |
| Output | $15.00/M | $7.50/M |
| 5-min cache write | $3.75/M | $1.875/M |
| 1-hour cache write | $6.00/M | $3.00/M |
| Cache hit | $0.30/M | $0.15/M |
The Batch discount stacks with caching. If you're running a nightly batch job with a consistent system prompt, you get both the 50% batch discount AND the 0.1x cache read discount on repeated prefixes. That's $0.15/M for cached input tokens in batch mode — 95% cheaper than standard base input.
When to use Batch API#
- Bulk content generation (product descriptions, summaries)
- Large-scale data extraction or classification
- Evaluation runs across hundreds of test prompts
- Any workload where latency doesn't matter
Data Residency Surcharge#
Starting with Claude Sonnet 4.5 and newer models (including Sonnet 4.6), Anthropic charges a 1.1x multiplier if you specify US-only inference via the inference_geo parameter.
| Token Type | Global (default) | US-only (1.1x) |
|---|---|---|
| Input | $3.00/M | $3.30/M |
| Output | $15.00/M | $16.50/M |
| Cache write (5min) | $3.75/M | $4.125/M |
| Cache hit | $0.30/M | $0.33/M |
This surcharge stacks with everything else. If you use US-only + Batch + caching, all multipliers apply.
Most users don't need this — global routing is the default and has no surcharge. Only enable inference_geo if you have strict data residency requirements.
Crazyrouter Pricing: 45% Off#

Through Crazyrouter, Claude Sonnet 4.6 is available at 55% of official pricing — a 45% discount on base token rates.
| Token Type | Anthropic Official | Crazyrouter (55%) |
|---|---|---|
| Input | $3.00/M | $1.65/M |
| Output | $15.00/M | $8.25/M |
Crazyrouter supports both OpenAI-compatible and native Anthropic API formats, so you can use whichever SDK you prefer.
Code example: using Claude Sonnet 4.6 via Crazyrouter#
OpenAI-compatible format:
from openai import OpenAI
client = OpenAI(
api_key="your-crazyrouter-key",
base_url="https://crazyrouter.com/v1"
)
response = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to merge two sorted lists."}
]
)
print(response.choices[0].message.content)
Anthropic-native format:
import anthropic
client = anthropic.Anthropic(
api_key="your-crazyrouter-key",
base_url="https://crazyrouter.com"
)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[
{"role": "user", "content": "Write a Python function to merge two sorted lists."}
]
)
print(response.content[0].text)
curl:
curl https://crazyrouter.com/v1/chat/completions \
-H "Authorization: Bearer your-crazyrouter-key" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-6",
"messages": [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to merge two sorted lists."}
]
}'
Real-World Cost Comparison#
Let's compare costs for three common workloads: Anthropic direct vs Crazyrouter.
Scenario 1: Chatbot — 1M input + 500K output tokens per day#
| Anthropic Direct | Crazyrouter | |
|---|---|---|
| Input cost | $3.00 | $1.65 |
| Output cost | $7.50 | $4.125 |
| Daily total | $10.50 | $5.775 |
| Monthly (30 days) | $315.00 | $173.25 |
| Monthly savings | — | $141.75 (45%) |
Scenario 2: Code generation — 500K input + 2M output tokens per day#
| Anthropic Direct | Crazyrouter | |
|---|---|---|
| Input cost | $1.50 | $0.825 |
| Output cost | $30.00 | $16.50 |
| Daily total | $31.50 | $17.325 |
| Monthly (30 days) | $945.00 | $519.75 |
| Monthly savings | — | $425.25 (45%) |
Scenario 3: Chatbot with 60% cache hit rate — 1M input + 500K output per day#
With caching on Anthropic direct:
- 400K cache write tokens (5min): 400K × 1.50
- 600K cache hit tokens: 600K × 0.18
- 500K output tokens: 500K × 7.50
- Daily total: $9.18
With Crazyrouter (no native cache, but 45% off base):
- 1M input tokens: 1M × 1.65
- 500K output tokens: 500K × 4.125
- Daily total: $5.775
Even with Anthropic's caching at 60% hit rate, Crazyrouter's flat 45% discount still comes out cheaper for this workload. The gap narrows at very high cache hit rates (80%+), where Anthropic's cache reads at $0.30/M become extremely cheap.
Break-even analysis: when is Anthropic direct + caching cheaper?#
At what cache hit rate does going direct to Anthropic beat Crazyrouter?
For a pure input workload (ignoring output for simplicity):
- Crazyrouter cost per 1M input: $1.65
- Anthropic with cache: (1 - hit_rate) × 0.30
Solving: 3.75 + x × $0.30
- 3.75 - $3.45x
- 2.10
- x = 60.9%
At cache hit rates above ~61%, going direct to Anthropic with 5-minute caching is cheaper for input tokens. But remember: output tokens have no caching discount, and Crazyrouter's 45% off applies to output too. For output-heavy workloads, Crazyrouter wins at any cache hit rate.
Pricing Summary Table#
| Component | Anthropic Official | Crazyrouter |
|---|---|---|
| Base input | $3.00/M | $1.65/M |
| Base output | $15.00/M | $8.25/M |
| 5-min cache write | $3.75/M (1.25x) | — |
| 1-hour cache write | $6.00/M (2.0x) | — |
| Cache hit | $0.30/M (0.1x) | — |
| Batch input | $1.50/M (50% off) | — |
| Batch output | $7.50/M (50% off) | — |
| US-only surcharge | 1.1x all prices | — |
| Supported formats | Anthropic API | OpenAI + Anthropic |
Key Takeaways#
-
Base pricing is 15/M output. Output is 5x more expensive — optimize for shorter outputs when possible.
-
Prompt caching saves up to 90% on input tokens. The 5-minute cache pays for itself after just 1 reuse. The 1-hour cache needs 2 reuses.
-
Batch API cuts everything by 50%. Stack it with caching for up to 95% savings on cached input tokens.
-
Crazyrouter offers a flat 45% discount on base token rates, with no caching complexity to manage. For output-heavy workloads, this is often the better deal.
-
The optimal strategy depends on your workload. High cache hit rates + input-heavy = go direct. Output-heavy or unpredictable traffic = Crazyrouter wins.
Get Started#
Sign up at crazyrouter.com to access Claude Sonnet 4.6 at 45% off — along with 300+ other models from OpenAI, Google, xAI, and more, all through one API key.





