
AI API Rate Limits Compared: Every Major Provider in 2026
AI API Rate Limits Compared: Every Major Provider in 2026#
Rate limits are the silent killer of AI applications. You build a great product, users love it, traffic spikes — and suddenly you're getting 429 errors everywhere. Understanding rate limits across providers is critical for building reliable AI applications.
This guide compares rate limits for every major AI API provider in March 2026, with practical strategies for handling them in production.
What Are API Rate Limits?#
Rate limits control how many API requests you can make within a time window. They're measured in:
- RPM (Requests Per Minute): How many API calls per minute
- TPM (Tokens Per Minute): How many tokens processed per minute
- RPD (Requests Per Day): Daily request cap
- Images/min: For image generation APIs
Exceeding these limits returns a 429 Too Many Requests error.
Rate Limits by Provider#
OpenAI (GPT-5.2, GPT-5-mini, DALL-E)#
OpenAI uses a tier system based on spending history:
| Tier | Qualification | RPM | TPM (GPT-5.2) | TPM (GPT-5-mini) |
|---|---|---|---|---|
| Free | New account | 3 | 40,000 | 200,000 |
| Tier 1 | $5 paid | 500 | 30,000 | 200,000 |
| Tier 2 | $50 paid + 7 days | 5,000 | 450,000 | 2,000,000 |
| Tier 3 | $100 paid + 7 days | 5,000 | 800,000 | 4,000,000 |
| Tier 4 | $250 paid + 14 days | 10,000 | 2,000,000 | 10,000,000 |
| Tier 5 | $1,000 paid + 30 days | 10,000 | 5,000,000 | 30,000,000 |
Image Generation (DALL-E 3):
- Tier 1: 7 images/min
- Tier 5: 50 images/min
Batch API: 50% of standard limits, processed within 24 hours.
Key notes:
- Tier upgrades are automatic based on spending
- Organization-level limits (shared across all keys)
- Separate limits per model family
Anthropic (Claude Opus 4.6, Sonnet 4.5, Haiku 4.5)#
Anthropic also uses spending-based tiers:
| Tier | Qualification | RPM | TPM (Opus 4.6) | TPM (Sonnet 4.5) | TPM (Haiku 4.5) |
|---|---|---|---|---|---|
| Tier 1 | $5 credit | 50 | 40,000 | 40,000 | 50,000 |
| Tier 2 | $40 spent | 1,000 | 80,000 | 80,000 | 100,000 |
| Tier 3 | $200 spent | 2,000 | 160,000 | 160,000 | 200,000 |
| Tier 4 | $400 spent | 4,000 | 400,000 | 400,000 | 800,000 |
Key notes:
- Much lower RPM than OpenAI (50 vs 500 at Tier 1)
- Opus 4.6 has the tightest limits
- No batch API yet
- Prompt caching doesn't count against TPM limits
Google (Gemini 2.5 Pro, Flash, Gemini 3 Pro Preview)#
Google uses a simpler model:
| Model | Free Tier RPM | Free Tier RPD | Paid RPM | Paid TPM |
|---|---|---|---|---|
| Gemini 2.5 Pro | 2 | 50 | 1,000 | 4,000,000 |
| Gemini 2.5 Flash | 15 | 1,500 | 2,000 | 4,000,000 |
| Gemini 2.5 Flash Lite | 30 | 1,500 | 4,000 | 4,000,000 |
| Gemini 3 Pro Preview | 2 | 50 | 500 | 2,000,000 |
Key notes:
- Generous free tier (especially Flash)
- Paid tier has high TPM (4M)
- Lower RPM than OpenAI
- 1M context window doesn't affect rate limits
DeepSeek (V3.2, R2)#
| Model | RPM | TPM | Concurrent |
|---|---|---|---|
| DeepSeek V3.2 | 60 | 1,000,000 | 10 |
| DeepSeek R2 | 30 | 500,000 | 5 |
Key notes:
- Very low RPM (60) but high TPM
- Concurrent request limits
- No tier system — same limits for all
- Frequent capacity issues during peak hours
xAI (Grok 4.1)#
| Tier | RPM | TPM |
|---|---|---|
| Free | 10 | 20,000 |
| Basic | 60 | 100,000 |
| Standard | 600 | 1,000,000 |
| Enterprise | Custom | Custom |
Key notes:
- Relatively new API, limits may change
- Enterprise tier requires direct contact
- Separate limits for Grok Vision
Mistral (Large 2, Codestral)#
| Model | RPM | TPM |
|---|---|---|
| Mistral Large 2 | 300 | 2,000,000 |
| Codestral | 300 | 2,000,000 |
| Mistral Small | 300 | 2,000,000 |
Key notes:
- Uniform limits across models
- No tier system
- Generous TPM
Meta (Llama 4 via providers)#
Llama 4 is open-source, so rate limits depend on the hosting provider:
| Provider | RPM | TPM |
|---|---|---|
| Together AI | 600 | 10,000,000 |
| Fireworks AI | 600 | 10,000,000 |
| Groq | 30 | 6,000 |
| Crazyrouter | 1,000 | 5,000,000 |
Side-by-Side Comparison#
RPM Comparison (Paid Tier)#
| Provider | Entry RPM | Max RPM | Time to Max |
|---|---|---|---|
| OpenAI | 500 | 10,000 | 30 days + $1K |
| Anthropic | 50 | 4,000 | $400 spent |
| 1,000 | 4,000 | Immediate | |
| DeepSeek | 60 | 60 | N/A |
| xAI | 60 | 600 | Tier upgrade |
| Mistral | 300 | 300 | N/A |
| Crazyrouter | 1,000 | 5,000 | Immediate |
Winner: OpenAI at Tier 5 (10K RPM), but Crazyrouter offers 1K RPM immediately.
TPM Comparison (Flagship Models)#
| Provider | Model | Max TPM |
|---|---|---|
| OpenAI | GPT-5.2 | 5,000,000 |
| Gemini 2.5 Pro | 4,000,000 | |
| Mistral | Large 2 | 2,000,000 |
| DeepSeek | V3.2 | 1,000,000 |
| xAI | Grok 4.1 | 1,000,000 |
| Anthropic | Claude Opus 4.6 | 400,000 |
Winner: OpenAI (5M TPM), but Anthropic is notably restrictive (400K).
How Rate Limits Affect Real Applications#
Scenario 1: Customer Support Chatbot#
Requirements: 100 concurrent users, avg 500 tokens/request
| Provider | Can Handle? | Bottleneck |
|---|---|---|
| OpenAI (Tier 3) | ✅ Yes | None |
| Anthropic (Tier 2) | ⚠️ Barely | RPM (1,000) |
| Google (Paid) | ✅ Yes | None |
| DeepSeek | ❌ No | RPM (60) |
Scenario 2: Batch Content Generation#
Requirements: 10,000 articles/day, avg 2,000 tokens each
| Provider | Time to Complete | Bottleneck |
|---|---|---|
| OpenAI (Tier 5) | ~17 hours | TPM |
| OpenAI Batch API | ~24 hours | Batch queue |
| Anthropic (Tier 4) | ~83 hours | TPM |
| Google (Paid) | ~8 hours | TPM |
Scenario 3: Real-time AI Application#
Requirements: 1,000 RPM sustained, low latency
| Provider | Can Handle? | Notes |
|---|---|---|
| OpenAI (Tier 4+) | ✅ Yes | 10K RPM |
| Anthropic | ❌ No | Max 4K RPM |
| ✅ Yes | 4K RPM | |
| Crazyrouter | ✅ Yes | 5K RPM + failover |
Rate Limit Handling Strategies#
Strategy 1: Exponential Backoff#
The most basic approach — retry with increasing delays:
import openai
import time
import random
client = openai.OpenAI(
api_key="your-crazyrouter-key",
base_url="https://api.crazyrouter.com/v1"
)
def call_with_backoff(messages, model="gpt-5-mini", max_retries=5):
"""Call API with exponential backoff on rate limits"""
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model=model,
messages=messages,
max_tokens=1000
)
return response
except openai.RateLimitError as e:
if attempt == max_retries - 1:
raise
# Parse retry-after header if available
retry_after = getattr(e, 'retry_after', None)
if retry_after:
wait_time = retry_after
else:
# Exponential backoff with jitter
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Retrying in {wait_time:.1f}s (attempt {attempt + 1})")
time.sleep(wait_time)
return None
Strategy 2: Token Bucket Rate Limiter#
Pre-emptively limit your own request rate:
import asyncio
import time
from collections import deque
class TokenBucket:
"""Token bucket rate limiter"""
def __init__(self, rpm=500, tpm=100000):
self.rpm = rpm
self.tpm = tpm
self.request_times = deque()
self.token_usage = deque() # (timestamp, tokens)
async def acquire(self, estimated_tokens=500):
"""Wait until we can make a request"""
while True:
now = time.time()
# Clean old entries (older than 60s)
while self.request_times and now - self.request_times[0] > 60:
self.request_times.popleft()
while self.token_usage and now - self.token_usage[0][0] > 60:
self.token_usage.popleft()
# Check RPM
if len(self.request_times) >= self.rpm:
wait = 60 - (now - self.request_times[0])
await asyncio.sleep(max(0.1, wait))
continue
# Check TPM
current_tokens = sum(t for _, t in self.token_usage)
if current_tokens + estimated_tokens > self.tpm:
wait = 60 - (now - self.token_usage[0][0])
await asyncio.sleep(max(0.1, wait))
continue
# Record usage
self.request_times.append(now)
self.token_usage.append((now, estimated_tokens))
return
# Usage
limiter = TokenBucket(rpm=500, tpm=100000)
async def rate_limited_call(messages, model="gpt-5-mini"):
await limiter.acquire(estimated_tokens=500)
response = client.chat.completions.create(
model=model,
messages=messages
)
return response
Strategy 3: Multi-Provider Failover#
Route requests across providers when one hits limits:
import openai
# Crazyrouter handles this automatically, but here's manual approach:
providers = [
{
"name": "primary",
"client": openai.OpenAI(
api_key="your-crazyrouter-key",
base_url="https://api.crazyrouter.com/v1"
),
"model": "gpt-5-mini"
},
{
"name": "fallback_1",
"client": openai.OpenAI(
api_key="your-openai-key",
base_url="https://api.openai.com/v1"
),
"model": "gpt-5-mini"
},
{
"name": "fallback_2",
"client": openai.OpenAI(
api_key="your-anthropic-key",
base_url="https://api.anthropic.com/v1"
),
"model": "claude-sonnet-4-5"
}
]
def call_with_failover(messages, max_tokens=1000):
"""Try each provider in order"""
for provider in providers:
try:
response = provider["client"].chat.completions.create(
model=provider["model"],
messages=messages,
max_tokens=max_tokens
)
return response
except openai.RateLimitError:
print(f"Rate limited on {provider['name']}, trying next...")
continue
raise Exception("All providers rate limited")
Strategy 4: Request Queuing#
Queue requests and process them within rate limits:
import asyncio
from asyncio import Queue
class RequestQueue:
"""Queue-based rate limiter"""
def __init__(self, rpm=500):
self.queue = Queue()
self.rpm = rpm
self.interval = 60.0 / rpm # seconds between requests
async def worker(self):
"""Process requests from queue"""
while True:
func, args, kwargs, future = await self.queue.get()
try:
result = await asyncio.to_thread(func, *args, **kwargs)
future.set_result(result)
except Exception as e:
future.set_exception(e)
await asyncio.sleep(self.interval)
async def submit(self, func, *args, **kwargs):
"""Submit request to queue"""
future = asyncio.get_event_loop().create_future()
await self.queue.put((func, args, kwargs, future))
return await future
# Usage
queue = RequestQueue(rpm=450) # Leave 10% headroom
asyncio.create_task(queue.worker())
# Submit requests
result = await queue.submit(
client.chat.completions.create,
model="gpt-5-mini",
messages=[{"role": "user", "content": "Hello"}]
)
Strategy 5: Use Crazyrouter (Easiest)#
Crazyrouter handles rate limiting automatically:
import openai
# Single client, automatic rate limit handling
client = openai.OpenAI(
api_key="your-crazyrouter-key",
base_url="https://api.crazyrouter.com/v1"
)
# Crazyrouter automatically:
# - Routes to available providers
# - Handles 429 errors with retry
# - Load balances across multiple keys
# - Provides higher effective rate limits
response = client.chat.completions.create(
model="gpt-5-mini",
messages=[{"role": "user", "content": "Hello"}]
)
Benefits:
- 1,000+ RPM from day one (no tier grinding)
- Automatic failover across providers
- Built-in retry logic
- 30% cost savings
Node.js Examples#
Exponential Backoff (Node.js)#
import OpenAI from 'openai';
const client = new OpenAI({
apiKey: 'your-crazyrouter-key',
baseURL: 'https://api.crazyrouter.com/v1'
});
async function callWithBackoff(messages, model = 'gpt-5-mini', maxRetries = 5) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const response = await client.chat.completions.create({
model,
messages,
max_tokens: 1000
});
return response;
} catch (error) {
if (error.status === 429 && attempt < maxRetries - 1) {
const waitTime = Math.pow(2, attempt) + Math.random();
console.log(`Rate limited. Retrying in ${waitTime.toFixed(1)}s`);
await new Promise(r => setTimeout(r, waitTime * 1000));
} else {
throw error;
}
}
}
}
cURL with Retry#
#!/bin/bash
# Rate-limit-aware API call with retry
MAX_RETRIES=5
RETRY_DELAY=2
for i in $(seq 1 $MAX_RETRIES); do
RESPONSE=$(curl -s -w "\n%{http_code}" \
-X POST https://api.crazyrouter.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-crazyrouter-key" \
-d '{
"model": "gpt-5-mini",
"messages": [{"role": "user", "content": "Hello"}]
}')
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
BODY=$(echo "$RESPONSE" | head -n -1)
if [ "$HTTP_CODE" = "200" ]; then
echo "$BODY"
exit 0
elif [ "$HTTP_CODE" = "429" ]; then
echo "Rate limited. Retry $i/$MAX_RETRIES in ${RETRY_DELAY}s..."
sleep $RETRY_DELAY
RETRY_DELAY=$((RETRY_DELAY * 2))
else
echo "Error: HTTP $HTTP_CODE"
echo "$BODY"
exit 1
fi
done
echo "Max retries exceeded"
exit 1
Best Practices#
- Always implement retry logic — 429 errors are expected, not exceptional
- Use exponential backoff with jitter — prevents thundering herd
- Monitor your usage — track RPM/TPM to predict limits
- Pre-emptively rate limit — don't wait for 429s
- Use multiple providers — Crazyrouter makes this automatic
- Cache responses — reduce redundant API calls
- Use smaller models for simple tasks — higher limits, lower cost
- Batch when possible — OpenAI Batch API has separate limits
Frequently Asked Questions#
Which provider has the highest rate limits?#
OpenAI at Tier 5 offers 10,000 RPM and 5M TPM for GPT-5.2. However, reaching Tier 5 requires $1,000+ in spending over 30+ days. Crazyrouter offers 1,000+ RPM immediately.
Why is Anthropic's rate limit so low?#
Anthropic prioritizes quality and safety over throughput. Their Tier 1 starts at just 50 RPM. For high-volume applications, consider using Crazyrouter which provides higher effective limits through load balancing.
Do rate limits apply per API key or per organization?#
- OpenAI: Per organization (shared across all keys)
- Anthropic: Per organization
- Google: Per project
- DeepSeek: Per API key
- Crazyrouter: Per API key (higher limits)
How do I check my current rate limit usage?#
Most providers include rate limit headers in responses:
x-ratelimit-limit-requests: 500
x-ratelimit-remaining-requests: 499
x-ratelimit-reset-requests: 60s
x-ratelimit-limit-tokens: 100000
x-ratelimit-remaining-tokens: 99500
Can I request higher rate limits?#
- OpenAI: Automatic tier upgrades based on spending
- Anthropic: Contact sales for custom limits
- Google: Request quota increase in Cloud Console
- Crazyrouter: Contact support for enterprise limits
What happens when I hit the rate limit?#
You receive a 429 Too Many Requests response with a Retry-After header indicating when to retry. Your application should handle this gracefully with retry logic.
Conclusion#
Rate limits vary dramatically across providers. OpenAI offers the highest limits but requires significant spending to unlock them. Anthropic is the most restrictive, especially for Claude Opus. Google provides generous free tiers but moderate paid limits.
For production applications, the best strategy is using Crazyrouter to:
- Get 1,000+ RPM immediately (no tier grinding)
- Automatic failover when one provider is limited
- Built-in retry and load balancing
- 30% cost savings on all providers
Don't let rate limits break your application. Start building with reliable API access at crazyrouter.com.


