Login
Back to Blog
AI API Rate Limits Compared: Every Major Provider in 2026

AI API Rate Limits Compared: Every Major Provider in 2026

C
Crazyrouter Team
March 12, 2026
8 viewsEnglishGuide
Share:

AI API Rate Limits Compared: Every Major Provider in 2026#

Rate limits are the silent killer of AI applications. You build a great product, users love it, traffic spikes — and suddenly you're getting 429 errors everywhere. Understanding rate limits across providers is critical for building reliable AI applications.

This guide compares rate limits for every major AI API provider in March 2026, with practical strategies for handling them in production.

What Are API Rate Limits?#

Rate limits control how many API requests you can make within a time window. They're measured in:

  • RPM (Requests Per Minute): How many API calls per minute
  • TPM (Tokens Per Minute): How many tokens processed per minute
  • RPD (Requests Per Day): Daily request cap
  • Images/min: For image generation APIs

Exceeding these limits returns a 429 Too Many Requests error.

Rate Limits by Provider#

OpenAI (GPT-5.2, GPT-5-mini, DALL-E)#

OpenAI uses a tier system based on spending history:

TierQualificationRPMTPM (GPT-5.2)TPM (GPT-5-mini)
FreeNew account340,000200,000
Tier 1$5 paid50030,000200,000
Tier 2$50 paid + 7 days5,000450,0002,000,000
Tier 3$100 paid + 7 days5,000800,0004,000,000
Tier 4$250 paid + 14 days10,0002,000,00010,000,000
Tier 5$1,000 paid + 30 days10,0005,000,00030,000,000

Image Generation (DALL-E 3):

  • Tier 1: 7 images/min
  • Tier 5: 50 images/min

Batch API: 50% of standard limits, processed within 24 hours.

Key notes:

  • Tier upgrades are automatic based on spending
  • Organization-level limits (shared across all keys)
  • Separate limits per model family

Anthropic (Claude Opus 4.6, Sonnet 4.5, Haiku 4.5)#

Anthropic also uses spending-based tiers:

TierQualificationRPMTPM (Opus 4.6)TPM (Sonnet 4.5)TPM (Haiku 4.5)
Tier 1$5 credit5040,00040,00050,000
Tier 2$40 spent1,00080,00080,000100,000
Tier 3$200 spent2,000160,000160,000200,000
Tier 4$400 spent4,000400,000400,000800,000

Key notes:

  • Much lower RPM than OpenAI (50 vs 500 at Tier 1)
  • Opus 4.6 has the tightest limits
  • No batch API yet
  • Prompt caching doesn't count against TPM limits

Google (Gemini 2.5 Pro, Flash, Gemini 3 Pro Preview)#

Google uses a simpler model:

ModelFree Tier RPMFree Tier RPDPaid RPMPaid TPM
Gemini 2.5 Pro2501,0004,000,000
Gemini 2.5 Flash151,5002,0004,000,000
Gemini 2.5 Flash Lite301,5004,0004,000,000
Gemini 3 Pro Preview2505002,000,000

Key notes:

  • Generous free tier (especially Flash)
  • Paid tier has high TPM (4M)
  • Lower RPM than OpenAI
  • 1M context window doesn't affect rate limits

DeepSeek (V3.2, R2)#

ModelRPMTPMConcurrent
DeepSeek V3.2601,000,00010
DeepSeek R230500,0005

Key notes:

  • Very low RPM (60) but high TPM
  • Concurrent request limits
  • No tier system — same limits for all
  • Frequent capacity issues during peak hours

xAI (Grok 4.1)#

TierRPMTPM
Free1020,000
Basic60100,000
Standard6001,000,000
EnterpriseCustomCustom

Key notes:

  • Relatively new API, limits may change
  • Enterprise tier requires direct contact
  • Separate limits for Grok Vision

Mistral (Large 2, Codestral)#

ModelRPMTPM
Mistral Large 23002,000,000
Codestral3002,000,000
Mistral Small3002,000,000

Key notes:

  • Uniform limits across models
  • No tier system
  • Generous TPM

Meta (Llama 4 via providers)#

Llama 4 is open-source, so rate limits depend on the hosting provider:

ProviderRPMTPM
Together AI60010,000,000
Fireworks AI60010,000,000
Groq306,000
Crazyrouter1,0005,000,000

Side-by-Side Comparison#

RPM Comparison (Paid Tier)#

ProviderEntry RPMMax RPMTime to Max
OpenAI50010,00030 days + $1K
Anthropic504,000$400 spent
Google1,0004,000Immediate
DeepSeek6060N/A
xAI60600Tier upgrade
Mistral300300N/A
Crazyrouter1,0005,000Immediate

Winner: OpenAI at Tier 5 (10K RPM), but Crazyrouter offers 1K RPM immediately.

TPM Comparison (Flagship Models)#

ProviderModelMax TPM
OpenAIGPT-5.25,000,000
GoogleGemini 2.5 Pro4,000,000
MistralLarge 22,000,000
DeepSeekV3.21,000,000
xAIGrok 4.11,000,000
AnthropicClaude Opus 4.6400,000

Winner: OpenAI (5M TPM), but Anthropic is notably restrictive (400K).

How Rate Limits Affect Real Applications#

Scenario 1: Customer Support Chatbot#

Requirements: 100 concurrent users, avg 500 tokens/request

ProviderCan Handle?Bottleneck
OpenAI (Tier 3)✅ YesNone
Anthropic (Tier 2)⚠️ BarelyRPM (1,000)
Google (Paid)✅ YesNone
DeepSeek❌ NoRPM (60)

Scenario 2: Batch Content Generation#

Requirements: 10,000 articles/day, avg 2,000 tokens each

ProviderTime to CompleteBottleneck
OpenAI (Tier 5)~17 hoursTPM
OpenAI Batch API~24 hoursBatch queue
Anthropic (Tier 4)~83 hoursTPM
Google (Paid)~8 hoursTPM

Scenario 3: Real-time AI Application#

Requirements: 1,000 RPM sustained, low latency

ProviderCan Handle?Notes
OpenAI (Tier 4+)✅ Yes10K RPM
Anthropic❌ NoMax 4K RPM
Google✅ Yes4K RPM
Crazyrouter✅ Yes5K RPM + failover

Rate Limit Handling Strategies#

Strategy 1: Exponential Backoff#

The most basic approach — retry with increasing delays:

python
import openai
import time
import random

client = openai.OpenAI(
    api_key="your-crazyrouter-key",
    base_url="https://api.crazyrouter.com/v1"
)

def call_with_backoff(messages, model="gpt-5-mini", max_retries=5):
    """Call API with exponential backoff on rate limits"""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                max_tokens=1000
            )
            return response
        except openai.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            
            # Parse retry-after header if available
            retry_after = getattr(e, 'retry_after', None)
            if retry_after:
                wait_time = retry_after
            else:
                # Exponential backoff with jitter
                wait_time = (2 ** attempt) + random.uniform(0, 1)
            
            print(f"Rate limited. Retrying in {wait_time:.1f}s (attempt {attempt + 1})")
            time.sleep(wait_time)
    
    return None

Strategy 2: Token Bucket Rate Limiter#

Pre-emptively limit your own request rate:

python
import asyncio
import time
from collections import deque

class TokenBucket:
    """Token bucket rate limiter"""
    
    def __init__(self, rpm=500, tpm=100000):
        self.rpm = rpm
        self.tpm = tpm
        self.request_times = deque()
        self.token_usage = deque()  # (timestamp, tokens)
    
    async def acquire(self, estimated_tokens=500):
        """Wait until we can make a request"""
        while True:
            now = time.time()
            
            # Clean old entries (older than 60s)
            while self.request_times and now - self.request_times[0] > 60:
                self.request_times.popleft()
            while self.token_usage and now - self.token_usage[0][0] > 60:
                self.token_usage.popleft()
            
            # Check RPM
            if len(self.request_times) >= self.rpm:
                wait = 60 - (now - self.request_times[0])
                await asyncio.sleep(max(0.1, wait))
                continue
            
            # Check TPM
            current_tokens = sum(t for _, t in self.token_usage)
            if current_tokens + estimated_tokens > self.tpm:
                wait = 60 - (now - self.token_usage[0][0])
                await asyncio.sleep(max(0.1, wait))
                continue
            
            # Record usage
            self.request_times.append(now)
            self.token_usage.append((now, estimated_tokens))
            return

# Usage
limiter = TokenBucket(rpm=500, tpm=100000)

async def rate_limited_call(messages, model="gpt-5-mini"):
    await limiter.acquire(estimated_tokens=500)
    response = client.chat.completions.create(
        model=model,
        messages=messages
    )
    return response

Strategy 3: Multi-Provider Failover#

Route requests across providers when one hits limits:

python
import openai

# Crazyrouter handles this automatically, but here's manual approach:
providers = [
    {
        "name": "primary",
        "client": openai.OpenAI(
            api_key="your-crazyrouter-key",
            base_url="https://api.crazyrouter.com/v1"
        ),
        "model": "gpt-5-mini"
    },
    {
        "name": "fallback_1",
        "client": openai.OpenAI(
            api_key="your-openai-key",
            base_url="https://api.openai.com/v1"
        ),
        "model": "gpt-5-mini"
    },
    {
        "name": "fallback_2",
        "client": openai.OpenAI(
            api_key="your-anthropic-key",
            base_url="https://api.anthropic.com/v1"
        ),
        "model": "claude-sonnet-4-5"
    }
]

def call_with_failover(messages, max_tokens=1000):
    """Try each provider in order"""
    for provider in providers:
        try:
            response = provider["client"].chat.completions.create(
                model=provider["model"],
                messages=messages,
                max_tokens=max_tokens
            )
            return response
        except openai.RateLimitError:
            print(f"Rate limited on {provider['name']}, trying next...")
            continue
    
    raise Exception("All providers rate limited")

Strategy 4: Request Queuing#

Queue requests and process them within rate limits:

python
import asyncio
from asyncio import Queue

class RequestQueue:
    """Queue-based rate limiter"""
    
    def __init__(self, rpm=500):
        self.queue = Queue()
        self.rpm = rpm
        self.interval = 60.0 / rpm  # seconds between requests
    
    async def worker(self):
        """Process requests from queue"""
        while True:
            func, args, kwargs, future = await self.queue.get()
            try:
                result = await asyncio.to_thread(func, *args, **kwargs)
                future.set_result(result)
            except Exception as e:
                future.set_exception(e)
            
            await asyncio.sleep(self.interval)
    
    async def submit(self, func, *args, **kwargs):
        """Submit request to queue"""
        future = asyncio.get_event_loop().create_future()
        await self.queue.put((func, args, kwargs, future))
        return await future

# Usage
queue = RequestQueue(rpm=450)  # Leave 10% headroom
asyncio.create_task(queue.worker())

# Submit requests
result = await queue.submit(
    client.chat.completions.create,
    model="gpt-5-mini",
    messages=[{"role": "user", "content": "Hello"}]
)

Strategy 5: Use Crazyrouter (Easiest)#

Crazyrouter handles rate limiting automatically:

python
import openai

# Single client, automatic rate limit handling
client = openai.OpenAI(
    api_key="your-crazyrouter-key",
    base_url="https://api.crazyrouter.com/v1"
)

# Crazyrouter automatically:
# - Routes to available providers
# - Handles 429 errors with retry
# - Load balances across multiple keys
# - Provides higher effective rate limits

response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[{"role": "user", "content": "Hello"}]
)

Benefits:

  • 1,000+ RPM from day one (no tier grinding)
  • Automatic failover across providers
  • Built-in retry logic
  • 30% cost savings

Node.js Examples#

Exponential Backoff (Node.js)#

javascript
import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'your-crazyrouter-key',
  baseURL: 'https://api.crazyrouter.com/v1'
});

async function callWithBackoff(messages, model = 'gpt-5-mini', maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const response = await client.chat.completions.create({
        model,
        messages,
        max_tokens: 1000
      });
      return response;
    } catch (error) {
      if (error.status === 429 && attempt < maxRetries - 1) {
        const waitTime = Math.pow(2, attempt) + Math.random();
        console.log(`Rate limited. Retrying in ${waitTime.toFixed(1)}s`);
        await new Promise(r => setTimeout(r, waitTime * 1000));
      } else {
        throw error;
      }
    }
  }
}

cURL with Retry#

bash
#!/bin/bash
# Rate-limit-aware API call with retry

MAX_RETRIES=5
RETRY_DELAY=2

for i in $(seq 1 $MAX_RETRIES); do
  RESPONSE=$(curl -s -w "\n%{http_code}" \
    -X POST https://api.crazyrouter.com/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer your-crazyrouter-key" \
    -d '{
      "model": "gpt-5-mini",
      "messages": [{"role": "user", "content": "Hello"}]
    }')
  
  HTTP_CODE=$(echo "$RESPONSE" | tail -1)
  BODY=$(echo "$RESPONSE" | head -n -1)
  
  if [ "$HTTP_CODE" = "200" ]; then
    echo "$BODY"
    exit 0
  elif [ "$HTTP_CODE" = "429" ]; then
    echo "Rate limited. Retry $i/$MAX_RETRIES in ${RETRY_DELAY}s..."
    sleep $RETRY_DELAY
    RETRY_DELAY=$((RETRY_DELAY * 2))
  else
    echo "Error: HTTP $HTTP_CODE"
    echo "$BODY"
    exit 1
  fi
done

echo "Max retries exceeded"
exit 1

Best Practices#

  1. Always implement retry logic — 429 errors are expected, not exceptional
  2. Use exponential backoff with jitter — prevents thundering herd
  3. Monitor your usage — track RPM/TPM to predict limits
  4. Pre-emptively rate limit — don't wait for 429s
  5. Use multiple providers — Crazyrouter makes this automatic
  6. Cache responses — reduce redundant API calls
  7. Use smaller models for simple tasks — higher limits, lower cost
  8. Batch when possible — OpenAI Batch API has separate limits

Frequently Asked Questions#

Which provider has the highest rate limits?#

OpenAI at Tier 5 offers 10,000 RPM and 5M TPM for GPT-5.2. However, reaching Tier 5 requires $1,000+ in spending over 30+ days. Crazyrouter offers 1,000+ RPM immediately.

Why is Anthropic's rate limit so low?#

Anthropic prioritizes quality and safety over throughput. Their Tier 1 starts at just 50 RPM. For high-volume applications, consider using Crazyrouter which provides higher effective limits through load balancing.

Do rate limits apply per API key or per organization?#

  • OpenAI: Per organization (shared across all keys)
  • Anthropic: Per organization
  • Google: Per project
  • DeepSeek: Per API key
  • Crazyrouter: Per API key (higher limits)

How do I check my current rate limit usage?#

Most providers include rate limit headers in responses:

code
x-ratelimit-limit-requests: 500
x-ratelimit-remaining-requests: 499
x-ratelimit-reset-requests: 60s
x-ratelimit-limit-tokens: 100000
x-ratelimit-remaining-tokens: 99500

Can I request higher rate limits?#

  • OpenAI: Automatic tier upgrades based on spending
  • Anthropic: Contact sales for custom limits
  • Google: Request quota increase in Cloud Console
  • Crazyrouter: Contact support for enterprise limits

What happens when I hit the rate limit?#

You receive a 429 Too Many Requests response with a Retry-After header indicating when to retry. Your application should handle this gracefully with retry logic.

Conclusion#

Rate limits vary dramatically across providers. OpenAI offers the highest limits but requires significant spending to unlock them. Anthropic is the most restrictive, especially for Claude Opus. Google provides generous free tiers but moderate paid limits.

For production applications, the best strategy is using Crazyrouter to:

  • Get 1,000+ RPM immediately (no tier grinding)
  • Automatic failover when one provider is limited
  • Built-in retry and load balancing
  • 30% cost savings on all providers

Don't let rate limits break your application. Start building with reliable API access at crazyrouter.com.

Related Articles