EnglishGuide

AI API Rate Limits Compared: Every Major Provider in 2026

Complete comparison of API rate limits for OpenAI, Anthropic, Google, DeepSeek, xAI, and more. Understand TPM, RPM, and strategies to handle rate limiting in production.

Crazyrouter Team

March 12, 2026 / 1405 views

AI API Rate Limits Compared: Every Major Provider in 2026

Crazyrouter

Read the docs Check live pricing Open image tool Create account

AI API Rate Limits Compared: Every Major Provider in 2026#

Rate limits are the silent killer of AI applications. You build a great product, users love it, traffic spikes — and suddenly you're getting 429 errors everywhere. Understanding rate limits across providers is critical for building reliable AI applications.

This guide compares rate limits for every major AI API provider in March 2026, with practical strategies for handling them in production.

What Are API Rate Limits?#

Rate limits control how many API requests you can make within a time window. They're measured in:

RPM (Requests Per Minute): How many API calls per minute
TPM (Tokens Per Minute): How many tokens processed per minute
RPD (Requests Per Day): Daily request cap
Images/min: For image generation APIs

Exceeding these limits returns a 429 Too Many Requests error.

Rate Limits by Provider#

OpenAI (GPT-5.2, GPT-5-mini, DALL-E)#

OpenAI uses a tier system based on spending history:

Tier	Qualification	RPM	TPM (GPT-5.2)	TPM (GPT-5-mini)
Free	New account	3	40,000	200,000
Tier 1	$5 paid	500	30,000	200,000
Tier 2	$50 paid + 7 days	5,000	450,000	2,000,000
Tier 3	$100 paid + 7 days	5,000	800,000	4,000,000
Tier 4	$250 paid + 14 days	10,000	2,000,000	10,000,000
Tier 5	$1,000 paid + 30 days	10,000	5,000,000	30,000,000

Image Generation (DALL-E 3):

Tier 1: 7 images/min
Tier 5: 50 images/min

Batch API: 50% of standard limits, processed within 24 hours.

Key notes:

Tier upgrades are automatic based on spending
Organization-level limits (shared across all keys)
Separate limits per model family

Anthropic (Claude Opus 4.6, Sonnet 4.5, Haiku 4.5)#

Anthropic also uses spending-based tiers:

Tier	Qualification	RPM	TPM (Opus 4.6)	TPM (Sonnet 4.5)	TPM (Haiku 4.5)
Tier 1	$5 credit	50	40,000	40,000	50,000
Tier 2	$40 spent	1,000	80,000	80,000	100,000
Tier 3	$200 spent	2,000	160,000	160,000	200,000
Tier 4	$400 spent	4,000	400,000	400,000	800,000

Key notes:

Much lower RPM than OpenAI (50 vs 500 at Tier 1)
Opus 4.6 has the tightest limits
No batch API yet
Prompt caching doesn't count against TPM limits

Google (Gemini 2.5 Pro, Flash, Gemini 3 Pro Preview)#

Google uses a simpler model:

Model	Free Tier RPM	Free Tier RPD	Paid RPM	Paid TPM
Gemini 2.5 Pro	2	50	1,000	4,000,000
Gemini 2.5 Flash	15	1,500	2,000	4,000,000
Gemini 2.5 Flash Lite	30	1,500	4,000	4,000,000
Gemini 3 Pro Preview	2	50	500	2,000,000

Key notes:

Generous free tier (especially Flash)
Paid tier has high TPM (4M)
Lower RPM than OpenAI
1M context window doesn't affect rate limits

DeepSeek (V3.2, R2)#

Model	RPM	TPM	Concurrent
DeepSeek V3.2	60	1,000,000	10
DeepSeek R2	30	500,000	5

Key notes:

Very low RPM (60) but high TPM
Concurrent request limits
No tier system — same limits for all
Frequent capacity issues during peak hours

xAI (Grok 4.1)#

Tier	RPM	TPM
Free	10	20,000
Basic	60	100,000
Standard	600	1,000,000
Enterprise	Custom	Custom

Key notes:

Relatively new API, limits may change
Enterprise tier requires direct contact
Separate limits for Grok Vision

Mistral (Large 2, Codestral)#

Model	RPM	TPM
Mistral Large 2	300	2,000,000
Codestral	300	2,000,000
Mistral Small	300	2,000,000

Key notes:

Uniform limits across models
No tier system
Generous TPM

Meta (Llama 4 via providers)#

Llama 4 is open-source, so rate limits depend on the hosting provider:

Provider	RPM	TPM
Together AI	600	10,000,000
Fireworks AI	600	10,000,000
Groq	30	6,000
Crazyrouter	1,000	5,000,000

Side-by-Side Comparison#

RPM Comparison (Paid Tier)#

Provider	Entry RPM	Max RPM	Time to Max
OpenAI	500	10,000	30 days + $1K
Anthropic	50	4,000	$400 spent
Google	1,000	4,000	Immediate
DeepSeek	60	60	N/A
xAI	60	600	Tier upgrade
Mistral	300	300	N/A
Crazyrouter	1,000	5,000	Immediate

Winner: OpenAI at Tier 5 (10K RPM), but Crazyrouter offers 1K RPM immediately.

TPM Comparison (Flagship Models)#

Provider	Model	Max TPM
OpenAI	GPT-5.2	5,000,000
Google	Gemini 2.5 Pro	4,000,000
Mistral	Large 2	2,000,000
DeepSeek	V3.2	1,000,000
xAI	Grok 4.1	1,000,000
Anthropic	Claude Opus 4.6	400,000

Winner: OpenAI (5M TPM), but Anthropic is notably restrictive (400K).

How Rate Limits Affect Real Applications#

Scenario 1: Customer Support Chatbot#

Requirements: 100 concurrent users, avg 500 tokens/request

Provider	Can Handle?	Bottleneck
OpenAI (Tier 3)	✅ Yes	None
Anthropic (Tier 2)	⚠️ Barely	RPM (1,000)
Google (Paid)	✅ Yes	None
DeepSeek	❌ No	RPM (60)

Scenario 2: Batch Content Generation#

Requirements: 10,000 articles/day, avg 2,000 tokens each

Provider	Time to Complete	Bottleneck
OpenAI (Tier 5)	~17 hours	TPM
OpenAI Batch API	~24 hours	Batch queue
Anthropic (Tier 4)	~83 hours	TPM
Google (Paid)	~8 hours	TPM

Scenario 3: Real-time AI Application#

Requirements: 1,000 RPM sustained, low latency

Provider	Can Handle?	Notes
OpenAI (Tier 4+)	✅ Yes	10K RPM
Anthropic	❌ No	Max 4K RPM
Google	✅ Yes	4K RPM
Crazyrouter	✅ Yes	5K RPM + failover

Rate Limit Handling Strategies#

Strategy 1: Exponential Backoff#

The most basic approach — retry with increasing delays:

python

import openai
import time
import random

client = openai.OpenAI(
    api_key="your-crazyrouter-key",
    base_url="https://api.crazyrouter.com/v1"
)

def call_with_backoff(messages, model="gpt-5-mini", max_retries=5):
    """Call API with exponential backoff on rate limits"""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                max_tokens=1000
            )
            return response
        except openai.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            
            # Parse retry-after header if available
            retry_after = getattr(e, 'retry_after', None)
            if retry_after:
                wait_time = retry_after
            else:
                # Exponential backoff with jitter
                wait_time = (2 ** attempt) + random.uniform(0, 1)
            
            print(f"Rate limited. Retrying in {wait_time:.1f}s (attempt {attempt + 1})")
            time.sleep(wait_time)
    
    return None

Strategy 2: Token Bucket Rate Limiter#

Pre-emptively limit your own request rate:

python

import asyncio
import time
from collections import deque

class TokenBucket:
    """Token bucket rate limiter"""
    
    def __init__(self, rpm=500, tpm=100000):
        self.rpm = rpm
        self.tpm = tpm
        self.request_times = deque()
        self.token_usage = deque()  # (timestamp, tokens)
    
    async def acquire(self, estimated_tokens=500):
        """Wait until we can make a request"""
        while True:
            now = time.time()
            
            # Clean old entries (older than 60s)
            while self.request_times and now - self.request_times[0] > 60:
                self.request_times.popleft()
            while self.token_usage and now - self.token_usage[0][0] > 60:
                self.token_usage.popleft()
            
            # Check RPM
            if len(self.request_times) >= self.rpm:
                wait = 60 - (now - self.request_times[0])
                await asyncio.sleep(max(0.1, wait))
                continue
            
            # Check TPM
            current_tokens = sum(t for _, t in self.token_usage)
            if current_tokens + estimated_tokens > self.tpm:
                wait = 60 - (now - self.token_usage[0][0])
                await asyncio.sleep(max(0.1, wait))
                continue
            
            # Record usage
            self.request_times.append(now)
            self.token_usage.append((now, estimated_tokens))
            return

# Usage
limiter = TokenBucket(rpm=500, tpm=100000)

async def rate_limited_call(messages, model="gpt-5-mini"):
    await limiter.acquire(estimated_tokens=500)
    response = client.chat.completions.create(
        model=model,
        messages=messages
    )
    return response

Strategy 3: Multi-Provider Failover#

Route requests across providers when one hits limits:

python

import openai

# Crazyrouter handles this automatically, but here's manual approach:
providers = [
    {
        "name": "primary",
        "client": openai.OpenAI(
            api_key="your-crazyrouter-key",
            base_url="https://api.crazyrouter.com/v1"
        ),
        "model": "gpt-5-mini"
    },
    {
        "name": "fallback_1",
        "client": openai.OpenAI(
            api_key="your-openai-key",
            base_url="https://api.openai.com/v1"
        ),
        "model": "gpt-5-mini"
    },
    {
        "name": "fallback_2",
        "client": openai.OpenAI(
            api_key="your-anthropic-key",
            base_url="https://api.anthropic.com/v1"
        ),
        "model": "claude-sonnet-4-5"
    }
]

def call_with_failover(messages, max_tokens=1000):
    """Try each provider in order"""
    for provider in providers:
        try:
            response = provider["client"].chat.completions.create(
                model=provider["model"],
                messages=messages,
                max_tokens=max_tokens
            )
            return response
        except openai.RateLimitError:
            print(f"Rate limited on {provider['name']}, trying next...")
            continue
    
    raise Exception("All providers rate limited")

Strategy 4: Request Queuing#

Queue requests and process them within rate limits:

python

import asyncio
from asyncio import Queue

class RequestQueue:
    """Queue-based rate limiter"""
    
    def __init__(self, rpm=500):
        self.queue = Queue()
        self.rpm = rpm
        self.interval = 60.0 / rpm  # seconds between requests
    
    async def worker(self):
        """Process requests from queue"""
        while True:
            func, args, kwargs, future = await self.queue.get()
            try:
                result = await asyncio.to_thread(func, *args, **kwargs)
                future.set_result(result)
            except Exception as e:
                future.set_exception(e)
            
            await asyncio.sleep(self.interval)
    
    async def submit(self, func, *args, **kwargs):
        """Submit request to queue"""
        future = asyncio.get_event_loop().create_future()
        await self.queue.put((func, args, kwargs, future))
        return await future

# Usage
queue = RequestQueue(rpm=450)  # Leave 10% headroom
asyncio.create_task(queue.worker())

# Submit requests
result = await queue.submit(
    client.chat.completions.create,
    model="gpt-5-mini",
    messages=[{"role": "user", "content": "Hello"}]
)

Strategy 5: Use Crazyrouter (Easiest)#

Crazyrouter handles rate limiting automatically:

python

import openai

# Single client, automatic rate limit handling
client = openai.OpenAI(
    api_key="your-crazyrouter-key",
    base_url="https://api.crazyrouter.com/v1"
)

# Crazyrouter automatically:
# - Routes to available providers
# - Handles 429 errors with retry
# - Load balances across multiple keys
# - Provides higher effective rate limits

response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[{"role": "user", "content": "Hello"}]
)

Benefits:

1,000+ RPM from day one (no tier grinding)
Automatic failover across providers
Built-in retry logic
30% cost savings

Node.js Examples#

Exponential Backoff (Node.js)#

javascript

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'your-crazyrouter-key',
  baseURL: 'https://api.crazyrouter.com/v1'
});

async function callWithBackoff(messages, model = 'gpt-5-mini', maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const response = await client.chat.completions.create({
        model,
        messages,
        max_tokens: 1000
      });
      return response;
    } catch (error) {
      if (error.status === 429 && attempt < maxRetries - 1) {
        const waitTime = Math.pow(2, attempt) + Math.random();
        console.log(`Rate limited. Retrying in ${waitTime.toFixed(1)}s`);
        await new Promise(r => setTimeout(r, waitTime * 1000));
      } else {
        throw error;
      }
    }
  }
}

cURL with Retry#

bash

#!/bin/bash
# Rate-limit-aware API call with retry

MAX_RETRIES=5
RETRY_DELAY=2

for i in $(seq 1 $MAX_RETRIES); do
  RESPONSE=$(curl -s -w "\n%{http_code}" \
    -X POST https://api.crazyrouter.com/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer your-crazyrouter-key" \
    -d '{
      "model": "gpt-5-mini",
      "messages": [{"role": "user", "content": "Hello"}]
    }')
  
  HTTP_CODE=$(echo "$RESPONSE" | tail -1)
  BODY=$(echo "$RESPONSE" | head -n -1)
  
  if [ "$HTTP_CODE" = "200" ]; then
    echo "$BODY"
    exit 0
  elif [ "$HTTP_CODE" = "429" ]; then
    echo "Rate limited. Retry $i/$MAX_RETRIES in ${RETRY_DELAY}s..."
    sleep $RETRY_DELAY
    RETRY_DELAY=$((RETRY_DELAY * 2))
  else
    echo "Error: HTTP $HTTP_CODE"
    echo "$BODY"
    exit 1
  fi
done

echo "Max retries exceeded"
exit 1

Best Practices#

Always implement retry logic — 429 errors are expected, not exceptional
Use exponential backoff with jitter — prevents thundering herd
Monitor your usage — track RPM/TPM to predict limits
Pre-emptively rate limit — don't wait for 429s
Use multiple providers — Crazyrouter makes this automatic
Cache responses — reduce redundant API calls
Use smaller models for simple tasks — higher limits, lower cost
Batch when possible — OpenAI Batch API has separate limits

Frequently Asked Questions#

Which provider has the highest rate limits?#

OpenAI at Tier 5 offers 10,000 RPM and 5M TPM for GPT-5.2. However, reaching Tier 5 requires $1,000+ in spending over 30+ days. Crazyrouter offers 1,000+ RPM immediately.

Why is Anthropic's rate limit so low?#

Anthropic prioritizes quality and safety over throughput. Their Tier 1 starts at just 50 RPM. For high-volume applications, consider using Crazyrouter which provides higher effective limits through load balancing.

Do rate limits apply per API key or per organization?#

OpenAI: Per organization (shared across all keys)
Anthropic: Per organization
Google: Per project
DeepSeek: Per API key
Crazyrouter: Per API key (higher limits)

How do I check my current rate limit usage?#

Most providers include rate limit headers in responses:

code

x-ratelimit-limit-requests: 500
x-ratelimit-remaining-requests: 499
x-ratelimit-reset-requests: 60s
x-ratelimit-limit-tokens: 100000
x-ratelimit-remaining-tokens: 99500

Can I request higher rate limits?#

OpenAI: Automatic tier upgrades based on spending
Anthropic: Contact sales for custom limits
Google: Request quota increase in Cloud Console
Crazyrouter: Contact support for enterprise limits

What happens when I hit the rate limit?#

You receive a 429 Too Many Requests response with a Retry-After header indicating when to retry. Your application should handle this gracefully with retry logic.

Conclusion#

Rate limits vary dramatically across providers. OpenAI offers the highest limits but requires significant spending to unlock them. Anthropic is the most restrictive, especially for Claude Opus. Google provides generous free tiers but moderate paid limits.

For production applications, the best strategy is using Crazyrouter to:

Get 1,000+ RPM immediately (no tier grinding)
Automatic failover when one provider is limited
Built-in retry and load balancing
30% cost savings on all providers

Don't let rate limits break your application. Start building with reliable API access at crazyrouter.com.

Implementation Guides

Quick Start GuideMake the first Crazyrouter API call and validate your setup.List ModelsQuery models available to the current API key through GET /v1/models.Claude Native FormatCall Claude through the Anthropic Messages API on Crazyrouter.Usage Logs and Cost MonitoringUse management APIs to query logs, quota, token usage, and dollar cost.

Crazyrouter

Read the docs Check live pricing Open image tool Create account

Topics

Comparisons API GuidesGuide

AI API Rate Limits Compared: Every Major Provider in 2026#

What Are API Rate Limits?#

Rate Limits by Provider#

OpenAI (GPT-5.2, GPT-5-mini, DALL-E)#

Anthropic (Claude Opus 4.6, Sonnet 4.5, Haiku 4.5)#

Google (Gemini 2.5 Pro, Flash, Gemini 3 Pro Preview)#

DeepSeek (V3.2, R2)#

xAI (Grok 4.1)#

Mistral (Large 2, Codestral)#

Meta (Llama 4 via providers)#

Side-by-Side Comparison#

RPM Comparison (Paid Tier)#

TPM Comparison (Flagship Models)#

How Rate Limits Affect Real Applications#

Scenario 1: Customer Support Chatbot#

Scenario 2: Batch Content Generation#

Scenario 3: Real-time AI Application#

Rate Limit Handling Strategies#

Strategy 1: Exponential Backoff#

Strategy 2: Token Bucket Rate Limiter#

Strategy 3: Multi-Provider Failover#

Strategy 4: Request Queuing#

Strategy 5: Use Crazyrouter (Easiest)#

Node.js Examples#

Exponential Backoff (Node.js)#

cURL with Retry#

Best Practices#

Frequently Asked Questions#

Which provider has the highest rate limits?#

Why is Anthropic's rate limit so low?#

Do rate limits apply per API key or per organization?#

How do I check my current rate limit usage?#

Can I request higher rate limits?#

What happens when I hit the rate limit?#

Conclusion#

Implementation Guides

Topics

Related Posts

Seedance by ByteDance: Complete Guide to AI Video Generation in 2026

Gemini CLI Complete Guide 2026: Repo Automation, CI Agents, and API Routing

How to Get Claude API Key in China 2026: Complete Setup Guide

Claude Code Pricing Guide for Teams in 2026: Costs, Limits, and Cheaper API Workflows

How to Pay for Anthropic Claude API: Billing, Cards, and Payment Methods

Building AI SaaS on a Budget: From Zero to Revenue with Minimal Spend