EnglishComparison

AI Inference Speed Benchmark 2026: Tokens Per Second Compared

Compare real-world inference speed (tokens per second) across GPT-5, Claude Opus 4.6, Gemini 3 Pro, DeepSeek V3.2, and more — and how to optimize latency in production.

Crazyrouter Team

April 8, 2026 / 852 views

AI Inference Speed Benchmark 2026: Tokens Per Second Compared

Crazyrouter

Open API Playground Open image tool Read the docs Check live pricing

AI Inference Speed Benchmark 2026: Tokens Per Second Compared#

When you're building production AI applications, inference speed matters as much as model quality. A 200ms first-token latency is fine for a chatbot. For real-time voice agents or live coding assistants, it can make or break the user experience.

This guide benchmarks real-world inference speed across major AI models in 2026: tokens per second (TPS), time to first token (TTFT), and what those numbers mean for your architecture.

Why Inference Speed Matters#

Different applications have different latency requirements:

Application Type	Max Tolerable TTFT	Notes
Batch processing	>10s acceptable	Cost > speed
Chatbot (web)	1-3s	Streaming hides latency
Code completion	<500ms	IDE plugins are sensitive
Voice agent	<300ms	Human conversation rhythm
Real-time translation	<200ms	Must feel instant
Edge/IoT	<100ms	Network RTT is the ceiling

Key Metrics Explained#

TTFT (Time to First Token): How long from request sent to first token received. Critical for perceived responsiveness.
TPS (Tokens Per Second): Generation throughput after the first token. Determines how fast long responses complete.
E2E Latency: Total time for a complete response. For short replies, TTFT dominates; for long replies, TPS dominates.

AI Inference Speed Benchmark 2026#

Methodology: Benchmarks are based on community measurements, provider documentation, and Crazyrouter internal routing data (April 2026). Results vary significantly by region, time of day, and request size. All speeds measured via API with streaming enabled.

Large Models (70B+ Parameters)#

Model	Provider	TTFT (p50)	TPS (p50)	Context
GPT-5.2	OpenAI	850ms	62 tok/s	128K
Claude Opus 4.6	Anthropic	1,100ms	48 tok/s	200K
Gemini 3 Pro Preview	Google	720ms	71 tok/s	1M
Grok 4.1 Fast	xAI	450ms	95 tok/s	128K
DeepSeek V3.2	DeepSeek	380ms	110 tok/s	64K
Kimi K2 Thinking	Moonshot	1,200ms	35 tok/s	128K
MiniMax M2	MiniMax	520ms	88 tok/s	1M

Mid-Size / Efficient Models#

Model	Provider	TTFT (p50)	TPS (p50)	Context
GPT-5 Mini	OpenAI	320ms	145 tok/s	128K
Claude Sonnet 4.5	Anthropic	580ms	98 tok/s	200K
Claude Haiku 4.5	Anthropic	180ms	180 tok/s	200K
Gemini 2.5 Flash	Google	250ms	190 tok/s	1M
Gemini 2.5 Flash Lite	Google	140ms	240 tok/s	1M
DeepSeek V3.2 (turbo)	DeepSeek	280ms	135 tok/s	64K
Qwen 2.5 VL 72B	Alibaba	410ms	102 tok/s	128K

Speed Champions (Specialized Fast Models)#

Model	TTFT (p50)	TPS	Use Case
Gemini 2.5 Flash Lite	140ms	240	Batch, real-time
Claude Haiku 4.5	180ms	180	Edge, high-volume
GPT-5 Mini	320ms	145	Code completion
Grok 4.1 Fast	450ms	95	Reasoning tasks

Speed vs Quality Trade-offs#

Here's how to think about model selection when speed matters:

code

High Speed, Lower Quality:
  Gemini 2.5 Flash Lite → Claude Haiku 4.5 → GPT-5 Mini
  
  Best for: autocomplete, simple Q&A, data extraction

Balanced Speed + Quality:
  Gemini 2.5 Flash → Claude Sonnet 4.5 → DeepSeek V3.2
  
  Best for: chatbots, content generation, analysis

High Quality, Slower:
  Claude Opus 4.6 → GPT-5.2 → Kimi K2 Thinking
  
  Best for: complex reasoning, code review, research

Real-World Latency by Region#

Model API servers are not geographically evenly distributed. Latency varies:

Provider	Best Region	Additional Latency (Other Regions)
OpenAI	US East	+50-200ms (EU), +100-300ms (Asia)
Anthropic	US West	+80-250ms (EU), +150-400ms (Asia)
Google	Global (Gemini)	Low variance via CDN
xAI	US	+100-350ms (non-US)
DeepSeek	China	+150-400ms (US/EU), lower in Asia
Crazyrouter	Multi-region	Auto-routes to lowest latency

Pro tip: Crazyrouter automatically routes your API calls to the lowest-latency provider endpoint for your region, which can reduce TTFT by 100-300ms for non-US users.

How to Optimize AI API Latency in Production#

1. Use Streaming Always#

python

from openai import OpenAI

client = OpenAI(
    api_key="your-crazyrouter-key",
    base_url="https://crazyrouter.com/v1"
)

# Always stream — users see output immediately
stream = client.chat.completions.create(
    model="claude-sonnet-4-5",
    messages=[{"role": "user", "content": "Explain streaming APIs"}],
    stream=True  # Critical for perceived latency
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

2. Right-size Your Model Selection#

python

def select_model(task_type: str, quality_requirement: str) -> str:
    """Route to the right model based on task needs."""
    
    if task_type == "autocomplete" or quality_requirement == "low":
        return "gemini-2.5-flash-lite"  # Fastest
    
    elif task_type == "chat" and quality_requirement == "medium":
        return "claude-haiku-4-5"  # Fast + capable
    
    elif task_type == "reasoning" or quality_requirement == "high":
        return "claude-opus-4-6"  # Best quality
    
    else:
        return "claude-sonnet-4-5"  # Good default

3. Implement Smart Fallbacks#

python

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    api_key="your-crazyrouter-key",
    base_url="https://crazyrouter.com/v1"
)

async def fast_completion_with_fallback(prompt: str, timeout: float = 2.0):
    """Try fast model first, fall back to reliable model on timeout."""
    
    try:
        # Try the fastest model with a strict timeout
        response = await asyncio.wait_for(
            client.chat.completions.create(
                model="gemini-2.5-flash-lite",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=512
            ),
            timeout=timeout
        )
        return response.choices[0].message.content
    
    except asyncio.TimeoutError:
        # Fall back to more reliable model
        response = await client.chat.completions.create(
            model="claude-sonnet-4-5",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=512
        )
        return response.choices[0].message.content

4. Use Prompt Caching for Repeated Context#

If you're sending the same system prompt or large document in every request, prompt caching can dramatically reduce both latency and cost:

python

# Claude supports caching via cache_control
response = client.chat.completions.create(
    model="claude-opus-4-6",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an expert code reviewer. " + large_codebase_context,
                    "cache_control": {"type": "ephemeral"}  # Cache this
                }
            ]
        },
        {"role": "user", "content": "Review this new function..."}
    ]
)

5. Parallel Requests for Independent Tasks#

python

import asyncio

async def parallel_analysis(code_files: list[str]):
    """Analyze multiple files in parallel instead of sequentially."""
    
    tasks = [
        client.chat.completions.create(
            model="claude-haiku-4-5",
            messages=[
                {"role": "user", "content": f"Review this code:\n{file}"}
            ]
        )
        for file in code_files
    ]
    
    # All requests fire simultaneously
    responses = await asyncio.gather(*tasks)
    return [r.choices[0].message.content for r in responses]

Speed Recommendations by Use Case#

Use Case	Recommended Model	Reason
Real-time voice assistant	Gemini 2.5 Flash Lite or Claude Haiku 4.5	TTFT <200ms
Code autocomplete (IDE)	GPT-5 Mini or Gemini 2.5 Flash	<350ms TTFT
Customer support chatbot	Claude Sonnet 4.5	Balance of speed + quality
Document analysis (batch)	Claude Opus 4.6 or GPT-5.2	Quality > speed
Long-form content generation	DeepSeek V3.2	Fast TPS for long outputs
Multilingual app	Gemini 3 Pro	Strong multilingual + fast
Image + text understanding	Qwen 2.5 VL 72B	Good vision + reasonable speed

Node.js Latency Measurement Tool#

javascript

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.CRAZYROUTER_API_KEY,
  baseURL: 'https://crazyrouter.com/v1',
});

async function measureLatency(model, prompt) {
  const start = Date.now();
  let firstTokenTime = null;
  let tokenCount = 0;
  
  const stream = await client.chat.completions.create({
    model,
    messages: [{ role: 'user', content: prompt }],
    stream: true,
  });
  
  for await (const chunk of stream) {
    if (!firstTokenTime && chunk.choices[0]?.delta?.content) {
      firstTokenTime = Date.now() - start;
    }
    if (chunk.choices[0]?.delta?.content) {
      tokenCount++;
    }
  }
  
  const totalTime = Date.now() - start;
  const generationTime = totalTime - firstTokenTime;
  const tps = Math.round(tokenCount / (generationTime / 1000));
  
  console.log(`Model: ${model}`);
  console.log(`TTFT: ${firstTokenTime}ms`);
  console.log(`TPS: ${tps} tokens/sec`);
  console.log(`Total: ${totalTime}ms`);
}

// Benchmark multiple models
const models = ['claude-opus-4-6', 'claude-sonnet-4-5', 'gemini-2.5-flash'];
const prompt = 'Explain the difference between async and parallel programming in 200 words.';

for (const model of models) {
  await measureLatency(model, prompt);
  await new Promise(r => setTimeout(r, 1000)); // Rate limit buffer
}

Frequently Asked Questions#

Q: Which AI model is fastest in 2026? A: For raw tokens per second, Gemini 2.5 Flash Lite (~240 TPS) and Claude Haiku 4.5 (~180 TPS) lead the efficient model category. For frontier models, Grok 4.1 Fast (~95 TPS) and DeepSeek V3.2 (~110 TPS) are notably quick.

Q: Does inference speed matter for my chatbot? A: Yes, but streaming mitigates it. With streaming, TTFT matters most — under 1s feels responsive. TPS matters more for long responses.

Q: Can I run benchmarks against Crazyrouter? A: Yes — the Node.js example above works with your Crazyrouter API key. Swap in any model name from the 300+ models available.

Q: Why is DeepSeek so fast? A: DeepSeek V3.2 uses a Mixture of Experts (MoE) architecture that activates only a subset of parameters per token, enabling high throughput without full model compute.

Q: How does prompt length affect speed? A: Longer prompts increase the "prefill" phase before generation starts, which raises TTFT. Very long contexts (>50K tokens) can add 500ms-2s to TTFT for most models.

Summary#

In 2026, the fastest frontier models for production use are:

Lowest latency: Gemini 2.5 Flash Lite, Claude Haiku 4.5
Best speed/quality balance: Claude Sonnet 4.5, Gemini 2.5 Flash, DeepSeek V3.2
Highest quality (slower): Claude Opus 4.6, GPT-5.2

For most production applications, streaming + model right-sizing eliminates perceived latency issues. For latency-critical use cases (voice, real-time coding), choose models with sub-300ms TTFT.

Crazyrouter provides access to all these models through a single OpenAI-compatible API, with automatic regional routing to minimize latency for your users worldwide.

→ Start building with low-latency AI APIs at Crazyrouter