
"AI Inference Speed Benchmark 2026: Tokens Per Second Compared"
AI Inference Speed Benchmark 2026: Tokens Per Second Compared#
When you're building production AI applications, inference speed matters as much as model quality. A 200ms first-token latency is fine for a chatbot. For real-time voice agents or live coding assistants, it can make or break the user experience.
This guide benchmarks real-world inference speed across major AI models in 2026: tokens per second (TPS), time to first token (TTFT), and what those numbers mean for your architecture.
Why Inference Speed Matters#
Different applications have different latency requirements:
| Application Type | Max Tolerable TTFT | Notes |
|---|---|---|
| Batch processing | >10s acceptable | Cost > speed |
| Chatbot (web) | 1-3s | Streaming hides latency |
| Code completion | <500ms | IDE plugins are sensitive |
| Voice agent | <300ms | Human conversation rhythm |
| Real-time translation | <200ms | Must feel instant |
| Edge/IoT | <100ms | Network RTT is the ceiling |
Key Metrics Explained#
- TTFT (Time to First Token): How long from request sent to first token received. Critical for perceived responsiveness.
- TPS (Tokens Per Second): Generation throughput after the first token. Determines how fast long responses complete.
- E2E Latency: Total time for a complete response. For short replies, TTFT dominates; for long replies, TPS dominates.
AI Inference Speed Benchmark 2026#
Methodology: Benchmarks are based on community measurements, provider documentation, and Crazyrouter internal routing data (April 2026). Results vary significantly by region, time of day, and request size. All speeds measured via API with streaming enabled.
Large Models (70B+ Parameters)#
| Model | Provider | TTFT (p50) | TPS (p50) | Context |
|---|---|---|---|---|
| GPT-5.2 | OpenAI | 850ms | 62 tok/s | 128K |
| Claude Opus 4.6 | Anthropic | 1,100ms | 48 tok/s | 200K |
| Gemini 3 Pro Preview | 720ms | 71 tok/s | 1M | |
| Grok 4.1 Fast | xAI | 450ms | 95 tok/s | 128K |
| DeepSeek V3.2 | DeepSeek | 380ms | 110 tok/s | 64K |
| Kimi K2 Thinking | Moonshot | 1,200ms | 35 tok/s | 128K |
| MiniMax M2 | MiniMax | 520ms | 88 tok/s | 1M |
Mid-Size / Efficient Models#
| Model | Provider | TTFT (p50) | TPS (p50) | Context |
|---|---|---|---|---|
| GPT-5 Mini | OpenAI | 320ms | 145 tok/s | 128K |
| Claude Sonnet 4.5 | Anthropic | 580ms | 98 tok/s | 200K |
| Claude Haiku 4.5 | Anthropic | 180ms | 180 tok/s | 200K |
| Gemini 2.5 Flash | 250ms | 190 tok/s | 1M | |
| Gemini 2.5 Flash Lite | 140ms | 240 tok/s | 1M | |
| DeepSeek V3.2 (turbo) | DeepSeek | 280ms | 135 tok/s | 64K |
| Qwen 2.5 VL 72B | Alibaba | 410ms | 102 tok/s | 128K |
Speed Champions (Specialized Fast Models)#
| Model | TTFT (p50) | TPS | Use Case |
|---|---|---|---|
| Gemini 2.5 Flash Lite | 140ms | 240 | Batch, real-time |
| Claude Haiku 4.5 | 180ms | 180 | Edge, high-volume |
| GPT-5 Mini | 320ms | 145 | Code completion |
| Grok 4.1 Fast | 450ms | 95 | Reasoning tasks |
Speed vs Quality Trade-offs#
Here's how to think about model selection when speed matters:
High Speed, Lower Quality:
Gemini 2.5 Flash Lite → Claude Haiku 4.5 → GPT-5 Mini
Best for: autocomplete, simple Q&A, data extraction
Balanced Speed + Quality:
Gemini 2.5 Flash → Claude Sonnet 4.5 → DeepSeek V3.2
Best for: chatbots, content generation, analysis
High Quality, Slower:
Claude Opus 4.6 → GPT-5.2 → Kimi K2 Thinking
Best for: complex reasoning, code review, research
Real-World Latency by Region#
Model API servers are not geographically evenly distributed. Latency varies:
| Provider | Best Region | Additional Latency (Other Regions) |
|---|---|---|
| OpenAI | US East | +50-200ms (EU), +100-300ms (Asia) |
| Anthropic | US West | +80-250ms (EU), +150-400ms (Asia) |
| Global (Gemini) | Low variance via CDN | |
| xAI | US | +100-350ms (non-US) |
| DeepSeek | China | +150-400ms (US/EU), lower in Asia |
| Crazyrouter | Multi-region | Auto-routes to lowest latency |
Pro tip: Crazyrouter automatically routes your API calls to the lowest-latency provider endpoint for your region, which can reduce TTFT by 100-300ms for non-US users.
How to Optimize AI API Latency in Production#
1. Use Streaming Always#
from openai import OpenAI
client = OpenAI(
api_key="your-crazyrouter-key",
base_url="https://crazyrouter.com/v1"
)
# Always stream — users see output immediately
stream = client.chat.completions.create(
model="claude-sonnet-4-5",
messages=[{"role": "user", "content": "Explain streaming APIs"}],
stream=True # Critical for perceived latency
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
2. Right-size Your Model Selection#
def select_model(task_type: str, quality_requirement: str) -> str:
"""Route to the right model based on task needs."""
if task_type == "autocomplete" or quality_requirement == "low":
return "gemini-2.5-flash-lite" # Fastest
elif task_type == "chat" and quality_requirement == "medium":
return "claude-haiku-4-5" # Fast + capable
elif task_type == "reasoning" or quality_requirement == "high":
return "claude-opus-4-6" # Best quality
else:
return "claude-sonnet-4-5" # Good default
3. Implement Smart Fallbacks#
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(
api_key="your-crazyrouter-key",
base_url="https://crazyrouter.com/v1"
)
async def fast_completion_with_fallback(prompt: str, timeout: float = 2.0):
"""Try fast model first, fall back to reliable model on timeout."""
try:
# Try the fastest model with a strict timeout
response = await asyncio.wait_for(
client.chat.completions.create(
model="gemini-2.5-flash-lite",
messages=[{"role": "user", "content": prompt}],
max_tokens=512
),
timeout=timeout
)
return response.choices[0].message.content
except asyncio.TimeoutError:
# Fall back to more reliable model
response = await client.chat.completions.create(
model="claude-sonnet-4-5",
messages=[{"role": "user", "content": prompt}],
max_tokens=512
)
return response.choices[0].message.content
4. Use Prompt Caching for Repeated Context#
If you're sending the same system prompt or large document in every request, prompt caching can dramatically reduce both latency and cost:
# Claude supports caching via cache_control
response = client.chat.completions.create(
model="claude-opus-4-6",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an expert code reviewer. " + large_codebase_context,
"cache_control": {"type": "ephemeral"} # Cache this
}
]
},
{"role": "user", "content": "Review this new function..."}
]
)
5. Parallel Requests for Independent Tasks#
import asyncio
async def parallel_analysis(code_files: list[str]):
"""Analyze multiple files in parallel instead of sequentially."""
tasks = [
client.chat.completions.create(
model="claude-haiku-4-5",
messages=[
{"role": "user", "content": f"Review this code:\n{file}"}
]
)
for file in code_files
]
# All requests fire simultaneously
responses = await asyncio.gather(*tasks)
return [r.choices[0].message.content for r in responses]
Speed Recommendations by Use Case#
| Use Case | Recommended Model | Reason |
|---|---|---|
| Real-time voice assistant | Gemini 2.5 Flash Lite or Claude Haiku 4.5 | TTFT <200ms |
| Code autocomplete (IDE) | GPT-5 Mini or Gemini 2.5 Flash | <350ms TTFT |
| Customer support chatbot | Claude Sonnet 4.5 | Balance of speed + quality |
| Document analysis (batch) | Claude Opus 4.6 or GPT-5.2 | Quality > speed |
| Long-form content generation | DeepSeek V3.2 | Fast TPS for long outputs |
| Multilingual app | Gemini 3 Pro | Strong multilingual + fast |
| Image + text understanding | Qwen 2.5 VL 72B | Good vision + reasonable speed |
Node.js Latency Measurement Tool#
import OpenAI from 'openai';
const client = new OpenAI({
apiKey: process.env.CRAZYROUTER_API_KEY,
baseURL: 'https://crazyrouter.com/v1',
});
async function measureLatency(model, prompt) {
const start = Date.now();
let firstTokenTime = null;
let tokenCount = 0;
const stream = await client.chat.completions.create({
model,
messages: [{ role: 'user', content: prompt }],
stream: true,
});
for await (const chunk of stream) {
if (!firstTokenTime && chunk.choices[0]?.delta?.content) {
firstTokenTime = Date.now() - start;
}
if (chunk.choices[0]?.delta?.content) {
tokenCount++;
}
}
const totalTime = Date.now() - start;
const generationTime = totalTime - firstTokenTime;
const tps = Math.round(tokenCount / (generationTime / 1000));
console.log(`Model: ${model}`);
console.log(`TTFT: ${firstTokenTime}ms`);
console.log(`TPS: ${tps} tokens/sec`);
console.log(`Total: ${totalTime}ms`);
}
// Benchmark multiple models
const models = ['claude-opus-4-6', 'claude-sonnet-4-5', 'gemini-2.5-flash'];
const prompt = 'Explain the difference between async and parallel programming in 200 words.';
for (const model of models) {
await measureLatency(model, prompt);
await new Promise(r => setTimeout(r, 1000)); // Rate limit buffer
}
Frequently Asked Questions#
Q: Which AI model is fastest in 2026? A: For raw tokens per second, Gemini 2.5 Flash Lite (~240 TPS) and Claude Haiku 4.5 (~180 TPS) lead the efficient model category. For frontier models, Grok 4.1 Fast (~95 TPS) and DeepSeek V3.2 (~110 TPS) are notably quick.
Q: Does inference speed matter for my chatbot? A: Yes, but streaming mitigates it. With streaming, TTFT matters most — under 1s feels responsive. TPS matters more for long responses.
Q: Can I run benchmarks against Crazyrouter? A: Yes — the Node.js example above works with your Crazyrouter API key. Swap in any model name from the 300+ models available.
Q: Why is DeepSeek so fast? A: DeepSeek V3.2 uses a Mixture of Experts (MoE) architecture that activates only a subset of parameters per token, enabling high throughput without full model compute.
Q: How does prompt length affect speed? A: Longer prompts increase the "prefill" phase before generation starts, which raises TTFT. Very long contexts (>50K tokens) can add 500ms-2s to TTFT for most models.
Summary#
In 2026, the fastest frontier models for production use are:
- Lowest latency: Gemini 2.5 Flash Lite, Claude Haiku 4.5
- Best speed/quality balance: Claude Sonnet 4.5, Gemini 2.5 Flash, DeepSeek V3.2
- Highest quality (slower): Claude Opus 4.6, GPT-5.2
For most production applications, streaming + model right-sizing eliminates perceived latency issues. For latency-critical use cases (voice, real-time coding), choose models with sub-300ms TTFT.
Crazyrouter provides access to all these models through a single OpenAI-compatible API, with automatic regional routing to minimize latency for your users worldwide.


