
"Groq API Complete Guide: The Fastest AI Inference Platform in 2026"
What Is Groq?#
Groq is an AI inference company that built custom hardware — the Language Processing Unit (LPU) — specifically designed to run large language models at extreme speed. Unlike traditional GPU-based inference, Groq's LPU architecture delivers tokens at speeds that make other providers look like they're running on dial-up.
Important distinction: Groq (the inference company) is not the same as Grok (xAI's chatbot built by Elon Musk). Different companies, different products, confusingly similar names.
Groq doesn't train models. Instead, it takes open-source models like Llama 4, Qwen 3, and Mixtral, and runs them on its LPU hardware to deliver the fastest inference speeds available anywhere.
LPU vs GPU: Why Groq Is So Fast#
Traditional AI inference runs on GPUs (Graphics Processing Units) — hardware originally designed for rendering graphics, repurposed for matrix math. GPUs are great at parallel processing but have bottlenecks when it comes to sequential token generation.
Groq's LPU (Language Processing Unit) was designed from scratch for language model inference:
| Aspect | GPU (NVIDIA A100/H100) | Groq LPU |
|---|---|---|
| Design Purpose | General parallel compute | LLM inference |
| Memory Architecture | HBM (shared) | SRAM (on-chip, deterministic) |
| Token Generation | ~50-100 tokens/sec | 500-1,000+ tokens/sec |
| Latency | Variable | Predictable, low |
| Batching Required | Yes (for efficiency) | Minimal |
| Power Efficiency | Moderate | High |
The key advantage: Groq's LPU eliminates the memory bandwidth bottleneck that limits GPU inference speed. With on-chip SRAM instead of external HBM, data doesn't need to travel back and forth between memory and compute units. The result is deterministic, ultra-low-latency inference.
In real-world benchmarks, Groq consistently delivers 500-1,000+ tokens per second for models like Llama 4 Scout and Qwen 3 32B. That's 5-10x faster than typical GPU-based inference.
Supported Models#
Groq hosts a growing selection of open-source models. Here's what's available as of 2026:
Large Language Models#
| Model | Parameters | Context | Speed (TPS) | Input $/1M | Output $/1M |
|---|---|---|---|---|---|
| GPT OSS 20B | 20B | 128K | 1,000 | $0.075 | $0.30 |
| GPT OSS 120B | 120B | 128K | 500 | $0.15 | $0.60 |
| Kimi K2-0905 | 1T MoE | 256K | 200 | $1.00 | $3.00 |
| Llama 4 Scout (17Bx16E) | 17Bx16E MoE | 128K | 594 | $0.11 | $0.34 |
| Llama 4 Maverick (17Bx128E) | 17Bx128E MoE | 128K | 562 | $0.20 | $0.60 |
| Qwen 3 32B | 32B | 131K | 662 | $0.29 | $0.59 |
| Llama 3.3 70B | 70B | 128K | 394 | $0.59 | $0.79 |
| Llama 3.1 8B | 8B | 128K | 840 | $0.05 | $0.08 |
Speech Models#
| Model | Type | Speed | Price |
|---|---|---|---|
| Whisper V3 Large | ASR | 217x realtime | $0.111/hr |
| Whisper Large v3 Turbo | ASR | 228x realtime | $0.04/hr |
| Orpheus English | TTS | 100 chars/s | $22/M chars |
Getting Started with the Groq API#
Step 1: Get Your API Key#
- Go to console.groq.com
- Sign up with Google or GitHub
- Navigate to API Keys and create a new key
- Groq offers a generous free tier — no credit card required
Step 2: Make Your First API Call#
The Groq API is OpenAI-compatible, so you can use the standard OpenAI SDK with a different base URL.
Python#
from openai import OpenAI
client = OpenAI(
api_key="your-groq-api-key",
base_url="https://api.groq.com/openai/v1"
)
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain how transformers work in 3 sentences."}
],
temperature=0.7,
max_tokens=256
)
print(response.choices[0].message.content)
Node.js#
import OpenAI from 'openai';
const client = new OpenAI({
apiKey: 'your-groq-api-key',
baseURL: 'https://api.groq.com/openai/v1'
});
const response = await client.chat.completions.create({
model: 'llama-3.3-70b-versatile',
messages: [
{ role: 'system', content: 'You are a helpful coding assistant.' },
{ role: 'user', content: 'Write a binary search function in Python' }
],
temperature: 0.5
});
console.log(response.choices[0].message.content);
cURL#
curl https://api.groq.com/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-groq-api-key" \
-d '{
"model": "llama-3.3-70b-versatile",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.3
}'
Pricing Comparison: Groq vs Other Providers#
How does Groq stack up against the major AI API providers? Here's a comparison for similar-capability models:
| Provider | Model | Input $/1M | Output $/1M | Speed |
|---|---|---|---|---|
| Groq | Llama 3.3 70B | $0.59 | $0.79 | ~394 TPS |
| OpenAI | GPT-4o | $2.50 | $10.00 | ~80 TPS |
| OpenAI | GPT-4o-mini | $0.15 | $0.60 | ~120 TPS |
| Anthropic | Claude Sonnet 4 | $3.00 | $15.00 | ~80 TPS |
| Anthropic | Claude Haiku 3.5 | $0.80 | $4.00 | ~100 TPS |
| Gemini 2.0 Flash | $0.10 | $0.40 | ~150 TPS | |
| Crazyrouter | All of the above | Same or lower | Same or lower | Provider speed |
Groq's pricing is competitive for open-source models, and the speed advantage is massive. But what if you need access to both Groq's speed AND proprietary models like Claude or GPT-4o?
Access Groq Models via Crazyrouter#
Crazyrouter is a unified AI API gateway that gives you access to 200+ models from every major provider — including Groq — through a single API key.
Why Use Crazyrouter for Groq?#
- One API key for Groq, OpenAI, Anthropic, Google, and more
- Automatic failover — if Groq is rate-limited, requests route to another provider
- Unified billing — no need to manage credits across multiple platforms
- OpenAI-compatible — same SDK, same code, just change the base URL
- No rate limit headaches — Crazyrouter pools capacity across multiple accounts
Code Example: Groq Models via Crazyrouter#
from openai import OpenAI
client = OpenAI(
api_key="your-crazyrouter-key",
base_url="https://crazyrouter.com/v1"
)
# Use Groq's Llama model
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "user", "content": "Compare REST and GraphQL APIs"}
]
)
print(response.choices[0].message.content)
# Switch to Claude with the same client — no code changes
response = client.chat.completions.create(
model="claude-sonnet-4-20250514",
messages=[
{"role": "user", "content": "Review this Python code for bugs"}
]
)
The beauty of Crazyrouter: switch between Groq, OpenAI, Anthropic, and Google models by changing one string. Your code stays the same.
Groq API Best Practices#
- Use streaming for long responses — Groq is fast, but streaming still improves perceived latency for users
- Pick the right model — Llama 3.1 8B for simple tasks (840 TPS), Llama 3.3 70B for complex reasoning
- Leverage prompt caching — Groq supports cached input tokens at 50% discount for supported models
- Set reasonable max_tokens — Don't request 4096 tokens if you need 200
- Handle rate limits gracefully — Even Groq has limits; implement exponential backoff
FAQ#
Is the Groq API free?#
Groq offers a free tier with rate limits for developers to experiment. For production use, you'll need a paid plan. Pricing starts as low as $0.05 per million input tokens for Llama 3.1 8B.
How fast is Groq compared to OpenAI?#
Groq delivers 400-1,000+ tokens per second depending on the model, compared to 50-120 tokens per second for OpenAI's GPT models. That's roughly 5-10x faster for token generation.
What's the difference between Groq and Grok?#
Groq is an AI inference company that makes LPU hardware for running open-source models at high speed. Grok is xAI's chatbot (Elon Musk's company). Different companies, different products — the similar names cause frequent confusion.
Can I use the OpenAI SDK with Groq?#
Yes. Groq's API is OpenAI-compatible. Just change the base_url to https://api.groq.com/openai/v1 and use your Groq API key. The same applies when accessing Groq models through Crazyrouter.
What models does Groq support?#
Groq supports Llama 4 Scout, Llama 4 Maverick, Llama 3.3 70B, Llama 3.1 8B, Qwen 3 32B, GPT OSS 20B/120B, Kimi K2, Whisper (speech-to-text), and Orpheus (text-to-speech). The model list is regularly updated.
Is Groq good for production use?#
Yes, with caveats. Groq's speed is unmatched, but availability can vary during high-demand periods. For production reliability, consider using Crazyrouter to add automatic failover to other providers when Groq is at capacity.
How does Groq's LPU compare to NVIDIA GPUs?#
Groq's LPU is purpose-built for LLM inference, using on-chip SRAM instead of external HBM memory. This eliminates the memory bandwidth bottleneck, delivering deterministic low-latency performance at 5-10x the speed of GPU-based inference for supported models.
Summary#
Groq is the speed king of AI inference. If your application needs the fastest possible token generation — real-time chatbots, coding assistants, interactive agents — Groq's LPU-powered API is hard to beat.
The trade-off: Groq only runs open-source models. For proprietary models like Claude or GPT-4o, you need other providers. That's where Crazyrouter comes in — one API key gives you Groq's speed for open-source models AND access to every major proprietary model, all through the same OpenAI-compatible endpoint.
👉 Start building at crazyrouter.com — one key, every model, maximum speed.


