
"OpenAI Realtime API Complete Guide: Build Voice AI Apps in 2026"
OpenAI Realtime API Complete Guide: Build Voice AI Apps in 2026#
The OpenAI Realtime API represents a paradigm shift in how developers build voice-enabled AI applications. Instead of the traditional request-response pattern, the Realtime API uses WebSocket connections for continuous, low-latency audio streaming — enabling natural voice conversations with GPT models.
In this comprehensive guide, we'll cover everything you need to know about the OpenAI Realtime API, from basic concepts to production-ready code examples.
What is the OpenAI Realtime API?#
The OpenAI Realtime API is a WebSocket-based interface that enables real-time, bidirectional audio communication with GPT models. Unlike the standard Chat Completions API where you send text and receive text, the Realtime API handles:
- Audio input: Stream microphone audio directly to the model
- Audio output: Receive synthesized speech responses in real-time
- Interruption handling: Users can interrupt the AI mid-sentence, just like a real conversation
- Function calling: Trigger tools and actions during voice conversations
- Multi-modal input: Combine text and audio in the same session
This makes it ideal for building voice assistants, customer service bots, language tutoring apps, and any application requiring natural conversational AI.
Key Features#
| Feature | Description |
|---|---|
| WebSocket Protocol | Persistent connection for low-latency streaming |
| Voice Activity Detection (VAD) | Automatic detection of when users start/stop speaking |
| Multiple Voices | Choose from several natural-sounding voices (alloy, echo, fable, onyx, nova, shimmer) |
| Function Calling | Execute tools during real-time conversations |
| Session Management | Configure conversation parameters per session |
| Audio Formats | Support for PCM16, G.711 µ-law, and G.711 A-law |
How to Use the OpenAI Realtime API#
Step 1: Create a Session#
First, create a session to get an ephemeral token:
curl -X POST "https://api.openai.com/v1/realtime/sessions" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-realtime-preview",
"voice": "alloy",
"instructions": "You are a helpful assistant.",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16"
}'
Using Crazyrouter, you can access the Realtime API through a single endpoint with significant cost savings:
curl -X POST "https://crazyrouter.com/v1/realtime/sessions" \
-H "Authorization: Bearer YOUR_CRAZYROUTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-realtime-preview",
"voice": "alloy"
}'
Step 2: Connect via WebSocket (Python)#
import asyncio
import websockets
import json
import base64
async def realtime_conversation():
url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
headers = {
"Authorization": "Bearer YOUR_API_KEY",
"OpenAI-Beta": "realtime=v1"
}
async with websockets.connect(url, extra_headers=headers) as ws:
# Configure the session
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"instructions": "You are a helpful voice assistant.",
"voice": "alloy",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
}
}
}))
# Send audio data (base64-encoded PCM16)
audio_data = get_microphone_audio() # Your audio capture function
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(audio_data).decode()
}))
# Commit the audio buffer
await ws.send(json.dumps({
"type": "input_audio_buffer.commit"
}))
# Listen for responses
async for message in ws:
event = json.loads(message)
if event["type"] == "response.audio.delta":
audio_chunk = base64.b64decode(event["delta"])
play_audio(audio_chunk) # Your audio playback function
elif event["type"] == "response.done":
print("Response complete")
break
asyncio.run(realtime_conversation())
Step 3: Node.js Implementation#
import WebSocket from 'ws';
const ws = new WebSocket(
'wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview',
{
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'OpenAI-Beta': 'realtime=v1'
}
}
);
ws.on('open', () => {
// Configure session
ws.send(JSON.stringify({
type: 'session.update',
session: {
modalities: ['text', 'audio'],
instructions: 'You are a helpful voice assistant.',
voice: 'alloy',
input_audio_format: 'pcm16',
output_audio_format: 'pcm16',
tools: [{
type: 'function',
name: 'get_weather',
description: 'Get current weather for a location',
parameters: {
type: 'object',
properties: {
location: { type: 'string', description: 'City name' }
},
required: ['location']
}
}]
}
}));
});
ws.on('message', (data) => {
const event = JSON.parse(data);
switch (event.type) {
case 'response.audio.delta':
const audioBuffer = Buffer.from(event.delta, 'base64');
playAudio(audioBuffer);
break;
case 'response.function_call_arguments.done':
// Handle function calls during conversation
const result = handleFunctionCall(event.name, JSON.parse(event.arguments));
ws.send(JSON.stringify({
type: 'conversation.item.create',
item: {
type: 'function_call_output',
call_id: event.call_id,
output: JSON.stringify(result)
}
}));
break;
case 'response.done':
console.log('Response complete');
break;
}
});
Realtime API vs Chat Completions API#
| Feature | Realtime API | Chat Completions API |
|---|---|---|
| Protocol | WebSocket (persistent) | HTTP (request/response) |
| Audio Support | Native audio I/O | Text only (need separate TTS/STT) |
| Latency | ~300ms response time | ~1-3s for streaming |
| Interruptions | Built-in support | Not applicable |
| Function Calling | ✅ During conversation | ✅ Per request |
| Cost | Higher per minute | Lower per token |
| Best For | Voice apps, real-time | Chatbots, text processing |
| Streaming | Continuous | Chunked SSE |
Pricing Breakdown#
Understanding the cost structure is critical for production deployments:
| Component | OpenAI Official | Crazyrouter | Savings |
|---|---|---|---|
| Audio Input | $0.06/min | $0.04/min | 33% |
| Audio Output | $0.24/min | $0.17/min | 29% |
| Text Input (cache miss) | $5.00/1M tokens | $3.50/1M tokens | 30% |
| Text Output | $20.00/1M tokens | $14.00/1M tokens | 30% |
Example cost for a 5-minute voice conversation:
- OpenAI Direct: ~$1.50
- Via Crazyrouter: ~0.45 per conversation)
At scale (10,000 conversations/month), that's $4,500 in monthly savings by routing through Crazyrouter's unified API gateway.
Use Cases#
- Voice Assistants: Build Siri/Alexa-like experiences with GPT intelligence
- Customer Support: Real-time voice bots that handle complex queries
- Language Tutoring: Conversational practice with natural pronunciation feedback
- Accessibility Tools: Voice-driven interfaces for visually impaired users
- Telehealth: AI-assisted medical intake and triage conversations
- Gaming: NPC characters that respond naturally in voice
FAQ#
How much does the OpenAI Realtime API cost?#
The OpenAI Realtime API charges based on audio duration and text tokens. Audio input costs 0.24/minute at official prices. Using an API gateway like Crazyrouter can reduce these costs by 25-33%.
What audio formats does the Realtime API support?#
The Realtime API supports three audio formats: PCM16 (16-bit PCM at 24kHz), G.711 µ-law, and G.711 A-law. PCM16 offers the best quality, while G.711 formats are more bandwidth-efficient for telephony applications.
Can I use function calling with the Realtime API?#
Yes! The Realtime API fully supports function calling during voice conversations. You can define tools when configuring the session, and the model will invoke them contextually during the conversation, allowing actions like booking appointments, checking databases, or controlling smart devices.
What's the latency of the Realtime API?#
Typical response latency is 300-500ms from the end of user speech to the beginning of AI audio output. This is significantly faster than chaining separate STT → LLM → TTS services, which typically adds 2-4 seconds of latency.
How do I handle interruptions in the Realtime API?#
The Realtime API has built-in Voice Activity Detection (VAD) that automatically detects when a user starts speaking. When an interruption is detected, the model stops its current output and processes the new input. You can configure VAD sensitivity via the turn_detection session parameter.
Is the Realtime API available through API gateways?#
Yes. Services like Crazyrouter provide access to the OpenAI Realtime API through their unified gateway, offering cost savings, automatic failover, and simplified key management across 300+ AI models.
Summary#
The OpenAI Realtime API opens up powerful possibilities for voice-enabled AI applications. Its WebSocket-based architecture delivers the low latency needed for natural conversations, while features like built-in VAD, function calling, and multi-modal support make it production-ready.
For developers looking to build with the Realtime API at scale, using an API gateway like Crazyrouter provides significant cost savings (25-33%), unified billing across all AI providers, and the reliability of automatic failover — all accessible with a single API key.
Ready to build voice AI? Get your Crazyrouter API key and start streaming in minutes.


