EnglishTutorial

OpenAI Realtime API Complete Guide: Build Voice AI Apps in 2026

"Learn how to use OpenAI's Realtime API for building voice AI applications with WebSocket streaming, audio input/output, and function calling. Complete tutorial with code examples."

Crazyrouter Team

March 2, 2026 / 1451 views

OpenAI Realtime API Complete Guide: Build Voice AI Apps in 2026

Crazyrouter

Read the docs Check live pricing Open image tool Create account

OpenAI Realtime API Complete Guide: Build Voice AI Apps in 2026#

The OpenAI Realtime API represents a paradigm shift in how developers build voice-enabled AI applications. Instead of the traditional request-response pattern, the Realtime API uses WebSocket connections for continuous, low-latency audio streaming — enabling natural voice conversations with GPT models.

In this comprehensive guide, we'll cover everything you need to know about the OpenAI Realtime API, from basic concepts to production-ready code examples.

What is the OpenAI Realtime API?#

The OpenAI Realtime API is a WebSocket-based interface that enables real-time, bidirectional audio communication with GPT models. Unlike the standard Chat Completions API where you send text and receive text, the Realtime API handles:

Audio input: Stream microphone audio directly to the model
Audio output: Receive synthesized speech responses in real-time
Interruption handling: Users can interrupt the AI mid-sentence, just like a real conversation
Function calling: Trigger tools and actions during voice conversations
Multi-modal input: Combine text and audio in the same session

This makes it ideal for building voice assistants, customer service bots, language tutoring apps, and any application requiring natural conversational AI.

Key Features#

Feature	Description
WebSocket Protocol	Persistent connection for low-latency streaming
Voice Activity Detection (VAD)	Automatic detection of when users start/stop speaking
Multiple Voices	Choose from several natural-sounding voices (alloy, echo, fable, onyx, nova, shimmer)
Function Calling	Execute tools during real-time conversations
Session Management	Configure conversation parameters per session
Audio Formats	Support for PCM16, G.711 µ-law, and G.711 A-law

How to Use the OpenAI Realtime API#

Step 1: Create a Session#

First, create a session to get an ephemeral token:

bash

curl -X POST "https://api.openai.com/v1/realtime/sessions" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-realtime-preview",
    "voice": "alloy",
    "instructions": "You are a helpful assistant.",
    "input_audio_format": "pcm16",
    "output_audio_format": "pcm16"
  }'

Using Crazyrouter, you can access the Realtime API through a single endpoint with significant cost savings:

bash

curl -X POST "https://crazyrouter.com/v1/realtime/sessions" \
  -H "Authorization: Bearer YOUR_CRAZYROUTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-realtime-preview",
    "voice": "alloy"
  }'

Step 2: Connect via WebSocket (Python)#

python

import asyncio
import websockets
import json
import base64

async def realtime_conversation():
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {
        "Authorization": "Bearer YOUR_API_KEY",
        "OpenAI-Beta": "realtime=v1"
    }

    async with websockets.connect(url, extra_headers=headers) as ws:
        # Configure the session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "instructions": "You are a helpful voice assistant.",
                "voice": "alloy",
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.5,
                    "prefix_padding_ms": 300,
                    "silence_duration_ms": 500
                }
            }
        }))

        # Send audio data (base64-encoded PCM16)
        audio_data = get_microphone_audio()  # Your audio capture function
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": base64.b64encode(audio_data).decode()
        }))

        # Commit the audio buffer
        await ws.send(json.dumps({
            "type": "input_audio_buffer.commit"
        }))

        # Listen for responses
        async for message in ws:
            event = json.loads(message)
            if event["type"] == "response.audio.delta":
                audio_chunk = base64.b64decode(event["delta"])
                play_audio(audio_chunk)  # Your audio playback function
            elif event["type"] == "response.done":
                print("Response complete")
                break

asyncio.run(realtime_conversation())

Step 3: Node.js Implementation#

javascript

import WebSocket from 'ws';

const ws = new WebSocket(
  'wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview',
  {
    headers: {
      'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
      'OpenAI-Beta': 'realtime=v1'
    }
  }
);

ws.on('open', () => {
  // Configure session
  ws.send(JSON.stringify({
    type: 'session.update',
    session: {
      modalities: ['text', 'audio'],
      instructions: 'You are a helpful voice assistant.',
      voice: 'alloy',
      input_audio_format: 'pcm16',
      output_audio_format: 'pcm16',
      tools: [{
        type: 'function',
        name: 'get_weather',
        description: 'Get current weather for a location',
        parameters: {
          type: 'object',
          properties: {
            location: { type: 'string', description: 'City name' }
          },
          required: ['location']
        }
      }]
    }
  }));
});

ws.on('message', (data) => {
  const event = JSON.parse(data);
  
  switch (event.type) {
    case 'response.audio.delta':
      const audioBuffer = Buffer.from(event.delta, 'base64');
      playAudio(audioBuffer);
      break;
    case 'response.function_call_arguments.done':
      // Handle function calls during conversation
      const result = handleFunctionCall(event.name, JSON.parse(event.arguments));
      ws.send(JSON.stringify({
        type: 'conversation.item.create',
        item: {
          type: 'function_call_output',
          call_id: event.call_id,
          output: JSON.stringify(result)
        }
      }));
      break;
    case 'response.done':
      console.log('Response complete');
      break;
  }
});

Realtime API vs Chat Completions API#

Feature	Realtime API	Chat Completions API
Protocol	WebSocket (persistent)	HTTP (request/response)
Audio Support	Native audio I/O	Text only (need separate TTS/STT)
Latency	~300ms response time	~1-3s for streaming
Interruptions	Built-in support	Not applicable
Function Calling	✅ During conversation	✅ Per request
Cost	Higher per minute	Lower per token
Best For	Voice apps, real-time	Chatbots, text processing
Streaming	Continuous	Chunked SSE

Pricing Breakdown#

Understanding the cost structure is critical for production deployments:

Component	OpenAI Official	Crazyrouter	Savings
Audio Input	$0.06/min	$0.04/min	33%
Audio Output	$0.24/min	$0.17/min	29%
Text Input (cache miss)	$5.00/1M tokens	$3.50/1M tokens	30%
Text Output	$20.00/1M tokens	$14.00/1M tokens	30%

Example cost for a 5-minute voice conversation:

OpenAI Direct: ~$1.50
Via Crazyrouter: ~ $1.05 (saving$ 0.45 per conversation)

At scale (10,000 conversations/month), that's $4,500 in monthly savings by routing through Crazyrouter's unified API gateway.

Use Cases#

Voice Assistants: Build Siri/Alexa-like experiences with GPT intelligence
Customer Support: Real-time voice bots that handle complex queries
Language Tutoring: Conversational practice with natural pronunciation feedback
Accessibility Tools: Voice-driven interfaces for visually impaired users
Telehealth: AI-assisted medical intake and triage conversations
Gaming: NPC characters that respond naturally in voice

FAQ#

How much does the OpenAI Realtime API cost?#

The OpenAI Realtime API charges based on audio duration and text tokens. Audio input costs $0.06/minute and audio output costs$ 0.24/minute at official prices. Using an API gateway like Crazyrouter can reduce these costs by 25-33%.

What audio formats does the Realtime API support?#

The Realtime API supports three audio formats: PCM16 (16-bit PCM at 24kHz), G.711 µ-law, and G.711 A-law. PCM16 offers the best quality, while G.711 formats are more bandwidth-efficient for telephony applications.

Can I use function calling with the Realtime API?#

Yes! The Realtime API fully supports function calling during voice conversations. You can define tools when configuring the session, and the model will invoke them contextually during the conversation, allowing actions like booking appointments, checking databases, or controlling smart devices.

What's the latency of the Realtime API?#

Typical response latency is 300-500ms from the end of user speech to the beginning of AI audio output. This is significantly faster than chaining separate STT → LLM → TTS services, which typically adds 2-4 seconds of latency.

How do I handle interruptions in the Realtime API?#

The Realtime API has built-in Voice Activity Detection (VAD) that automatically detects when a user starts speaking. When an interruption is detected, the model stops its current output and processes the new input. You can configure VAD sensitivity via the turn_detection session parameter.

Is the Realtime API available through API gateways?#

Yes. Services like Crazyrouter provide access to the OpenAI Realtime API through their unified gateway, offering cost savings, automatic failover, and simplified key management across 300+ AI models.

Summary#

The OpenAI Realtime API opens up powerful possibilities for voice-enabled AI applications. Its WebSocket-based architecture delivers the low latency needed for natural conversations, while features like built-in VAD, function calling, and multi-modal support make it production-ready.

For developers looking to build with the Realtime API at scale, using an API gateway like Crazyrouter provides significant cost savings (25-33%), unified billing across all AI providers, and the reliability of automatic failover — all accessible with a single API key.

Ready to build voice AI? Get your Crazyrouter API key and start streaming in minutes.