EnglishTutorial

AI Voice Agent Guide 2026: Build Speech-to-Speech AI with Real-Time APIs

"Complete guide to building AI voice agents with speech-to-speech APIs. Compare OpenAI Realtime, ElevenLabs, Deepgram, and PlayHT for building conversational voice AI."

Crazyrouter Team

March 2, 2026 / 621 views

AI Voice Agent Guide 2026: Build Speech-to-Speech AI with Real-Time APIs

Crazyrouter

Read the docs Check live pricing Open image tool Create account

AI Voice Agent Guide 2026: Build Speech-to-Speech AI with Real-Time APIs#

AI voice agents are rapidly becoming the interface of choice for customer service, healthcare, sales, and personal assistants. Unlike text chatbots, voice agents create natural, human-like conversations that feel intuitive and accessible.

This guide covers everything you need to build a production-ready AI voice agent in 2026.

What is an AI Voice Agent?#

An AI voice agent is a system that can:

Listen — Convert speech to text (STT) in real-time
Think — Process the input with a language model (LLM)
Speak — Convert the response back to speech (TTS)
React — Handle interruptions, pauses, and turn-taking naturally

Architecture Patterns#

Pattern A: Modular Pipeline (Traditional)

code

Microphone → STT → LLM → TTS → Speaker
              ↓      ↓      ↓
           Deepgram  GPT  ElevenLabs

Pattern B: End-to-End (Modern)

code

Microphone → OpenAI Realtime API → Speaker
              (STT + LLM + TTS combined)

Pattern C: Hybrid

code

Microphone → Deepgram STT → Claude → ElevenLabs TTS → Speaker
              (Best STT)    (Best reasoning)  (Best voices)

Voice AI Provider Comparison#

Provider	STT	LLM	TTS	E2E	Latency	Voice Quality
OpenAI Realtime	✅	✅	✅	✅	~300ms	⭐⭐⭐⭐
ElevenLabs	❌	❌	✅	❌	~200ms	⭐⭐⭐⭐⭐
Deepgram	✅	❌	✅	❌	~100ms	⭐⭐⭐⭐
PlayHT	❌	❌	✅	❌	~150ms	⭐⭐⭐⭐⭐
AssemblyAI	✅	❌	❌	❌	~150ms	N/A
Google STT/TTS	✅	❌	✅	❌	~200ms	⭐⭐⭐⭐

Building a Voice Agent: Step by Step#

Option 1: OpenAI Realtime (Simplest)#

The fastest way to build a voice agent — everything in one API:

python

import asyncio
import websockets
import json
import base64
import pyaudio

CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 24000

async def voice_agent():
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    # Use Crazyrouter for cost savings
    # url = "wss://crazyrouter.com/v1/realtime?model=gpt-4o-realtime-preview"
    
    headers = {
        "Authorization": "Bearer YOUR_API_KEY",
        "OpenAI-Beta": "realtime=v1"
    }
    
    audio = pyaudio.PyAudio()
    
    # Input stream (microphone)
    input_stream = audio.open(
        format=FORMAT, channels=CHANNELS,
        rate=RATE, input=True, frames_per_buffer=CHUNK
    )
    
    # Output stream (speaker)
    output_stream = audio.open(
        format=FORMAT, channels=CHANNELS,
        rate=RATE, output=True, frames_per_buffer=CHUNK
    )
    
    async with websockets.connect(url, extra_headers=headers) as ws:
        # Configure the agent
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "instructions": """You are a helpful customer service agent for a tech company. 
                Be concise, friendly, and professional. 
                If you don't know something, say so honestly.""",
                "voice": "nova",
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.5,
                    "silence_duration_ms": 800
                }
            }
        }))
        
        # Send audio from microphone
        async def send_audio():
            while True:
                data = input_stream.read(CHUNK, exception_on_overflow=False)
                encoded = base64.b64encode(data).decode()
                await ws.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": encoded
                }))
                await asyncio.sleep(0.01)
        
        # Receive and play audio
        async def receive_audio():
            async for message in ws:
                event = json.loads(message)
                if event["type"] == "response.audio.delta":
                    audio_bytes = base64.b64decode(event["delta"])
                    output_stream.write(audio_bytes)
                elif event["type"] == "response.audio_transcript.delta":
                    print(f"Agent: {event['delta']}", end="", flush=True)
                elif event["type"] == "input_audio_buffer.speech_started":
                    print("\n[User speaking...]")
        
        await asyncio.gather(send_audio(), receive_audio())

asyncio.run(voice_agent())

Option 2: Modular Pipeline (Best Quality)#

Mix the best providers for each component:

python

from openai import OpenAI
import deepgram
import elevenlabs
import asyncio

# Use Crazyrouter for the LLM component
llm_client = OpenAI(
    api_key="YOUR_CRAZYROUTER_KEY",
    base_url="https://crazyrouter.com/v1"
)

# Deepgram for STT (fastest, most accurate)
dg_client = deepgram.DeepgramClient("YOUR_DEEPGRAM_KEY")

# ElevenLabs for TTS (most natural voices)
el_client = elevenlabs.ElevenLabs(api_key="YOUR_ELEVENLABS_KEY")

class VoiceAgent:
    def __init__(self):
        self.conversation_history = []
        self.system_prompt = """You are a friendly AI assistant. 
        Keep responses under 2 sentences for natural conversation flow.
        Be warm, helpful, and concise."""
    
    async def listen(self, audio_stream) -> str:
        """Convert speech to text using Deepgram."""
        response = await dg_client.listen.live.v("1").transcribe(
            audio_stream,
            model="nova-2",
            language="en",
            smart_format=True,
            interim_results=True
        )
        return response.results.channels[0].alternatives[0].transcript
    
    def think(self, user_text: str) -> str:
        """Generate response using LLM via Crazyrouter."""
        self.conversation_history.append({"role": "user", "content": user_text})
        
        response = llm_client.chat.completions.create(
            model="claude-sonnet-4-20250514",  # Best for conversational AI
            messages=[
                {"role": "system", "content": self.system_prompt},
                *self.conversation_history[-10:]  # Last 10 turns for context
            ],
            max_tokens=150  # Keep responses short for voice
        )
        
        assistant_text = response.choices[0].message.content
        self.conversation_history.append({"role": "assistant", "content": assistant_text})
        return assistant_text
    
    def speak(self, text: str) -> bytes:
        """Convert text to speech using ElevenLabs."""
        audio = el_client.generate(
            text=text,
            voice="Rachel",
            model="eleven_turbo_v2_5",
            stream=True
        )
        return b"".join(audio)

agent = VoiceAgent()

Option 3: Phone/Telephony Integration#

python

# Using Twilio + Voice Agent for phone calls
from flask import Flask, Response
from twilio.twiml.voice_response import VoiceResponse, Gather

app = Flask(__name__)

@app.route("/incoming-call", methods=["POST"])
def incoming_call():
    response = VoiceResponse()
    gather = Gather(
        input="speech",
        action="/process-speech",
        language="en-US",
        speech_timeout="auto"
    )
    gather.say("Hello! I'm your AI assistant. How can I help you today?")
    response.append(gather)
    return Response(str(response), mimetype="text/xml")

@app.route("/process-speech", methods=["POST"])
def process_speech():
    from flask import request
    user_speech = request.form.get("SpeechResult", "")
    
    # Use Crazyrouter LLM to generate response
    llm_response = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a phone customer service agent. Be brief and helpful."},
            {"role": "user", "content": user_speech}
        ],
        max_tokens=100
    )
    
    agent_text = llm_response.choices[0].message.content
    
    response = VoiceResponse()
    gather = Gather(
        input="speech",
        action="/process-speech",
        speech_timeout="auto"
    )
    gather.say(agent_text)
    response.append(gather)
    return Response(str(response), mimetype="text/xml")

Pricing Comparison#

End-to-End Voice (OpenAI Realtime)#

Component	OpenAI Direct	Crazyrouter	Savings
Audio Input	$0.06/min	$0.042/min	30%
Audio Output	$0.24/min	$0.168/min	30%
5-min conversation	$1.50	$1.05	$0.45

Modular Pipeline (per 5-min conversation)#

Component	Provider	Cost
STT	Deepgram Nova-2	$0.04
LLM	Claude Sonnet via Crazyrouter	$0.03
TTS	ElevenLabs Turbo	$0.15
Total		$0.22

The modular pipeline is ~5x cheaper than end-to-end, at the cost of slightly higher latency (~500ms vs ~300ms).

Cost at Scale#

Volume	OpenAI Realtime (Crazyrouter)	Modular Pipeline	Savings
1K conversations/month	$1,050	$220	79%
10K conversations/month	$10,500	$2,200	79%
100K conversations/month	$105,000	$22,000	79%

Best Practices for Voice Agents#

1. Keep Responses Short#

Voice conversations need concise responses. Aim for 1-3 sentences per turn.

python

system_prompt = """Respond in 1-2 sentences maximum. 
Be conversational and natural. Avoid lists or technical jargon unless asked."""

2. Handle Interruptions Gracefully#

Users will interrupt — your agent should handle this naturally.

3. Add Thinking Indicators#

Play a subtle sound or say "Let me check..." during LLM processing to avoid awkward silence.

4. Implement Error Recovery#

python

if not transcription or len(transcription.strip()) < 2:
    return "I didn't quite catch that. Could you repeat?"

5. Monitor Conversation Quality#

Log all conversations for quality review and fine-tuning.

FAQ#

What's the best approach for building a voice agent in 2026?#

For rapid prototyping, use OpenAI Realtime API — it's the simplest (one API handles everything). For production at scale, the modular pipeline (Deepgram STT + LLM via Crazyrouter + ElevenLabs TTS) gives better cost efficiency and voice quality.

How do I reduce latency in voice agents?#

Key strategies: (1) Use streaming for all components (STT, LLM, TTS), (2) Start TTS as soon as the first sentence is ready, (3) Use fast models (GPT-4o-mini, Claude Haiku) for the LLM layer, (4) Deploy geographically close to users.

Can I clone a custom voice for my voice agent?#

Yes! ElevenLabs and PlayHT both offer voice cloning APIs. You can create a branded voice from a few minutes of sample audio and use it in your voice agent for a consistent brand experience.

How much does it cost to run a voice agent?#

Using the modular pipeline through Crazyrouter, a 5-minute conversation costs approximately $0.22. At 10,000 conversations/month, that's about$ 2,200/month — significantly cheaper than human agents at $15-25/hour.

What languages do AI voice agents support?#

Most providers support 30+ languages for STT and TTS. The LLM layer (via Crazyrouter) supports 100+ languages. For best quality, English, Spanish, French, German, Japanese, and Chinese have the most mature voice models.

Summary#

Building AI voice agents in 2026 is more accessible than ever. Whether you choose the simplicity of OpenAI Realtime or the flexibility of a modular pipeline, the key is matching your architecture to your requirements for latency, cost, and voice quality.

For the LLM layer — the brain of your voice agent — Crazyrouter provides access to GPT-4o, Claude, Gemini, and 300+ models through one API key, with 25-30% cost savings that compound at scale.

Start building your voice agent → Get your Crazyrouter API key