Login
Back to Blog
EnglishTutorial

AI Voice Agent Guide 2026: Build Speech-to-Speech AI with Real-Time APIs

"Complete guide to building AI voice agents with speech-to-speech APIs. Compare OpenAI Realtime, ElevenLabs, Deepgram, and PlayHT for building conversational voice AI."

C
Crazyrouter Team
March 2, 2026 / 621 views
Share:
AI Voice Agent Guide 2026: Build Speech-to-Speech AI with Real-Time APIs

AI Voice Agent Guide 2026: Build Speech-to-Speech AI with Real-Time APIs#

AI voice agents are rapidly becoming the interface of choice for customer service, healthcare, sales, and personal assistants. Unlike text chatbots, voice agents create natural, human-like conversations that feel intuitive and accessible.

This guide covers everything you need to build a production-ready AI voice agent in 2026.

What is an AI Voice Agent?#

An AI voice agent is a system that can:

  1. Listen — Convert speech to text (STT) in real-time
  2. Think — Process the input with a language model (LLM)
  3. Speak — Convert the response back to speech (TTS)
  4. React — Handle interruptions, pauses, and turn-taking naturally

Architecture Patterns#

Pattern A: Modular Pipeline (Traditional)

code
Microphone → STT → LLM → TTS → Speaker
              ↓      ↓      ↓
           Deepgram  GPT  ElevenLabs

Pattern B: End-to-End (Modern)

code
Microphone → OpenAI Realtime API → Speaker
              (STT + LLM + TTS combined)

Pattern C: Hybrid

code
Microphone → Deepgram STT → Claude → ElevenLabs TTS → Speaker
              (Best STT)    (Best reasoning)  (Best voices)

Voice AI Provider Comparison#

ProviderSTTLLMTTSE2ELatencyVoice Quality
OpenAI Realtime~300ms⭐⭐⭐⭐
ElevenLabs~200ms⭐⭐⭐⭐⭐
Deepgram~100ms⭐⭐⭐⭐
PlayHT~150ms⭐⭐⭐⭐⭐
AssemblyAI~150msN/A
Google STT/TTS~200ms⭐⭐⭐⭐

Building a Voice Agent: Step by Step#

Option 1: OpenAI Realtime (Simplest)#

The fastest way to build a voice agent — everything in one API:

python
import asyncio
import websockets
import json
import base64
import pyaudio

CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 24000

async def voice_agent():
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    # Use Crazyrouter for cost savings
    # url = "wss://crazyrouter.com/v1/realtime?model=gpt-4o-realtime-preview"
    
    headers = {
        "Authorization": "Bearer YOUR_API_KEY",
        "OpenAI-Beta": "realtime=v1"
    }
    
    audio = pyaudio.PyAudio()
    
    # Input stream (microphone)
    input_stream = audio.open(
        format=FORMAT, channels=CHANNELS,
        rate=RATE, input=True, frames_per_buffer=CHUNK
    )
    
    # Output stream (speaker)
    output_stream = audio.open(
        format=FORMAT, channels=CHANNELS,
        rate=RATE, output=True, frames_per_buffer=CHUNK
    )
    
    async with websockets.connect(url, extra_headers=headers) as ws:
        # Configure the agent
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "instructions": """You are a helpful customer service agent for a tech company. 
                Be concise, friendly, and professional. 
                If you don't know something, say so honestly.""",
                "voice": "nova",
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.5,
                    "silence_duration_ms": 800
                }
            }
        }))
        
        # Send audio from microphone
        async def send_audio():
            while True:
                data = input_stream.read(CHUNK, exception_on_overflow=False)
                encoded = base64.b64encode(data).decode()
                await ws.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": encoded
                }))
                await asyncio.sleep(0.01)
        
        # Receive and play audio
        async def receive_audio():
            async for message in ws:
                event = json.loads(message)
                if event["type"] == "response.audio.delta":
                    audio_bytes = base64.b64decode(event["delta"])
                    output_stream.write(audio_bytes)
                elif event["type"] == "response.audio_transcript.delta":
                    print(f"Agent: {event['delta']}", end="", flush=True)
                elif event["type"] == "input_audio_buffer.speech_started":
                    print("\n[User speaking...]")
        
        await asyncio.gather(send_audio(), receive_audio())

asyncio.run(voice_agent())

Option 2: Modular Pipeline (Best Quality)#

Mix the best providers for each component:

python
from openai import OpenAI
import deepgram
import elevenlabs
import asyncio

# Use Crazyrouter for the LLM component
llm_client = OpenAI(
    api_key="YOUR_CRAZYROUTER_KEY",
    base_url="https://crazyrouter.com/v1"
)

# Deepgram for STT (fastest, most accurate)
dg_client = deepgram.DeepgramClient("YOUR_DEEPGRAM_KEY")

# ElevenLabs for TTS (most natural voices)
el_client = elevenlabs.ElevenLabs(api_key="YOUR_ELEVENLABS_KEY")

class VoiceAgent:
    def __init__(self):
        self.conversation_history = []
        self.system_prompt = """You are a friendly AI assistant. 
        Keep responses under 2 sentences for natural conversation flow.
        Be warm, helpful, and concise."""
    
    async def listen(self, audio_stream) -> str:
        """Convert speech to text using Deepgram."""
        response = await dg_client.listen.live.v("1").transcribe(
            audio_stream,
            model="nova-2",
            language="en",
            smart_format=True,
            interim_results=True
        )
        return response.results.channels[0].alternatives[0].transcript
    
    def think(self, user_text: str) -> str:
        """Generate response using LLM via Crazyrouter."""
        self.conversation_history.append({"role": "user", "content": user_text})
        
        response = llm_client.chat.completions.create(
            model="claude-sonnet-4-20250514",  # Best for conversational AI
            messages=[
                {"role": "system", "content": self.system_prompt},
                *self.conversation_history[-10:]  # Last 10 turns for context
            ],
            max_tokens=150  # Keep responses short for voice
        )
        
        assistant_text = response.choices[0].message.content
        self.conversation_history.append({"role": "assistant", "content": assistant_text})
        return assistant_text
    
    def speak(self, text: str) -> bytes:
        """Convert text to speech using ElevenLabs."""
        audio = el_client.generate(
            text=text,
            voice="Rachel",
            model="eleven_turbo_v2_5",
            stream=True
        )
        return b"".join(audio)

agent = VoiceAgent()

Option 3: Phone/Telephony Integration#

python
# Using Twilio + Voice Agent for phone calls
from flask import Flask, Response
from twilio.twiml.voice_response import VoiceResponse, Gather

app = Flask(__name__)

@app.route("/incoming-call", methods=["POST"])
def incoming_call():
    response = VoiceResponse()
    gather = Gather(
        input="speech",
        action="/process-speech",
        language="en-US",
        speech_timeout="auto"
    )
    gather.say("Hello! I'm your AI assistant. How can I help you today?")
    response.append(gather)
    return Response(str(response), mimetype="text/xml")

@app.route("/process-speech", methods=["POST"])
def process_speech():
    from flask import request
    user_speech = request.form.get("SpeechResult", "")
    
    # Use Crazyrouter LLM to generate response
    llm_response = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a phone customer service agent. Be brief and helpful."},
            {"role": "user", "content": user_speech}
        ],
        max_tokens=100
    )
    
    agent_text = llm_response.choices[0].message.content
    
    response = VoiceResponse()
    gather = Gather(
        input="speech",
        action="/process-speech",
        speech_timeout="auto"
    )
    gather.say(agent_text)
    response.append(gather)
    return Response(str(response), mimetype="text/xml")

Pricing Comparison#

End-to-End Voice (OpenAI Realtime)#

ComponentOpenAI DirectCrazyrouterSavings
Audio Input$0.06/min$0.042/min30%
Audio Output$0.24/min$0.168/min30%
5-min conversation$1.50$1.05$0.45

Modular Pipeline (per 5-min conversation)#

ComponentProviderCost
STTDeepgram Nova-2$0.04
LLMClaude Sonnet via Crazyrouter$0.03
TTSElevenLabs Turbo$0.15
Total$0.22

The modular pipeline is ~5x cheaper than end-to-end, at the cost of slightly higher latency (~500ms vs ~300ms).

Cost at Scale#

VolumeOpenAI Realtime (Crazyrouter)Modular PipelineSavings
1K conversations/month$1,050$22079%
10K conversations/month$10,500$2,20079%
100K conversations/month$105,000$22,00079%

Best Practices for Voice Agents#

1. Keep Responses Short#

Voice conversations need concise responses. Aim for 1-3 sentences per turn.

python
system_prompt = """Respond in 1-2 sentences maximum. 
Be conversational and natural. Avoid lists or technical jargon unless asked."""

2. Handle Interruptions Gracefully#

Users will interrupt — your agent should handle this naturally.

3. Add Thinking Indicators#

Play a subtle sound or say "Let me check..." during LLM processing to avoid awkward silence.

4. Implement Error Recovery#

python
if not transcription or len(transcription.strip()) < 2:
    return "I didn't quite catch that. Could you repeat?"

5. Monitor Conversation Quality#

Log all conversations for quality review and fine-tuning.

FAQ#

What's the best approach for building a voice agent in 2026?#

For rapid prototyping, use OpenAI Realtime API — it's the simplest (one API handles everything). For production at scale, the modular pipeline (Deepgram STT + LLM via Crazyrouter + ElevenLabs TTS) gives better cost efficiency and voice quality.

How do I reduce latency in voice agents?#

Key strategies: (1) Use streaming for all components (STT, LLM, TTS), (2) Start TTS as soon as the first sentence is ready, (3) Use fast models (GPT-4o-mini, Claude Haiku) for the LLM layer, (4) Deploy geographically close to users.

Can I clone a custom voice for my voice agent?#

Yes! ElevenLabs and PlayHT both offer voice cloning APIs. You can create a branded voice from a few minutes of sample audio and use it in your voice agent for a consistent brand experience.

How much does it cost to run a voice agent?#

Using the modular pipeline through Crazyrouter, a 5-minute conversation costs approximately 0.22.At10,000conversations/month,thatsabout0.22. At 10,000 conversations/month, that's about 2,200/month — significantly cheaper than human agents at $15-25/hour.

What languages do AI voice agents support?#

Most providers support 30+ languages for STT and TTS. The LLM layer (via Crazyrouter) supports 100+ languages. For best quality, English, Spanish, French, German, Japanese, and Chinese have the most mature voice models.

Summary#

Building AI voice agents in 2026 is more accessible than ever. Whether you choose the simplicity of OpenAI Realtime or the flexibility of a modular pipeline, the key is matching your architecture to your requirements for latency, cost, and voice quality.

For the LLM layer — the brain of your voice agent — Crazyrouter provides access to GPT-4o, Claude, Gemini, and 300+ models through one API key, with 25-30% cost savings that compound at scale.

Start building your voice agentGet your Crazyrouter API key

Implementation Guides

Related Posts

OpenAI Realtime API Complete Guide: Build Voice AI Apps in 2026Tutorial

OpenAI Realtime API Complete Guide: Build Voice AI Apps in 2026

"Learn how to use OpenAI's Realtime API for building voice AI applications with WebSocket streaming, audio input/output, and function calling. Complete tutorial with code examples."

Mar 2
Text-Embedding-3-Small API Tutorial - OpenAI Embedding Model GuideTutorial

Text-Embedding-3-Small API Tutorial - OpenAI Embedding Model Guide

Complete guide to using OpenAI text-embedding-3-small API for semantic search, RAG systems, and similarity matching. Includes Python, Node.js examples and pricing comparison.

Jan 26
Cheaper AI API in 2026: How to Lower LLM Costs Without Losing QualityTutorial

Cheaper AI API in 2026: How to Lower LLM Costs Without Losing Quality

At 1M GPT-4 tokens per month, official API pricing is $30, while Crazyrouter lists $21 for the same volume (pricing data updated 2026-03-06). That 30% gap looks clear on paper, yet real production...

Mar 18
Codex CLI Installation Guide 2026: macOS, Linux, WSL, Proxies, and Dev ContainersTutorial

Codex CLI Installation Guide 2026: macOS, Linux, WSL, Proxies, and Dev Containers

Install Codex CLI across common developer environments and learn how to route AI calls through Crazyrouter.

May 25
Can Claude Code Build a World Cup 2026 Match Predictor? A Real Crazyrouter API TestTutorial

Can Claude Code Build a World Cup 2026 Match Predictor? A Real Crazyrouter API Test

We built a reproducible World Cup 2026 match predictor demo with Claude Code-style workflow, Elo/Poisson probabilities, charts, and real Crazyrouter API calls through https://cn.crazyrouter.com/v1.

Jun 12
Suno Music API Tutorial: Generate AI Music Programmatically in 2026Tutorial

Suno Music API Tutorial: Generate AI Music Programmatically in 2026

"Learn how to use the Suno Music API to generate songs, lyrics, and instrumentals with code. Includes Python examples, pricing, and integration tips."

Feb 21