
"AI Voice Agent Guide 2026: Build Speech-to-Speech AI with Real-Time APIs"
AI Voice Agent Guide 2026: Build Speech-to-Speech AI with Real-Time APIs#
AI voice agents are rapidly becoming the interface of choice for customer service, healthcare, sales, and personal assistants. Unlike text chatbots, voice agents create natural, human-like conversations that feel intuitive and accessible.
This guide covers everything you need to build a production-ready AI voice agent in 2026.
What is an AI Voice Agent?#
An AI voice agent is a system that can:
- Listen — Convert speech to text (STT) in real-time
- Think — Process the input with a language model (LLM)
- Speak — Convert the response back to speech (TTS)
- React — Handle interruptions, pauses, and turn-taking naturally
Architecture Patterns#
Pattern A: Modular Pipeline (Traditional)
Microphone → STT → LLM → TTS → Speaker
↓ ↓ ↓
Deepgram GPT ElevenLabs
Pattern B: End-to-End (Modern)
Microphone → OpenAI Realtime API → Speaker
(STT + LLM + TTS combined)
Pattern C: Hybrid
Microphone → Deepgram STT → Claude → ElevenLabs TTS → Speaker
(Best STT) (Best reasoning) (Best voices)
Voice AI Provider Comparison#
| Provider | STT | LLM | TTS | E2E | Latency | Voice Quality |
|---|---|---|---|---|---|---|
| OpenAI Realtime | ✅ | ✅ | ✅ | ✅ | ~300ms | ⭐⭐⭐⭐ |
| ElevenLabs | ❌ | ❌ | ✅ | ❌ | ~200ms | ⭐⭐⭐⭐⭐ |
| Deepgram | ✅ | ❌ | ✅ | ❌ | ~100ms | ⭐⭐⭐⭐ |
| PlayHT | ❌ | ❌ | ✅ | ❌ | ~150ms | ⭐⭐⭐⭐⭐ |
| AssemblyAI | ✅ | ❌ | ❌ | ❌ | ~150ms | N/A |
| Google STT/TTS | ✅ | ❌ | ✅ | ❌ | ~200ms | ⭐⭐⭐⭐ |
Building a Voice Agent: Step by Step#
Option 1: OpenAI Realtime (Simplest)#
The fastest way to build a voice agent — everything in one API:
import asyncio
import websockets
import json
import base64
import pyaudio
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 24000
async def voice_agent():
url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
# Use Crazyrouter for cost savings
# url = "wss://crazyrouter.com/v1/realtime?model=gpt-4o-realtime-preview"
headers = {
"Authorization": "Bearer YOUR_API_KEY",
"OpenAI-Beta": "realtime=v1"
}
audio = pyaudio.PyAudio()
# Input stream (microphone)
input_stream = audio.open(
format=FORMAT, channels=CHANNELS,
rate=RATE, input=True, frames_per_buffer=CHUNK
)
# Output stream (speaker)
output_stream = audio.open(
format=FORMAT, channels=CHANNELS,
rate=RATE, output=True, frames_per_buffer=CHUNK
)
async with websockets.connect(url, extra_headers=headers) as ws:
# Configure the agent
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"instructions": """You are a helpful customer service agent for a tech company.
Be concise, friendly, and professional.
If you don't know something, say so honestly.""",
"voice": "nova",
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"silence_duration_ms": 800
}
}
}))
# Send audio from microphone
async def send_audio():
while True:
data = input_stream.read(CHUNK, exception_on_overflow=False)
encoded = base64.b64encode(data).decode()
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": encoded
}))
await asyncio.sleep(0.01)
# Receive and play audio
async def receive_audio():
async for message in ws:
event = json.loads(message)
if event["type"] == "response.audio.delta":
audio_bytes = base64.b64decode(event["delta"])
output_stream.write(audio_bytes)
elif event["type"] == "response.audio_transcript.delta":
print(f"Agent: {event['delta']}", end="", flush=True)
elif event["type"] == "input_audio_buffer.speech_started":
print("\n[User speaking...]")
await asyncio.gather(send_audio(), receive_audio())
asyncio.run(voice_agent())
Option 2: Modular Pipeline (Best Quality)#
Mix the best providers for each component:
from openai import OpenAI
import deepgram
import elevenlabs
import asyncio
# Use Crazyrouter for the LLM component
llm_client = OpenAI(
api_key="YOUR_CRAZYROUTER_KEY",
base_url="https://crazyrouter.com/v1"
)
# Deepgram for STT (fastest, most accurate)
dg_client = deepgram.DeepgramClient("YOUR_DEEPGRAM_KEY")
# ElevenLabs for TTS (most natural voices)
el_client = elevenlabs.ElevenLabs(api_key="YOUR_ELEVENLABS_KEY")
class VoiceAgent:
def __init__(self):
self.conversation_history = []
self.system_prompt = """You are a friendly AI assistant.
Keep responses under 2 sentences for natural conversation flow.
Be warm, helpful, and concise."""
async def listen(self, audio_stream) -> str:
"""Convert speech to text using Deepgram."""
response = await dg_client.listen.live.v("1").transcribe(
audio_stream,
model="nova-2",
language="en",
smart_format=True,
interim_results=True
)
return response.results.channels[0].alternatives[0].transcript
def think(self, user_text: str) -> str:
"""Generate response using LLM via Crazyrouter."""
self.conversation_history.append({"role": "user", "content": user_text})
response = llm_client.chat.completions.create(
model="claude-sonnet-4-20250514", # Best for conversational AI
messages=[
{"role": "system", "content": self.system_prompt},
*self.conversation_history[-10:] # Last 10 turns for context
],
max_tokens=150 # Keep responses short for voice
)
assistant_text = response.choices[0].message.content
self.conversation_history.append({"role": "assistant", "content": assistant_text})
return assistant_text
def speak(self, text: str) -> bytes:
"""Convert text to speech using ElevenLabs."""
audio = el_client.generate(
text=text,
voice="Rachel",
model="eleven_turbo_v2_5",
stream=True
)
return b"".join(audio)
agent = VoiceAgent()
Option 3: Phone/Telephony Integration#
# Using Twilio + Voice Agent for phone calls
from flask import Flask, Response
from twilio.twiml.voice_response import VoiceResponse, Gather
app = Flask(__name__)
@app.route("/incoming-call", methods=["POST"])
def incoming_call():
response = VoiceResponse()
gather = Gather(
input="speech",
action="/process-speech",
language="en-US",
speech_timeout="auto"
)
gather.say("Hello! I'm your AI assistant. How can I help you today?")
response.append(gather)
return Response(str(response), mimetype="text/xml")
@app.route("/process-speech", methods=["POST"])
def process_speech():
from flask import request
user_speech = request.form.get("SpeechResult", "")
# Use Crazyrouter LLM to generate response
llm_response = llm_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a phone customer service agent. Be brief and helpful."},
{"role": "user", "content": user_speech}
],
max_tokens=100
)
agent_text = llm_response.choices[0].message.content
response = VoiceResponse()
gather = Gather(
input="speech",
action="/process-speech",
speech_timeout="auto"
)
gather.say(agent_text)
response.append(gather)
return Response(str(response), mimetype="text/xml")
Pricing Comparison#
End-to-End Voice (OpenAI Realtime)#
| Component | OpenAI Direct | Crazyrouter | Savings |
|---|---|---|---|
| Audio Input | $0.06/min | $0.042/min | 30% |
| Audio Output | $0.24/min | $0.168/min | 30% |
| 5-min conversation | $1.50 | $1.05 | $0.45 |
Modular Pipeline (per 5-min conversation)#
| Component | Provider | Cost |
|---|---|---|
| STT | Deepgram Nova-2 | $0.04 |
| LLM | Claude Sonnet via Crazyrouter | $0.03 |
| TTS | ElevenLabs Turbo | $0.15 |
| Total | $0.22 |
The modular pipeline is ~5x cheaper than end-to-end, at the cost of slightly higher latency (~500ms vs ~300ms).
Cost at Scale#
| Volume | OpenAI Realtime (Crazyrouter) | Modular Pipeline | Savings |
|---|---|---|---|
| 1K conversations/month | $1,050 | $220 | 79% |
| 10K conversations/month | $10,500 | $2,200 | 79% |
| 100K conversations/month | $105,000 | $22,000 | 79% |
Best Practices for Voice Agents#
1. Keep Responses Short#
Voice conversations need concise responses. Aim for 1-3 sentences per turn.
system_prompt = """Respond in 1-2 sentences maximum.
Be conversational and natural. Avoid lists or technical jargon unless asked."""
2. Handle Interruptions Gracefully#
Users will interrupt — your agent should handle this naturally.
3. Add Thinking Indicators#
Play a subtle sound or say "Let me check..." during LLM processing to avoid awkward silence.
4. Implement Error Recovery#
if not transcription or len(transcription.strip()) < 2:
return "I didn't quite catch that. Could you repeat?"
5. Monitor Conversation Quality#
Log all conversations for quality review and fine-tuning.
FAQ#
What's the best approach for building a voice agent in 2026?#
For rapid prototyping, use OpenAI Realtime API — it's the simplest (one API handles everything). For production at scale, the modular pipeline (Deepgram STT + LLM via Crazyrouter + ElevenLabs TTS) gives better cost efficiency and voice quality.
How do I reduce latency in voice agents?#
Key strategies: (1) Use streaming for all components (STT, LLM, TTS), (2) Start TTS as soon as the first sentence is ready, (3) Use fast models (GPT-4o-mini, Claude Haiku) for the LLM layer, (4) Deploy geographically close to users.
Can I clone a custom voice for my voice agent?#
Yes! ElevenLabs and PlayHT both offer voice cloning APIs. You can create a branded voice from a few minutes of sample audio and use it in your voice agent for a consistent brand experience.
How much does it cost to run a voice agent?#
Using the modular pipeline through Crazyrouter, a 5-minute conversation costs approximately 2,200/month — significantly cheaper than human agents at $15-25/hour.
What languages do AI voice agents support?#
Most providers support 30+ languages for STT and TTS. The LLM layer (via Crazyrouter) supports 100+ languages. For best quality, English, Spanish, French, German, Japanese, and Chinese have the most mature voice models.
Summary#
Building AI voice agents in 2026 is more accessible than ever. Whether you choose the simplicity of OpenAI Realtime or the flexibility of a modular pipeline, the key is matching your architecture to your requirements for latency, cost, and voice quality.
For the LLM layer — the brain of your voice agent — Crazyrouter provides access to GPT-4o, Claude, Gemini, and 300+ models through one API key, with 25-30% cost savings that compound at scale.
Start building your voice agent → Get your Crazyrouter API key


