
"Multi-Model Orchestration Patterns: Route AI Requests Like a Pro"
No single AI model is best at everything. GPT-4.1 excels at code generation, Claude handles long documents better, Gemini processes multimodal inputs natively, and DeepSeek offers strong performance at a fraction of the cost. The smartest AI applications don't pick one model — they orchestrate many.
This guide covers the patterns and architectures for routing requests to the right model at the right time, optimizing for cost, quality, and reliability.
Why Multi-Model?#
Here's the reality of AI model performance in 2026:
| Task | Best Model | Runner-Up | Cost Difference |
|---|---|---|---|
| Code generation | GPT-4.1 / Claude Opus | Gemini 2.5 Pro | 3-5x |
| Long document analysis | Claude (200K ctx) | Gemini (1M ctx) | 2x |
| Creative writing | Claude Opus | GPT-4.1 | 2x |
| Simple Q&A | GPT-4.1 mini | DeepSeek V3 | 10-20x vs flagship |
| Image understanding | Gemini 2.5 Pro | GPT-4.1 | 1.5x |
| Math/reasoning | o4-mini | Claude Opus | 3x |
| Cost-sensitive tasks | DeepSeek V3 | GPT-4.1 nano | 5-10x savings |
Locking into one provider means overpaying for simple tasks and underperforming on specialized ones.
Pattern 1: Complexity-Based Routing#
Route requests to different models based on task complexity. Simple questions go to cheap models; complex tasks go to powerful ones.
from openai import OpenAI
client = OpenAI(
api_key="your-crazyrouter-api-key",
base_url="https://crazyrouter.com/v1"
)
# Complexity classifier (can be rule-based or ML-based)
def classify_complexity(message: str) -> str:
"""Classify request complexity as low, medium, or high."""
# Simple heuristics — replace with a classifier in production
word_count = len(message.split())
if word_count < 20 and "?" in message:
return "low"
elif any(kw in message.lower() for kw in ["analyze", "compare", "design", "architect", "refactor"]):
return "high"
elif word_count > 200:
return "high"
else:
return "medium"
MODEL_MAP = {
"low": "gpt-4.1-nano", # $0.10/M input — simple Q&A
"medium": "gpt-4.1-mini", # $0.40/M input — standard tasks
"high": "gpt-4.1", # $2.00/M input — complex reasoning
}
def route_request(messages):
user_message = messages[-1]["content"]
complexity = classify_complexity(user_message)
model = MODEL_MAP[complexity]
response = client.chat.completions.create(
model=model,
messages=messages
)
return {
"response": response,
"model_used": model,
"complexity": complexity,
"cost_tier": complexity
}
Advanced: ML-Based Router#
For production systems, train a small classifier to route requests:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
# Train on historical data: (request_features) -> best_model
# Features: word_count, has_code, has_question, topic_embedding, etc.
class ModelRouter:
def __init__(self):
self.classifier = RandomForestClassifier()
self.models = ["gpt-4.1-nano", "gpt-4.1-mini", "gpt-4.1", "claude-sonnet-4-5"]
def extract_features(self, message):
return [
len(message.split()), # word count
int("```" in message), # has code
int("?" in message), # is question
len(message), # char count
message.count("\n"), # line count
int(any(kw in message.lower() # has complex keywords
for kw in ["analyze", "design", "compare", "explain"]))
]
def route(self, message):
features = np.array([self.extract_features(message)])
model_idx = self.classifier.predict(features)[0]
return self.models[model_idx]
Pattern 2: Task-Specific Routing#
Different models for different task types:
TASK_ROUTES = {
"code": {
"model": "gpt-4.1",
"system": "You are an expert programmer. Write clean, efficient code.",
"temperature": 0.2
},
"creative": {
"model": "claude-sonnet-4-5",
"system": "You are a creative writer with a vivid imagination.",
"temperature": 0.8
},
"analysis": {
"model": "claude-sonnet-4-5",
"system": "You are a precise analyst. Be thorough and data-driven.",
"temperature": 0.3
},
"translation": {
"model": "gpt-4.1-mini",
"system": "You are a professional translator.",
"temperature": 0.1
},
"math": {
"model": "o4-mini",
"system": "Solve step by step.",
"temperature": 0.0
},
"chat": {
"model": "gpt-4.1-nano",
"system": "You are a helpful assistant.",
"temperature": 0.7
}
}
def detect_task_type(message: str) -> str:
"""Detect the task type from the user message."""
message_lower = message.lower()
if any(kw in message_lower for kw in ["write code", "function", "implement", "debug", "```"]):
return "code"
elif any(kw in message_lower for kw in ["write a story", "creative", "poem", "imagine"]):
return "creative"
elif any(kw in message_lower for kw in ["analyze", "compare", "evaluate", "review"]):
return "analysis"
elif any(kw in message_lower for kw in ["translate", "翻译", "traduire"]):
return "translation"
elif any(kw in message_lower for kw in ["calculate", "solve", "equation", "math"]):
return "math"
else:
return "chat"
def route_by_task(user_message):
task_type = detect_task_type(user_message)
config = TASK_ROUTES[task_type]
response = client.chat.completions.create(
model=config["model"],
messages=[
{"role": "system", "content": config["system"]},
{"role": "user", "content": user_message}
],
temperature=config["temperature"]
)
return response, task_type
Pattern 3: Cost-Optimized Cascade#
Start with the cheapest model. If the response quality is insufficient, escalate to a more expensive one:
import re
COST_CASCADE = [
{"model": "gpt-4.1-nano", "cost_per_1k": 0.0001},
{"model": "gpt-4.1-mini", "cost_per_1k": 0.0004},
{"model": "gpt-4.1", "cost_per_1k": 0.002},
]
def quality_check(response_text: str, task_type: str) -> bool:
"""Check if the response meets quality thresholds."""
# Basic quality heuristics
if len(response_text.strip()) < 20:
return False
if "I don't know" in response_text or "I'm not sure" in response_text:
return False
if task_type == "code" and "```" not in response_text:
return False # Code task should contain code blocks
return True
def cascade_request(messages, task_type="general"):
for tier in COST_CASCADE:
response = client.chat.completions.create(
model=tier["model"],
messages=messages
)
content = response.choices[0].message.content
if quality_check(content, task_type):
return {
"content": content,
"model": tier["model"],
"escalated": tier != COST_CASCADE[0]
}
print(f"{tier['model']} response insufficient, escalating...")
# Return last response even if quality check failed
return {
"content": content,
"model": COST_CASCADE[-1]["model"],
"escalated": True
}
Pattern 4: A/B Testing Models#
Compare model performance in production:
import random
import hashlib
from datetime import datetime
class ModelABTest:
def __init__(self, variants):
"""
variants: [
{"model": "gpt-4.1", "weight": 0.5},
{"model": "claude-sonnet-4-5", "weight": 0.5}
]
"""
self.variants = variants
self.results = {v["model"]: {"count": 0, "latency_sum": 0, "errors": 0}
for v in variants}
def select_variant(self, user_id: str = None):
"""Select a model variant. Consistent per user if user_id provided."""
if user_id:
# Deterministic assignment based on user ID
hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
threshold = 0
for variant in self.variants:
threshold += variant["weight"]
if (hash_val % 100) / 100 < threshold:
return variant["model"]
# Random assignment
r = random.random()
threshold = 0
for variant in self.variants:
threshold += variant["weight"]
if r < threshold:
return variant["model"]
return self.variants[-1]["model"]
def record(self, model, latency_ms, success=True):
self.results[model]["count"] += 1
self.results[model]["latency_sum"] += latency_ms
if not success:
self.results[model]["errors"] += 1
def report(self):
for model, stats in self.results.items():
avg_latency = stats["latency_sum"] / max(stats["count"], 1)
error_rate = stats["errors"] / max(stats["count"], 1)
print(f"{model}: {stats['count']} calls, "
f"avg {avg_latency:.0f}ms, "
f"error rate {error_rate:.1%}")
# Usage
ab_test = ModelABTest([
{"model": "gpt-4.1", "weight": 0.5},
{"model": "claude-sonnet-4-5", "weight": 0.5}
])
model = ab_test.select_variant(user_id="user_123")
Pattern 5: Consensus / Ensemble#
For high-stakes decisions, query multiple models and aggregate:
import asyncio
from openai import AsyncOpenAI
async_client = AsyncOpenAI(
api_key="your-crazyrouter-api-key",
base_url="https://crazyrouter.com/v1"
)
async def ensemble_request(messages, models=None):
"""Query multiple models and return consensus."""
models = models or ["gpt-4.1", "claude-sonnet-4-5", "gemini-2.5-flash"]
async def query_model(model):
try:
response = await async_client.chat.completions.create(
model=model,
messages=messages
)
return {"model": model, "content": response.choices[0].message.content}
except Exception as e:
return {"model": model, "error": str(e)}
# Query all models in parallel
results = await asyncio.gather(*[query_model(m) for m in models])
# Filter successful responses
successful = [r for r in results if "content" in r]
if not successful:
raise Exception("All models failed")
# For classification tasks: majority vote
# For generation tasks: return all and let the application choose
return {
"responses": successful,
"count": len(successful),
"models_used": [r["model"] for r in successful]
}
Architecture: Putting It All Together#
Here's a production-ready orchestration layer:
class AIOrchestrator:
def __init__(self, api_key, base_url="https://crazyrouter.com/v1"):
self.client = OpenAI(api_key=api_key, base_url=base_url)
self.router = ModelRouter()
self.circuit_breaker = CircuitBreaker()
self.ab_test = None # Optional
def complete(self, messages, strategy="auto", **kwargs):
"""
Main entry point for AI completions.
Strategies:
- "auto": Complexity-based routing
- "cheap": Always use cheapest model
- "best": Always use best model
- "cascade": Start cheap, escalate if needed
- "specific": Use kwargs["model"]
"""
if strategy == "specific":
model = kwargs["model"]
elif strategy == "cheap":
model = "gpt-4.1-nano"
elif strategy == "best":
model = "gpt-4.1"
elif strategy == "cascade":
return self._cascade(messages, kwargs.get("task_type", "general"))
else: # auto
model = self.router.route(messages[-1]["content"])
return self._call_with_fallback(messages, model)
def _call_with_fallback(self, messages, primary_model):
fallback_models = self._get_fallbacks(primary_model)
for model in [primary_model] + fallback_models:
if not self.circuit_breaker.can_execute(model):
continue
try:
response = self.client.chat.completions.create(
model=model, messages=messages
)
self.circuit_breaker.record_success(model)
return response
except Exception as e:
self.circuit_breaker.record_failure(model)
raise Exception("All models unavailable")
def _get_fallbacks(self, model):
FALLBACKS = {
"gpt-4.1": ["claude-sonnet-4-5", "gemini-2.5-flash"],
"claude-sonnet-4-5": ["gpt-4.1", "gemini-2.5-flash"],
"gemini-2.5-flash": ["gpt-4.1-mini", "deepseek-v3"],
"gpt-4.1-mini": ["deepseek-v3", "gpt-4.1-nano"],
"gpt-4.1-nano": ["deepseek-v3"],
}
return FALLBACKS.get(model, ["gpt-4.1-mini"])
Cost Impact#
Here's what multi-model orchestration saves in practice:
| Approach | Monthly Cost (1M requests) | Quality |
|---|---|---|
| Always GPT-4.1 | ~$6,000 | ⭐⭐⭐⭐⭐ |
| Always GPT-4.1 mini | ~$1,200 | ⭐⭐⭐⭐ |
| Complexity routing | ~$2,400 | ⭐⭐⭐⭐⭐ |
| Cost cascade | ~$1,800 | ⭐⭐⭐⭐ |
Complexity routing typically saves 50-70% compared to always using the flagship model, with minimal quality impact.
FAQ#
How do I decide which model to use for each task?#
Start with benchmarks (MMLU, HumanEval, etc.) for your specific use case, then A/B test in production. The "best" model changes frequently — what matters is having the infrastructure to switch quickly.
Does Crazyrouter handle model routing automatically?#
Crazyrouter provides a unified API for 300+ models, making it trivial to switch between providers. You implement the routing logic in your application, and Crazyrouter handles the provider-specific API translation.
What's the latency overhead of multi-model routing?#
The routing decision itself adds <1ms. The main latency factor is the model itself. Cascade patterns add latency when escalation happens, so optimize your classifier to minimize unnecessary escalations.
Should I cache AI responses?#
Yes, for deterministic queries (same input → same output). Use semantic caching (embedding-based similarity) for fuzzy matching. This can reduce costs by 20-40% for applications with repetitive queries.
Summary#
Multi-model orchestration is the difference between a demo and a production AI application. Route by complexity, fall back across providers, and optimize for cost — your users get better results and you spend less.
Crazyrouter makes this practical by providing one API key for 300+ models. No need to manage multiple provider accounts, API keys, or SDK versions. Start building your orchestration layer at crazyrouter.com.


