
OpenClaw Custom Model Integration: Deploy Your Own AI Models
OpenClaw Custom Model Integration: Deploy Your Own AI Models#
While OpenClaw works seamlessly with commercial AI models through Crazyrouter, many organizations need to integrate proprietary or fine-tuned models. This comprehensive guide covers everything from model deployment to production optimization, enabling you to extend OpenClaw with custom AI capabilities.
Understanding OpenClaw Model Architecture#
OpenClaw's model integration system provides a flexible framework for connecting custom AI models while maintaining compatibility with the standard OpenAI API format.
Model Provider Interface#
OpenClaw uses a provider-based architecture where each model source implements a standardized interface. This design allows seamless switching between commercial APIs, self-hosted models, and hybrid configurations.
Provider Components: Every model provider implements four core methods: initialize() for setup, complete() for text generation, embed() for embeddings, and validate() for health checks. This consistent interface ensures OpenClaw can treat all models uniformly.
Request Transformation: OpenClaw automatically transforms requests between different model formats. When you send an OpenAI-formatted request to a custom model, OpenClaw handles the conversion to your model's expected format.
Response Normalization: Similarly, responses from custom models are normalized to OpenAI format, ensuring consistent behavior across all model types. This abstraction layer simplifies client code and enables easy model switching.
Supported Model Types#
OpenClaw supports integration with various model architectures:
Transformer Models: Standard transformer-based models like GPT, BERT, and T5 variants work out of the box. OpenClaw provides optimized inference pipelines for these architectures.
Specialized Models: Vision models, audio models, and multimodal models can be integrated with custom preprocessing pipelines. OpenClaw handles format conversion and batching automatically.
Fine-tuned Models: Models fine-tuned on domain-specific data integrate seamlessly. OpenClaw preserves custom tokenizers and model configurations during deployment.
Setting Up Your Model Deployment Environment#
Before integrating custom models, establish a robust deployment infrastructure that can handle production workloads.
Infrastructure Requirements#
Compute Resources: GPU-accelerated instances are essential for acceptable inference latency. For production deployments, consider:
- NVIDIA A100 or H100 GPUs for large models (7B+ parameters)
- NVIDIA T4 or A10 GPUs for medium models (1-7B parameters)
- CPU-only deployment for small models (<1B parameters) or low-traffic scenarios
Memory Requirements: Model memory usage depends on precision and architecture:
- FP32: ~4 bytes per parameter
- FP16: ~2 bytes per parameter
- INT8: ~1 byte per parameter
A 7B parameter model in FP16 requires approximately 14GB GPU memory, plus overhead for activations and KV cache.
Storage: Fast SSD storage is crucial for model loading and caching. Allocate at least 3x model size for checkpoints, quantized versions, and temporary files.
Container-Based Deployment#
Use Docker for consistent model deployment across environments:
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
# Install Python and dependencies
RUN apt-get update && apt-get install -y python3.11 python3-pip
RUN pip3 install torch transformers accelerate
# Copy model files
COPY models/ /app/models/
COPY server.py /app/
# Set environment variables
ENV MODEL_PATH=/app/models/my-model
ENV PORT=8000
WORKDIR /app
CMD ["python3", "server.py"]
Build and run:
docker build -t my-custom-model:latest .
docker run --gpus all -p 8000:8000 my-custom-model:latest
Kubernetes Deployment#
For production scale, deploy with Kubernetes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: custom-model-deployment
spec:
replicas: 3
selector:
matchLabels:
app: custom-model
template:
metadata:
labels:
app: custom-model
spec:
containers:
- name: model-server
image: my-custom-model:latest
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
requests:
nvidia.com/gpu: 1
memory: "16Gi"
ports:
- containerPort: 8000
env:
- name: MODEL_PATH
value: "/models/my-model"
- name: BATCH_SIZE
value: "8"
---
apiVersion: v1
kind: Service
metadata:
name: custom-model-service
spec:
selector:
app: custom-model
ports:
- protocol: TCP
port: 80
targetPort: 8000
type: LoadBalancer
Building a Model Server#
Create a production-ready model server that OpenClaw can communicate with.
FastAPI Model Server#
Implement a FastAPI server with OpenAI-compatible endpoints:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
app = FastAPI()
# Load model at startup
model_path = "/app/models/my-model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map="auto"
)
class Message(BaseModel):
role: str
content: str
class CompletionRequest(BaseModel):
model: str
messages: List[Message]
temperature: float = 0.7
max_tokens: int = 1000
stream: bool = False
class CompletionResponse(BaseModel):
id: str
object: str = "chat.completion"
created: int
model: str
choices: List[dict]
usage: dict
@app.post("/v1/chat/completions")
async def create_completion(request: CompletionRequest):
try:
# Format messages for model
prompt = format_messages(request.messages)
# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=request.max_tokens,
temperature=request.temperature,
do_sample=True
)
# Decode response
response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Format as OpenAI response
return CompletionResponse(
id=f"chatcmpl-{generate_id()}",
created=int(time.time()),
model=request.model,
choices=[{
"index": 0,
"message": {
"role": "assistant",
"content": response_text
},
"finish_reason": "stop"
}],
usage={
"prompt_tokens": len(inputs.input_ids[0]),
"completion_tokens": len(outputs[0]) - len(inputs.input_ids[0]),
"total_tokens": len(outputs[0])
}
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
def format_messages(messages: List[Message]) -> str:
"""Convert OpenAI messages to model-specific format"""
formatted = []
for msg in messages:
if msg.role == "system":
formatted.append(f"System: {msg.content}")
elif msg.role == "user":
formatted.append(f"User: {msg.content}")
elif msg.role == "assistant":
formatted.append(f"Assistant: {msg.content}")
return "\n".join(formatted) + "\nAssistant:"
@app.get("/health")
async def health_check():
return {"status": "healthy", "model": model_path}
Streaming Support#
Implement streaming for better user experience:
from fastapi.responses import StreamingResponse
import json
@app.post("/v1/chat/completions")
async def create_completion(request: CompletionRequest):
if request.stream:
return StreamingResponse(
stream_completion(request),
media_type="text/event-stream"
)
else:
return await generate_completion(request)
async def stream_completion(request: CompletionRequest):
prompt = format_messages(request.messages)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Stream generation
for token in model.generate(
**inputs,
max_new_tokens=request.max_tokens,
temperature=request.temperature,
do_sample=True,
streamer=TextIteratorStreamer(tokenizer)
):
chunk = {
"id": f"chatcmpl-{generate_id()}",
"object": "chat.completion.chunk",
"created": int(time.time()),
"model": request.model,
"choices": [{
"index": 0,
"delta": {"content": token},
"finish_reason": None
}]
}
yield f"data: {json.dumps(chunk)}\n\n"
# Send final chunk
yield "data: [DONE]\n\n"
Configuring OpenClaw for Custom Models#
Integrate your model server with OpenClaw through configuration and provider registration.
Provider Configuration#
Create a custom provider configuration in config/providers/custom.yaml:
providers:
- name: my-custom-model
type: openai-compatible
base_url: http://custom-model-service/v1
api_key: ${CUSTOM_MODEL_API_KEY}
models:
- name: my-model-7b
display_name: "My Custom 7B Model"
context_length: 4096
max_tokens: 2048
supports_streaming: true
supports_functions: false
pricing:
prompt: 0.0001 # per 1K tokens
completion: 0.0002
# Health check configuration
health_check:
enabled: true
endpoint: /health
interval: 60s
timeout: 5s
# Rate limiting
rate_limit:
requests_per_minute: 60
tokens_per_minute: 100000
# Retry configuration
retry:
max_attempts: 3
backoff: exponential
initial_delay: 1s
Registering Custom Providers#
Register your provider in OpenClaw's initialization:
import { OpenClaw } from '@openclaw/core';
import { CustomModelProvider } from './providers/custom';
const openclaw = new OpenClaw({
providers: [
new CustomModelProvider({
name: 'my-custom-model',
baseURL: 'http://custom-model-service/v1',
apiKey: process.env.CUSTOM_MODEL_API_KEY,
models: ['my-model-7b']
})
]
});
await openclaw.initialize();
Model Selection Logic#
Implement intelligent model selection based on task requirements:
export class ModelSelector {
selectModel(task: Task): string {
const { complexity, budget, latency_requirement } = task;
// Use custom model for specialized tasks
if (task.domain === 'medical' || task.domain === 'legal') {
return 'my-custom-model/my-model-7b';
}
// Use Crazyrouter for general tasks
if (complexity === 'high' && budget > 0.01) {
return 'gpt-4';
} else if (latency_requirement === 'low') {
return 'claude-3-haiku-20240307';
}
// Default to custom model for cost optimization
return 'my-custom-model/my-model-7b';
}
}
Optimizing Custom Model Performance#
Maximize throughput and minimize latency for production workloads.
Quantization Strategies#
Reduce model size and improve inference speed with quantization:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# 8-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0
)
model = AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=quantization_config,
device_map="auto"
)
# 4-bit quantization for even smaller footprint
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
Batch Processing#
Improve throughput by batching requests:
from collections import deque
import asyncio
class BatchProcessor:
def __init__(self, model, max_batch_size=8, max_wait_ms=50):
self.model = model
self.max_batch_size = max_batch_size
self.max_wait_ms = max_wait_ms
self.queue = deque()
self.processing = False
async def process_request(self, request):
future = asyncio.Future()
self.queue.append((request, future))
if not self.processing:
asyncio.create_task(self.process_batch())
return await future
async def process_batch(self):
self.processing = True
await asyncio.sleep(self.max_wait_ms / 1000)
batch = []
futures = []
while self.queue and len(batch) < self.max_batch_size:
request, future = self.queue.popleft()
batch.append(request)
futures.append(future)
if batch:
results = await self.model.generate_batch(batch)
for future, result in zip(futures, results):
future.set_result(result)
self.processing = False
if self.queue:
asyncio.create_task(self.process_batch())
KV Cache Optimization#
Optimize memory usage with efficient KV cache management:
from transformers import Cache, DynamicCache
class OptimizedInference:
def __init__(self, model, max_cache_size=1000):
self.model = model
self.cache = DynamicCache()
self.max_cache_size = max_cache_size
def generate_with_cache(self, inputs, past_key_values=None):
# Reuse KV cache for repeated prefixes
outputs = self.model.generate(
**inputs,
past_key_values=past_key_values,
use_cache=True,
return_dict_in_generate=True
)
# Store cache for future use
if len(self.cache) < self.max_cache_size:
self.cache.update(outputs.past_key_values)
return outputs
Hybrid Model Strategies with Crazyrouter#
Combine custom models with commercial APIs for optimal cost and performance.
Intelligent Routing#
Route requests based on complexity and requirements:
export class HybridRouter {
constructor(
private customModel: ModelProvider,
private crazyrouter: CrazyrouterClient
) {}
async route(request: CompletionRequest): Promise<CompletionResponse> {
const complexity = await this.assessComplexity(request);
// Simple queries → custom model (cost-effective)
if (complexity < 0.3) {
return this.customModel.complete(request);
}
// Medium complexity → try custom first, fallback to Crazyrouter
if (complexity < 0.7) {
try {
return await this.customModel.complete(request);
} catch (error) {
return this.crazyrouter.complete({
...request,
model: 'gpt-3.5-turbo'
});
}
}
// High complexity → use best commercial model
return this.crazyrouter.complete({
...request,
model: 'gpt-4'
});
}
private async assessComplexity(request: CompletionRequest): Promise<number> {
// Analyze request to estimate complexity
const messageLength = request.messages.reduce(
(sum, msg) => sum + msg.content.length,
0
);
const hasCode = request.messages.some(msg =>
msg.content.includes('```')
);
const hasMultiStep = request.messages.some(msg =>
/step \d+|first.*then|1\.|2\./i.test(msg.content)
);
let score = 0;
score += Math.min(messageLength / 1000, 0.3);
score += hasCode ? 0.3 : 0;
score += hasMultiStep ? 0.4 : 0;
return Math.min(score, 1.0);
}
}
Cost Optimization#
Track and optimize costs across providers:
export class CostOptimizer {
private costs: Map<string, number> = new Map();
async selectProvider(
request: CompletionRequest,
budget: number
): Promise<string> {
const estimates = await Promise.all([
this.estimateCost('custom', request),
this.estimateCost('crazyrouter-gpt-3.5', request),
this.estimateCost('crazyrouter-claude', request)
]);
// Filter by budget
const affordable = estimates.filter(e => e.cost <= budget);
if (affordable.length === 0) {
throw new Error('No providers within budget');
}
// Select best quality within budget
return affordable.sort((a, b) => b.quality - a.quality)[0].provider;
}
private async estimateCost(
provider: string,
request: CompletionRequest
): Promise<{ provider: string; cost: number; quality: number }> {
const tokenCount = this.estimateTokens(request);
const pricing = {
'custom': { prompt: 0.0001, completion: 0.0002, quality: 0.7 },
'crazyrouter-gpt-3.5': { prompt: 0.0005, completion: 0.0015, quality: 0.85 },
'crazyrouter-claude': { prompt: 0.0008, completion: 0.0024, quality: 0.9 }
};
const config = pricing[provider];
const cost = (tokenCount.prompt * config.prompt +
tokenCount.completion * config.completion) / 1000;
return { provider, cost, quality: config.quality };
}
private estimateTokens(request: CompletionRequest) {
const promptTokens = request.messages.reduce(
(sum, msg) => sum + Math.ceil(msg.content.length / 4),
0
);
const completionTokens = request.max_tokens || 1000;
return { prompt: promptTokens, completion: completionTokens };
}
}
Monitoring and Observability#
Implement comprehensive monitoring for custom model deployments.
Metrics Collection#
Track key performance indicators:
from prometheus_client import Counter, Histogram, Gauge
import time
# Define metrics
request_count = Counter(
'model_requests_total',
'Total number of model requests',
['model', 'status']
)
request_duration = Histogram(
'model_request_duration_seconds',
'Request duration in seconds',
['model']
)
active_requests = Gauge(
'model_active_requests',
'Number of active requests',
['model']
)
token_count = Counter(
'model_tokens_total',
'Total tokens processed',
['model', 'type']
)
@app.post("/v1/chat/completions")
async def create_completion(request: CompletionRequest):
model_name = request.model
active_requests.labels(model=model_name).inc()
start_time = time.time()
try:
response = await generate_completion(request)
# Record metrics
request_count.labels(model=model_name, status='success').inc()
token_count.labels(model=model_name, type='prompt').inc(
response.usage['prompt_tokens']
)
token_count.labels(model=model_name, type='completion').inc(
response.usage['completion_tokens']
)
return response
except Exception as e:
request_count.labels(model=model_name, status='error').inc()
raise
finally:
duration = time.time() - start_time
request_duration.labels(model=model_name).observe(duration)
active_requests.labels(model=model_name).dec()
Logging and Tracing#
Implement structured logging:
import logging
import json
from datetime import datetime
logger = logging.getLogger(__name__)
class StructuredLogger:
@staticmethod
def log_request(request_id: str, request: CompletionRequest):
logger.info(json.dumps({
'timestamp': datetime.utcnow().isoformat(),
'event': 'request_received',
'request_id': request_id,
'model': request.model,
'message_count': len(request.messages),
'max_tokens': request.max_tokens,
'temperature': request.temperature
}))
@staticmethod
def log_response(request_id: str, response: CompletionResponse, duration: float):
logger.info(json.dumps({
'timestamp': datetime.utcnow().isoformat(),
'event': 'response_sent',
'request_id': request_id,
'duration_ms': duration * 1000,
'prompt_tokens': response.usage['prompt_tokens'],
'completion_tokens': response.usage['completion_tokens'],
'total_tokens': response.usage['total_tokens']
}))
Security Considerations#
Protect your custom model deployments from unauthorized access and abuse.
Authentication and Authorization#
Implement API key authentication:
from fastapi import Security, HTTPException
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
import hashlib
security = HTTPBearer()
API_KEYS = {
hashlib.sha256(b"key1").hexdigest(): {"name": "openclaw", "rate_limit": 1000},
hashlib.sha256(b"key2").hexdigest(): {"name": "internal", "rate_limit": 10000}
}
async def verify_api_key(
credentials: HTTPAuthorizationCredentials = Security(security)
) -> dict:
key_hash = hashlib.sha256(credentials.credentials.encode()).hexdigest()
if key_hash not in API_KEYS:
raise HTTPException(status_code=401, detail="Invalid API key")
return API_KEYS[key_hash]
@app.post("/v1/chat/completions")
async def create_completion(
request: CompletionRequest,
api_key_info: dict = Depends(verify_api_key)
):
# Process request with rate limiting based on api_key_info
pass
Input Validation and Sanitization#
Prevent injection attacks:
from pydantic import validator
import re
class CompletionRequest(BaseModel):
model: str
messages: List[Message]
temperature: float = 0.7
max_tokens: int = 1000
@validator('messages')
def validate_messages(cls, messages):
if len(messages) > 100:
raise ValueError("Too many messages")
for msg in messages:
if len(msg.content) > 50000:
raise ValueError("Message too long")
# Remove potentially harmful content
msg.content = sanitize_content(msg.content)
return messages
@validator('max_tokens')
def validate_max_tokens(cls, v):
if v > 4096:
raise ValueError("max_tokens too large")
return v
def sanitize_content(content: str) -> str:
# Remove control characters
content = re.sub(r'[\x00-\x1f\x7f-\x9f]', '', content)
# Limit consecutive whitespace
content = re.sub(r'\s{10,}', ' ' * 10, content)
return content
Troubleshooting Common Issues#
Resolve typical problems encountered with custom model integration.
Out of Memory Errors#
Problem: GPU runs out of memory during inference
Solutions:
- Reduce batch size
- Enable gradient checkpointing
- Use quantization (8-bit or 4-bit)
- Implement model sharding across multiple GPUs
# Enable gradient checkpointing
model.gradient_checkpointing_enable()
# Use model sharding
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
model = load_checkpoint_and_dispatch(
model,
checkpoint_path,
device_map="auto",
no_split_module_classes=["LlamaDecoderLayer"]
)
High Latency#
Problem: Inference takes too long
Solutions:
- Enable KV cache
- Use Flash Attention
- Implement request batching
- Optimize tokenizer
# Use Flash Attention 2
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
attn_implementation="flash_attention_2"
)
Model Quality Issues#
Problem: Custom model produces poor results
Solutions:
- Verify prompt formatting matches training format
- Adjust temperature and sampling parameters
- Check for tokenizer mismatches
- Validate model checkpoint integrity
Conclusion#
Integrating custom models with OpenClaw unlocks powerful capabilities for specialized domains and cost optimization. By following the patterns in this guide, you can deploy production-ready model servers, implement intelligent routing strategies, and maintain high performance at scale.
Key takeaways:
- Deploy models with Docker/Kubernetes for scalability
- Implement OpenAI-compatible APIs for seamless integration
- Use quantization and batching for performance optimization
- Combine custom models with Crazyrouter for hybrid strategies
- Monitor metrics and implement comprehensive logging
- Secure deployments with authentication and input validation
Start integrating your custom models today and extend OpenClaw's capabilities for your specific use cases!
Ready to master advanced OpenClaw techniques? Continue with OpenClaw Advanced Techniques for expert-level optimization and deployment strategies.
Need AI model access for your custom deployments? Visit Crazyrouter for unified API access to 300+ models.


