Login
Back to Blog
OpenClaw Custom Model Integration: Deploy Your Own AI Models

OpenClaw Custom Model Integration: Deploy Your Own AI Models

C
Crazyrouter Team
March 7, 2026
11 viewsEnglish
Share:

OpenClaw Custom Model Integration: Deploy Your Own AI Models#

While OpenClaw works seamlessly with commercial AI models through Crazyrouter, many organizations need to integrate proprietary or fine-tuned models. This comprehensive guide covers everything from model deployment to production optimization, enabling you to extend OpenClaw with custom AI capabilities.

Understanding OpenClaw Model Architecture#

OpenClaw's model integration system provides a flexible framework for connecting custom AI models while maintaining compatibility with the standard OpenAI API format.

Model Provider Interface#

OpenClaw uses a provider-based architecture where each model source implements a standardized interface. This design allows seamless switching between commercial APIs, self-hosted models, and hybrid configurations.

Provider Components: Every model provider implements four core methods: initialize() for setup, complete() for text generation, embed() for embeddings, and validate() for health checks. This consistent interface ensures OpenClaw can treat all models uniformly.

Request Transformation: OpenClaw automatically transforms requests between different model formats. When you send an OpenAI-formatted request to a custom model, OpenClaw handles the conversion to your model's expected format.

Response Normalization: Similarly, responses from custom models are normalized to OpenAI format, ensuring consistent behavior across all model types. This abstraction layer simplifies client code and enables easy model switching.

Supported Model Types#

OpenClaw supports integration with various model architectures:

Transformer Models: Standard transformer-based models like GPT, BERT, and T5 variants work out of the box. OpenClaw provides optimized inference pipelines for these architectures.

Specialized Models: Vision models, audio models, and multimodal models can be integrated with custom preprocessing pipelines. OpenClaw handles format conversion and batching automatically.

Fine-tuned Models: Models fine-tuned on domain-specific data integrate seamlessly. OpenClaw preserves custom tokenizers and model configurations during deployment.

Setting Up Your Model Deployment Environment#

Before integrating custom models, establish a robust deployment infrastructure that can handle production workloads.

Infrastructure Requirements#

Compute Resources: GPU-accelerated instances are essential for acceptable inference latency. For production deployments, consider:

  • NVIDIA A100 or H100 GPUs for large models (7B+ parameters)
  • NVIDIA T4 or A10 GPUs for medium models (1-7B parameters)
  • CPU-only deployment for small models (<1B parameters) or low-traffic scenarios

Memory Requirements: Model memory usage depends on precision and architecture:

  • FP32: ~4 bytes per parameter
  • FP16: ~2 bytes per parameter
  • INT8: ~1 byte per parameter

A 7B parameter model in FP16 requires approximately 14GB GPU memory, plus overhead for activations and KV cache.

Storage: Fast SSD storage is crucial for model loading and caching. Allocate at least 3x model size for checkpoints, quantized versions, and temporary files.

Container-Based Deployment#

Use Docker for consistent model deployment across environments:

dockerfile
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y python3.11 python3-pip
RUN pip3 install torch transformers accelerate

# Copy model files
COPY models/ /app/models/
COPY server.py /app/

# Set environment variables
ENV MODEL_PATH=/app/models/my-model
ENV PORT=8000

WORKDIR /app
CMD ["python3", "server.py"]

Build and run:

bash
docker build -t my-custom-model:latest .
docker run --gpus all -p 8000:8000 my-custom-model:latest

Kubernetes Deployment#

For production scale, deploy with Kubernetes:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: custom-model-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: custom-model
  template:
    metadata:
      labels:
        app: custom-model
    spec:
      containers:
      - name: model-server
        image: my-custom-model:latest
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "16Gi"
        ports:
        - containerPort: 8000
        env:
        - name: MODEL_PATH
          value: "/models/my-model"
        - name: BATCH_SIZE
          value: "8"
---
apiVersion: v1
kind: Service
metadata:
  name: custom-model-service
spec:
  selector:
    app: custom-model
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: LoadBalancer

Building a Model Server#

Create a production-ready model server that OpenClaw can communicate with.

FastAPI Model Server#

Implement a FastAPI server with OpenAI-compatible endpoints:

python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

app = FastAPI()

# Load model at startup
model_path = "/app/models/my-model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    device_map="auto"
)

class Message(BaseModel):
    role: str
    content: str

class CompletionRequest(BaseModel):
    model: str
    messages: List[Message]
    temperature: float = 0.7
    max_tokens: int = 1000
    stream: bool = False

class CompletionResponse(BaseModel):
    id: str
    object: str = "chat.completion"
    created: int
    model: str
    choices: List[dict]
    usage: dict

@app.post("/v1/chat/completions")
async def create_completion(request: CompletionRequest):
    try:
        # Format messages for model
        prompt = format_messages(request.messages)

        # Tokenize
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        # Generate
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=request.max_tokens,
                temperature=request.temperature,
                do_sample=True
            )

        # Decode response
        response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Format as OpenAI response
        return CompletionResponse(
            id=f"chatcmpl-{generate_id()}",
            created=int(time.time()),
            model=request.model,
            choices=[{
                "index": 0,
                "message": {
                    "role": "assistant",
                    "content": response_text
                },
                "finish_reason": "stop"
            }],
            usage={
                "prompt_tokens": len(inputs.input_ids[0]),
                "completion_tokens": len(outputs[0]) - len(inputs.input_ids[0]),
                "total_tokens": len(outputs[0])
            }
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

def format_messages(messages: List[Message]) -> str:
    """Convert OpenAI messages to model-specific format"""
    formatted = []
    for msg in messages:
        if msg.role == "system":
            formatted.append(f"System: {msg.content}")
        elif msg.role == "user":
            formatted.append(f"User: {msg.content}")
        elif msg.role == "assistant":
            formatted.append(f"Assistant: {msg.content}")
    return "\n".join(formatted) + "\nAssistant:"

@app.get("/health")
async def health_check():
    return {"status": "healthy", "model": model_path}

Streaming Support#

Implement streaming for better user experience:

python
from fastapi.responses import StreamingResponse
import json

@app.post("/v1/chat/completions")
async def create_completion(request: CompletionRequest):
    if request.stream:
        return StreamingResponse(
            stream_completion(request),
            media_type="text/event-stream"
        )
    else:
        return await generate_completion(request)

async def stream_completion(request: CompletionRequest):
    prompt = format_messages(request.messages)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Stream generation
    for token in model.generate(
        **inputs,
        max_new_tokens=request.max_tokens,
        temperature=request.temperature,
        do_sample=True,
        streamer=TextIteratorStreamer(tokenizer)
    ):
        chunk = {
            "id": f"chatcmpl-{generate_id()}",
            "object": "chat.completion.chunk",
            "created": int(time.time()),
            "model": request.model,
            "choices": [{
                "index": 0,
                "delta": {"content": token},
                "finish_reason": None
            }]
        }
        yield f"data: {json.dumps(chunk)}\n\n"

    # Send final chunk
    yield "data: [DONE]\n\n"

Configuring OpenClaw for Custom Models#

Integrate your model server with OpenClaw through configuration and provider registration.

Provider Configuration#

Create a custom provider configuration in config/providers/custom.yaml:

yaml
providers:
  - name: my-custom-model
    type: openai-compatible
    base_url: http://custom-model-service/v1
    api_key: ${CUSTOM_MODEL_API_KEY}
    models:
      - name: my-model-7b
        display_name: "My Custom 7B Model"
        context_length: 4096
        max_tokens: 2048
        supports_streaming: true
        supports_functions: false
        pricing:
          prompt: 0.0001  # per 1K tokens
          completion: 0.0002

    # Health check configuration
    health_check:
      enabled: true
      endpoint: /health
      interval: 60s
      timeout: 5s

    # Rate limiting
    rate_limit:
      requests_per_minute: 60
      tokens_per_minute: 100000

    # Retry configuration
    retry:
      max_attempts: 3
      backoff: exponential
      initial_delay: 1s

Registering Custom Providers#

Register your provider in OpenClaw's initialization:

typescript
import { OpenClaw } from '@openclaw/core';
import { CustomModelProvider } from './providers/custom';

const openclaw = new OpenClaw({
  providers: [
    new CustomModelProvider({
      name: 'my-custom-model',
      baseURL: 'http://custom-model-service/v1',
      apiKey: process.env.CUSTOM_MODEL_API_KEY,
      models: ['my-model-7b']
    })
  ]
});

await openclaw.initialize();

Model Selection Logic#

Implement intelligent model selection based on task requirements:

typescript
export class ModelSelector {
  selectModel(task: Task): string {
    const { complexity, budget, latency_requirement } = task;

    // Use custom model for specialized tasks
    if (task.domain === 'medical' || task.domain === 'legal') {
      return 'my-custom-model/my-model-7b';
    }

    // Use Crazyrouter for general tasks
    if (complexity === 'high' && budget > 0.01) {
      return 'gpt-4';
    } else if (latency_requirement === 'low') {
      return 'claude-3-haiku-20240307';
    }

    // Default to custom model for cost optimization
    return 'my-custom-model/my-model-7b';
  }
}

Optimizing Custom Model Performance#

Maximize throughput and minimize latency for production workloads.

Quantization Strategies#

Reduce model size and improve inference speed with quantization:

python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 8-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=quantization_config,
    device_map="auto"
)

# 4-bit quantization for even smaller footprint
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

Batch Processing#

Improve throughput by batching requests:

python
from collections import deque
import asyncio

class BatchProcessor:
    def __init__(self, model, max_batch_size=8, max_wait_ms=50):
        self.model = model
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.queue = deque()
        self.processing = False

    async def process_request(self, request):
        future = asyncio.Future()
        self.queue.append((request, future))

        if not self.processing:
            asyncio.create_task(self.process_batch())

        return await future

    async def process_batch(self):
        self.processing = True
        await asyncio.sleep(self.max_wait_ms / 1000)

        batch = []
        futures = []

        while self.queue and len(batch) < self.max_batch_size:
            request, future = self.queue.popleft()
            batch.append(request)
            futures.append(future)

        if batch:
            results = await self.model.generate_batch(batch)
            for future, result in zip(futures, results):
                future.set_result(result)

        self.processing = False

        if self.queue:
            asyncio.create_task(self.process_batch())

KV Cache Optimization#

Optimize memory usage with efficient KV cache management:

python
from transformers import Cache, DynamicCache

class OptimizedInference:
    def __init__(self, model, max_cache_size=1000):
        self.model = model
        self.cache = DynamicCache()
        self.max_cache_size = max_cache_size

    def generate_with_cache(self, inputs, past_key_values=None):
        # Reuse KV cache for repeated prefixes
        outputs = self.model.generate(
            **inputs,
            past_key_values=past_key_values,
            use_cache=True,
            return_dict_in_generate=True
        )

        # Store cache for future use
        if len(self.cache) < self.max_cache_size:
            self.cache.update(outputs.past_key_values)

        return outputs

Hybrid Model Strategies with Crazyrouter#

Combine custom models with commercial APIs for optimal cost and performance.

Intelligent Routing#

Route requests based on complexity and requirements:

typescript
export class HybridRouter {
  constructor(
    private customModel: ModelProvider,
    private crazyrouter: CrazyrouterClient
  ) {}

  async route(request: CompletionRequest): Promise<CompletionResponse> {
    const complexity = await this.assessComplexity(request);

    // Simple queries → custom model (cost-effective)
    if (complexity < 0.3) {
      return this.customModel.complete(request);
    }

    // Medium complexity → try custom first, fallback to Crazyrouter
    if (complexity < 0.7) {
      try {
        return await this.customModel.complete(request);
      } catch (error) {
        return this.crazyrouter.complete({
          ...request,
          model: 'gpt-3.5-turbo'
        });
      }
    }

    // High complexity → use best commercial model
    return this.crazyrouter.complete({
      ...request,
      model: 'gpt-4'
    });
  }

  private async assessComplexity(request: CompletionRequest): Promise<number> {
    // Analyze request to estimate complexity
    const messageLength = request.messages.reduce(
      (sum, msg) => sum + msg.content.length,
      0
    );

    const hasCode = request.messages.some(msg =>
      msg.content.includes('```')
    );

    const hasMultiStep = request.messages.some(msg =>
      /step \d+|first.*then|1\.|2\./i.test(msg.content)
    );

    let score = 0;
    score += Math.min(messageLength / 1000, 0.3);
    score += hasCode ? 0.3 : 0;
    score += hasMultiStep ? 0.4 : 0;

    return Math.min(score, 1.0);
  }
}

Cost Optimization#

Track and optimize costs across providers:

typescript
export class CostOptimizer {
  private costs: Map<string, number> = new Map();

  async selectProvider(
    request: CompletionRequest,
    budget: number
  ): Promise<string> {
    const estimates = await Promise.all([
      this.estimateCost('custom', request),
      this.estimateCost('crazyrouter-gpt-3.5', request),
      this.estimateCost('crazyrouter-claude', request)
    ]);

    // Filter by budget
    const affordable = estimates.filter(e => e.cost <= budget);

    if (affordable.length === 0) {
      throw new Error('No providers within budget');
    }

    // Select best quality within budget
    return affordable.sort((a, b) => b.quality - a.quality)[0].provider;
  }

  private async estimateCost(
    provider: string,
    request: CompletionRequest
  ): Promise<{ provider: string; cost: number; quality: number }> {
    const tokenCount = this.estimateTokens(request);

    const pricing = {
      'custom': { prompt: 0.0001, completion: 0.0002, quality: 0.7 },
      'crazyrouter-gpt-3.5': { prompt: 0.0005, completion: 0.0015, quality: 0.85 },
      'crazyrouter-claude': { prompt: 0.0008, completion: 0.0024, quality: 0.9 }
    };

    const config = pricing[provider];
    const cost = (tokenCount.prompt * config.prompt +
                  tokenCount.completion * config.completion) / 1000;

    return { provider, cost, quality: config.quality };
  }

  private estimateTokens(request: CompletionRequest) {
    const promptTokens = request.messages.reduce(
      (sum, msg) => sum + Math.ceil(msg.content.length / 4),
      0
    );
    const completionTokens = request.max_tokens || 1000;

    return { prompt: promptTokens, completion: completionTokens };
  }
}

Monitoring and Observability#

Implement comprehensive monitoring for custom model deployments.

Metrics Collection#

Track key performance indicators:

python
from prometheus_client import Counter, Histogram, Gauge
import time

# Define metrics
request_count = Counter(
    'model_requests_total',
    'Total number of model requests',
    ['model', 'status']
)

request_duration = Histogram(
    'model_request_duration_seconds',
    'Request duration in seconds',
    ['model']
)

active_requests = Gauge(
    'model_active_requests',
    'Number of active requests',
    ['model']
)

token_count = Counter(
    'model_tokens_total',
    'Total tokens processed',
    ['model', 'type']
)

@app.post("/v1/chat/completions")
async def create_completion(request: CompletionRequest):
    model_name = request.model
    active_requests.labels(model=model_name).inc()
    start_time = time.time()

    try:
        response = await generate_completion(request)

        # Record metrics
        request_count.labels(model=model_name, status='success').inc()
        token_count.labels(model=model_name, type='prompt').inc(
            response.usage['prompt_tokens']
        )
        token_count.labels(model=model_name, type='completion').inc(
            response.usage['completion_tokens']
        )

        return response
    except Exception as e:
        request_count.labels(model=model_name, status='error').inc()
        raise
    finally:
        duration = time.time() - start_time
        request_duration.labels(model=model_name).observe(duration)
        active_requests.labels(model=model_name).dec()

Logging and Tracing#

Implement structured logging:

python
import logging
import json
from datetime import datetime

logger = logging.getLogger(__name__)

class StructuredLogger:
    @staticmethod
    def log_request(request_id: str, request: CompletionRequest):
        logger.info(json.dumps({
            'timestamp': datetime.utcnow().isoformat(),
            'event': 'request_received',
            'request_id': request_id,
            'model': request.model,
            'message_count': len(request.messages),
            'max_tokens': request.max_tokens,
            'temperature': request.temperature
        }))

    @staticmethod
    def log_response(request_id: str, response: CompletionResponse, duration: float):
        logger.info(json.dumps({
            'timestamp': datetime.utcnow().isoformat(),
            'event': 'response_sent',
            'request_id': request_id,
            'duration_ms': duration * 1000,
            'prompt_tokens': response.usage['prompt_tokens'],
            'completion_tokens': response.usage['completion_tokens'],
            'total_tokens': response.usage['total_tokens']
        }))

Security Considerations#

Protect your custom model deployments from unauthorized access and abuse.

Authentication and Authorization#

Implement API key authentication:

python
from fastapi import Security, HTTPException
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
import hashlib

security = HTTPBearer()

API_KEYS = {
    hashlib.sha256(b"key1").hexdigest(): {"name": "openclaw", "rate_limit": 1000},
    hashlib.sha256(b"key2").hexdigest(): {"name": "internal", "rate_limit": 10000}
}

async def verify_api_key(
    credentials: HTTPAuthorizationCredentials = Security(security)
) -> dict:
    key_hash = hashlib.sha256(credentials.credentials.encode()).hexdigest()

    if key_hash not in API_KEYS:
        raise HTTPException(status_code=401, detail="Invalid API key")

    return API_KEYS[key_hash]

@app.post("/v1/chat/completions")
async def create_completion(
    request: CompletionRequest,
    api_key_info: dict = Depends(verify_api_key)
):
    # Process request with rate limiting based on api_key_info
    pass

Input Validation and Sanitization#

Prevent injection attacks:

python
from pydantic import validator
import re

class CompletionRequest(BaseModel):
    model: str
    messages: List[Message]
    temperature: float = 0.7
    max_tokens: int = 1000

    @validator('messages')
    def validate_messages(cls, messages):
        if len(messages) > 100:
            raise ValueError("Too many messages")

        for msg in messages:
            if len(msg.content) > 50000:
                raise ValueError("Message too long")

            # Remove potentially harmful content
            msg.content = sanitize_content(msg.content)

        return messages

    @validator('max_tokens')
    def validate_max_tokens(cls, v):
        if v > 4096:
            raise ValueError("max_tokens too large")
        return v

def sanitize_content(content: str) -> str:
    # Remove control characters
    content = re.sub(r'[\x00-\x1f\x7f-\x9f]', '', content)

    # Limit consecutive whitespace
    content = re.sub(r'\s{10,}', ' ' * 10, content)

    return content

Troubleshooting Common Issues#

Resolve typical problems encountered with custom model integration.

Out of Memory Errors#

Problem: GPU runs out of memory during inference

Solutions:

  • Reduce batch size
  • Enable gradient checkpointing
  • Use quantization (8-bit or 4-bit)
  • Implement model sharding across multiple GPUs
python
# Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Use model sharding
from accelerate import init_empty_weights, load_checkpoint_and_dispatch

with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

model = load_checkpoint_and_dispatch(
    model,
    checkpoint_path,
    device_map="auto",
    no_split_module_classes=["LlamaDecoderLayer"]
)

High Latency#

Problem: Inference takes too long

Solutions:

  • Enable KV cache
  • Use Flash Attention
  • Implement request batching
  • Optimize tokenizer
python
# Use Flash Attention 2
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2"
)

Model Quality Issues#

Problem: Custom model produces poor results

Solutions:

  • Verify prompt formatting matches training format
  • Adjust temperature and sampling parameters
  • Check for tokenizer mismatches
  • Validate model checkpoint integrity

Conclusion#

Integrating custom models with OpenClaw unlocks powerful capabilities for specialized domains and cost optimization. By following the patterns in this guide, you can deploy production-ready model servers, implement intelligent routing strategies, and maintain high performance at scale.

Key takeaways:

  • Deploy models with Docker/Kubernetes for scalability
  • Implement OpenAI-compatible APIs for seamless integration
  • Use quantization and batching for performance optimization
  • Combine custom models with Crazyrouter for hybrid strategies
  • Monitor metrics and implement comprehensive logging
  • Secure deployments with authentication and input validation

Start integrating your custom models today and extend OpenClaw's capabilities for your specific use cases!


Ready to master advanced OpenClaw techniques? Continue with OpenClaw Advanced Techniques for expert-level optimization and deployment strategies.

Need AI model access for your custom deployments? Visit Crazyrouter for unified API access to 300+ models.

Related Articles