
"Model Distillation Explained: How Small AI Models Learn from Giants"
Model Distillation Explained: How Small AI Models Learn from Giants#
In early 2025, Anthropic made a bombshell accusation: DeepSeek had used 16 million conversations with Claude to train its own models. The technique? Knowledge distillation — one of the most powerful (and controversial) methods in modern AI.
DeepSeek denied the specifics but openly published its R1-Distill series — a family of smaller models trained by distilling the reasoning capabilities of their massive 671-billion-parameter R1 model. The result? Models 10-400x smaller that retain 85-95% of the original's reasoning ability.
Fast-forward to March 2026, and distillation has become the default strategy for the entire industry. GPT-5-mini, GPT-5-nano, Gemini 2.5 Flash, Claude 4 Haiku, Llama 4 Scout/Maverick, Qwen 3, and dozens more — all products of distillation. It's the technique that makes cheap, fast AI possible.
Let's break down exactly how it works.
What Is Model Distillation?#
Model distillation (formally knowledge distillation) is a training technique where a large, powerful teacher model transfers its knowledge to a smaller, faster student model.
Think of it like a master chef teaching an apprentice. The apprentice doesn't need to reinvent every recipe from scratch — they learn by watching the master work, absorbing not just what to cook but how to think about cooking.
Why not just train a small model directly?#
Three reasons:
-
Small models can't learn as well from raw data. A 7B-parameter model trained on the same data as a 670B model will always perform worse — it simply doesn't have the capacity to extract the same patterns.
-
Teacher models produce better training signals. Instead of learning from "the answer is Paris," the student learns from the teacher's full probability distribution: "Paris (92%), Lyon (5%), Marseille (2%)..." — this softer signal carries much more information.
-
It's dramatically cheaper. Training a 670B model from scratch costs tens of millions of dollars. Distilling a 7B student from it costs thousands.
Distillation vs Fine-tuning vs Pre-training#
| Aspect | Pre-training | Fine-tuning | Distillation |
|---|---|---|---|
| Starting point | Random weights | Pre-trained model | Pre-trained student model |
| Training data | Trillions of tokens (raw text) | Task-specific examples (10K-1M) | Teacher model outputs |
| Goal | Learn language | Learn a specific task | Compress teacher's knowledge |
| Cost | 100M+ | 10K | 100K |
| Result | Foundation model | Specialized model | Smaller, faster model |
Core Distillation Techniques#
1. Soft Label Distillation (Hinton, 2015)#
The foundational technique, proposed by Geoffrey Hinton et al. Instead of training the student on hard labels ("this is a cat"), you train it on the teacher's soft probability distribution.
How it works:
The teacher model outputs probabilities for all possible tokens using a temperature-scaled softmax:
Standard output (T=1): Paris: 0.95, Lyon: 0.03, Berlin: 0.01, ...
Softened output (T=5): Paris: 0.45, Lyon: 0.18, Berlin: 0.12, ...
Higher temperature "softens" the distribution, revealing the teacher's uncertainty and the relationships between options. The student learns not just the right answer, but how confident the teacher is and what alternatives it considered.
Loss function:
L = α × CrossEntropy(student_output, hard_labels)
+ β × KL_Divergence(student_soft_output, teacher_soft_output)
The student optimizes for both correctness (hard labels) and mimicking the teacher's thinking (soft labels).
2. Hard Label Distillation (Response-based)#
The simplest and most widely used method in the LLM era. You simply:
- Send prompts to the teacher model
- Collect its text outputs
- Fine-tune the student model on those outputs
This is how Alpaca, Vicuna, and many open-source models were created — by fine-tuning LLaMA on ChatGPT/GPT-4 outputs.
Advantages: No access to teacher's internals needed — works with any API. Disadvantages: Loses the nuance of soft probability distributions.
3. Chain-of-Thought (CoT) Distillation#
The breakthrough technique behind DeepSeek-R1-Distill and similar reasoning models. Instead of just collecting answers, you collect the teacher's step-by-step reasoning process.
Example:
Prompt: "What is 17 × 23?"
Teacher output (with CoT):
"Let me break this down:
17 × 23 = 17 × 20 + 17 × 3
17 × 20 = 340
17 × 3 = 51
340 + 51 = 391
The answer is 391."
The student learns not just that 17 × 23 = 391, but how to reason through the problem. This is why DeepSeek-R1-Distill models show genuine reasoning ability — they've internalized the reasoning patterns, not just the answers.
4. Feature-based Distillation#
The student learns to mimic the teacher's intermediate representations (hidden layer activations), not just the final output. This transfers deeper structural knowledge but requires access to the teacher's internals.
Used primarily in research settings — most commercial distillation uses response-based or CoT methods.
The Distillation Hall of Fame#
Here's every major model that was built through distillation:
Open-Source Distilled Models#
| Model | Teacher | Student Base | Parameters | Key Achievement |
|---|---|---|---|---|
| DeepSeek-R1-Distill-Qwen-32B | DeepSeek-R1 (671B MoE) | Qwen-2.5-32B | 32B | Beats GPT-4o on math reasoning |
| DeepSeek-R1-Distill-Llama-70B | DeepSeek-R1 (671B MoE) | Llama-3.1-70B | 70B | 95% of R1's reasoning at 1/10 size |
| DeepSeek-R1-Distill-Qwen-14B | DeepSeek-R1 (671B MoE) | Qwen-2.5-14B | 14B | Runs on a single GPU |
| DeepSeek-R1-Distill-Qwen-7B | DeepSeek-R1 (671B MoE) | Qwen-2.5-7B | 7B | Runs on consumer hardware |
| DeepSeek-R1-Distill-Qwen-1.5B | DeepSeek-R1 (671B MoE) | Qwen-2.5-1.5B | 1.5B | Phone-deployable reasoning model |
| Llama 4 Scout | Llama 4 Behemoth (2T+) | Custom MoE | 109B (17B active) | 10M context, distilled from Behemoth |
| Llama 4 Maverick | Llama 4 Behemoth (2T+) | Custom MoE | 400B (17B active) | Frontier multimodal, distilled from Behemoth |
| Qwen 3 series | Qwen 3-235B (MoE) | Various | 0.6B-32B | Full family distilled from 235B MoE teacher |
| Qwen 3.5 Small series | Qwen 3.5 large models | Various | 0.8B-9B | On-device optimized, Mar 2026 |
| Gemma 3 (Google) | Gemini Pro | Custom | 1B-27B | Open-weight distilled from Gemini |
| Phi-4-mini (Microsoft) | GPT-4 + synthetic data | Custom | 3.8B | STEM performance beats 10x larger models |
| Orca 2 (Microsoft) | GPT-4 | LLaMA-2-13B | 13B | Learned to choose reasoning strategy |
| Vicuna-13B (2023) | ChatGPT | LLaMA-13B | 13B | First successful open distillation |
| Alpaca-7B (2023) | text-davinci-003 | LLaMA-7B | 7B | 52K instruction distillation for $600 |
Commercial Distilled Models (as of March 2026)#
| Model | Likely Teacher | Parameters | Cost Reduction | Performance Retained |
|---|---|---|---|---|
| GPT-5-mini | GPT-5 | Undisclosed | ~50x cheaper | ~82% of GPT-5 |
| GPT-5-nano | GPT-5 | Undisclosed | ~100x cheaper | ~70% of GPT-5 |
| GPT-4o-mini | GPT-4o | ~8B (est.) | 60x cheaper | ~82% of GPT-4o |
| Gemini 2.5 Flash | Gemini 2.5 Pro | Undisclosed | ~8x cheaper | ~90% of Pro |
| Gemini 2.0 Flash | Gemini 2.0 Pro | Undisclosed | 10x cheaper | ~88% of Pro |
| Claude 4 Haiku | Claude 4 Sonnet | Undisclosed | ~8x cheaper | ~83% of Sonnet |
| Claude 3.5 Haiku | Claude 3.5 Sonnet | Undisclosed | 10x cheaper | ~80% of Sonnet |
| Mistral Small | Mistral Large | 22B | 6x cheaper | ~78% of Large |
| Qwen-Turbo | Qwen-Max | Undisclosed | 10x cheaper | ~80% of Max |
Key insight: Almost every "mini," "nano," "flash," "haiku," or "turbo" model in your API is a distilled version of a larger model. When you call GPT-5-mini, you're using distillation. Even Meta's Llama 4 Scout and Maverick were distilled from the unreleased Behemoth model.
The Controversy: When Distillation Becomes Theft#
Distillation sits in a legal and ethical gray zone. Here's the tension:
The Anthropic vs DeepSeek Case#
In early 2025, Anthropic published evidence suggesting DeepSeek had:
- Made 16 million API calls to Claude
- Used Claude's outputs (including reasoning chains) as training data
- Trained competing models that exhibited Claude-like behavior patterns
DeepSeek acknowledged using outputs from multiple frontier models as part of their training pipeline — a practice that technically violates most API terms of service.
Who Allows Distillation?#
| Provider | Policy on Output Distillation | Enforcement |
|---|---|---|
| OpenAI | ❌ Explicitly prohibited (except for fine-tuning their own models) | Active monitoring |
| Anthropic | ❌ Prohibited for training competing models | Legal action threats |
| ❌ Prohibited for most Gemini outputs | Terms of Service | |
| DeepSeek | ✅ Explicitly allows distillation of R1 outputs | Open license |
| Meta (Llama) | ⚠️ Allowed with restrictions (no training >700M param models with Llama 2) | Community license |
| Mistral | ✅ Apache 2.0 for most models | Fully open |
| Qwen | ✅ Allows distillation | Open license |
The Academic Perspective#
Researchers largely view distillation as a legitimate and essential technique:
- It democratizes access to AI capabilities
- It enables deployment on edge devices
- It's a natural part of the "knowledge cascade" in ML
The Commercial Perspective#
Companies that invested billions in training frontier models see unauthorized distillation as:
- Intellectual property theft
- Unfair competitive advantage
- A threat to their business model
The uncomfortable truth: The entire open-source AI ecosystem — from Alpaca to DeepSeek-R1-Distill — was largely built by distilling proprietary models. Whether this is innovation or theft depends on who you ask.
How Developers Can Use Distillation#
You don't need to be DeepSeek to use distillation. Here's a practical workflow:
Step 1: Generate Training Data with a Large Model#
Use a powerful model to generate high-quality responses for your specific use case:
from openai import OpenAI
import json
client = OpenAI(
api_key="your-crazyrouter-key",
base_url="https://crazyrouter.com/v1"
)
# Define your task-specific prompts
prompts = [
"Classify this customer email as: billing, technical, general. Email: ...",
"Extract product name and price from this text: ...",
# ... hundreds or thousands of examples
]
training_data = []
for prompt in prompts:
response = client.chat.completions.create(
model="gpt-5", # Use a powerful teacher model
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
training_data.append({
"prompt": prompt,
"response": response.choices[0].message.content
})
# Save training dataset
with open("distillation_data.jsonl", "w") as f:
for item in training_data:
f.write(json.dumps(item) + "\n")
Step 2: Fine-tune a Smaller Model#
Use the generated data to fine-tune a cheaper model:
# Fine-tune GPT-4o-mini on your distillation data
file = client.files.create(
file=open("distillation_data.jsonl", "rb"),
purpose="fine-tune"
)
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini"
)
Step 3: Deploy and Save#
Your fine-tuned mini model now performs nearly as well as Claude Sonnet on your specific task — at a fraction of the cost.
When to Distill vs When to Just Call the Big Model#
| Scenario | Recommendation | Why |
|---|---|---|
| < 1K requests/day | Call the big model | Distillation setup cost not worth it |
| 1K-100K requests/day | Consider distillation | Savings start to add up |
| > 100K requests/day | Definitely distill | 10-60x cost savings compound fast |
| Latency-critical (< 200ms) | Distill | Small models are 3-10x faster |
| Privacy-sensitive | Distill + self-host | Data never leaves your server |
| Task changes frequently | Call the big model | Re-distilling is expensive |
With Crazyrouter, you can use the same API key to generate training data from any teacher model (Claude, GPT, Gemini, DeepSeek) and then deploy your distilled student model — all through one unified endpoint.
The Future of Distillation (Already Happening in 2026)#
Distillation is evolving fast — and many of these "future" trends are already shipping:
- MoE + Distillation: Llama 4 and Qwen 3 combine Mixture-of-Experts architecture with distillation — the teacher is a massive MoE model, students are smaller MoE or dense models. This is now the dominant paradigm.
- Self-distillation: Models distilling themselves (DeepSeek-R1 used RL + self-distillation)
- Progressive distillation: Multi-stage chains (Behemoth → Maverick → Scout; R1 671B → 70B → 32B → 7B → 1.5B)
- On-device distillation: Qwen 3.5 Small series (0.8B-9B) are specifically distilled for phone and edge deployment
- Task-specific distillation: Instead of general compression, distilling only the capabilities you need
- Synthetic data distillation: Using teacher models to generate entire training datasets (Microsoft's Phi series, Google's Gemma)
The trend is clear: frontier models are becoming training data generators. The biggest model you'll never call directly — but every cheap model you use was born from it.
Key Takeaways#
- Distillation transfers knowledge from large (teacher) to small (student) models
- Every "mini/flash/haiku" model you use is likely distilled from a larger one
- DeepSeek-R1-Distill proved that distilled models can match frontier reasoning
- It's controversial — most API providers prohibit using outputs for training competitors
- Developers can use it by generating training data from big models and fine-tuning small ones
- Cost savings are massive — 10-60x cheaper while retaining 80-95% performance
Further Reading#
- What Are Tokens in AI? Complete Guide
- AI API Pricing Guide 2026
- How to Cut AI API Costs by 50%
- DeepSeek vs GPT-4o vs Claude: Complete Comparison
Knowledge distillation is how the AI industry makes powerful models accessible to everyone. Updated March 2026 with the latest Llama 4, Qwen 3, GPT-5, and Gemini 2.5 distillation data. For the latest model comparisons and pricing, visit the Crazyrouter blog.


