EnglishTutorial

Model Distillation Explained: How Small AI Models Learn from Giants

"A complete guide to knowledge distillation in AI. Learn how DeepSeek, GPT-4o-mini, Gemini Flash, and Claude Haiku were built by distilling larger models, and how developers can use distillation to cut costs."

Crazyrouter Team

March 30, 2026 / 460 views

Model Distillation Explained: How Small AI Models Learn from Giants

Crazyrouter

Check live pricing Read the docs Open image tool Create account

Model Distillation Explained: How Small AI Models Learn from Giants#

In early 2025, Anthropic made a bombshell accusation: DeepSeek had used 16 million conversations with Claude to train its own models. The technique? Knowledge distillation — one of the most powerful (and controversial) methods in modern AI.

DeepSeek denied the specifics but openly published its R1-Distill series — a family of smaller models trained by distilling the reasoning capabilities of their massive 671-billion-parameter R1 model. The result? Models 10-400x smaller that retain 85-95% of the original's reasoning ability.

Fast-forward to March 2026, and distillation has become the default strategy for the entire industry. GPT-5-mini, GPT-5-nano, Gemini 2.5 Flash, Claude 4 Haiku, Llama 4 Scout/Maverick, Qwen 3, and dozens more — all products of distillation. It's the technique that makes cheap, fast AI possible.

Let's break down exactly how it works.

What Is Model Distillation?#

Model distillation (formally knowledge distillation) is a training technique where a large, powerful teacher model transfers its knowledge to a smaller, faster student model.

Think of it like a master chef teaching an apprentice. The apprentice doesn't need to reinvent every recipe from scratch — they learn by watching the master work, absorbing not just what to cook but how to think about cooking.

Why not just train a small model directly?#

Three reasons:

Small models can't learn as well from raw data. A 7B-parameter model trained on the same data as a 670B model will always perform worse — it simply doesn't have the capacity to extract the same patterns.
Teacher models produce better training signals. Instead of learning from "the answer is Paris," the student learns from the teacher's full probability distribution: "Paris (92%), Lyon (5%), Marseille (2%)..." — this softer signal carries much more information.
It's dramatically cheaper. Training a 670B model from scratch costs tens of millions of dollars. Distilling a 7B student from it costs thousands.

Distillation vs Fine-tuning vs Pre-training#

Aspect	Pre-training	Fine-tuning	Distillation
Starting point	Random weights	Pre-trained model	Pre-trained student model
Training data	Trillions of tokens (raw text)	Task-specific examples (10K-1M)	Teacher model outputs
Goal	Learn language	Learn a specific task	Compress teacher's knowledge
Cost	$10M-$ 100M+	$100-$ 10K	$1K-$ 100K
Result	Foundation model	Specialized model	Smaller, faster model

Core Distillation Techniques#

1. Soft Label Distillation (Hinton, 2015)#

The foundational technique, proposed by Geoffrey Hinton et al. Instead of training the student on hard labels ("this is a cat"), you train it on the teacher's soft probability distribution.

How it works:

The teacher model outputs probabilities for all possible tokens using a temperature-scaled softmax:

code

Standard output (T=1):  Paris: 0.95, Lyon: 0.03, Berlin: 0.01, ...
Softened output (T=5):  Paris: 0.45, Lyon: 0.18, Berlin: 0.12, ...

Higher temperature "softens" the distribution, revealing the teacher's uncertainty and the relationships between options. The student learns not just the right answer, but how confident the teacher is and what alternatives it considered.

Loss function:

code

L = α × CrossEntropy(student_output, hard_labels) 
  + β × KL_Divergence(student_soft_output, teacher_soft_output)

The student optimizes for both correctness (hard labels) and mimicking the teacher's thinking (soft labels).

2. Hard Label Distillation (Response-based)#

The simplest and most widely used method in the LLM era. You simply:

Send prompts to the teacher model
Collect its text outputs
Fine-tune the student model on those outputs

This is how Alpaca, Vicuna, and many open-source models were created — by fine-tuning LLaMA on ChatGPT/GPT-4 outputs.

Advantages: No access to teacher's internals needed — works with any API. Disadvantages: Loses the nuance of soft probability distributions.

3. Chain-of-Thought (CoT) Distillation#

The breakthrough technique behind DeepSeek-R1-Distill and similar reasoning models. Instead of just collecting answers, you collect the teacher's step-by-step reasoning process.

Example:

code

Prompt: "What is 17 × 23?"

Teacher output (with CoT):
"Let me break this down:
17 × 23 = 17 × 20 + 17 × 3
17 × 20 = 340
17 × 3 = 51
340 + 51 = 391
The answer is 391."

The student learns not just that 17 × 23 = 391, but how to reason through the problem. This is why DeepSeek-R1-Distill models show genuine reasoning ability — they've internalized the reasoning patterns, not just the answers.

4. Feature-based Distillation#

The student learns to mimic the teacher's intermediate representations (hidden layer activations), not just the final output. This transfers deeper structural knowledge but requires access to the teacher's internals.

Used primarily in research settings — most commercial distillation uses response-based or CoT methods.

The Distillation Hall of Fame#

Here's every major model that was built through distillation:

Open-Source Distilled Models#

Model	Teacher	Student Base	Parameters	Key Achievement
DeepSeek-R1-Distill-Qwen-32B	DeepSeek-R1 (671B MoE)	Qwen-2.5-32B	32B	Beats GPT-4o on math reasoning
DeepSeek-R1-Distill-Llama-70B	DeepSeek-R1 (671B MoE)	Llama-3.1-70B	70B	95% of R1's reasoning at 1/10 size
DeepSeek-R1-Distill-Qwen-14B	DeepSeek-R1 (671B MoE)	Qwen-2.5-14B	14B	Runs on a single GPU
DeepSeek-R1-Distill-Qwen-7B	DeepSeek-R1 (671B MoE)	Qwen-2.5-7B	7B	Runs on consumer hardware
DeepSeek-R1-Distill-Qwen-1.5B	DeepSeek-R1 (671B MoE)	Qwen-2.5-1.5B	1.5B	Phone-deployable reasoning model
Llama 4 Scout	Llama 4 Behemoth (2T+)	Custom MoE	109B (17B active)	10M context, distilled from Behemoth
Llama 4 Maverick	Llama 4 Behemoth (2T+)	Custom MoE	400B (17B active)	Frontier multimodal, distilled from Behemoth
Qwen 3 series	Qwen 3-235B (MoE)	Various	0.6B-32B	Full family distilled from 235B MoE teacher
Qwen 3.5 Small series	Qwen 3.5 large models	Various	0.8B-9B	On-device optimized, Mar 2026
Gemma 3 (Google)	Gemini Pro	Custom	1B-27B	Open-weight distilled from Gemini
Phi-4-mini (Microsoft)	GPT-4 + synthetic data	Custom	3.8B	STEM performance beats 10x larger models
Orca 2 (Microsoft)	GPT-4	LLaMA-2-13B	13B	Learned to choose reasoning strategy
Vicuna-13B (2023)	ChatGPT	LLaMA-13B	13B	First successful open distillation
Alpaca-7B (2023)	text-davinci-003	LLaMA-7B	7B	52K instruction distillation for $600

Commercial Distilled Models (as of March 2026)#

Model	Likely Teacher	Parameters	Cost Reduction	Performance Retained
GPT-5-mini	GPT-5	Undisclosed	~50x cheaper	~82% of GPT-5
GPT-5-nano	GPT-5	Undisclosed	~100x cheaper	~70% of GPT-5
GPT-4o-mini	GPT-4o	~8B (est.)	60x cheaper	~82% of GPT-4o
Gemini 2.5 Flash	Gemini 2.5 Pro	Undisclosed	~8x cheaper	~90% of Pro
Gemini 2.0 Flash	Gemini 2.0 Pro	Undisclosed	10x cheaper	~88% of Pro
Claude 4 Haiku	Claude 4 Sonnet	Undisclosed	~8x cheaper	~83% of Sonnet
Claude 3.5 Haiku	Claude 3.5 Sonnet	Undisclosed	10x cheaper	~80% of Sonnet
Mistral Small	Mistral Large	22B	6x cheaper	~78% of Large
Qwen-Turbo	Qwen-Max	Undisclosed	10x cheaper	~80% of Max

Key insight: Almost every "mini," "nano," "flash," "haiku," or "turbo" model in your API is a distilled version of a larger model. When you call GPT-5-mini, you're using distillation. Even Meta's Llama 4 Scout and Maverick were distilled from the unreleased Behemoth model.

The Controversy: When Distillation Becomes Theft#

Distillation sits in a legal and ethical gray zone. Here's the tension:

The Anthropic vs DeepSeek Case#

In early 2025, Anthropic published evidence suggesting DeepSeek had:

Made 16 million API calls to Claude
Used Claude's outputs (including reasoning chains) as training data
Trained competing models that exhibited Claude-like behavior patterns

DeepSeek acknowledged using outputs from multiple frontier models as part of their training pipeline — a practice that technically violates most API terms of service.

Who Allows Distillation?#

Provider	Policy on Output Distillation	Enforcement
OpenAI	❌ Explicitly prohibited (except for fine-tuning their own models)	Active monitoring
Anthropic	❌ Prohibited for training competing models	Legal action threats
Google	❌ Prohibited for most Gemini outputs	Terms of Service
DeepSeek	✅ Explicitly allows distillation of R1 outputs	Open license
Meta (Llama)	⚠️ Allowed with restrictions (no training >700M param models with Llama 2)	Community license
Mistral	✅ Apache 2.0 for most models	Fully open
Qwen	✅ Allows distillation	Open license

The Academic Perspective#

Researchers largely view distillation as a legitimate and essential technique:

It democratizes access to AI capabilities
It enables deployment on edge devices
It's a natural part of the "knowledge cascade" in ML

The Commercial Perspective#

Companies that invested billions in training frontier models see unauthorized distillation as:

Intellectual property theft
Unfair competitive advantage
A threat to their business model

The uncomfortable truth: The entire open-source AI ecosystem — from Alpaca to DeepSeek-R1-Distill — was largely built by distilling proprietary models. Whether this is innovation or theft depends on who you ask.

How Developers Can Use Distillation#

You don't need to be DeepSeek to use distillation. Here's a practical workflow:

Step 1: Generate Training Data with a Large Model#

Use a powerful model to generate high-quality responses for your specific use case:

python

from openai import OpenAI
import json

client = OpenAI(
    api_key="your-crazyrouter-key",
    base_url="https://crazyrouter.com/v1"
)

# Define your task-specific prompts
prompts = [
    "Classify this customer email as: billing, technical, general. Email: ...",
    "Extract product name and price from this text: ...",
    # ... hundreds or thousands of examples
]

training_data = []
for prompt in prompts:
    response = client.chat.completions.create(
        model="gpt-5",  # Use a powerful teacher model
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )
    training_data.append({
        "prompt": prompt,
        "response": response.choices[0].message.content
    })

# Save training dataset
with open("distillation_data.jsonl", "w") as f:
    for item in training_data:
        f.write(json.dumps(item) + "\n")

Step 2: Fine-tune a Smaller Model#

Use the generated data to fine-tune a cheaper model:

python

# Fine-tune GPT-4o-mini on your distillation data
file = client.files.create(
    file=open("distillation_data.jsonl", "rb"),
    purpose="fine-tune"
)

job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini"
)

Step 3: Deploy and Save#

Your fine-tuned mini model now performs nearly as well as Claude Sonnet on your specific task — at a fraction of the cost.

When to Distill vs When to Just Call the Big Model#

Scenario	Recommendation	Why
< 1K requests/day	Call the big model	Distillation setup cost not worth it
1K-100K requests/day	Consider distillation	Savings start to add up
> 100K requests/day	Definitely distill	10-60x cost savings compound fast
Latency-critical (< 200ms)	Distill	Small models are 3-10x faster
Privacy-sensitive	Distill + self-host	Data never leaves your server
Task changes frequently	Call the big model	Re-distilling is expensive

With Crazyrouter, you can use the same API key to generate training data from any teacher model (Claude, GPT, Gemini, DeepSeek) and then deploy your distilled student model — all through one unified endpoint.

The Future of Distillation (Already Happening in 2026)#

Distillation is evolving fast — and many of these "future" trends are already shipping:

MoE + Distillation: Llama 4 and Qwen 3 combine Mixture-of-Experts architecture with distillation — the teacher is a massive MoE model, students are smaller MoE or dense models. This is now the dominant paradigm.
Self-distillation: Models distilling themselves (DeepSeek-R1 used RL + self-distillation)
Progressive distillation: Multi-stage chains (Behemoth → Maverick → Scout; R1 671B → 70B → 32B → 7B → 1.5B)
On-device distillation: Qwen 3.5 Small series (0.8B-9B) are specifically distilled for phone and edge deployment
Task-specific distillation: Instead of general compression, distilling only the capabilities you need
Synthetic data distillation: Using teacher models to generate entire training datasets (Microsoft's Phi series, Google's Gemma)

The trend is clear: frontier models are becoming training data generators. The biggest model you'll never call directly — but every cheap model you use was born from it.

Key Takeaways#

Distillation transfers knowledge from large (teacher) to small (student) models
Every "mini/flash/haiku" model you use is likely distilled from a larger one
DeepSeek-R1-Distill proved that distilled models can match frontier reasoning
It's controversial — most API providers prohibit using outputs for training competitors
Developers can use it by generating training data from big models and fine-tuning small ones
Cost savings are massive — 10-60x cheaper while retaining 80-95% performance