Login
Back to Blog
"Model Distillation Explained: How Small AI Models Learn from Giants"

"Model Distillation Explained: How Small AI Models Learn from Giants"

C
Crazyrouter Team
March 30, 2026
1 viewsEnglishTutorial
Share:

Model Distillation Explained: How Small AI Models Learn from Giants#

In early 2025, Anthropic made a bombshell accusation: DeepSeek had used 16 million conversations with Claude to train its own models. The technique? Knowledge distillation — one of the most powerful (and controversial) methods in modern AI.

DeepSeek denied the specifics but openly published its R1-Distill series — a family of smaller models trained by distilling the reasoning capabilities of their massive 671-billion-parameter R1 model. The result? Models 10-400x smaller that retain 85-95% of the original's reasoning ability.

Fast-forward to March 2026, and distillation has become the default strategy for the entire industry. GPT-5-mini, GPT-5-nano, Gemini 2.5 Flash, Claude 4 Haiku, Llama 4 Scout/Maverick, Qwen 3, and dozens more — all products of distillation. It's the technique that makes cheap, fast AI possible.

Let's break down exactly how it works.


What Is Model Distillation?#

Model distillation (formally knowledge distillation) is a training technique where a large, powerful teacher model transfers its knowledge to a smaller, faster student model.

Think of it like a master chef teaching an apprentice. The apprentice doesn't need to reinvent every recipe from scratch — they learn by watching the master work, absorbing not just what to cook but how to think about cooking.

Why not just train a small model directly?#

Three reasons:

  1. Small models can't learn as well from raw data. A 7B-parameter model trained on the same data as a 670B model will always perform worse — it simply doesn't have the capacity to extract the same patterns.

  2. Teacher models produce better training signals. Instead of learning from "the answer is Paris," the student learns from the teacher's full probability distribution: "Paris (92%), Lyon (5%), Marseille (2%)..." — this softer signal carries much more information.

  3. It's dramatically cheaper. Training a 670B model from scratch costs tens of millions of dollars. Distilling a 7B student from it costs thousands.

Distillation vs Fine-tuning vs Pre-training#

AspectPre-trainingFine-tuningDistillation
Starting pointRandom weightsPre-trained modelPre-trained student model
Training dataTrillions of tokens (raw text)Task-specific examples (10K-1M)Teacher model outputs
GoalLearn languageLearn a specific taskCompress teacher's knowledge
Cost10M10M-100M+100100-10K1K1K-100K
ResultFoundation modelSpecialized modelSmaller, faster model

Core Distillation Techniques#

1. Soft Label Distillation (Hinton, 2015)#

The foundational technique, proposed by Geoffrey Hinton et al. Instead of training the student on hard labels ("this is a cat"), you train it on the teacher's soft probability distribution.

How it works:

The teacher model outputs probabilities for all possible tokens using a temperature-scaled softmax:

code
Standard output (T=1):  Paris: 0.95, Lyon: 0.03, Berlin: 0.01, ...
Softened output (T=5):  Paris: 0.45, Lyon: 0.18, Berlin: 0.12, ...

Higher temperature "softens" the distribution, revealing the teacher's uncertainty and the relationships between options. The student learns not just the right answer, but how confident the teacher is and what alternatives it considered.

Loss function:

code
L = α × CrossEntropy(student_output, hard_labels) 
  + β × KL_Divergence(student_soft_output, teacher_soft_output)

The student optimizes for both correctness (hard labels) and mimicking the teacher's thinking (soft labels).

2. Hard Label Distillation (Response-based)#

The simplest and most widely used method in the LLM era. You simply:

  1. Send prompts to the teacher model
  2. Collect its text outputs
  3. Fine-tune the student model on those outputs

This is how Alpaca, Vicuna, and many open-source models were created — by fine-tuning LLaMA on ChatGPT/GPT-4 outputs.

Advantages: No access to teacher's internals needed — works with any API. Disadvantages: Loses the nuance of soft probability distributions.

3. Chain-of-Thought (CoT) Distillation#

The breakthrough technique behind DeepSeek-R1-Distill and similar reasoning models. Instead of just collecting answers, you collect the teacher's step-by-step reasoning process.

Example:

code
Prompt: "What is 17 × 23?"

Teacher output (with CoT):
"Let me break this down:
17 × 23 = 17 × 20 + 17 × 3
17 × 20 = 340
17 × 3 = 51
340 + 51 = 391
The answer is 391."

The student learns not just that 17 × 23 = 391, but how to reason through the problem. This is why DeepSeek-R1-Distill models show genuine reasoning ability — they've internalized the reasoning patterns, not just the answers.

4. Feature-based Distillation#

The student learns to mimic the teacher's intermediate representations (hidden layer activations), not just the final output. This transfers deeper structural knowledge but requires access to the teacher's internals.

Used primarily in research settings — most commercial distillation uses response-based or CoT methods.


The Distillation Hall of Fame#

Here's every major model that was built through distillation:

Open-Source Distilled Models#

ModelTeacherStudent BaseParametersKey Achievement
DeepSeek-R1-Distill-Qwen-32BDeepSeek-R1 (671B MoE)Qwen-2.5-32B32BBeats GPT-4o on math reasoning
DeepSeek-R1-Distill-Llama-70BDeepSeek-R1 (671B MoE)Llama-3.1-70B70B95% of R1's reasoning at 1/10 size
DeepSeek-R1-Distill-Qwen-14BDeepSeek-R1 (671B MoE)Qwen-2.5-14B14BRuns on a single GPU
DeepSeek-R1-Distill-Qwen-7BDeepSeek-R1 (671B MoE)Qwen-2.5-7B7BRuns on consumer hardware
DeepSeek-R1-Distill-Qwen-1.5BDeepSeek-R1 (671B MoE)Qwen-2.5-1.5B1.5BPhone-deployable reasoning model
Llama 4 ScoutLlama 4 Behemoth (2T+)Custom MoE109B (17B active)10M context, distilled from Behemoth
Llama 4 MaverickLlama 4 Behemoth (2T+)Custom MoE400B (17B active)Frontier multimodal, distilled from Behemoth
Qwen 3 seriesQwen 3-235B (MoE)Various0.6B-32BFull family distilled from 235B MoE teacher
Qwen 3.5 Small seriesQwen 3.5 large modelsVarious0.8B-9BOn-device optimized, Mar 2026
Gemma 3 (Google)Gemini ProCustom1B-27BOpen-weight distilled from Gemini
Phi-4-mini (Microsoft)GPT-4 + synthetic dataCustom3.8BSTEM performance beats 10x larger models
Orca 2 (Microsoft)GPT-4LLaMA-2-13B13BLearned to choose reasoning strategy
Vicuna-13B (2023)ChatGPTLLaMA-13B13BFirst successful open distillation
Alpaca-7B (2023)text-davinci-003LLaMA-7B7B52K instruction distillation for $600

Commercial Distilled Models (as of March 2026)#

ModelLikely TeacherParametersCost ReductionPerformance Retained
GPT-5-miniGPT-5Undisclosed~50x cheaper~82% of GPT-5
GPT-5-nanoGPT-5Undisclosed~100x cheaper~70% of GPT-5
GPT-4o-miniGPT-4o~8B (est.)60x cheaper~82% of GPT-4o
Gemini 2.5 FlashGemini 2.5 ProUndisclosed~8x cheaper~90% of Pro
Gemini 2.0 FlashGemini 2.0 ProUndisclosed10x cheaper~88% of Pro
Claude 4 HaikuClaude 4 SonnetUndisclosed~8x cheaper~83% of Sonnet
Claude 3.5 HaikuClaude 3.5 SonnetUndisclosed10x cheaper~80% of Sonnet
Mistral SmallMistral Large22B6x cheaper~78% of Large
Qwen-TurboQwen-MaxUndisclosed10x cheaper~80% of Max

Key insight: Almost every "mini," "nano," "flash," "haiku," or "turbo" model in your API is a distilled version of a larger model. When you call GPT-5-mini, you're using distillation. Even Meta's Llama 4 Scout and Maverick were distilled from the unreleased Behemoth model.


The Controversy: When Distillation Becomes Theft#

Distillation sits in a legal and ethical gray zone. Here's the tension:

The Anthropic vs DeepSeek Case#

In early 2025, Anthropic published evidence suggesting DeepSeek had:

  • Made 16 million API calls to Claude
  • Used Claude's outputs (including reasoning chains) as training data
  • Trained competing models that exhibited Claude-like behavior patterns

DeepSeek acknowledged using outputs from multiple frontier models as part of their training pipeline — a practice that technically violates most API terms of service.

Who Allows Distillation?#

ProviderPolicy on Output DistillationEnforcement
OpenAI❌ Explicitly prohibited (except for fine-tuning their own models)Active monitoring
Anthropic❌ Prohibited for training competing modelsLegal action threats
Google❌ Prohibited for most Gemini outputsTerms of Service
DeepSeek✅ Explicitly allows distillation of R1 outputsOpen license
Meta (Llama)⚠️ Allowed with restrictions (no training >700M param models with Llama 2)Community license
Mistral✅ Apache 2.0 for most modelsFully open
Qwen✅ Allows distillationOpen license

The Academic Perspective#

Researchers largely view distillation as a legitimate and essential technique:

  • It democratizes access to AI capabilities
  • It enables deployment on edge devices
  • It's a natural part of the "knowledge cascade" in ML

The Commercial Perspective#

Companies that invested billions in training frontier models see unauthorized distillation as:

  • Intellectual property theft
  • Unfair competitive advantage
  • A threat to their business model

The uncomfortable truth: The entire open-source AI ecosystem — from Alpaca to DeepSeek-R1-Distill — was largely built by distilling proprietary models. Whether this is innovation or theft depends on who you ask.


How Developers Can Use Distillation#

You don't need to be DeepSeek to use distillation. Here's a practical workflow:

Step 1: Generate Training Data with a Large Model#

Use a powerful model to generate high-quality responses for your specific use case:

python
from openai import OpenAI
import json

client = OpenAI(
    api_key="your-crazyrouter-key",
    base_url="https://crazyrouter.com/v1"
)

# Define your task-specific prompts
prompts = [
    "Classify this customer email as: billing, technical, general. Email: ...",
    "Extract product name and price from this text: ...",
    # ... hundreds or thousands of examples
]

training_data = []
for prompt in prompts:
    response = client.chat.completions.create(
        model="gpt-5",  # Use a powerful teacher model
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )
    training_data.append({
        "prompt": prompt,
        "response": response.choices[0].message.content
    })

# Save training dataset
with open("distillation_data.jsonl", "w") as f:
    for item in training_data:
        f.write(json.dumps(item) + "\n")

Step 2: Fine-tune a Smaller Model#

Use the generated data to fine-tune a cheaper model:

python
# Fine-tune GPT-4o-mini on your distillation data
file = client.files.create(
    file=open("distillation_data.jsonl", "rb"),
    purpose="fine-tune"
)

job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini"
)

Step 3: Deploy and Save#

Your fine-tuned mini model now performs nearly as well as Claude Sonnet on your specific task — at a fraction of the cost.

When to Distill vs When to Just Call the Big Model#

ScenarioRecommendationWhy
< 1K requests/dayCall the big modelDistillation setup cost not worth it
1K-100K requests/dayConsider distillationSavings start to add up
> 100K requests/dayDefinitely distill10-60x cost savings compound fast
Latency-critical (< 200ms)DistillSmall models are 3-10x faster
Privacy-sensitiveDistill + self-hostData never leaves your server
Task changes frequentlyCall the big modelRe-distilling is expensive

With Crazyrouter, you can use the same API key to generate training data from any teacher model (Claude, GPT, Gemini, DeepSeek) and then deploy your distilled student model — all through one unified endpoint.


The Future of Distillation (Already Happening in 2026)#

Distillation is evolving fast — and many of these "future" trends are already shipping:

  • MoE + Distillation: Llama 4 and Qwen 3 combine Mixture-of-Experts architecture with distillation — the teacher is a massive MoE model, students are smaller MoE or dense models. This is now the dominant paradigm.
  • Self-distillation: Models distilling themselves (DeepSeek-R1 used RL + self-distillation)
  • Progressive distillation: Multi-stage chains (Behemoth → Maverick → Scout; R1 671B → 70B → 32B → 7B → 1.5B)
  • On-device distillation: Qwen 3.5 Small series (0.8B-9B) are specifically distilled for phone and edge deployment
  • Task-specific distillation: Instead of general compression, distilling only the capabilities you need
  • Synthetic data distillation: Using teacher models to generate entire training datasets (Microsoft's Phi series, Google's Gemma)

The trend is clear: frontier models are becoming training data generators. The biggest model you'll never call directly — but every cheap model you use was born from it.


Key Takeaways#

  1. Distillation transfers knowledge from large (teacher) to small (student) models
  2. Every "mini/flash/haiku" model you use is likely distilled from a larger one
  3. DeepSeek-R1-Distill proved that distilled models can match frontier reasoning
  4. It's controversial — most API providers prohibit using outputs for training competitors
  5. Developers can use it by generating training data from big models and fine-tuning small ones
  6. Cost savings are massive — 10-60x cheaper while retaining 80-95% performance

Further Reading#


Knowledge distillation is how the AI industry makes powerful models accessible to everyone. Updated March 2026 with the latest Llama 4, Qwen 3, GPT-5, and Gemini 2.5 distillation data. For the latest model comparisons and pricing, visit the Crazyrouter blog.

Related Articles