Distilling a small model from a strong teacher is a well-understood path to cheap inference. The recipe is straightforward; the quality bar is set by data preparation. Teams that skip filtering get mediocre results; teams that filter aggressively get small models that beat their teacher on the specific task.

Advertisement

Generate 5x what you need

Plan for 10K-100K training examples. Generate 50K-500K candidate completions from the teacher. The filter ratio depends on teacher quality; budget for 5x more generations than you'll use.

Capture chain-of-thought

Don't just capture final answers. Prompt teacher to think step-by-step; capture the reasoning. Train student on (prompt, reasoning, answer) triples. Small models with explicit CoT match much larger zero-shot models on benchmarks.

Advertisement

Filter ruthlessly

Reward model scoring (use a strong judge model). For code: compile + run + assert. For math: check answer. For extraction: validate format. For summarization: LLM-as-judge. Drop the bottom 50-80%. The remaining data is the gold.

Diversity through varied prompts

Same prompt 1000 times = mode collapse. Vary topic, style, length, difficulty, format. Cluster generated examples by embedding; ensure coverage. Don't rely on temperature alone for diversity.

Two-pass distillation

Train student on filtered teacher data (pass 1). Have student generate completions on new prompts; have teacher critique; train on the critiques (pass 2). Pass 2 often gives 5-10% quality boost.

Generate 5x, capture CoT, filter ruthlessly, diversity through prompts, optional two-pass. The data is the gate; everything else is recipe.