Speculative decoding speeds up LLM inference 2-3x without quality loss. A small 'draft' model proposes the next K tokens; the big model verifies them in parallel with one forward pass. Accepted prefix is committed; rejected token restarts. The math works out very favorably.
Why it works
LLM inference is memory-bandwidth-bound: a forward pass uses ~100GB/s on a 70B model. The same pass verifies K speculative tokens in parallel — you get K outputs for the cost of 1. Even with 30-50% rejection rate, you net 2x speedup.
Picking the draft model
Same family, 10-100x smaller. Llama 70B + Llama 7B draft. Quality matters: closer the draft is to the big model's distribution, more tokens get accepted. Training a draft model on the big model's outputs (distillation) helps.
Acceptance rate target
60-80% acceptance rate is healthy. <40%: draft model too divergent; consider distillation. >90%: draft model is overkill, you could use a smaller/faster one.
Medusa, Eagle: speculative without a second model
Trains additional 'heads' on the big model to predict multiple future tokens. Avoids hosting two models. Smaller speedup (~1.5-2x) but operationally simpler.
Deployment
vLLM, TensorRT-LLM, SGLang support speculative decoding. Need GPU memory for both models. Test on your traffic patterns — speculative decoding helps long generations more than short ones (more opportunities to accept).