Speculative decoding speeds up LLM inference 2-3x without quality loss. A small 'draft' model proposes the next K tokens; the big model verifies them in parallel with one forward pass. Accepted prefix is committed; rejected token restarts. The math works out very favorably.

Advertisement

Why it works

LLM inference is memory-bandwidth-bound: a forward pass uses ~100GB/s on a 70B model. The same pass verifies K speculative tokens in parallel — you get K outputs for the cost of 1. Even with 30-50% rejection rate, you net 2x speedup.

Picking the draft model

Same family, 10-100x smaller. Llama 70B + Llama 7B draft. Quality matters: closer the draft is to the big model's distribution, more tokens get accepted. Training a draft model on the big model's outputs (distillation) helps.

Advertisement

Acceptance rate target

60-80% acceptance rate is healthy. <40%: draft model too divergent; consider distillation. >90%: draft model is overkill, you could use a smaller/faster one.

Medusa, Eagle: speculative without a second model

Trains additional 'heads' on the big model to predict multiple future tokens. Avoids hosting two models. Smaller speedup (~1.5-2x) but operationally simpler.

Deployment

vLLM, TensorRT-LLM, SGLang support speculative decoding. Need GPU memory for both models. Test on your traffic patterns — speculative decoding helps long generations more than short ones (more opportunities to accept).

Speculative decoding is the free 2-3x speedup. Pick a small same-family draft, distill it on the big model, deploy via vLLM.