Speculative decoding gets you 2-3× speedup on LLM inference without quality loss. A small fast draft model proposes K tokens; the big model verifies them in ONE forward pass. The math guarantees the output distribution matches greedy/sampled big-model output exactly.

Advertisement

The setup

Draft model d (cheap, fast). Target model M (slow, accurate). For each generation cycle: d proposes a sequence of K tokens. M evaluates them all in one parallel forward pass. Accept tokens that 'pass' the verification; restart from the rejection point.

Acceptance rule for matching distribution

# For each proposed token x_i with prob d(x_i):
# Compute p = M(x_i) / d(x_i)
# Accept if rand() < min(1, p)
# If rejected, sample correction from (M - d)+

This rule ensures the final output distribution is exactly M's distribution, mathematically. The draft's mistakes get correctly rejected; correct guesses are accepted for free.

Advertisement

Speed math

# Without spec: T target forward passes for T tokens
# With spec: T/k_eff forward passes, where k_eff = accepted/cycle
#
# Plus draft cost: cheap (small model)
# Net speedup: ~(k_eff + 1) / (1 + cost_draft/cost_target)

For draft 10× smaller than target with k=4 proposals and 75% acceptance: ~3× speedup. Real workloads see 1.5-3× depending on prompt similarity to draft training data.

Picking a draft model

Best: same model family, much smaller. Llama 70B target + Llama 7B draft. Phi-3 medium target + Phi-3 mini draft. Quality matters more than size — closer to target = higher acceptance rate.

Implementations

vLLM, TGI, TensorRT-LLM, llama.cpp all support speculative decoding in 2026. Native models with multi-token prediction heads (DeepSeek V3, Medusa-style) bypass the need for a separate draft. Plug-and-play in production inference engines.

Spec decoding: small draft proposes, big model verifies in parallel. Exact-distribution preserving. 1.5-3× faster.