Sampling finds A good sequence. Beam search aims for THE highest-probability sequence. Used in machine translation, code completion, and structured outputs. Slower than greedy but more thorough. The math: maintain top-K partial sequences instead of one.
The procedure
def beam_search(model, prompt, K=5, max_len=100):
beams = [(prompt, 0.0)] # (sequence, log_prob)
for _ in range(max_len):
candidates = []
for seq, lp in beams:
logits = model(seq)[-1]
log_probs = log_softmax(logits)
for token, tp in topK(log_probs, K):
candidates.append((seq+[token], lp+tp))
beams = topK(candidates, K, key=lambda x: x[1])
if all(seq[-1] == EOS for seq, _ in beams): break
return beams[0][0]At each step: extend each beam by K candidates. Keep top K total. Final answer: the highest-scoring complete sequence. K is the beam width — typical 4-10.
Log probabilities not probabilities
Probabilities multiply; log probabilities add. Adding is numerically stable; multiplying tiny floats vanishes. Always work in log-probability space for any search over sequences.
Length normalization
# Without normalization: short sequences win unfairly
# (more steps = more negative log probs added)
#
# Standard: score = log_prob / length^α
# α typically 0.6-0.7 for translationWithout length penalty, beam search prefers short outputs. Common fix: normalize by length raised to a power. Less needed for chat where EOS naturally ends things; critical for translation, summarization.
Speed cost
Beam K: K× slower than greedy (must batch K hypotheses through model). Memory: K× KV caches. Skips sampling diversity. Right for: tasks where 'the best' is well-defined. Wrong for: chat (loses creativity), free-form generation.
Modern picture
Beam search dominated 2014-2020 for translation, summarization. Now: sampling + temperature is preferred for most LLM use cases. Beam search remains for: machine translation production (NMT), structured code generation (some implementations), constrained generation.