▶ Interactive Lab

Multi-Head Attention

See how heads specialize on different patterns.

Advertisement
Each head sees the same input but learns different attention patterns.

What you're seeing

Multi-head attention runs N parallel attention computations on projections of the input. Each head can specialize: one tracks syntax, another semantic similarity, another positional patterns.

The outputs concatenate and project back. GQA (Llama 2/3): query heads outnumber K/V heads to save KV cache memory while preserving multi-head benefits.

★ KEY TAKEAWAY
Different heads learn different patterns: position, identity, syntax, semantics. Multi-head ≠ single big head.
▶ WHAT TO TRY
  • Increase Heads to 8 — see distinct patterns per head.
  • Click Resample to see new initializations.