Attention Mechanism — Transformer Foundation

Scaled dot-product

Q, K, V from linear projections. QK^T scores. Scale by √d to keep softmax gradient healthy. Softmax → weights.

Advertisement

Multiple attention 'heads' in parallel, each with own projections. Different heads learn different relations.

Advertisement

Q, K, V all from same input. Each position attends to all others (context window).

O(N² · d) time + memory. N² is the transformer bottleneck. FlashAttention, sparse attention, RWKV explore alternatives.