Scaled dot-product
Q, K, V from linear projections. QK^T scores. Scale by √d to keep softmax gradient healthy. Softmax → weights.
Advertisement
Multi-head
Multiple attention 'heads' in parallel, each with own projections. Different heads learn different relations.
Advertisement
Self-attention
Q, K, V all from same input. Each position attends to all others (context window).
Complexity
O(N² · d) time + memory. N² is the transformer bottleneck. FlashAttention, sparse attention, RWKV explore alternatives.