Self-attention scores are dot products between query and key vectors. Understanding the geometry of dot products — angle, magnitude, scaling — clarifies why attention works and why we divide by sqrt(d_k).

Advertisement

Dot product as angle

⟨a, b⟩ = |a| · |b| · cos(θ)

If both vectors have unit norm, dot product is exactly the cosine of the angle between them. Range: [-1, 1]. 1 = same direction. 0 = perpendicular. -1 = opposite. Bigger dot product = more aligned vectors.

Why magnitude matters

With unnormalized vectors, the magnitude can dominate. A query vector of norm 10 dotted with a key vector of norm 10 gives a 100× larger score than two unit vectors at the same angle. This is why we sometimes normalize (cosine similarity) and why attention scales (next section).

Advertisement

The sqrt(d_k) scaling

Attention(Q, K, V) = softmax(Q·Kᵀ / sqrt(d_k)) · V

Each entry of Q·Kᵀ is a sum of d_k products. If Q, K have entries with variance 1, the sum has variance d_k → standard deviation sqrt(d_k). Without scaling, large d_k → large logits → softmax saturates → gradients vanish. Dividing by sqrt(d_k) keeps the logit distribution well-behaved.

Geometric intuition for attention

Q vector for token i 'asks': which keys are like me? Compute dot product with every key. Soft-pick by softmax. Sum the corresponding values weighted by attention probabilities. The whole operation is a soft, differentiable lookup based on geometric similarity in d_k-dimensional space.

Why high-dim space is unintuitive

In high dimensions, random vectors are almost always nearly orthogonal (cos θ ≈ 0). That's why d_model is large enough (768+) for many distinct concepts to fit without interference. Small SLMs (d_model=512) have less room; this constrains what they can represent.

Dot product = magnitude × cos(angle). Scale by sqrt(d_k) to keep softmax sane. High-d space has room for many concepts.