▶ Interactive Lab

Attention Score Matrix

Q · Kᵀ produces an N × N matrix of similarities.

Advertisement
Each cell [i,j] is the attention score (similarity) from query i to key j.

What you're seeing

Q, K projections of token embeddings. Scores = Q·Kᵀ / sqrt(d_k). Softmax over rows → attention weights.

Causal mask: -inf above diagonal so each token only attends to previous tokens (for autoregressive LMs).

★ KEY TAKEAWAY
Attention scores = Q·Kᵀ. Each cell [i,j] is how much query i attends to key j. Softmax over rows gives a distribution.
▶ WHAT TO TRY
  • Toggle Causal mask to see how -∞ above the diagonal forces autoregressive behavior.
  • Toggle Scale by sqrt(d_k) to see how it keeps the softmax from saturating.
  • Click Resample Q, K for new patterns.