▶ Interactive Lab

Attention Mask Visualizer

See causal, bidirectional, and prefix-LM attention masks side by side.

Advertisement
Each row = query position; each column = key position. Green = attends; dark = masked.

What you're seeing

The attention mask determines which positions can see which.

Causal (GPT, Llama): each token attends only to previous tokens. Lower triangular.

Bidirectional (BERT): each token attends to all positions. Full matrix.

Prefix-LM (T5): bidirectional on the prompt, causal on the generation.

Sliding window (Phi, Mistral): each token attends to the last W positions only. Diagonal band.

★ KEY TAKEAWAY
Different mask = different attention pattern. Causal for autoregressive LMs, bidi for BERT, prefix-LM for T5, sliding for Mistral.
▶ WHAT TO TRY
  • Switch between the four mask types.
  • Notice causal is the lower triangle; sliding is a band; bidi is full.