Transformer — Attention Is All You Need

Encoder block

Multi-head self-attention + FFN, with residual + LayerNorm. Stack N times.

Advertisement

Masked self-attention + cross-attention to encoder + FFN. Stack N times. Causal mask enables next-token training.

Advertisement

Attention is position-agnostic. Add sinusoidal or learned position embeddings. RoPE (rotary) modern default.

Original: LayerNorm after residual. Pre-norm (before block) trains more stably. Modern default.