Encoder block

Multi-head self-attention + FFN, with residual + LayerNorm. Stack N times.

Advertisement

Decoder block

Masked self-attention + cross-attention to encoder + FFN. Stack N times. Causal mask enables next-token training.

Advertisement

Positional encoding

Attention is position-agnostic. Add sinusoidal or learned position embeddings. RoPE (rotary) modern default.

Pre-norm vs post-norm

Original: LayerNorm after residual. Pre-norm (before block) trains more stably. Modern default.