Encoder block
Multi-head self-attention + FFN, with residual + LayerNorm. Stack N times.
Advertisement
Decoder block
Masked self-attention + cross-attention to encoder + FFN. Stack N times. Causal mask enables next-token training.
Advertisement
Positional encoding
Attention is position-agnostic. Add sinusoidal or learned position embeddings. RoPE (rotary) modern default.
Pre-norm vs post-norm
Original: LayerNorm after residual. Pre-norm (before block) trains more stably. Modern default.