A modern transformer block is a small, fixed set of components: norm, attention, residual, norm, MLP, residual. Knowing what each does and why it's in that order is the foundation for reading any new architecture paper.

Advertisement

Pre-norm vs post-norm

Pre-norm: LayerNorm before each sub-block. More stable training, easier to scale to many layers. Post-norm: LayerNorm after; slightly better final quality, harder to train deep. Modern LLMs all use pre-norm.

Attention sub-block

RMSNorm → linear projections (Q, K, V) → attention → output projection → residual add. Multi-head attention parallel inside.

Advertisement

MLP sub-block

RMSNorm → up-projection (often 4x hidden dim) → activation (SwiGLU is current default) → down-projection → residual add. Two-thirds of the model's parameters live here.

Pre-norm, attention sub-block, MLP sub-block, residuals everywhere. Most papers tweak one of these; knowing the base helps.