LayerNorm vs RMSNorm — Belgavi.AI Lab

Advertisement

Resample

LayerNorm: subtract mean, divide by std. RMSNorm: divide by RMS only. ~10% faster.

What you're seeing

Layer Normalization stabilizes training by normalizing activations within a layer (across feature dim) to zero mean and unit variance.

RMSNorm drops the mean centering — just normalizes by RMS. Faster (one less stat), equally effective in practice. Standard in Llama, Mistral, every recent open LLM.

★ KEY TAKEAWAY

LayerNorm centers AND scales. RMSNorm only scales. RMSNorm is ~10% faster, same quality on transformers.

▶ WHAT TO TRY

Click Resample to see how each affects different inputs.
Note the mean of RMSNorm output is non-zero (it keeps the input's mean).