Advertisement
LayerNorm: subtract mean, divide by std. RMSNorm: divide by RMS only. ~10% faster.
What you're seeing
Layer Normalization stabilizes training by normalizing activations within a layer (across feature dim) to zero mean and unit variance.
RMSNorm drops the mean centering — just normalizes by RMS. Faster (one less stat), equally effective in practice. Standard in Llama, Mistral, every recent open LLM.
★ KEY TAKEAWAY
LayerNorm centers AND scales. RMSNorm only scales. RMSNorm is ~10% faster, same quality on transformers.
▶ WHAT TO TRY
- Click Resample to see how each affects different inputs.
- Note the mean of RMSNorm output is non-zero (it keeps the input's mean).