Transformer Training Loss Curves

Loss curves during transformer training tell you whether training is healthy long before the final eval. Knowing the shapes — and what each one means — helps you catch problems hours instead of days into a run.

Advertisement

The healthy curve

Steep initial drop in the first 100-1000 steps as the model learns basics (next-token distribution, common patterns). Smooth, monotonically decreasing curve thereafter. Periodic small dips at LR warmup transitions. Steady, predictable, boring.

Loss spikes

Sudden upward spike, sometimes followed by recovery, sometimes by NaN. Causes: too-high learning rate, bad batch (corrupt data), numerical overflow. Often correlates with gradient norm spikes. Skipping the bad batch and lowering LR usually recovers; persistent spikes mean architectural issue.

Advertisement

Plateaus

Loss flattens, then resumes dropping. Usually a sign the model worked through one capability tier and is finding the next. Don't kill the run on a plateau prematurely; check gradient norms — if still nonzero, it's learning.

Divergence — the bad sign

Loss steadily increases. Almost always a configuration bug (wrong LR schedule, wrong gradient clipping, wrong norm). Kill the run, fix the config, restart. Don't 'wait it out' — divergence rarely self-corrects.

Training vs eval loss diverge

Training loss keeps dropping; eval loss starts rising. Overfitting. For LLM pretraining this is rare (data is so large). For fine-tuning it's common: lower epochs, more regularization, or stop early. Use the eval-loss-best checkpoint, not the final.

Healthy = smooth decline. Spikes = bad batches or LR. Plateaus are fine. Divergence = bug; restart with fix.