Loss curves during transformer training tell you whether training is healthy long before the final eval. Knowing the shapes — and what each one means — helps you catch problems hours instead of days into a run.
The healthy curve
Steep initial drop in the first 100-1000 steps as the model learns basics (next-token distribution, common patterns). Smooth, monotonically decreasing curve thereafter. Periodic small dips at LR warmup transitions. Steady, predictable, boring.
Loss spikes
Sudden upward spike, sometimes followed by recovery, sometimes by NaN. Causes: too-high learning rate, bad batch (corrupt data), numerical overflow. Often correlates with gradient norm spikes. Skipping the bad batch and lowering LR usually recovers; persistent spikes mean architectural issue.
Plateaus
Loss flattens, then resumes dropping. Usually a sign the model worked through one capability tier and is finding the next. Don't kill the run on a plateau prematurely; check gradient norms — if still nonzero, it's learning.
Divergence — the bad sign
Loss steadily increases. Almost always a configuration bug (wrong LR schedule, wrong gradient clipping, wrong norm). Kill the run, fix the config, restart. Don't 'wait it out' — divergence rarely self-corrects.
Training vs eval loss diverge
Training loss keeps dropping; eval loss starts rising. Overfitting. For LLM pretraining this is rare (data is so large). For fine-tuning it's common: lower epochs, more regularization, or stop early. Use the eval-loss-best checkpoint, not the final.