LLMs are trained to predict the next token. Cross-entropy is the standard loss. Knowing why it's chosen and how its gradient simplifies when combined with softmax makes the training loop transparent.
Information-theoretic motivation
H(p, q) = -sum over i of p[i] · log(q[i])Cross-entropy H(p, q) measures how 'surprised' you are if you believed q but reality is p. Minimum is achieved when q = p. For next-token prediction, p is the one-hot label (1.0 on the correct token, 0.0 elsewhere); q is the model's predicted distribution.
Simplification with one-hot labels
H(one_hot(c), q) = -log(q[c])All terms in the sum are zero except where p[i]=1. So cross-entropy with a one-hot label is just negative log probability of the correct token. Minimize loss ↔ maximize log-likelihood of the right answer. Standard maximum-likelihood training.
Combined with softmax
loss = -log(softmax(z)[c])
= -z[c] + log(sum over j of exp(z[j]))The log of the softmax denominator is a 'log-sum-exp' (LSE) — well-studied, numerically stable to compute. PyTorch's F.cross_entropy(logits, target) fuses these operations for stability and speed; never compute softmax then log separately.
The miracle gradient
∂loss/∂z[i] = softmax(z)[i] - δ[i, c]The gradient w.r.t. the logits is just predicted_probs minus the one-hot label. Simple, cheap, well-behaved. This is why softmax + cross-entropy is the default for classification — the gradient is the prediction error.
Per-token loss in sequence training
For sequence-level training: compute the loss at every position, average over the sequence (ignoring padding). For a B×N batch with V-token vocab, loss is a single scalar averaging over B·N positions. Gradient backprops through the whole transformer to update all weights.