Cross-Entropy Loss for Next-Token Prediction

LLMs are trained to predict the next token. Cross-entropy is the standard loss. Knowing why it's chosen and how its gradient simplifies when combined with softmax makes the training loop transparent.

Advertisement

Information-theoretic motivation

H(p, q) = -sum over i of p[i] · log(q[i])

Cross-entropy H(p, q) measures how 'surprised' you are if you believed q but reality is p. Minimum is achieved when q = p. For next-token prediction, p is the one-hot label (1.0 on the correct token, 0.0 elsewhere); q is the model's predicted distribution.

Simplification with one-hot labels

H(one_hot(c), q) = -log(q[c])

All terms in the sum are zero except where p[i]=1. So cross-entropy with a one-hot label is just negative log probability of the correct token. Minimize loss ↔ maximize log-likelihood of the right answer. Standard maximum-likelihood training.

Advertisement

Combined with softmax

loss = -log(softmax(z)[c])
     = -z[c] + log(sum over j of exp(z[j]))

The log of the softmax denominator is a 'log-sum-exp' (LSE) — well-studied, numerically stable to compute. PyTorch's F.cross_entropy(logits, target) fuses these operations for stability and speed; never compute softmax then log separately.

The miracle gradient

∂loss/∂z[i] = softmax(z)[i] - δ[i, c]

The gradient w.r.t. the logits is just predicted_probs minus the one-hot label. Simple, cheap, well-behaved. This is why softmax + cross-entropy is the default for classification — the gradient is the prediction error.

Per-token loss in sequence training

For sequence-level training: compute the loss at every position, average over the sequence (ignoring padding). For a B×N batch with V-token vocab, loss is a single scalar averaging over B·N positions. Gradient backprops through the whole transformer to update all weights.

Cross-entropy = -log(prob of correct token). Combined with softmax, gradient is just (probs - one_hot). Clean and stable.