Softmax Numerical Stability + Temperature

Advertisement

Temperature 1.00 Logit scale 10

With large logits, naive exp overflows. Subtracting max() before exp is the standard fix.

What you're seeing

For z ∈ ℝᴷ, softmax(z)[i] = exp(z[i]) / Σ exp(z[j]). Identity: softmax(z) = softmax(z - max(z)). The safe form keeps every exp input ≤ 0, no overflow.

Temperature T divides logits before softmax. Low T → peaked (deterministic). High T → flat (diverse). T=0 corresponds to argmax (greedy).

★ KEY TAKEAWAY

Softmax + temperature reshapes the distribution: low T → peaked, high T → flat. The subtraction trick prevents exp() overflow.

▶ WHAT TO TRY

Slide Temperature to T=0.3 (sharp), T=1 (model's learned distribution), T=2 (diverse).
Increase Logit scale to see how max-subtraction keeps the math stable.
Watch the entropy readout — high entropy = uncertain model.