Advertisement
With large logits, naive exp overflows. Subtracting max() before exp is the standard fix.
What you're seeing
For z ∈ ℝᴷ, softmax(z)[i] = exp(z[i]) / Σ exp(z[j]). Identity: softmax(z) = softmax(z - max(z)). The safe form keeps every exp input ≤ 0, no overflow.
Temperature T divides logits before softmax. Low T → peaked (deterministic). High T → flat (diverse). T=0 corresponds to argmax (greedy).
★ KEY TAKEAWAY
Softmax + temperature reshapes the distribution: low T → peaked, high T → flat. The subtraction trick prevents exp() overflow.
▶ WHAT TO TRY
- Slide Temperature to T=0.3 (sharp), T=1 (model's learned distribution), T=2 (diverse).
- Increase Logit scale to see how max-subtraction keeps the math stable.
- Watch the entropy readout — high entropy = uncertain model.