Advertisement
Minimize f(x,y) = x² + 2y². Steps follow -gradient. Too high LR → divergence.
What you're seeing
Gradient descent: x ← x - lr·∇f(x). Small lr: slow convergence. Large lr: oscillation or divergence. Just-right: monotonic descent to minimum.
Modern optimizers (Adam, AdamW): add momentum + per-parameter adaptive scaling. Standard for LLM training and most deep learning.
★ KEY TAKEAWAY
Lower LR: slow convergence. Higher LR: oscillation or divergence. Right LR: monotonic descent.
▶ WHAT TO TRY
- Set LR very high (1.0+) — diverges.
- Set LR very low — convergence takes forever.
- Find the sweet spot (~0.1-0.3 for this loss surface).