▶ Interactive Lab

Gradient Descent

Tune learning rate; watch convergence (or divergence).

Advertisement
Minimize f(x,y) = x² + 2y². Steps follow -gradient. Too high LR → divergence.

What you're seeing

Gradient descent: x ← x - lr·∇f(x). Small lr: slow convergence. Large lr: oscillation or divergence. Just-right: monotonic descent to minimum.

Modern optimizers (Adam, AdamW): add momentum + per-parameter adaptive scaling. Standard for LLM training and most deep learning.

★ KEY TAKEAWAY
Lower LR: slow convergence. Higher LR: oscillation or divergence. Right LR: monotonic descent.
▶ WHAT TO TRY
  • Set LR very high (1.0+) — diverges.
  • Set LR very low — convergence takes forever.
  • Find the sweet spot (~0.1-0.3 for this loss surface).