▶ Interactive Lab

SGD vs Adam — Step Trajectories

Two optimizers descending the same loss surface.

Advertisement
f(x,y) = x²+2y². SGD: noisy. Momentum: smoother. Adam: adaptive, fast.

What you're seeing

SGD: noisy descent. Momentum: smooths the path. Adam: per-parameter LR — fast on flat directions, careful on steep.

★ KEY TAKEAWAY
SGD is noisy. Momentum smooths. Adam adapts per-parameter — best for transformers.
▶ WHAT TO TRY
  • Switch between SGD / Momentum / Adam.
  • Click Auto and watch how Adam quickly reaches the minimum even with noisy gradients.
  • This is why every modern LLM uses AdamW (Adam + decoupled weight decay).