SGD vs Adam — Step Trajectories

Advertisement

Optimizer

f(x,y) = x²+2y². SGD: noisy. Momentum: smoother. Adam: adaptive, fast.

SGD: noisy descent. Momentum: smooths the path. Adam: per-parameter LR — fast on flat directions, careful on steep.

★ KEY TAKEAWAY

SGD is noisy. Momentum smooths. Adam adapts per-parameter — best for transformers.

▶ WHAT TO TRY

Switch between SGD / Momentum / Adam.
Click Auto and watch how Adam quickly reaches the minimum even with noisy gradients.
This is why every modern LLM uses AdamW (Adam + decoupled weight decay).