Advertisement
f(x,y) = x²+2y². SGD: noisy. Momentum: smoother. Adam: adaptive, fast.
What you're seeing
SGD: noisy descent. Momentum: smooths the path. Adam: per-parameter LR — fast on flat directions, careful on steep.
★ KEY TAKEAWAY
SGD is noisy. Momentum smooths. Adam adapts per-parameter — best for transformers.
▶ WHAT TO TRY
- Switch between SGD / Momentum / Adam.
- Click Auto and watch how Adam quickly reaches the minimum even with noisy gradients.
- This is why every modern LLM uses AdamW (Adam + decoupled weight decay).