SGD
w ← w - η · ∇L. Simplest. Requires careful LR schedule. Noisy but escapes local minima.
Advertisement
Momentum
v ← β·v + ∇L. w ← w - η·v. Accelerates in consistent directions. Dampens oscillation. β = 0.9 typical.
Advertisement
Adam
Adaptive per-parameter LR via 1st + 2nd moment estimates. Default choice for most nets. Sometimes worse generalization than SGD+momentum.
Learning rate schedule
Warmup → decay. Cosine schedule. Reduce-on-plateau. Fine-tuning uses low LR + shorter schedules.