SGD vs Adam vs AdamW Optimizer Math

Once you have a gradient, you need a rule to update parameters. SGD is the simplest. Adam adds adaptive learning rates per parameter. AdamW fixes a subtle weight-decay bug. Every modern LLM uses AdamW.

Advertisement

SGD — the baseline

# For each parameter θ:
θ ← θ - η · g
# η is learning rate, g is gradient

Each step subtracts a scaled gradient. Simple, well-understood. Works for small models. Slow convergence for LLMs because effective gradient scale varies wildly across parameters (embedding rows vs attention biases).

Momentum and Nesterov

v ← β · v + g
θ ← θ - η · v
# β ≈ 0.9 typically

Momentum accumulates a running average of past gradients. Smooths noise, helps escape narrow valleys. Nesterov is a small refinement (lookahead). Common in pre-LLM era; less used now.

Advertisement

Adam — adaptive learning per parameter

# m: first moment (mean of gradient)
# v: second moment (mean of squared gradient)
m ← β1 · m + (1-β1) · g
v ← β2 · v + (1-β2) · g²

# Bias correction (m, v are initialized to 0):
m_hat = m / (1 - β1^t)
v_hat = v / (1 - β2^t)

θ ← θ - η · m_hat / (sqrt(v_hat) + ε)

# Defaults: β1=0.9, β2=0.999, ε=1e-8

Each parameter gets its own effective learning rate — small for high-gradient params, large for low. Works across many scales without manual tuning. Standard for transformers since 2017.

AdamW — fixed weight decay

# Wrong (Adam + L2 regularization):
θ ← θ - η · (m_hat / sqrt(v_hat) + λ · θ)

# Right (AdamW — decoupled):
θ ← (1 - η · λ) · θ - η · m_hat / sqrt(v_hat)

L2 regularization in vanilla Adam interacts oddly with adaptive scaling. Loshchilov & Hutter (2018) showed: decouple weight decay from the adaptive update. AdamW often gives 1-2% better validation. Llama, Phi, GPT all use AdamW.

Memory cost

Per parameter: θ, m, v, ε buffers, sometimes master copies (FP32 + FP16) = 4–16 bytes

# For 1B param model:
Params alone: 4 GB (FP32)
AdamW state:  8 GB (m, v)
Gradients:    4 GB
Total:        16 GB minimum

AdamW's 2 moments per param are why training memory is 4× model size at FP32. CPU training requires the model to be small enough that ALL this fits in RAM. ~125M-350M is the practical CPU training ceiling on a 32-64 GB machine.

SGD: subtract gradient. Adam: adapt per parameter. AdamW: decoupled weight decay. AdamW state doubles memory — limits CPU model size.