Once you have a gradient, you need a rule to update parameters. SGD is the simplest. Adam adds adaptive learning rates per parameter. AdamW fixes a subtle weight-decay bug. Every modern LLM uses AdamW.
SGD — the baseline
# For each parameter θ:
θ ← θ - η · g
# η is learning rate, g is gradientEach step subtracts a scaled gradient. Simple, well-understood. Works for small models. Slow convergence for LLMs because effective gradient scale varies wildly across parameters (embedding rows vs attention biases).
Momentum and Nesterov
v ← β · v + g
θ ← θ - η · v
# β ≈ 0.9 typicallyMomentum accumulates a running average of past gradients. Smooths noise, helps escape narrow valleys. Nesterov is a small refinement (lookahead). Common in pre-LLM era; less used now.
Adam — adaptive learning per parameter
# m: first moment (mean of gradient)
# v: second moment (mean of squared gradient)
m ← β1 · m + (1-β1) · g
v ← β2 · v + (1-β2) · g²
# Bias correction (m, v are initialized to 0):
m_hat = m / (1 - β1^t)
v_hat = v / (1 - β2^t)
θ ← θ - η · m_hat / (sqrt(v_hat) + ε)
# Defaults: β1=0.9, β2=0.999, ε=1e-8Each parameter gets its own effective learning rate — small for high-gradient params, large for low. Works across many scales without manual tuning. Standard for transformers since 2017.
AdamW — fixed weight decay
# Wrong (Adam + L2 regularization):
θ ← θ - η · (m_hat / sqrt(v_hat) + λ · θ)
# Right (AdamW — decoupled):
θ ← (1 - η · λ) · θ - η · m_hat / sqrt(v_hat)L2 regularization in vanilla Adam interacts oddly with adaptive scaling. Loshchilov & Hutter (2018) showed: decouple weight decay from the adaptive update. AdamW often gives 1-2% better validation. Llama, Phi, GPT all use AdamW.
Memory cost
Per parameter: θ, m, v, ε buffers, sometimes master copies (FP32 + FP16) = 4–16 bytes
# For 1B param model:
Params alone: 4 GB (FP32)
AdamW state: 8 GB (m, v)
Gradients: 4 GB
Total: 16 GB minimumAdamW's 2 moments per param are why training memory is 4× model size at FP32. CPU training requires the model to be small enough that ALL this fits in RAM. ~125M-350M is the practical CPU training ceiling on a 32-64 GB machine.