Modern LLM training uses a LR schedule: warmup linearly, then decay (usually cosine). The combination is empirically optimal — start small to avoid divergence, ramp up to fast learning, gently decay to fine-tune. The math is straightforward.

Advertisement

Why not constant LR?

Too high at start: gradients are noisy, model is at random init, taking large steps amplifies bad directions. Too low at end: stuck in suboptimal basin, can't make small refinements. Curriculum: start small (warmup), peak, decay.

Linear warmup

# For step t in [0, T_warmup]:
lr(t) = lr_peak * (t / T_warmup)

# Typical: T_warmup = 2000-10000 steps

Linear ramp from 0 (or tiny) to peak over a few thousand steps. Lets the model 'settle' before applying full LR. Skipping warmup → loss spikes or divergence in early training.

Advertisement

Cosine decay

# For step t in [T_warmup, T_total]:
u = (t - T_warmup) / (T_total - T_warmup)
lr(t) = lr_min + 0.5 * (lr_peak - lr_min) * (1 + cos(π * u))

# Typical: lr_min = 0.1 * lr_peak

Smooth decay from peak to a small minimum. Cosine shape gives gradual deceleration. Most popular LLM training schedule (used in Llama, Phi, Mistral, Qwen).

Practical settings

Peak LR: ~3e-4 for AdamW on most LLMs. ~1e-3 for SLMs. Warmup: 1-2% of total training steps. Total steps: dictated by dataset size + batch size. T_total = (tokens / (batch * seq_len)). For 1T tokens at 1M batch: 1M steps total.

CPU training specifics

Smaller models train faster per step (less compute) but need more steps for convergence. Adjust T_total accordingly. Smaller batch on CPU means more steps per epoch. Plan warmup as a fraction (e.g. 2%), not a fixed step count.

Warmup linearly then cosine decay to ~10% of peak. Peak LR ≈ 3e-4 for AdamW. Skip warmup → divergence.