Advertisement
Standard LLM schedule: linear warmup → cosine decay → small min LR.
What you're seeing
Warmup typically 1-2% of total steps. Cosine decay from peak to ~10% of peak. Both have empirical justification.
★ KEY TAKEAWAY
Warmup linearly to peak, then cosine decay to ~10% of peak. The standard LLM training schedule.
▶ WHAT TO TRY
- Slide Warmup % — without warmup the curve starts at peak, which causes early-training divergence.
- Slide Min/Peak — most schedules decay to 10% of peak so late training keeps refining.