Advertisement
Total RAM = weights + gradients + optimizer + activations.
What you're seeing
Per-param: FP32 = 16 bytes (weights+grad+m+v). BF16 mix ≈ 10 bytes. Activations scale with d·L·seq.
★ KEY TAKEAWAY
CPU training memory = weights × ~4 (FP32+AdamW) + activations. 350M fits in 16GB; 1B needs 64GB.
▶ WHAT TO TRY
- Slide Params from 50M to 3B to see the memory breakdown.
- Toggle BF16 mixed — halves activation memory.
- Toggle Grad checkpoint — saves ~70% activation memory at 33% extra compute.