CPU training only works if everything fits in RAM. A 350M model in FP32 with AdamW + activations needs ~6-8 GB. A 1B model needs ~24+ GB. Knowing the math tells you exactly what hardware you need.

Advertisement

Per-parameter cost

FP32 weights:        4 bytes
FP32 gradients:      4 bytes
AdamW first moment:  4 bytes
AdamW second moment: 4 bytes
=== total per param: 16 bytes

Quadruples model size in memory just for static state. Plus activations (next section). Plus framework overhead. For 350M params: 5.6 GB just for the static state at FP32.

Activation memory

# Per forward pass, per token:
# Roughly: 20 * d_model * L activations
# For d=1024, L=16, seq=1024:
#   20 * 1024 * 16 * 1024 = ~336 MB per batch element

Activation memory scales with batch_size × seq_len × d × L. With batch=8 and seq=1024 on a 350M model: ~3 GB activations during forward pass. Stored for backward pass.

Advertisement

Gradient checkpointing

# Recompute activations during backward instead of storing all:
#   memory: O(sqrt(L) * batch * seq * d)
#   compute: 33% more forward passes

Trade compute for memory. Memory drops to sqrt(L) × per-layer (e.g., 4 layers worth instead of 16). Saves ~70% activation memory at ~33% extra compute. torch.utils.checkpoint. Essential for CPU training of bigger models.

BF16 cuts in half

# With BF16 mixed precision:
weights:    4 bytes (FP32 master) + 2 bytes (BF16 copy)
gradients:  2 bytes (BF16)
optimizer:  8 bytes (FP32 m, v)
activations: 2 bytes (BF16)

Mixed precision halves activation memory and doubles throughput on AMX/AVX-512-BF16 CPUs. Master weights stay FP32 to preserve optimizer precision. ~30-40% memory savings overall.

Practical CPU training table

Model size | RAM (FP32 + AdamW + acts) | RAM (BF16 mix)
----------  ----------                  ----------
50M        |     2 GB                    1.5 GB
125M       |     5 GB                    3 GB
350M       |    12 GB                    7 GB
1.3B       |    40 GB                   24 GB
3B         |    90 GB                   55 GB

Practical CPU training ceiling: 1-3B parameters on 64-128 GB workstations. Above this, you need GPU. Inference (no gradients/optimizer) tolerates much larger models.

Weights × 4 (FP32) or × 2.5 (BF16 mix) + activations + checkpointing. 350M comfortable on 16GB; 1B on 64GB.