▶ Interactive Lab

Gradient Accumulation

K micro-batches build up to an effective large batch.

Advertisement
K micro-batches accumulate gradients; one optimizer step at end.

What you're seeing

Each micro-batch's loss is divided by K so the accumulated grad = average grad. Memory stays low.

★ KEY TAKEAWAY
K micro-batches → one optimizer step. Memory of one micro-batch, gradient of an effective K× batch.
▶ WHAT TO TRY
  • Set K to 32 and click Run — watch 32 micro-batches build up before one optimizer step.
  • Each micro-batch's loss is divided by K, so the accumulated gradient equals the average.