Backpropagation — Chain Rule for NN Gradients

Forward pass

Compute outputs layer by layer. Store activations for backward.

Advertisement

Compute gradient of loss w.r.t. output. Propagate backward via chain rule. Accumulate weight gradients.

Advertisement

Modern frameworks (PyTorch, JAX) build computation graph automatically. Reverse-mode AD = backprop.

Store all activations for backward. Gradient checkpointing trades compute for memory (rematerialize activations).