Forward pass

Compute outputs layer by layer. Store activations for backward.

Advertisement

Backward pass

Compute gradient of loss w.r.t. output. Propagate backward via chain rule. Accumulate weight gradients.

Advertisement

Automatic differentiation

Modern frameworks (PyTorch, JAX) build computation graph automatically. Reverse-mode AD = backprop.

Memory

Store all activations for backward. Gradient checkpointing trades compute for memory (rematerialize activations).