Forward pass
Compute outputs layer by layer. Store activations for backward.
Advertisement
Backward pass
Compute gradient of loss w.r.t. output. Propagate backward via chain rule. Accumulate weight gradients.
Advertisement
Automatic differentiation
Modern frameworks (PyTorch, JAX) build computation graph automatically. Reverse-mode AD = backprop.
Memory
Store all activations for backward. Gradient checkpointing trades compute for memory (rematerialize activations).