Vanilla RNN
h_t = tanh(W·h_{t-1} + U·x_t). Gradient vanishes/explodes over long sequences.
Advertisement
LSTM
Cell state + input/forget/output gates. Additive updates → gradient survives. Powered 2015-2018 NLP.
Advertisement
GRU
Simplified: reset + update gates. Fewer parameters than LSTM, similar performance often.
Bidirectional
Combine forward + backward pass. Requires full sequence. Used in encoders (BERT predecessor).