Interactive Explainer
An LSTM cell has three gates: a forget gate decides what to erase from the cell state, an input gate decides what to write, and an output gate decides what to read. Drag each slider, feed an input sequence, and watch the cell state flow — or freeze in place.
Vanilla RNNs struggle with long-range dependencies. Gradients multiplied through a chain of W⊤·diag(tanh') either vanish or explode. Hochreiter & Schmidhuber's 1997 insight: give the recurrence its own state that is updated by gates — sigmoid values in [0, 1] — instead of a matrix multiply.
The cell state becomes a conveyor belt. At each timestep, the forget gate multiplies, the input gate adds, and only the output gate decides how much to expose as the hidden state.
GRU (Cho 2014) merges the forget and input gates into one "update" gate, and merges c and h. Fewer parameters, often comparable accuracy. Picking between them in 2026 is mostly personal preference; both are dwarfed by Transformers on most tasks. But for tiny devices and streaming, they are still the workhorses.
Part of the ES 667 Deep Learning course · IIT Gandhinagar · Aug 2026.