Interactive Explainer

LSTM Gates Playground

An LSTM cell has three gates: a forget gate decides what to erase from the cell state, an input gate decides what to write, and an output gate decides what to read. Drag each slider, feed an input sequence, and watch the cell state flow — or freeze in place.

~8 minDeep Learning · RNN · LSTM

Vanilla RNNs struggle with long-range dependencies. Gradients multiplied through a chain of W^⊤·diag(tanh') either vanish or explode. Hochreiter & Schmidhuber's 1997 insight: give the recurrence its own state that is updated by gates — sigmoid values in [0, 1] — instead of a matrix multiply.

The cell state becomes a conveyor belt. At each timestep, the forget gate multiplies, the input gate adds, and only the output gate decides how much to expose as the hidden state.

The playground

forget (f) 0.90 input (i) 0.30 output (o) 1.00

        Input spikes at t=3 and t=7 · solid line = cell state c_t · dashed = hidden state h_t
      

Try these presets

Remember forever — f=1, i=0. The cell state freezes after the first input. This is how LSTMs hold long-range information.
Forget always — f=0. Every timestep erases. Like vanilla RNN with tanh saturation.
Overwrite always — f=0, i=1. Cell state tracks the latest input only.

The key equation. c_t = f_t · c_{t-1} + i_t · c̃_t. The forget gate is a multiplicative path (no matrix multiply), so gradients through the cell state don't explode or vanish — they scale by f_t, which the network learns to keep near 1 when memory matters.

LSTM vs GRU

GRU (Cho 2014) merges the forget and input gates into one "update" gate, and merges c and h. Fewer parameters, often comparable accuracy. Picking between them in 2026 is mostly personal preference; both are dwarfed by Transformers on most tasks. But for tiny devices and streaming, they are still the workhorses.

Part of the ES 667 Deep Learning course · IIT Gandhinagar · Aug 2026.