Interactive Explainer
A single 3×3 conv sees 9 pixels. Stack three, and a deep feature sees a 7×7 region. Stack with stride 2, and depth reaches the corners of the image. Drag the depth slider to watch it grow.
Convolution is local by design. That's the whole point — it encodes a prior that nearby pixels matter together. But a single 3×3 kernel can only see 9 pixels. So how does a CNN ever recognize a whole dog?
The answer is that receptive field grows with depth. Each layer's neuron sees a patch of the previous layer's output, which was itself built from a patch of the layer before that. After a few layers, a single activation depends on a large region of the input.
The formula is simple: each stride-1 conv of kernel size K adds (K−1) to the receptive field. Stride-2 convs multiply the growth rate. Dilated convs add gaps, expanding RF without any extra params.
RF_ℓ = RF_{ℓ−1} + (K_ℓ − 1) · Π S_i
Three stacked 3×3 convolutions and a single 7×7 conv have the same receptive field. But stacked 3×3 wins on two fronts:
This is the VGG insight (Simonyan & Zisserman 2014). It became the default for every CNN architecture from 2014 onward.
Part of the ES 667 Deep Learning course · IIT Gandhinagar · Aug 2026.