Interactive Explainer

Receptive Field Grower

A single 3×3 conv sees 9 pixels. Stack three, and a deep feature sees a 7×7 region. Stack with stride 2, and depth reaches the corners of the image. Drag the depth slider to watch it grow.

~7 minDeep Learning · CNN · Receptive Field

Convolution is local by design. That's the whole point — it encodes a prior that nearby pixels matter together. But a single 3×3 kernel can only see 9 pixels. So how does a CNN ever recognize a whole dog?

The answer is that receptive field grows with depth. Each layer's neuron sees a patch of the previous layer's output, which was itself built from a patch of the layer before that. After a few layers, a single activation depends on a large region of the input.

The playground

Depth: 3 Kernel size:

Stride:

RF size: 7 Equivalent single kernel: 7×7 Stacked params (C→C): 27 C² Single-kernel params: 49 C²

The formula is simple: each stride-1 conv of kernel size K adds (K−1) to the receptive field. Stride-2 convs multiply the growth rate. Dilated convs add gaps, expanding RF without any extra params.

RF_ℓ = RF_{ℓ−1} + (K_ℓ − 1) · Π S_i

Why stack 3×3 instead of one 7×7?

Three stacked 3×3 convolutions and a single 7×7 conv have the same receptive field. But stacked 3×3 wins on two fronts:

Parameters — 3 × (3·3·C²) = 27C² < 49C² = 7·7·C².
Non-linearities — three ReLUs between stacks vs one after the 7×7. A more expressive function class for fewer parameters.

This is the VGG insight (Simonyan & Zisserman 2014). It became the default for every CNN architecture from 2014 onward.

The twist. Theoretical RF grows linearly. Effective RF (Luo et al. 2016) is a center-weighted Gaussian — pixels near the centre contribute far more than edge pixels. This is why attention (L12) caught on: it has truly uniform global RF from layer 1.

Part of the ES 667 Deep Learning course · IIT Gandhinagar · Aug 2026.