1D toy:
This is exactly the formula students see in textbooks — but now we know where each term comes from.
VGG's first layer:
Output is same size as input — this is "same" padding (
ImageNet stem: input (3, 224, 224), apply Conv2d(3, 64, kernel_size=7, stride=2, padding=3):
Output: (64, 112, 112).
Why convolution is cheap. MLP doing the same thing:
Parameters are shared across positions, and each output depends on a small region. That's what makes the conv orders of magnitude cheaper than an equivalent MLP — not a marginal win.
For every conv layer, think in this order:
99% of production CNNs only use (1, 2). Segmentation (L9) uses (4). Advanced audio / video sometimes uses (4). For images, reach for kernel_size=3, padding=1, stride=1 or 2 — nearly every other choice is a specific research idea.
Input · 4 × 4 matrix
2×2 max-pool, stride 2 · slide a 2×2 window non-overlapping.
| window | values | max |
|---|---|---|
| top-left | 1, 2, 4, 6 | 6 |
| top-right | 8, 3, 5, 1 | 8 |
| bottom-left | 9, 7, 5, 3 | 9 |
| bottom-right | 2, 3, 1, 0 | 3 |
Output ·
Halved spatial size · kept the strongest activation per region · gained translation invariance.
Two kinds of cat detector:
Convolution is translation equivariant. "Equivariant" = changes the same way.
Pooling adds local translation invariance. "Invariant" = doesn't change.
Move the strongest feature within the window → same output. Stack many such layers and the local invariance compounds into global invariance: the final layer cares that a cat is present, not where.
In 2026, stride-2 convs often replace max-pooling entirely (less info loss).
Imagine a cat in the corner of an image vs centered. Should the network's answer depend on where the cat is?
Pooling creates a small "I don't care exactly where" window. Stacks of pool+conv compound this: by layer 5, the network cares about the cat's presence, not its exact pixel position.
This is the invariance gradient a classification CNN builds — conv1 nearly equivariant (tells you where), final global pool fully invariant (tells you what).
How deep features "see" far-away pixels
Interactive: drag depth/kernel/stride and watch RF grow live — receptive-field-grower.
You need to cross a complex city.
The 3-layer stack has the same coverage as the 7×7 expert — but more efficient and more "thinking" between hops.
Three stacked 3×3 convs vs. one 7×7 conv (stride=1,
Receptive field (RF) — trace how far back one output pixel "sees":
Parameters (formula:
| 1× (7×7) | 3× (3×3) stacked | |
|---|---|---|
| RF | 7 | 7 |
| Params | ||
| ReLUs | 1 | 3 |
VGG (2014) built its whole architecture around this trade.
Mid-network layer with
Saving: ~1.5 million parameters per block, plus 2 extra non-linearities. Repeated across many blocks → enormous total saving. This is the core design principle of VGG.
Analogy · dominoes. Push the last domino — how many dominoes were involved? Each conv layer is like one push. A 3×3 kernel "pushes" a 3×3 region; the next layer's 3×3 pushes a region already pushed by the first. The effect propagates.
For a stack with stride 1, the RF formula is:
5-layer CNN, all 3×3,
| Layer | Calculation | RF |
|---|---|---|
| Input | — | 1 |
| Conv1 | 3 | |
| Conv2 | 5 | |
| Conv3 | 7 | |
| Conv4 | 9 | |
| Conv5 | 11 |
Stride. A layer with stride
The RF size tells us the scale of features a layer can detect.
| RF | What fits in this window |
|---|---|
| 3×3 | a single corner; a fragment of an edge |
| 5×5 | a longer edge or a junction |
| 11×11 | on a 28×28 MNIST digit, the loop of a "6" or the cross of a "4" |
| 49×49 | on a 224×224 ImageNet photo, an entire eye, nose, or small object |
| ~500 | the whole image (ResNet-50 final block) |
Each layer learns to use this growing context — edges → textures → parts → objects. Deeper nets = larger RF = richer hierarchies.
Theoretical RF grows linearly with depth.
Effective RF (Luo et al. 2016) is more Gaussian-shaped — pixels near the centre contribute much more than edge pixels.
Two consequences:
LeNet → AlexNet → VGG → Inception → ResNet → MobileNet → EfficientNet
| Year | Model | Key contribution |
|---|---|---|
| 1998 | LeNet-5 | first successful CNN (digits, MNIST) |
| 2012 | AlexNet | GPU training, ReLU, dropout, big data (ImageNet) |
| 2014 | VGG | "depth with small kernels" — 3×3 only, stacked |
| 2014 | GoogLeNet | 1×1 bottlenecks, parallel branches (Inception) |
| 2015 | ResNet | skip connections → 152 layers trainable |
| 2017 | MobileNet | depthwise separable convs (edge devices) |
| 2019 | EfficientNet | compound scaling (depth × width × resolution) |
A 1×1 convolution sounds useless · it only sees one pixel! But the power is in the depth dimension.
At each pixel · you have 256 channel values (your "ingredients"). The 1×1 conv learns the best recipes to mix them down into 64 new "flavors" (output channels).
It's an extremely cheap way to · reduce channels (bottleneck) · expand them after a 3×3 · or remix a feature map without spatial mixing.
Used everywhere in modern CNNs and Transformers (the "output projection" of attention is a 1×1 conv applied to the channel axis).
A 1×1 conv = a small linear (FC) layer applied at every pixel independently.
At pixel
Worked numeric · 3 channels (RGB) → 2 channels. Pixel value
Output at this pixel:
Uses everywhere in modern networks · reduce channels before 3×3 (GoogLeNet bottleneck) · expand after (ResNet
Input tensor (256, 14, 14) — 256 channels at 14×14 spatial resolution.
Apply Conv2d(256, 64, kernel_size=1) → (64, 14, 14).
What did we just do? Took 256 channels at each spatial position, mixed them linearly to 64 channels. Spatial structure preserved. Channel structure compressed.
A 3×3 conv immediately after now runs 4× cheaper because the depth is 4× smaller. That's the bottleneck trick: sandwich 3×3 convs between 1×1 compressions.
Between 2012 and 2014, the field converged on a recipe:
By 2015, people tried to go deeper than 25 layers and networks stopped learning. Adding layers made training loss worse — not a generalization issue. This pointed at the optimization problem that ResNet (next lecture) solved with skip connections.
Why convolution beats MLP on images
Each output sees only a small window. Forces the network to first extract local features (edges, textures) before combining them.
Why correct for images · nearby pixels are semantically related (same edge, same object).
Shift input → shift output. The same feature detector runs at every position.
Why correct for images · a cat is a cat whether it's in the corner or centre.
Stacking convs builds larger receptive fields. Deep networks compose features at each scale.
Why correct for images · visual world is hierarchical (edge → texture → part → object).
Analogy · searching for keys.
The right assumptions shrink the search space enormously.
Scenario. One hidden layer. Input: 224×224 RGB image →
MLP (fully connected). Hidden layer = 4096 neurons. Every input connects to every neuron.
CNN. 3×3 kernel,
Difference: 616,000,000 vs. 1,728 — a factor of ~350,000.
This is what locality (small kernel) and weight sharing (sliding the same kernel) buy. The CNN is forced to learn reusable patterns; the MLP must learn every connection from scratch.
Vision Transformers (L18) give up most of this inductive bias — they need pretraining on far more data to compensate.
The bias is a free data multiplier. A CNN at 50k images behaves like a ViT at 500k. If your dataset is small, use a CNN (or start from a pretrained CNN).
class ConvBlock(nn.Module):
def __init__(self, c_in, c_out):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(c_in, c_out, kernel_size=3, padding=1),
nn.BatchNorm2d(c_out),
nn.ReLU(inplace=True),
)
def forward(self, x):
return self.net(x)
Every classic CNN is a stack of these (plus pooling or stride-2 for downsampling). VGG is literally this block repeated.
| Layer | Typical features (for a trained CNN on natural images) |
|---|---|
| Conv1 | oriented edges, colour blobs (~like Gabor filters) |
| Conv2 | junctions, simple textures |
| Conv3 | repeated patterns (fur, grid, stripes) |
| Conv4 | object parts (eyes, wheels) |
| Conv5 | whole objects / object arrangements |
Zeiler & Fergus 2014 · Visualizing and Understanding Convolutional Networks — canonical reference for layer visualization.
L7 covered the "classic CNN" era — the 1998–2014 progression that ended with VGG.
L8 (next lecture) picks up where we paused: Inception modules, ResNet in CNNs, MobileNet's depthwise separable convs, EfficientNet scaling, and transfer learning — the one practical skill you'll use most often.
Q. What's a typical kernel size in 2026?
A. 3 everywhere, except stem layer uses 7×7 (more RF for first layer). Very occasionally 5×5 for specific blocks.
Q. When do I need big-kernel convs (ConvNeXt 7×7)?
A. When replicating Transformer-style long-range mixing in CNNs. ConvNeXt showed 7×7 depthwise conv can match attention on some tasks.
Q. How do I choose number of channels?
A. Double channels every time you halve spatial (32→64→128→256→512). Keeps params-per-layer roughly constant. VGG, ResNet, EfficientNet all follow this.
Q. Padding · 'same' vs 'valid'?
A. padding='same' (PyTorch) keeps output size = input size. Default for most blocks. 'valid' (no padding) shrinks · used when downsampling is the point.