Output-size · worked numeric

VGG's first layer: .

Output is same size as input — this is "same" padding ( and ). Used in nearly every modern CNN block.

ImageNet stem: input (3, 224, 224), apply Conv2d(3, 64, kernel_size=7, stride=2, padding=3):

Output: (64, 112, 112).

Why convolution is cheap. MLP doing the same thing: ops. Conv: ops — ~1000× cheaper.

Parameters are shared across positions, and each output depends on a small region. That's what makes the conv orders of magnitude cheaper than an equivalent MLP — not a marginal win.

window	values	max
top-left	1, 2, 4, 6	6
top-right	8, 3, 5, 1	8
bottom-left	9, 7, 5, 3	9
bottom-right	2, 3, 1, 0	3

	1× (7×7)	3× (3×3) stacked
RF	7	7
Params
ReLUs	1	3

Layer	Calculation	RF
Input	—	1
Conv1		3
Conv2		5
Conv3		7
Conv4		9
Conv5		11

RF	What fits in this window
3×3	a single corner; a fragment of an edge
5×5	a longer edge or a junction
11×11	on a 28×28 MNIST digit, the loop of a "6" or the cross of a "4"
49×49	on a 224×224 ImageNet photo, an entire eye, nose, or small object
~500	the whole image (ResNet-50 final block)

Year	Model	Key contribution
1998	LeNet-5	first successful CNN (digits, MNIST)
2012	AlexNet	GPU training, ReLU, dropout, big data (ImageNet)
2014	VGG	"depth with small kernels" — 3×3 only, stacked
2014	GoogLeNet	1×1 bottlenecks, parallel branches (Inception)
2015	ResNet	skip connections → 152 layers trainable
2017	MobileNet	depthwise separable convs (edge devices)
2019	EfficientNet	compound scaling (depth × width × resolution)

Layer	Typical features (for a trained CNN on natural images)
Conv1	oriented edges, colour blobs (~like Gabor filters)
Conv2	junctions, simple textures
Conv3	repeated patterns (fur, grid, stripes)
Conv4	object parts (eyes, wheels)
Conv5	whole objects / object arrangements

CNN Deep Dive & Classic Architectures

Lecture 7 · ES 667: Deep Learning

Learning outcomes

Recap · where we are

Four questions

PART 1

Convolution mechanics

Convolution · the feature-detector view

Convolution — sliding window, shared weights

The lawnmower analogy

Building the output-size formula step-by-step

Output-size · worked numeric

The four hyperparameters, in one picture

Max-pool · worked numeric example

The "where's the cat?" detector

Equivariance vs invariance · concretely

Why pooling works · the invariance argument

PART 2

Receptive field

RF grows with depth

Receptive field grows with depth · picture

One expert vs. three locals

Let's prove it · RF and parameter math

Worked numeric · counting parameters

Receptive field · the chain-reaction view

What can a 11×11 patch "see"?

Effective receptive field

PART 3

Classic architecture evolution

The progression

What each era got right

1×1 conv · the recipe-mixer

1×1 conv · the math, with a worked example

1×1 conv · worked example

AlexNet → VGG · the "just add depth" years

PART 4

Inductive biases

Three biases baked into convolution

In words · what each bias does

1. Locality

2. Translation equivariance

3. Hierarchy of scales

Assumptions are a shortcut

Parameter math · MLP vs CNN

Inductive bias · the data-efficiency plot

Small data (≤ 10⁴ images)

Huge data (≥ 10⁸ images)

A CNN block in PyTorch

Feature visualization · what each layer learns

What's next

Common questions · FAQ

Lecture 7 — summary

Read before Lecture 8

Next lecture