Model	Depth	Params	ImageNet top-1
ResNet-18	18	12 M	69.8%
ResNet-34	34	22 M	73.3%
ResNet-50	50	26 M	76.1%
ResNet-101	101	45 M	77.4%
ResNet-152	152	60 M	78.3%

Model	FLOPs	iPhone inference
ResNet-50	4.1 G	~90 ms
MobileNetV2	0.3 G	~15 ms
MobileNetV3-S	0.06 G	~4 ms

Model	Year	Key idea
MobileNet v1	2017	Depthwise separable + width multiplier
MobileNet v2	2018	Inverted residuals + linear bottlenecks
MobileNet v3	2019	Neural-architecture-search + SiLU
EfficientNet-B0	2019	Compound scaling foundation

Model	Depth	Width	Res	Params	ImageNet top-1
B0	1.0	1.0	224	5.3 M	77.3%
B1	1.1	1.0	240	7.8 M	79.2%
B3	1.4	1.2	300	12 M	81.6%
B5	1.6	1.6	456	30 M	83.6%
B7	2.0	2.0	600	66 M	84.3%

Your data	Recommended recipe	What to freeze
< 100 labels	Linear probe	everything except head
100 - 1k	Fine-tune top layers	freeze conv1-3
1k - 10k	Fine-tune whole backbone	nothing (with discriminative LR)
10k - 100k	Fine-tune + LR scheduling	nothing, bigger LR
> 100k	Probably train from scratch	—

Scenario	Backbone	Why
ImageNet-like natural photos	ResNet-50 or ConvNeXt	strong + well-supported
Very small labeled data	CLIP ViT features	cross-domain generalization
Edge / real-time	MobileNet-V3	latency budget
High-res medical / satellite	ConvNeXt-large or DINOv2	captures fine detail
Arbitrary RGB, unknown domain	DINOv2 frozen features	best general-purpose SSL

Lecture 8 — summary

Inception module — parallel branches of 1×1, 3×3, 5×5, pool; let SGD pick the receptive field.
1×1 convolutions — channel mixing + cheap bottlenecks; in every modern architecture.
ResNet bottleneck — 1×1 → 3×3 → 1×1 + skip; 17× fewer params per block than basic residual.
Depthwise separable — split spatial from channel mixing; MobileNet's foundation.
Compound scaling (EfficientNet) — scale depth × width × resolution together.
Transfer learning recipes — feature-extract · fine-tune top · fine-tune all. Match to data size.
Practically · start from timm.create_model('resnet50', pretrained=True) and go from there.

Read before Lecture 9

Bishop Ch 10 + CS231n OD notes (UDL doesn't cover detection/segmentation).

Next lecture

Detection & Segmentation — R-CNN → Faster R-CNN, YOLO, IoU, NMS, U-Net, Mask R-CNN, zero-shot SAM.

Notebook 8 · 08-transfer-learning.ipynb — fine-tune ResNet-50 on Flowers-102 with discriminative LRs, measure effect of freezing depth.

	Plain	Residual
One layer
10 layers

Modern CNNs & Transfer Learning

Lecture 8 · ES 667: Deep Learning

Learning outcomes

Recap · where we are

PART 1

Inception · parallel kernels

Inception · the buffet idea

The Inception module (Szegedy 2014)

Why 1×1 convolutions matter · derivation

Worked numeric · 1×1 bottleneck on small numbers

PART 2

ResNet in CNNs

Vanishing gradients · the telephone game

Skip connections · derive the gradient

Worked numeric · gradient flow

Why skip connections also help forward pass

The ResNet-CNN block

Basic vs bottleneck · annotated

When the shortcut shape doesn't match

Projection shortcuts · the math

The ResNet family by depth

PART 3

MobileNet · efficient CNNs

Depthwise separable · split the work

Depthwise separable · the smoothie analogy

Depthwise separable · param math

Worked numeric · depthwise separable

Why edge models care about FLOPs

MobileNet variants · a decade of improvements

PART 4

EfficientNet · compound scaling

Compound scaling · tuning a car engine

How do we make a network "bigger"?

Compound scaling · the rule

EfficientNet scale-up at a glance

PART 5

Transfer learning

The premise

What transfer learning is, in one line

The premise

Transfer learning · the core problem

Three recipes

Transfer learning · by data size

The bitter truth · negative transfer

Discriminative (layer-wise) learning rates

When transfer learning fails

Pick-the-right-backbone table

Loading a pretrained backbone · PyTorch

The timm ecosystem

Lecture 8 — summary

Read before Lecture 9

Next lecture

The `timm` ecosystem