Input (14, 14, 16) → Output (14, 14, 32), kernel 3×3.
Without bottleneck.
With bottleneck (squeeze to 4 channels).
Total:
1×1 convs appear everywhere modern · GoogLeNet reduction · ResNet bottleneck · Transformer FFN projections.
Skip connections · bottleneck blocks
Analogy. A game of telephone over 100 people. By the end, the message is gibberish. Gradients are the correction message sent backwards: "Hey person 1, you said 'purple monkey' — should've been 'purple donkey'."
By the time that correction reaches person 1, it's diluted to a meaningless whisper. Person 1 can't learn. Vanishing gradient.
A skip connection is a gradient superhighway — a direct, uninterrupted path from the end back to the beginning.
Plain layer.
With 100 layers: 100 of these
Residual layer.
So:
The +1 is the highway. Even when
Upstream gradient
| Plain | Residual | |
|---|---|---|
| One layer | ||
| 10 layers |
Plain → completely vanished after 10 layers. Residual → still strong.
This is why ResNet trains 152 layers easily, while plain 34 layers couldn't even fit the training data. Identity path stops vanishing at construction time, not through training luck. Same vector form:
When a new block isn't useful yet, residual = 0 is the easiest thing to learn — the identity mapping just passes
Adding blocks can only improve or at worst no-op, never hurt. Before ResNet, adding layers to a working network often made it worse (degradation problem). After ResNet, deeper ≥ shallower — you just keep adding.
Same idea shows up as LSTM's cell state (L10) and Transformer's residual stream (L13). Skip connections are the single most load-bearing design in modern deep learning.
The simple skip
What if the main path:
Analogy · adapter plug. Your wall socket (
When dimensions don't match:
C_out → matches channels.Now we can add. Everything else stays residual.
class Bottleneck(nn.Module):
expansion = 4
def __init__(self, c_in, c, stride=1):
super().__init__()
self.conv1 = nn.Conv2d(c_in, c, 1)
self.conv2 = nn.Conv2d(c, c, 3, stride, padding=1)
self.conv3 = nn.Conv2d(c, c * self.expansion, 1)
self.bn1, self.bn2, self.bn3 = [nn.BatchNorm2d(x) for x in (c, c, c*self.expansion)]
self.shortcut = nn.Conv2d(c_in, c*self.expansion, 1, stride) if stride != 1 or c_in != c*self.expansion else nn.Identity()
| Model | Depth | Params | ImageNet top-1 |
|---|---|---|---|
| ResNet-18 | 18 | 12 M | 69.8% |
| ResNet-34 | 34 | 22 M | 73.3% |
| ResNet-50 | 50 | 26 M | 76.1% |
| ResNet-101 | 101 | 45 M | 77.4% |
| ResNet-152 | 152 | 60 M | 78.3% |
ResNet-50 is the workhorse. Unless you have a specific reason, start there for any CNN task you face in 2026.
Depthwise separable convolutions
A standard convolution does two jobs at once · spatial mixing and channel mixing.
Analogy · making a smoothie.
Splitting these two jobs makes the operation dramatically cheaper.
Standard 3×3 conv with
Depthwise-separable splits into:
Total:
Numeric (
~8.4× cheaper. Accuracy drop ~1%. Speedup 8–10×. In every mobile model since 2017.
Input (14, 14, 16) → Output (14, 14, 32), kernel 3×3.
Standard.
Depthwise separable.
(14, 14, 16).A phone has ~10 W total power; a datacenter GPU draws 300 W just to idle. Efficient nets let you run:
| Model | FLOPs | iPhone inference |
|---|---|---|
| ResNet-50 | 4.1 G | ~90 ms |
| MobileNetV2 | 0.3 G | ~15 ms |
| MobileNetV3-S | 0.06 G | ~4 ms |
Real-time on-device tasks (camera AR, live caption, wake-word) need ≤30 ms budget. MobileNet-style splits are the reason such apps exist.
| Model | Year | Key idea |
|---|---|---|
| MobileNet v1 | 2017 | Depthwise separable + width multiplier |
| MobileNet v2 | 2018 | Inverted residuals + linear bottlenecks |
| MobileNet v3 | 2019 | Neural-architecture-search + SiLU |
| EfficientNet-B0 | 2019 | Compound scaling foundation |
Scale depth, width, and resolution together
You can scale up an engine by · making it bigger (depth) · using wider pistons (width) · running on higher-octane fuel (resolution).
Any one alone helps a bit. Doing all three together · in balance · gives the best engine.
EfficientNet's insight · the same is true of neural networks. Depth, width, and input resolution should grow together for a given compute budget · not one at a time.
You have a baseline net (a small car engine). Three knobs to make it more powerful:
The old way · pick one knob, turn it all the way up (VGG → depth · WideResNet → width · ProGAN → resolution).
EfficientNet's idea · turn all three up in balance.
Define a single scaling knob
subject to
For EfficientNet-B0..B7, the paper found roughly
Worked numeric · scaling B0 → B2 (
Single
| Model | Depth | Width | Res | Params | ImageNet top-1 |
|---|---|---|---|---|---|
| B0 | 1.0 | 1.0 | 224 | 5.3 M | 77.3% |
| B1 | 1.1 | 1.0 | 240 | 7.8 M | 79.2% |
| B3 | 1.4 | 1.2 | 300 | 12 M | 81.6% |
| B5 | 1.6 | 1.6 | 456 | 30 M | 83.6% |
| B7 | 2.0 | 2.0 | 600 | 66 M | 84.3% |
EfficientNet set the accuracy/param Pareto frontier for 2019–2021 before Vision Transformers took over.
The skill you'll use 90% of the time
Start from a model that already knows something relevant, adapt it to your new task with a fraction of the data and compute.
The pretrained network is a learned prior. You're not training from random init — you're fine-tuning a near-working initialization. In most vision tasks this is worth ≈ 1–2 orders of magnitude of training data.
ImageNet pretraining gives you a generic vision stack:
Transfer learning rule — more data in your new domain → unfreeze more layers.
Scenario. A botanist gives you 5,000 photos of flowers and wants a 102-class classifier.
Option 1 · train from scratch. Design a ResNet-50, train on 5,000 images.
Problem · 5,000 isn't enough to learn what an edge, texture, or petal even is from random noise. The model overfits badly.
Option 2 · transfer learning. Take a ResNet-50 already trained on ImageNet (1.2M images). It already knows edges, textures, fur, eyes. Adapt this powerful feature extractor to flowers.
Almost always Option 2 wins when labels are scarce. Now · how to adapt?
Jargon unpacked
requires_grad=False. Optimizer skips that layer; it's a fixed feature extractor.| Your data | Recommended recipe | What to freeze |
|---|---|---|
| < 100 labels | Linear probe | everything except head |
| 100 - 1k | Fine-tune top layers | freeze conv1-3 |
| 1k - 10k | Fine-tune whole backbone | nothing (with discriminative LR) |
| 10k - 100k | Fine-tune + LR scheduling | nothing, bigger LR |
| > 100k | Probably train from scratch | — |
Smaller data → more frozen. Larger data → more trainable. If you have 1M labels in your domain, you likely don't need transfer at all (but it rarely hurts as a warm start).
Sometimes transfer learning hurts. Medical MRI → ImageNet-pretrained ResNet. Natural photos teach features that don't transfer to grayscale medical modalities.
Symptoms:
Fix · try from-scratch, or use domain-specific pretraining (RadImageNet for medical, SatCLIP for satellite). Generic features aren't universal.
When you do unfreeze early layers, they should learn more slowly than late layers — early layers are already good.
# PyTorch — different LR per param group
params = [
{"params": model.conv1.parameters(), "lr": 1e-5}, # early · slow
{"params": model.conv2.parameters(), "lr": 1e-5},
{"params": model.conv3.parameters(), "lr": 3e-5},
{"params": model.conv4.parameters(), "lr": 1e-4},
{"params": model.conv5.parameters(), "lr": 3e-4}, # late · fast
{"params": model.fc.parameters(), "lr": 1e-3}, # new · fastest
]
opt = torch.optim.AdamW(params, weight_decay=0.01)
fastai popularized "1cycle + discriminative LRs" for transfer learning — often the right defaults for small-data fine-tuning.
Large domain gap. ImageNet (natural photos) → medical X-rays, satellite imagery, microscopy. Early-layer features may still transfer; late-layer features definitely won't.
Very different image sizes. ImageNet is 224×224; medical imaging may be 1024+. You may need to resize or re-pretrain.
Very small target dataset (≪ 100 examples). Even linear probing won't save you — consider self-supervised pretraining on your domain first (L17).
In those cases — pre-train on a closer domain, or use self-supervised methods (coming in L17).
| Scenario | Backbone | Why |
|---|---|---|
| ImageNet-like natural photos | ResNet-50 or ConvNeXt | strong + well-supported |
| Very small labeled data | CLIP ViT features | cross-domain generalization |
| Edge / real-time | MobileNet-V3 | latency budget |
| High-res medical / satellite | ConvNeXt-large or DINOv2 | captures fine detail |
| Arbitrary RGB, unknown domain | DINOv2 frozen features | best general-purpose SSL |
In 2026, DINOv2 (self-supervised on 142M images) is often the starting point for vision-backbone features — even better than ImageNet-supervised ResNets for downstream tasks.
import torch, torch.nn as nn
import torchvision.models as M
# 1. Load pretrained weights
model = M.resnet50(weights=M.ResNet50_Weights.IMAGENET1K_V2)
# 2. Freeze everything
for p in model.parameters():
p.requires_grad = False
# 3. Replace the 1000-way classifier with N-way
n_classes = 10
model.fc = nn.Linear(model.fc.in_features, n_classes) # new, trainable by default
# 4. Train only the new fc
opt = torch.optim.AdamW(model.fc.parameters(), lr=1e-3)
# 5. Later, unfreeze progressively
for p in model.layer4.parameters(): p.requires_grad = True
timm ecosystemFor 2026, stop hand-rolling architectures:
import timm
# 500+ pretrained vision models in one line
model = timm.create_model('resnet50', pretrained=True, num_classes=10)
model = timm.create_model('efficientnet_b3', pretrained=True, num_classes=10)
model = timm.create_model('convnext_base', pretrained=True, num_classes=10)
model = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=10)
timm by Ross Wightman · the de facto vision model zoo. Every competitive Kaggle vision solution uses it.