Task	Typical labeling cost	Typical dataset size
ImageNet class	0.5 USD per image (crowdsource)	14M images
Detection bbox	5-20 USD per image	200k images (COCO)
Segmentation mask	30-100 USD per image	10k images
Medical annotation	50-500 USD per image	~1-10k images

Augmentation pair	Accuracy
crop only	40%
color-jitter only	28%
crop + color-jitter	56%
crop + color + blur	64%

BYOL · twin networks (student + teacher)

Online network ("student", ) · trained with SGD. Has an extra predictor head.
Target network ("teacher", ) · not updated by gradients. Weights are a slow EMA of .

Game · online sees view 1, predicts what target outputs for view 2.

Mechanism 1 · EMA target.

Worked numeric · , , init .

Step 1 · →
Step 2 · →

The teacher trails the student smoothly. The student is chasing a stable, slow-moving version of itself.

Mechanism 2 · stop-gradient on the target. Loss = MSE(online_pred, sg(target)). Gradients flow back through online only. The teacher can't "cheat" by moving its output to match the student.

The asymmetry (predictor + EMA + stop-grad) prevents collapse without needing negatives.

Method	Key idea
MoCo	FIFO queue of negatives; momentum encoder
SimCLR	large batch negatives; projection head
SwAV	online clustering (prototype assignments)
BYOL	no negatives; predictor + EMA
SimSiam	BYOL minus EMA — even simpler
Barlow Twins	decorrelate representations across views

Method	What's measured	What's frozen
Linear probe	inherent feature quality	encoder frozen; only 1-layer classifier trained
Fine-tune	ceiling of the representation	nothing frozen (often low LR for encoder)
k-NN	local feature structure	encoder frozen; no classifier
Few-shot	sample efficiency	encoder frozen; tiny labeled set

Backbone	Linear probe ImageNet	Fine-tune detection
Supervised ResNet-50	76.1	38 mAP
SimCLR ResNet-50	69	36
MoCo-v3 ViT-B	76	39
MAE ViT-H	76	54 mAP
DINOv2 ViT-L	86	52

Scenario	Use SSL?
Plenty of labeled data, single task	No · just supervised
Large unlabeled pool, small labeled	Yes · SSL pretrain + fine-tune
Need generic features for many tasks	Yes · start from DINOv2
Need fast deployment on consumer GPU	Probably not · use CLIP/DINOv2 frozen
Novel domain (medical, satellite)	Yes · in-domain SSL then fine-tune

Modality	Dominant SSL approach
Text	Next-token prediction (every LLM)
Vision	MAE + DINO-style distillation
Speech	Wav2Vec 2.0 / HuBERT (masked frame prediction)
Video	MAE extended to spacetime patches
Multimodal	CLIP-style contrastive image-text (next lecture)

Self-Supervised & Contrastive Learning

Lecture 17 · ES 667: Deep Learning

Learning outcomes

Where we are

The labeling bottleneck · in numbers

PART 1

The labeling bottleneck

Two facts about modern ML

What a surrogate task looks like

PART 2

Contrastive learning · SimCLR

The SimCLR framework

SimCLR · batch as an 8×8 matrix

SimCLR · the matching-game analogy

How SimCLR works, step-by-step

InfoNCE · derive the loss step by step

Worked numeric · InfoNCE

InfoNCE · geometrically

InfoNCE in plain English

The crumple-zone projection head

Why SimCLR works

Temperature · the volume-knob analogy

Temperature · numeric demo

Augmentation matters · hugely

PART 3

BYOL · self-distillation without negatives

BYOL · learning without negatives

BYOL · twin networks (student + teacher)

Why doesn't BYOL collapse?

MoCo, SwAV, and the contrastive zoo

Linear probe vs fine-tune · the chef-and-knife analogy

2026 SSL benchmarks · who wins what

The pattern across all these methods

PART 4

MAE · BERT for pixels

MAE · the jigsaw-puzzle analogy

MAE · full pipeline

MAE · mask 75%, reconstruct the rest

MAE · the asymmetric architecture

MAE vs contrastive · who wins what

Contrastive (SimCLR)

MAE

When should you use SSL?

SSL in text · the original success

DINO and DINOv2 · self-distillation at scale

PART 5

Where self-supervision lives in 2026

The landscape

Lecture 17 — summary

Read before Lecture 18

Next lecture