For one anchor
Score function · cosine similarity. Higher = more similar.
Make it a probability problem. "Which
Cross-entropy with the true partner
Substitute to get the full form:
It's standard softmax cross-entropy where the "classes" are batch positions and the label is the positive pair. No human labels needed.
Tiny batch of 2 images → 4 views.
Compute loss for
Step 1 · similarities.
Step 2 · scaled exps.
Denominator
Step 3 · loss.
Loss is tiny because the positive's similarity dominates. If the model had assigned similarity 0.2 instead of 0.9 to
Read the loss
It's a softmax classification problem · "given
InfoNCE forces the final reps to be identical for two augmentations of the same image. But what if a downstream task needs the info we just discarded (e.g. color matters for ripeness)?
Analogy · car crumple zone.
This way the encoder doesn't have to delete useful info just to win the contrastive game.
Three ingredients (Chen et al. 2020 ablations):
Pretrained SimCLR features, fine-tuned, match or beat supervised ImageNet on many downstream tasks. Surprising in 2020; foundational today.
Three critics give similarity scores
In SSL, "hard negatives" are the most informative. Low
Same scores
Small
Chen et al. swept augmentation pairs. Accuracy (linear-probe on ImageNet):
| Augmentation pair | Accuracy |
|---|---|
| crop only | 40% |
| color-jitter only | 28% |
| crop + color-jitter | 56% |
| crop + color + blur | 64% |
Contrastive learning is as much about what invariances you pick as about the loss. You're telling the model · "ignore crops, ignore color shifts, ignore blur — but pay attention to content." Those choices become the downstream invariances of the representation.
Two networks chase each other
SimCLR needed lots of negatives. What if we only pull positives together?
Danger · collapse. If the only force is "pull together," the model can output the same constant vector for every image ·
BYOL is a clever recipe to prevent collapse without negatives, using two asymmetric networks.
Game · online sees view 1, predicts what target outputs for view 2.
Mechanism 1 · EMA target.
Worked numeric ·
The teacher trails the student smoothly. The student is chasing a stable, slow-moving version of itself.
Mechanism 2 · stop-gradient on the target. Loss = MSE(online_pred, sg(target)). Gradients flow back through online only. The teacher can't "cheat" by moving its output to match the student.
The asymmetry (predictor + EMA + stop-grad) prevents collapse without needing negatives.
Without negatives, the obvious failure mode is
Three forces prevent collapse:
Grill et al. had to run the training for 300 epochs to verify it really didn't collapse — nobody initially believed it.
The 2019–2021 era had many flavors:
| Method | Key idea |
|---|---|
| MoCo | FIFO queue of negatives; momentum encoder |
| SimCLR | large batch negatives; projection head |
| SwAV | online clustering (prototype assignments) |
| BYOL | no negatives; predictor + EMA |
| SimSiam | BYOL minus EMA — even simpler |
| Barlow Twins | decorrelate representations across views |
By 2023 the community mostly converged on masked autoencoding (MAE) and self-distillation (DINO).
We've forged a new chef's knife (the pretrained encoder). How do we test its quality?
| Method | What's measured | What's frozen |
|---|---|---|
| Linear probe | inherent feature quality | encoder frozen; only 1-layer classifier trained |
| Fine-tune | ceiling of the representation | nothing frozen (often low LR for encoder) |
| k-NN | local feature structure | encoder frozen; no classifier |
| Few-shot | sample efficiency | encoder frozen; tiny labeled set |
Linear probe is the cleanest measure — it isolates the encoder. Fine-tune tests the ceiling but can hide a weak encoder.
| Backbone | Linear probe ImageNet | Fine-tune detection |
|---|---|---|
| Supervised ResNet-50 | 76.1 | 38 mAP |
| SimCLR ResNet-50 | 69 | 36 |
| MoCo-v3 ViT-B | 76 | 39 |
| MAE ViT-H | 76 | 54 mAP |
| DINOv2 ViT-L | 86 | 52 |
Observation · DINOv2 (self-distillation at scale, 142M images) dominates linear probe. MAE wins detection. Supervised is no longer SOTA for any frozen-feature evaluation.
Zoom out. Every SSL method is a variation on "create a task the model can only solve if it learns features."
The architecture and loss differ, but the meta-idea is the same · make the data supervise itself.
Masked autoencoding for images
Take a photograph · shred 75% of it · ask someone to reconstruct the missing pieces.
To do this they can't just look at local pixels · they must understand what a face looks like, what a tree branch is shaped like.
MAE forces the encoder to learn this deep visual world model by predicting the missing 75% from the visible 25%. It's BERT-for-pixels · masked-then-reconstruct as a self-supervision recipe.
Analogy · the expert and the intern.
Why this is brilliant. The encoder — the expensive part — runs on only 25% of the input → ~4× faster pretraining than processing the full image. The decoder is small and only handles reconstruction.
He et al. 2021 · ViT-Huge MAE pretraining → SOTA on many downstream vision tasks.
| Scenario | Use SSL? |
|---|---|
| Plenty of labeled data, single task | No · just supervised |
| Large unlabeled pool, small labeled | Yes · SSL pretrain + fine-tune |
| Need generic features for many tasks | Yes · start from DINOv2 |
| Need fast deployment on consumer GPU | Probably not · use CLIP/DINOv2 frozen |
| Novel domain (medical, satellite) | Yes · in-domain SSL then fine-tune |
Rule of thumb (2026) · if labels cost more than compute, use SSL. In most real-world contexts, labels ARE the bottleneck. SSL tilts the equation.
GPT is self-supervised · next-token prediction on a trillion-token corpus. BERT is self-supervised · masked-LM.
All LLMs are self-supervised models, pretrained without a single human label (before RLHF tuning). The text modality has had SSL baked in since word2vec (2013). Vision caught up only around 2020 (SimCLR, MAE).
Contrast · NLP went straight to SSL because text is abundant and labels are expensive. Vision started supervised because ImageNet was cheap at 1M labels. The convergence · both modalities now use SSL as the foundation.
DINO (Caron et al. 2021) combined MAE-style patches with BYOL-style self-distillation for ViTs. Emergent properties:
DINOv2 (Oquab et al. 2023) scaled this up to ViT-g (1B params) on 142M curated images.
DINOv2 features are the de facto general-purpose vision representation in 2026 — ship it for any vision task where you can't afford full fine-tuning.
| Modality | Dominant SSL approach |
|---|---|
| Text | Next-token prediction (every LLM) |
| Vision | MAE + DINO-style distillation |
| Speech | Wav2Vec 2.0 / HuBERT (masked frame prediction) |
| Video | MAE extended to spacetime patches |
| Multimodal | CLIP-style contrastive image-text (next lecture) |
Self-supervision created the foundation model era. Every modality now has its own canonical SSL recipe. Supervised learning survives only at the end of the pipeline — fine-tuning on small labeled data.