Interactive Explainer
CLIP, Zero-Shot, from the Embeddings Up
In 2021, OpenAI's CLIP model did something no image classifier had done before: it could tell you whether a photo was "a photo of a parrot" or "a photo of a skateboard" without ever being fine-tuned on those classes. This page runs the real CLIP-ViT-B/32 in your browser, embeds whatever labels you type into a shared space with your photo, and shows every cosine similarity and softmax on the way.
The zero-shot idea
A regular image classifier has a fixed output head: 1,000 ImageNet classes, 80 COCO classes, whatever. You cannot ask it about "a photo of a tarantula eating sushi" because it doesn't have that output unit.
CLIP punts on output heads entirely. Instead, it trains two encoders—one for images, one for text—so that matching pairs end up close together in a shared 512-dim vector space. At inference time, you encode the image once, encode any set of candidate labels as text, and just pick whichever label is closest in cosine similarity. The model has no idea which labels you'll ask about; that's the zero-shot part.
OpenAI trained CLIP on 400 million image-text pairs scraped from the web, with a simple contrastive loss: within a big batch, the correct (image, caption) pair should have high similarity, and every wrong pair should have low similarity. After training, the embedding space is meaningful enough that any caption can act as a class prototype.
Pick a photo Loading CLIP…
The model (~150 MB) downloads once and caches in your browser's IndexedDB. First load takes ~20-40 s on a typical connection.
Six CC-licensed stock photos.
Two encoders, one shared space
CLIP is not one network. It's two. The vision encoder is a Vision Transformer (or ResNet, depending on the variant); the text encoder is a small Transformer. Each maps its input to a single 512-dim vector. That shared 512-dim space is where all the magic happens.
Both vectors are $\ell_2$-normalised after the final projection, so they live on the unit hypersphere. Distance between them is measured by cosine similarity—equivalently, their dot product—and it ranges from $-1$ (opposite) to $+1$ (identical).
Your image, through the vision encoder
The 512 numbers below are the real output of CLIP's vision encoder on your current photo. If you swap photos, the whole vector changes.
Your labels, through the text encoder
Type whatever you want. CLIP will tokenize each label, push the tokens through its text Transformer, and emit another 512-dim vector. These become your class prototypes. Whichever prototype is closest to your image in the shared space wins.
Prompt template
CLIP was trained on captions, not bare class names. Wrapping your label in a natural-sounding template like "a photo of a {label}" typically gives a much better text vector than the label alone. Swap templates and compare:
Edit your candidate labels
Each row is a separate candidate. The live similarity and the softmax probability are shown on the right.
First text prototype (for reference)
Cosine similarity: the one operation that matters
Because both vectors are unit-norm, the dot product is the cosine similarity:
One dot product per label. 512 multiplications and 511 additions, times the number of labels. That's the entire cost of zero-shot classification in CLIP.
Raw similarities (before softmax)
Softmax and temperature
To turn similarity scores into probabilities we apply a softmax. CLIP scales the logits by a learned temperature $\tau \approx 0.01$ (equivalently, scales by $1/\tau \approx 100$):
The scaling makes the softmax very sharp. A similarity difference of just 0.05 can produce a 5× probability gap. Slide the temperature below to see the effect: lower $\tau$ = sharper predictions; higher $\tau$ = flatter.
CLIP's pretrained temperature is ~0.01. Raise it to flatten confidence, lower it to spike it.
Live predictions
Why this works at all
A model that's never seen your labels nonetheless gets them right most of the time. The reason is the scale and diversity of the contrastive pretraining data. CLIP saw 400 million (image, caption) pairs from the open web, which means it saw:
- Dozens of ways to describe a cat ("cat", "kitten", "kitty", "tabby cat", "a photo of a cat on a rug"…)
- Thousands of different cats, from every angle and lighting.
- Negatives: every caption paired with many images that aren't the described one.
By the end, the image of any cat is close to the text embedding of any plausible caption about cats. The pretraining buys you a surprisingly general image-to-language similarity function that transfers to unseen label sets.
Why the 512-dim head is enough
dimensions in the shared space. That's plenty of room for
millions of distinct concepts because the embeddings don't
need to be orthogonal—they just need to cluster
by semantic similarity. cat, kitten,
and tabby don't fight for axes; they pile on top
of each other.
Beyond classification
Zero-shot classification is the simplest use of CLIP. The same embedding space powers:
| Task | Trick | Example |
|---|---|---|
| Image retrieval | Embed a text query; find images whose embeddings are closest. | Search "golden hour landscape" across your photo library. |
| Image → text retrieval | Embed an image; find captions whose embeddings are closest. | Suggest alt-text for an accessibility pass. |
| Guidance for diffusion | Use CLIP similarity as a loss to nudge image generators toward a prompt. | Classifier-free guidance in Stable Diffusion. |
| Open-vocabulary detection / segmentation | Replace the detector's class head with CLIP's text embeddings. | OWL-ViT, CLIPSeg, X-Decoder. |
| OCR-free document understanding | Embed receipts, slides, charts; search by semantic content. | Paper search, slide organization. |
Four places CLIP will fool you
"CLIP understands what it classifies."
CLIP learns a similarity function, not a truth function. It's
happy to call a typed sign "iPod" if someone tapes that word
to an apple (typographic attack, Goh et al. 2021).
It doesn't reason; it matches.
"Zero-shot means no training."
Someone spent 400 million pairs and a GPU cluster training
the encoders. What you get for free is generalization to
new labels, not a free lunch.
"Adding more labels can only help."
Extra near-synonym labels split probability mass, hurting
any single label's score. Extra garbage labels pull the
softmax and act as noise. More is not always better; pick a
focused, well-separated label set.
"CLIP is unbiased because it wasn't trained on labels."
It was trained on 400M captions scraped from the web. The
biases of the web—gendered professions, racial
stereotypes, cultural blind spots—are baked in. Many
papers have probed these; deploy with care.