Interactive Explainer

CLIP, Zero-Shot, from the Embeddings Up

In 2021, OpenAI's CLIP model did something no image classifier had done before: it could tell you whether a photo was "a photo of a parrot" or "a photo of a skateboard" without ever being fine-tuned on those classes. This page runs the real CLIP-ViT-B/32 in your browser, embeds whatever labels you type into a shared space with your photo, and shows every cosine similarity and softmax on the way.

~18 min Deep Learning · Vision-Language · Real CLIP

Prelude

The zero-shot idea

A regular image classifier has a fixed output head: 1,000 ImageNet classes, 80 COCO classes, whatever. You cannot ask it about "a photo of a tarantula eating sushi" because it doesn't have that output unit.

CLIP punts on output heads entirely. Instead, it trains two encoders—one for images, one for text—so that matching pairs end up close together in a shared 512-dim vector space. At inference time, you encode the image once, encode any set of candidate labels as text, and just pick whichever label is closest in cosine similarity. The model has no idea which labels you'll ask about; that's the zero-shot part.

OpenAI trained CLIP on 400 million image-text pairs scraped from the web, with a simple contrastive loss: within a big batch, the correct (image, caption) pair should have high similarity, and every wrong pair should have low similarity. After training, the embedding space is meaningful enough that any caption can act as a class prototype.

Pick a photo Loading CLIP…

The model (~150 MB) downloads once and caches in your browser's IndexedDB. First load takes ~20-40 s on a typical connection.

Drop a photo here, or click to upload. Runs locally. Any JPEG or PNG.

Six CC-licensed stock photos.

Loading…

Model CLIP-ViT-B/32

Shared dim 512

Inference time —

Step 1

Two encoders, one shared space

CLIP is not one network. It's two. The vision encoder is a Vision Transformer (or ResNet, depending on the variant); the text encoder is a small Transformer. Each maps its input to a single 512-dim vector. That shared 512-dim space is where all the magic happens.

Both vectors are $\ell_2$-normalised after the final projection, so they live on the unit hypersphere. Distance between them is measured by cosine similarity—equivalently, their dot product—and it ranges from $-1$ (opposite) to $+1$ (identical).

Your image, through the vision encoder

The 512 numbers below are the real output of CLIP's vision encoder on your current photo. If you swap photos, the whole vector changes.

Waiting for model…

Step 2

Your labels, through the text encoder

Type whatever you want. CLIP will tokenize each label, push the tokens through its text Transformer, and emit another 512-dim vector. These become your class prototypes. Whichever prototype is closest to your image in the shared space wins.

Prompt template

CLIP was trained on captions, not bare class names. Wrapping your label in a natural-sounding template like "a photo of a {label}" typically gives a much better text vector than the label alone. Swap templates and compare:

Edit your candidate labels

Each row is a separate candidate. The live similarity and the softmax probability are shown on the right.

First text prototype (for reference)

Waiting for labels…

Step 3

Cosine similarity: the one operation that matters

Because both vectors are unit-norm, the dot product is the cosine similarity:

One dot product per label. 512 multiplications and 511 additions, times the number of labels. That's the entire cost of zero-shot classification in CLIP.

Raw similarities (before softmax)

Why unit-normalising matters. The contrastive loss CLIP trained with rewards high dot product for matching pairs and low for non-matching. Without normalisation, the model could cheat by making all pairs' similarities big in absolute terms. Normalising puts every vector on the same scale; similarity is only about direction.

Step 4

Softmax and temperature

To turn similarity scores into probabilities we apply a softmax. CLIP scales the logits by a learned temperature $\tau \approx 0.01$ (equivalently, scales by $1/\tau \approx 100$):

The scaling makes the softmax very sharp. A similarity difference of just 0.05 can produce a 5× probability gap. Slide the temperature below to see the effect: lower $\tau$ = sharper predictions; higher $\tau$ = flatter.

Softmax temperature τ = 0.01

Scale factor 1/τ = 100

CLIP's pretrained temperature is ~0.01. Raise it to flatten confidence, lower it to spike it.

Live predictions

Top label —

Top probability —

Entropy (nats) —

Step 5

Why this works at all

A model that's never seen your labels nonetheless gets them right most of the time. The reason is the scale and diversity of the contrastive pretraining data. CLIP saw 400 million (image, caption) pairs from the open web, which means it saw:

Dozens of ways to describe a cat ("cat", "kitten", "kitty", "tabby cat", "a photo of a cat on a rug"…)
Thousands of different cats, from every angle and lighting.
Negatives: every caption paired with many images that aren't the described one.

By the end, the image of any cat is close to the text embedding of any plausible caption about cats. The pretraining buys you a surprisingly general image-to-language similarity function that transfers to unseen label sets.

Why the 512-dim head is enough

512

dimensions in the shared space. That's plenty of room for millions of distinct concepts because the embeddings don't need to be orthogonal—they just need to cluster by semantic similarity. cat, kitten, and tabby don't fight for axes; they pile on top of each other.

Step 6

Beyond classification

Zero-shot classification is the simplest use of CLIP. The same embedding space powers:

Task	Trick	Example
Image retrieval	Embed a text query; find images whose embeddings are closest.	Search "golden hour landscape" across your photo library.
Image → text retrieval	Embed an image; find captions whose embeddings are closest.	Suggest alt-text for an accessibility pass.
Guidance for diffusion	Use CLIP similarity as a loss to nudge image generators toward a prompt.	Classifier-free guidance in Stable Diffusion.
Open-vocabulary detection / segmentation	Replace the detector's class head with CLIP's text embeddings.	OWL-ViT, CLIPSeg, X-Decoder.
OCR-free document understanding	Embed receipts, slides, charts; search by semantic content.	Paper search, slide organization.

Step 7

Four places CLIP will fool you

Myth

"CLIP understands what it classifies."
CLIP learns a similarity function, not a truth function. It's happy to call a typed sign "iPod" if someone tapes that word to an apple (typographic attack, Goh et al. 2021). It doesn't reason; it matches.

Myth

"Zero-shot means no training."
Someone spent 400 million pairs and a GPU cluster training the encoders. What you get for free is generalization to new labels, not a free lunch.

Myth

"Adding more labels can only help."
Extra near-synonym labels split probability mass, hurting any single label's score. Extra garbage labels pull the softmax and act as noise. More is not always better; pick a focused, well-separated label set.

Myth

"CLIP is unbiased because it wasn't trained on labels."
It was trained on 400M captions scraped from the web. The biases of the web—gendered professions, racial stereotypes, cultural blind spots—are baked in. Many papers have probed these; deploy with care.

Final takeaway. Zero-shot classification with CLIP is two matrix multiplies and a softmax. The magic isn't the math; it's the 400 million captioned images whose structure got distilled into the shared embedding space. You've now done the full pipeline yourself on your own photo with your own labels—on a model that genuinely does not know in advance what you were going to ask.