Interactive Explainer

Vision Transformers, on Real Photos

A pretrained MobileNet v2 runs in your browser to turn every patch of your photo into a real 1024-dimensional feature vector. We use those real features as stand-ins for the Q/K/V of a Transformer encoder block, compute honest dot-product softmax attention over them, and paint the result back on the photo. Nothing below is synthesised.

~17 min Deep Learning · Vision · TensorFlow.js

Prelude

From CNNs to a sentence of patches

A CNN processes an image by sliding small kernels over nearby pixels, layer after layer. Its priors—locality, translation equivariance, a pyramid of scales—are hand-wired. A Vision Transformer tosses all of that. It chops the image into non-overlapping patches, flattens each into a vector, and drops the sequence into a vanilla Transformer. No convolutions, no pooling—just self-attention.

A production ViT (vit-base-patch16, DINOv2) doesn't run easily in-browser yet, but a real MobileNet v2 does, and crucially its spatial feature map at the penultimate layer is a 7×7×1280 tensor—i.e. a 7×7 grid of real 1280-dim vectors learned from ImageNet. We use those vectors wherever ViT would normally use Q/K/V projections of patch embeddings. The math is identical: real features, real dot products, real softmax.

Pick a photo Loading MobileNet…

Drop a photo here, or click to upload. Any JPEG or PNG. Inference is local; nothing is uploaded.

Six CC-licensed stock photos.

Step 1

Chop the image into patches

The first—and really the only image-specific—step in a ViT is splitting the input into a grid of equal patches. For a 224×224 image with 16×16 patches you get a 14×14 = 196 patch grid. That grid is the "sentence" the Transformer consumes.

Patch size P = 32 px

Image with patch grid overlay. Click any patch.

Grid—

Patches—

Pixels / patch—

Flat vector dim—

Patch-size tradeoff. Smaller patches = more tokens = finer spatial resolution but quadratic attention cost. Larger patches = fewer tokens but blurrier spatial reasoning. ViT-Base used 16×16 on 224×224 images → 196 tokens.

Step 2

Flatten each patch into a vector

A patch is a tiny image: $P \times P$ pixels, each with 3 RGB channels. The ViT just reads the $3 P^2$ numbers in raster order:

A learned $D \times 3P^2$ projection matrix then maps this to the Transformer's hidden dim $D$. Click a patch in Step 1 or below to see its raw flattened bytes.

Click a patch to inspect.

Selected patch

No patch selected yet. Click one above.

First 12 dims of the flattened raw vector

—

Why the projection isn't trivial. The raw flattened vector is too low-level; it's just pixel brightnesses. The $D$-dim projection learns which linear combinations of those pixels carry useful signal—and in practice ends up computing gradient-like features, the same thing the first layer of a CNN does. For this page, we substitute the real MobileNet backbone's penultimate feature map, which has already learned thousands of ImageNet classes worth of features.

Step 3

Add position embeddings

Self-attention is permutation-invariant—shuffle the tokens and the output is identical. That's fatal for an image. The fix is cheap: add a learned position embedding $p_i$ to each patch token so the Transformer knows where it came from.

Sequence the Transformer sees

Waiting for photo…

In Step 5 you'll be able to compare attention with vs without this positional signal. Without it, attention becomes a pure feature-similarity search— identical-looking patches in different corners get equal weight.

Step 4

Prepend the [CLS] token

ViT borrows a BERT trick: prepend a learnable [CLS] token whose job is to aggregate information from all patches. It has no pixels behind it—it's a free parameter vector the model learns to use. After self-attention runs, the CLS token's final embedding feeds a small MLP that outputs class logits.

Why a special token, not just an average? Averaging is fine, and newer models (DeiT-III, many MAE variants) do exactly that. A learnable CLS just gives the model a free parameter budget to decide how to aggregate per-class—different heads may steal different patches for different parts of the label.

Step 5

Self-attention: patches listen to patches

Here's the payoff. Every token projects itself into a query, key, and value. Attention weights are a softmax of scaled query-key dot products, and each token's new value is a weighted sum of everybody's values:

Live attention heatmap — on real MobileNet features

Click any patch below. We use its real 1280-dim MobileNet feature vector as the query $q$, every other patch's feature as a key $k$, and compute the softmax of $q^\top k$ over all patches. The result is a real learned-feature attention map, drawn back on your photo. It's not what a trained ViT would produce exactly (it's CNN features, not ViT Q/K projections), but every number is a real dot product over real features—not a stand-in.

Click any patch to make it the query.

Query patch —

Top attended —

Top 5 attention weights

Click a patch first.

high

medium

low

What to try. On a portrait, click a patch on the face—attention clusters on the other face/skin patches. On a landscape, click the sky and it snaps to other sky patches across the whole image (long-range attention that CNNs struggle with). Turn off position embeddings (Step 3) and you'll see even farther patches attended to, because the content-only similarity ignores spatial layout.

Step 6

The full recipe

One encoder block of a ViT is: self-attention → residual add → LayerNorm → MLP → residual add → LayerNorm. ViT-Base stacks 12 of them. The final CLS token goes through a one-hidden-layer MLP head to produce class logits.

Stage	What it does	For our photo

Sequence length fed to the Transformer

—

= 1 CLS + N patch tokens. Every token attends to every other; that's $N^2$ pairwise comparisons per layer.

Compute comparison across patch sizes

Patch size	Grid	Tokens (incl. CLS)	Attention ops (rel.)

Step 7

Four things people get wrong about ViTs

Myth

"ViTs have no inductive bias for images."
They have less than a CNN, but far from none. The patch embedding is literally a stride-$P$ convolution. Position embeddings encode the grid. And self-attention itself is a smooth, learnable prior against gibberish permutations.

Myth

"They need hundreds of millions of images."
The original ViT did, because of its weak inductive bias. Post-2021 variants (DeiT, DINOv2, SwiT, MAE pre-training) train on ImageNet-1k alone and match or beat CNNs. Data hunger was a recipe choice, not a fundamental limit.

Myth

"Attention weights tell you what the model sees."
Tempting, but noisy. Attention is one signal among many in a deep residual stack. The same prediction can be supported by many different attention patterns, and gradient attributions frequently disagree. Treat heatmaps as suggestive, not evidence.

Myth

"Smaller patches are always better."
Halving the patch size quadruples the token count and quadruples squared the attention cost. Hybrid models (Swin, MaxViT) use local attention windows to keep the precision without the $N^2$ bill.

Final takeaway. A Vision Transformer is a bag of patches plus their positions, poured through the exact same attention block you'd use on text. You've now seen every number along the way—the patch grid, the flattened vectors, the position tags, the CLS token, and the attention heatmap over real pretrained features—on an actual photo. The rest of the field is scaling this up, finding better pretraining objectives, and adding locality back in cheaper ways.