Interactive Explainer
Vision Transformers, on Real Photos
A pretrained MobileNet v2 runs in your browser to turn every patch of your photo into a real 1024-dimensional feature vector. We use those real features as stand-ins for the Q/K/V of a Transformer encoder block, compute honest dot-product softmax attention over them, and paint the result back on the photo. Nothing below is synthesised.
From CNNs to a sentence of patches
A CNN processes an image by sliding small kernels over nearby pixels, layer after layer. Its priors—locality, translation equivariance, a pyramid of scales—are hand-wired. A Vision Transformer tosses all of that. It chops the image into non-overlapping patches, flattens each into a vector, and drops the sequence into a vanilla Transformer. No convolutions, no pooling—just self-attention.
A production ViT (vit-base-patch16, DINOv2)
doesn't run easily in-browser yet, but a real
MobileNet v2 does, and crucially its spatial
feature map at the penultimate layer is a 7×7×1280
tensor—i.e. a 7×7 grid of real 1280-dim vectors
learned from ImageNet. We use those vectors wherever ViT would
normally use Q/K/V projections of patch embeddings. The math is
identical: real features, real dot products, real softmax.
Pick a photo Loading MobileNet…
Six CC-licensed stock photos.
Chop the image into patches
The first—and really the only image-specific—step in a ViT is splitting the input into a grid of equal patches. For a 224×224 image with 16×16 patches you get a 14×14 = 196 patch grid. That grid is the "sentence" the Transformer consumes.
Flatten each patch into a vector
A patch is a tiny image: $P \times P$ pixels, each with 3 RGB channels. The ViT just reads the $3 P^2$ numbers in raster order:
A learned $D \times 3P^2$ projection matrix then maps this to the Transformer's hidden dim $D$. Click a patch in Step 1 or below to see its raw flattened bytes.
Selected patch
First 12 dims of the flattened raw vector
Add position embeddings
Self-attention is permutation-invariant—shuffle the tokens and the output is identical. That's fatal for an image. The fix is cheap: add a learned position embedding $p_i$ to each patch token so the Transformer knows where it came from.
Sequence the Transformer sees
In Step 5 you'll be able to compare attention with vs without this positional signal. Without it, attention becomes a pure feature-similarity search— identical-looking patches in different corners get equal weight.
Prepend the [CLS] token
ViT borrows a BERT trick: prepend a learnable [CLS] token whose job is to aggregate information from all patches. It has no pixels behind it—it's a free parameter vector the model learns to use. After self-attention runs, the CLS token's final embedding feeds a small MLP that outputs class logits.
Self-attention: patches listen to patches
Here's the payoff. Every token projects itself into a query, key, and value. Attention weights are a softmax of scaled query-key dot products, and each token's new value is a weighted sum of everybody's values:
Live attention heatmap — on real MobileNet features
Click any patch below. We use its real 1280-dim MobileNet feature vector as the query $q$, every other patch's feature as a key $k$, and compute the softmax of $q^\top k$ over all patches. The result is a real learned-feature attention map, drawn back on your photo. It's not what a trained ViT would produce exactly (it's CNN features, not ViT Q/K projections), but every number is a real dot product over real features—not a stand-in.
Top 5 attention weights
The full recipe
One encoder block of a ViT is: self-attention → residual add → LayerNorm → MLP → residual add → LayerNorm. ViT-Base stacks 12 of them. The final CLS token goes through a one-hidden-layer MLP head to produce class logits.
| Stage | What it does | For our photo |
|---|
Sequence length fed to the Transformer
= 1 CLS + N patch tokens. Every token attends to every other; that's $N^2$ pairwise comparisons per layer.
Compute comparison across patch sizes
| Patch size | Grid | Tokens (incl. CLS) | Attention ops (rel.) |
|---|
Four things people get wrong about ViTs
"ViTs have no inductive bias for images."
They have less than a CNN, but far from none. The patch
embedding is literally a stride-$P$ convolution. Position
embeddings encode the grid. And self-attention itself is a
smooth, learnable prior against gibberish permutations.
"They need hundreds of millions of images."
The original ViT did, because of its weak inductive
bias. Post-2021 variants (DeiT, DINOv2, SwiT, MAE pre-training)
train on ImageNet-1k alone and match or beat CNNs. Data hunger
was a recipe choice, not a fundamental limit.
"Attention weights tell you what the model sees."
Tempting, but noisy. Attention is one signal among many in a
deep residual stack. The same prediction can be supported by
many different attention patterns, and gradient attributions
frequently disagree. Treat heatmaps as suggestive, not
evidence.
"Smaller patches are always better."
Halving the patch size quadruples the token count and
quadruples squared the attention cost. Hybrid models
(Swin, MaxViT) use local attention windows to keep the
precision without the $N^2$ bill.