Tiny example. Flattened patch ·
Learned
Compute.
| CNN | ViT | |
|---|---|---|
| Receptive field growth | local → global slowly | global from layer 1 |
| Weight sharing | spatial | none (each position has own attention) |
| Translation equivariance | baked in | learned (if at all) |
| Data efficiency (small data) | strong | weak |
| Data efficiency (large data) | plateaus | keeps improving |
CNNs encode vision priors; ViTs learn them. With small data, priors win. With massive data (300M+), learning wins.
| Model | Patch | Params | Notes |
|---|---|---|---|
| ViT-B/16 | 16×16 | 86M | "Base" — most common |
| ViT-L/14 | 14×14 | 307M | Used by CLIP |
| ViT-H/14 | 14×14 | 632M | SOTA around 2021 |
| ViT-g (DINOv2) | 14×14 | 1.1B | 2023 general-purpose vision |
Swin Transformer (Liu et al. 2021) adds hierarchical windowing — bridges ViT and CNN. Popular in practice.
class ViT(nn.Module):
def __init__(self, image_size=224, patch=16, dim=768, depth=12, heads=12, classes=1000):
super().__init__()
self.patch_embed = nn.Conv2d(3, dim, kernel_size=patch, stride=patch) # patchify
n_patches = (image_size // patch) ** 2
self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
self.pos_embed = nn.Parameter(torch.randn(1, n_patches + 1, dim))
self.blocks = nn.ModuleList([TransformerBlock(dim, heads) for _ in range(depth)])
self.norm = nn.LayerNorm(dim)
self.head = nn.Linear(dim, classes)
def forward(self, x):
x = self.patch_embed(x).flatten(2).transpose(1, 2) # (B, N, D)
cls = self.cls_token.expand(x.size(0), -1, -1)
x = torch.cat([cls, x], dim=1) + self.pos_embed
for b in self.blocks: x = b(x)
return self.head(self.norm(x[:, 0])) # CLS → class logits
Literally a Transformer from L13 with a patch-embed preamble. No vision-specific ops.
The paper that launched zero-shot vision
Train on (image, caption) pairs scraped from the web. Push matching pairs close in embedding space, push mismatched pairs far apart.
Build a shared space where images and captions live together. Once it exists, you can:
The training objective is InfoNCE (L17), extended across two modalities instead of two augmentations.
Tiny batch ·
Similarity matrix.
Logits row 1.
Softmax.
Loss row 1 (correct = index 0).
If the off-diagonal were higher, the loss spikes immediately. CLIP just runs this over 32k images simultaneously.
import clip
model, preprocess = clip.load("ViT-L/14")
image = preprocess(pil_image).unsqueeze(0)
texts = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
text_tokens = clip.tokenize(texts)
with torch.no_grad():
img_emb = model.encode_image(image)
txt_embs = model.encode_text(text_tokens)
# Cosine similarity → pick the closest text
sims = (img_emb @ txt_embs.T).softmax(dim=-1)
# sims = [[0.89, 0.08, 0.03]] → "cat"
No training on cats. CLIP has never seen an ImageNet label. Yet it beats ResNet-50 on zero-shot ImageNet classification.
Image · a photo of a tabby cat. CLIP image encoder produces a 768-dim vector (after L2-normalize · unit length).
Build text prompts · "a photo of a {cat / dog / car}". Run through CLIP text encoder.
| class | text emb | cos · img emb | softmax |
|---|---|---|---|
| cat | t_cat |
0.32 | 0.89 |
| dog | t_dog |
0.19 | 0.08 |
| car | t_car |
0.04 | 0.03 |
Pick argmax · "cat" wins with 89% confidence. Total inference cost · 1 image-encoder call + 3 text-encoder calls. No fine-tuning. Add 100 classes · still 100 text-encoder calls (a few seconds, total) and you're done.
CLIP is the default general-purpose vision-language model in 2026. For retrieval, search, zero-shot classification, content moderation, CLIP features are the first thing you try.
Zero-shot CLIP is sensitive to how you phrase the class label. Changing the prompt template affects accuracy by ~10% absolute on ImageNet.
| Template | ImageNet zero-shot |
|---|---|
"{label}" (bare word) |
63% |
"a photo of a {label}" |
68% |
"a photo of a {label}, a type of pet" (for pets dataset) |
72% |
The model's "understanding" of a class is mediated by the caption distribution it was trained on. If most cat images were captioned "a photo of a cat", matching that template works best. Prompt-engineering for vision is not a trick — it's matching the training distribution.
Project image features into token space
image → ViT-L/14 (CLIP encoder) → 256 image tokens (frozen)
↓
Linear projection
↓
"Image tokens" in LLM space
↓
Llama 2 / Vicuna (autoregressive)
↓
text response
LLaVA recipe (Liu et al. 2023):
[image_tokens, text_tokens] and feed to an LLM.(image, question, answer) triples.Surprisingly good. The LLM brings reasoning; CLIP brings vision understanding; the projection layer glues them.
Two experts speak different languages.
We need a translator · a linear map
[256, 1024].W of shape [1024, 4096] plus bias [4096].Per patch: llm_token = patch @ W + b. Shape check · [1, 1024] @ [1024, 4096] → [1, 4096] ✓.
Do this for all 256 patches → 256 vectors that "look like" tokens to the LLM. Prepend them to the user's text:
Stage 1 · freeze CLIP and LLM, train only
Stage 2 · unfreeze the LLM, fine-tune on instruction data.
New params · ~10M in the projection. A 7B LLM becomes multimodal with < 0.15% extra weights.
CLIP patch ·
Compute
The LLM treats this just like the embedding for the word "cat".
Alayrac et al. 2022 · an alternative approach:
Benefit: the LLM never "sees" images as tokens; it queries them via cross-attention when it needs to.
Both approaches work. LLaVA is simpler and became dominant in open-source; Flamingo-style crossattention persists in frontier labs (Anthropic's vision, Google's Gemini variants).
The frontier
You want a dog that understands verbal and hand commands.
2026 frontier leans native. Bolt-on dominates open-source · only feasible approach when you can't pretrain a 100B+ model from scratch.
Instead of converting the image into LLM-space tokens, Flamingo keeps vision features separate and injects them via cross-attention layers inserted between LLM blocks.
Benefit · LLM never "sees" pixels as tokens; it asks for image info when it needs it. Drawback · more complex architecture, more params to train. LLaVA's simpler "concat tokens" won on ease; Flamingo-style persists in frontier labs.
| Model | What it sees | What it does |
|---|---|---|
| GPT-4V / GPT-5 | image, video | general reasoning + tool use |
| Claude 4 (Anthropic) | image, PDF, video frames | general reasoning + computer use |
| Gemini 2 Ultra (Google) | image, audio, video | natively multimodal from pretraining |
| LLaVA / Qwen-VL (open) | image | open-source equivalent of GPT-4V |
Key trend: the input side is increasingly "anything", the output is still mostly text. True any-to-any (text→image→video→audio) is the 2026+ frontier.
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-4",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{"type": "image", "source": {"type": "url", "url": "https://..."}},
{"type": "text", "text": "What does this plot suggest?"}
]
}
]
)
Three lines in; a paragraph of reasoning out. The API abstracts away all the CLIP + projection + LLM machinery.
| Benchmark | Tests | Example task |
|---|---|---|
| VQAv2 | general visual QA | "what color is the car?" |
| GQA | compositional reasoning | "is the animal to the left of the tree brown?" |
| MMMU | multi-discipline expert | college-level chemistry diagram |
| DocVQA | document understanding | read text from scanned form |
| ChartQA | chart reading | "what was Q3 revenue?" |
| AI2D | diagram reasoning | "which step comes after X?" |
Claude / GPT-4o / Gemini all push 85%+ on VQAv2 now. MMMU remains hard (50-60%) · genuine multi-modal reasoning is the frontier.
The LLM has seen "a cup of coffee on a desk" millions of times in text. The vision encoder gives it a blurry image of a desk.
If the visual signal is weak, the language prior can override it · the model adds a coffee cup that isn't there.
This is a battle between prior (text statistics from training) and evidence (current visual input). When evidence is strong (clear photo), priors don't matter much. When evidence is weak, they take over and the model "fills in" things confidently.
Vision-language models sometimes describe things that aren't in the image:
Image · a cat on a mat.
Prompt · "describe the room in detail."
Output · "... next to a blue vase on the wooden table ..."
Cause · the language model has a strong prior from text pretraining ("rooms have vases") that overrides weak visual signal. Fixes · stronger vision encoder, more VQA training data, CFG-style text-image alignment.
GPT-4o and Claude 3.5+ dramatically reduced hallucination via more curated training and RLHF specifically on vision tasks. Still not solved.
The agentic side (Claude computer use, GPT operator) is where multimodal is most valuable in 2026.