CNN	ViT
Receptive field growth	local → global slowly	global from layer 1
Weight sharing	spatial	none (each position has own attention)
Translation equivariance	baked in	learned (if at all)
Data efficiency (small data)	strong	weak
Data efficiency (large data)	plateaus	keeps improving

Model	Patch	Params	Notes
ViT-B/16	16×16	86M	"Base" — most common
ViT-L/14	14×14	307M	Used by CLIP
ViT-H/14	14×14	632M	SOTA around 2021
ViT-g (DINOv2)	14×14	1.1B	2023 general-purpose vision

class	text emb	cos · img emb	softmax
cat	`t_cat`	0.32	0.89
dog	`t_dog`	0.19	0.08
car	`t_car`	0.04	0.03

Template	ImageNet zero-shot
`"{label}"` (bare word)	63%
`"a photo of a {label}"`	68%
`"a photo of a {label}, a type of pet"` (for pets dataset)	72%

LLaVA · the linear projection, with shapes

CLIP output · 256 patch embeddings, each 1024-d. Tensor [256, 1024].
LLM expects · token embeddings of dim 4096.
Bridge · linear layer W of shape [1024, 4096] plus bias [4096].

Per patch: llm_token = patch @ W + b. Shape check · [1, 1024] @ [1024, 4096] → [1, 4096] ✓.

Do this for all 256 patches → 256 vectors that "look like" tokens to the LLM. Prepend them to the user's text:

Stage 1 · freeze CLIP and LLM, train only , on caption data.
Stage 2 · unfreeze the LLM, fine-tune on instruction data.

New params · ~10M in the projection. A 7B LLM becomes multimodal with < 0.15% extra weights.

Model	What it sees	What it does
GPT-4V / GPT-5	image, video	general reasoning + tool use
Claude 4 (Anthropic)	image, PDF, video frames	general reasoning + computer use
Gemini 2 Ultra (Google)	image, audio, video	natively multimodal from pretraining
LLaVA / Qwen-VL (open)	image	open-source equivalent of GPT-4V

Benchmark	Tests	Example task
VQAv2	general visual QA	"what color is the car?"
GQA	compositional reasoning	"is the animal to the left of the tree brown?"
MMMU	multi-discipline expert	college-level chemistry diagram
DocVQA	document understanding	read text from scanned form
ChartQA	chart reading	"what was Q3 revenue?"
AI2D	diagram reasoning	"which step comes after X?"

Vision-Language Models

Lecture 18 · ES 667: Deep Learning

Learning outcomes

Where we are

PART 1

Vision Transformer (ViT)

The 2020 bet

Why throw away CNNs · the bet of 2020

How can a Transformer "read" an image?

From pixels to tokens · with a tiny example

Worked numeric · patch embedding

From image to sequence · picture

How ViT works

ViT vs CNN · inductive biases

ViT variants you'll encounter

ViT in PyTorch · 30 lines

PART 2

CLIP · contrastive image-text pretraining

CLIP · training as contrast matrix

CLIP · dual encoder

CLIP · the core idea

CLIP · the matching-game derivation

Worked numeric · CLIP loss for one row

CLIP's killer feature · zero-shot

Worked example · zero-shot a single image

Why CLIP mattered

Prompt engineering · for vision

PART 3

LLaVA · give an LLM eyes

LLaVA · full stack

LLaVA · the architecture

LLaVA · the translator analogy

LLaVA · the linear projection, with shapes

Worked numeric · the projection

Flamingo · cross-attention bridge

PART 4

Multimodal LLMs in 2026

Native-multimodal vs bolt-on · adopt or raise?

Native vs bolt-on · trade-offs

Bolt-on (LLaVA, Flamingo)

Native (Gemini, GPT-4o)

Cross-attention · Flamingo's design

2026 multimodal state

Practical multimodal prompting

VLM benchmarks · what we measure

Hallucination · priors fighting evidence

Why VLMs hallucinate

Emerging applications

Vision tasks, zero-shot

Agentic loops

Lecture 18 — summary

Read before Lecture 19

Next lecture