Optical Flow, Frame by Frame

Prelude

Motion is a 2-D vector field

Play a video for a few frames and every pixel has a story: stationary background pixels don't move; a tennis ball travels a huge distance; a runner's arms and legs move differently from their torso. Collecting all those per-pixel displacements between two consecutive frames gives you a flow field. Each cell is a 2-D vector $(u, v)$ in pixels-per-frame. That's it—that's optical flow.

Historically, motion estimation drove the compression behind every video codec (MPEG, H.264, AV1), the tracking in camcorders, and the slo-mo in modern smartphones. Modern SLAM, 3-D reconstruction, and video diffusion models still lean on it.

Pick a clip

Drop a video here, or click to upload. MP4 / WebM work best. A short sports or action clip is ideal. Playback and flow computation run locally—nothing is uploaded.

Four public-domain / CC-licensed sample clips.

Pick a sample or drop your own video.

frame 0 / 0

Video size—

Frame rate—

Flow compute—

Step 1

Two frames, one question

Every optical-flow algorithm starts with the same setup: the current frame $I_t$ and the previous frame $I_{t-1}$. For each pixel at position $(x, y)$ in frame $t$, we ask: where was this pixel in frame $t-1$? Call the answer $(x - u, y - v)$. The pair $(u, v)$ is the flow at $(x, y)$.

Step through the clip and look at the before/after below. The stationary parts look almost identical; the moving parts are ghosted or duplicated. That's exactly the signal flow algorithms exploit.

Previous frame $I_{t-1}$

Current frame $I_t$

The difference image $I_t - I_{t-1}$

Subtract frame $t-1$ from frame $t$ pixel-by-pixel. Static pixels cancel out and go grey; moving pixels leave a bright residual that traces out the motion.

Frame difference — bright = pixel changed, grey = static.

Step 2

The brightness constancy assumption

The foundation of (almost) every classical optical flow method is a very simple claim:

A pixel's brightness at $(x, y, t)$ equals its brightness at $(x + u, y + v, t + 1)$. That's the brightness constancy assumption. First-order Taylor-expand around $(x, y, t)$ and you get the optical flow constraint equation:

Here $I_x$, $I_y$, $I_t$ are the spatial and temporal derivatives of the image. This is one equation in two unknowns ($u, v$). That's the aperture problem: locally, you can only measure motion perpendicular to intensity gradients.

When the assumption fails. Dramatic lighting changes, exposure flicker, shadows, specular highlights, occlusions (a pixel genuinely has no predecessor), big displacements beyond the Taylor radius—all break brightness constancy. Modern networks (RAFT, FlowFormer) use it as a soft prior, not a hard constraint.

Step 3

Lucas-Kanade: constrain the missing equation

Lucas and Kanade (1981) get around the aperture problem by assuming the flow is locally constant: every pixel in a small window $W$ around $(x, y)$ shares the same $(u, v)$. That turns the single OFCE at one pixel into an over-determined system of many OFCEs at neighbouring pixels, which you solve by least squares:

The $2 \times 2$ matrix $\sum \nabla I\, \nabla I^\top$ is the classic structure tensor. Its eigenvalues tell you whether the window has reliable flow: two big eigenvalues = a corner (well-conditioned, unique answer); one big one = an edge (aperture problem); zero eigenvalues = a textureless region (flow is hopeless).

Real Lucas-Kanade, live

Below, we pick out the strongest corner points in the current frame using a small Shi-Tomasi score (the min-eigenvalue of the structure tensor), solve Lucas-Kanade at each corner, and draw the flow vectors. The math is all in the browser: spatial gradients by Sobel, temporal gradient by frame difference, window-local least squares per corner.

Window size 15 px

Max features 60

Min corner quality 0.02

Blue dots = tracked corners. Orange arrows = estimated flow (magnified 4×).

Corners found0

Mean |flow|—

Max |flow|—

Step 4

Dense flow: one vector per pixel

Sparse LK on a handful of corners is great for tracking but leaves most of the image unexplained. Dense methods (Horn-Schunck 1981, Farnebäck 2003, RAFT 2020) produce one vector per pixel. Here we use a simple approach: run LK on a regular grid of positions across the image, yielding a dense-enough flow field to visualise.

The standard visualisation maps every flow vector to an HSV colour: hue = direction, saturation = 1, value = magnitude (clamped). Red for rightward motion, green for up, cyan for left, magenta for down. A still pixel is black.

Grid spacing 8 px

Magnitude threshold 0.5 px

Dense flow HSV heatmap (overlaid on the current frame). Drag the threshold slider to filter noise.

Colour wheel: hue is the direction of motion (red = right, green = up, cyan = left, magenta = down).

What to watch for. On a panning shot, the whole image moves in one direction and the heatmap becomes a uniform colour (e.g. all red when the camera pans right). On a sports clip with a moving ball or player, the background goes dark (tiny flow, clamped to black) while the fast object lights up bright. That's optical flow doing its job as a motion segmenter for free.

Step 5

Why classical methods fail on big motion

Taylor expansion around the current pixel is only accurate for small displacements—typically 1-2 pixels. On a 60 fps video of slow-moving objects that's fine. On a 24 fps video of a pitched baseball, the ball moves dozens of pixels per frame and LK's Taylor-expansion breaks.

The classical fix is pyramidal Lucas-Kanade: downsample the image into a Gaussian pyramid, estimate flow at the coarse level (where motion is small in pixels), warp the next-level frame using that coarse estimate, and refine. Run on the same clip at 4× downscale and you can track the ball again.

Family	Key idea	Strength	Weakness
Lucas-Kanade (this page)	Local window least-squares under brightness constancy.	Fast, interpretable, no training.	Small motions only, aperture problem, brightness-dependent.
Pyramidal LK	LK at coarse scale then iterative refinement.	Handles large motions, real-time.	Still brightness-dependent; no occlusion handling.
Horn-Schunck	Global energy: brightness constancy + smoothness prior.	Dense output, smooth fields.	Over-smooths motion boundaries; slow.
Farnebäck	Fit local polynomial to image; solve for affine displacement.	Sub-pixel accurate, dense, real-time on CPU.	Still locally affine assumption; tuned hyperparameters.
Deep flow (FlowNet, PWC, RAFT)	Learned matching on feature pyramids, iterative refinement with GRUs.	State-of-the-art; handles occlusion & lighting.	Needs GPU; expensive training; can hallucinate.

Step 6

The big number

Flow is dense by definition. Every pixel, every frame, two numbers. For one second of 720p video at 30 fps that's:

Flow vectors in one second of 720p @ 30 fps

55,296,000

1280 × 720 pixels × 30 frames × 2 components = 55.3 million numbers. That's why every flow algorithm you'll ever use is designed to be fast per pixel—classical methods do one matrix-vector solve; neural methods do one GPU sweep.

Step 7

Four ways optical flow trips people up

Myth

"Optical flow = motion in 3-D."
Optical flow is 2-D displacement in the image plane. A camera zooming in produces large flow without any object moving. Flow from rotating camera looks like translation. Recovering 3-D requires stereo, depth sensors, or structure-from-motion.

Myth

"Brightness constancy holds for modern cameras."
Auto-exposure, auto-white-balance, HDR merging, and rolling shutters routinely violate it. A fluorescent light flickering at 50/60 Hz is a classic trap. Real pipelines photometrically correct or learn to ignore these.

Myth

"The flow is well-defined at occlusions."
Pixels that appear for the first time (newly disoccluded) or disappear (newly occluded) have no meaningful predecessor or successor. Classical methods produce garbage there; modern methods explicitly predict an occlusion mask alongside flow.

Myth

"More pixels = better flow."
A textureless wall has zero spatial gradient, which means the structure-tensor is singular and LK fails no matter how many pixels you feed it. Flow quality depends on image content, not resolution.

Final takeaway. Optical flow is the oldest motion primitive in computer vision, built on one equation per pixel and one assumption (brightness constancy). You've now seen that equation fail on big motion, seen Lucas-Kanade solve it at corners, and seen a dense flow field light up the moving parts of a real clip. Every modern tracking / stabilisation / video codec / SLAM pipeline still starts from this same pixel-sized question: where did you come from?