Interactive Explainer
Optical Flow, Frame by Frame
Optical flow is the 2-D velocity field that says, for every pixel in a video frame, where it moved in the next frame. This page runs a real Lucas-Kanade solver and a real block-matching dense flow estimator in your browser, on real video. Step through a sports clip one frame at a time and watch the vectors fall out of the brightness-constancy equations.
Motion is a 2-D vector field
Play a video for a few frames and every pixel has a story: stationary background pixels don't move; a tennis ball travels a huge distance; a runner's arms and legs move differently from their torso. Collecting all those per-pixel displacements between two consecutive frames gives you a flow field. Each cell is a 2-D vector $(u, v)$ in pixels-per-frame. That's it—that's optical flow.
Historically, motion estimation drove the compression behind every video codec (MPEG, H.264, AV1), the tracking in camcorders, and the slo-mo in modern smartphones. Modern SLAM, 3-D reconstruction, and video diffusion models still lean on it.
Pick a clip
Four public-domain / CC-licensed sample clips.
Two frames, one question
Every optical-flow algorithm starts with the same setup: the current frame $I_t$ and the previous frame $I_{t-1}$. For each pixel at position $(x, y)$ in frame $t$, we ask: where was this pixel in frame $t-1$? Call the answer $(x - u, y - v)$. The pair $(u, v)$ is the flow at $(x, y)$.
Step through the clip and look at the before/after below. The stationary parts look almost identical; the moving parts are ghosted or duplicated. That's exactly the signal flow algorithms exploit.
The difference image $I_t - I_{t-1}$
Subtract frame $t-1$ from frame $t$ pixel-by-pixel. Static pixels cancel out and go grey; moving pixels leave a bright residual that traces out the motion.
The brightness constancy assumption
The foundation of (almost) every classical optical flow method is a very simple claim:
A pixel's brightness at $(x, y, t)$ equals its brightness at $(x + u, y + v, t + 1)$. That's the brightness constancy assumption. First-order Taylor-expand around $(x, y, t)$ and you get the optical flow constraint equation:
Here $I_x$, $I_y$, $I_t$ are the spatial and temporal derivatives of the image. This is one equation in two unknowns ($u, v$). That's the aperture problem: locally, you can only measure motion perpendicular to intensity gradients.
Lucas-Kanade: constrain the missing equation
Lucas and Kanade (1981) get around the aperture problem by assuming the flow is locally constant: every pixel in a small window $W$ around $(x, y)$ shares the same $(u, v)$. That turns the single OFCE at one pixel into an over-determined system of many OFCEs at neighbouring pixels, which you solve by least squares:
The $2 \times 2$ matrix $\sum \nabla I\, \nabla I^\top$ is the classic structure tensor. Its eigenvalues tell you whether the window has reliable flow: two big eigenvalues = a corner (well-conditioned, unique answer); one big one = an edge (aperture problem); zero eigenvalues = a textureless region (flow is hopeless).
Real Lucas-Kanade, live
Below, we pick out the strongest corner points in the current frame using a small Shi-Tomasi score (the min-eigenvalue of the structure tensor), solve Lucas-Kanade at each corner, and draw the flow vectors. The math is all in the browser: spatial gradients by Sobel, temporal gradient by frame difference, window-local least squares per corner.
Dense flow: one vector per pixel
Sparse LK on a handful of corners is great for tracking but leaves most of the image unexplained. Dense methods (Horn-Schunck 1981, Farnebäck 2003, RAFT 2020) produce one vector per pixel. Here we use a simple approach: run LK on a regular grid of positions across the image, yielding a dense-enough flow field to visualise.
The standard visualisation maps every flow vector to an HSV colour: hue = direction, saturation = 1, value = magnitude (clamped). Red for rightward motion, green for up, cyan for left, magenta for down. A still pixel is black.
Why classical methods fail on big motion
Taylor expansion around the current pixel is only accurate for small displacements—typically 1-2 pixels. On a 60 fps video of slow-moving objects that's fine. On a 24 fps video of a pitched baseball, the ball moves dozens of pixels per frame and LK's Taylor-expansion breaks.
The classical fix is pyramidal Lucas-Kanade: downsample the image into a Gaussian pyramid, estimate flow at the coarse level (where motion is small in pixels), warp the next-level frame using that coarse estimate, and refine. Run on the same clip at 4× downscale and you can track the ball again.
| Family | Key idea | Strength | Weakness |
|---|---|---|---|
| Lucas-Kanade (this page) | Local window least-squares under brightness constancy. | Fast, interpretable, no training. | Small motions only, aperture problem, brightness-dependent. |
| Pyramidal LK | LK at coarse scale then iterative refinement. | Handles large motions, real-time. | Still brightness-dependent; no occlusion handling. |
| Horn-Schunck | Global energy: brightness constancy + smoothness prior. | Dense output, smooth fields. | Over-smooths motion boundaries; slow. |
| Farnebäck | Fit local polynomial to image; solve for affine displacement. | Sub-pixel accurate, dense, real-time on CPU. | Still locally affine assumption; tuned hyperparameters. |
| Deep flow (FlowNet, PWC, RAFT) | Learned matching on feature pyramids, iterative refinement with GRUs. | State-of-the-art; handles occlusion & lighting. | Needs GPU; expensive training; can hallucinate. |
The big number
Flow is dense by definition. Every pixel, every frame, two numbers. For one second of 720p video at 30 fps that's:
Flow vectors in one second of 720p @ 30 fps
1280 × 720 pixels × 30 frames × 2 components = 55.3 million numbers. That's why every flow algorithm you'll ever use is designed to be fast per pixel—classical methods do one matrix-vector solve; neural methods do one GPU sweep.
Four ways optical flow trips people up
"Optical flow = motion in 3-D."
Optical flow is 2-D displacement in the image plane. A camera
zooming in produces large flow without any object moving. Flow
from rotating camera looks like translation. Recovering 3-D
requires stereo, depth sensors, or structure-from-motion.
"Brightness constancy holds for modern cameras."
Auto-exposure, auto-white-balance, HDR merging, and rolling
shutters routinely violate it. A fluorescent light flickering
at 50/60 Hz is a classic trap. Real pipelines photometrically
correct or learn to ignore these.
"The flow is well-defined at occlusions."
Pixels that appear for the first time (newly disoccluded) or
disappear (newly occluded) have no meaningful predecessor or
successor. Classical methods produce garbage there; modern
methods explicitly predict an occlusion mask alongside flow.
"More pixels = better flow."
A textureless wall has zero spatial gradient, which means the
structure-tensor is singular and LK fails no matter how many
pixels you feed it. Flow quality depends on image content,
not resolution.