← Explainer Library

Interactive Explainer

Image Segmentation, with a Real Segmenter

Classification labels the image. Detection labels boxes. Segmentation labels every pixel. This page runs a real pretrained DeepLab v3+ segmenter on whatever photo you hand it, then lets you paint your own mask and watch mean-IoU and pixel accuracy tick up toward the model's output.

Prelude

A real segmenter running in your browser

The model below is DeepLab v3+ (MobileNetV2 backbone) trained on the Pascal VOC benchmark's 21 classes (background + 20 foreground: person, cat, dog, bicycle, car, chair, sofa, tv, …). TensorFlow.js downloads it once (~10 MB) and runs every inference locally on your GPU.

Three flavours of segmentation to keep in mind:

Pick a photo Loading segmenter…

Six CC-licensed stock photos with Pascal VOC classes.

Loading… the segmentation overlay will appear here as soon as the model and photo are ready.
Model DeepLab v3+
Classes found
Inference time

Classes detected in this image

Step 1

A segmentation is a function from pixels to labels

The model's output is a $W \times H \times C$ tensor of class logits—one logit per pixel per class. Taking the argmax at every pixel collapses it to a label map with one integer per pixel. Below is the model's raw label map rendered as a colour overlay. Every pixel has exactly one colour, because every pixel has exactly one winner.

Toggle between the image and the model's semantic mask.

Move the mouse over the image to see what class the model assigned to that exact pixel.

Step 2

Paint your own mask; grade it live

The canvas on the left is the image (the model's mask faintly overlaid as a guide). The canvas on the right is your blank mask. Pick a class, paint, and watch mean-IoU, pixel accuracy, and per-class IoU update after every stroke—against the real model output as ground truth.

Image with model mask (guide).
Your mask (paint here).
Pixel accuracy
Mean IoU
Pixels painted 0

Per-class IoU

Step 3

Region growing: the pre-neural baseline

Before CNNs, segmentation started from a seed pixel and grew outward while neighbours stayed similar in colour. The rule is four lines of code: BFS over 4-connected neighbours; include a pixel if its RGB distance to the seed (or the running mean of the region) is within a threshold $\tau$.

Compare its output to the neural network's on the same photo.

Click any pixel to drop a seed. Orange = included in the region.
Seed colour
Region pixels0
IoU vs model
The pre-neural weakness. Drop a seed on the sky, raise $\tau$ slowly. At low $\tau$, you get a clean sky mask. At high $\tau$, the region leaks into buildings the moment their colour is similar enough. That's the core problem: "same colour" is a fragile proxy for "same object." A neural net learned to ignore colour when it doesn't matter (shaded vs sunlit grass is still grass) and to pay attention when it does.
Step 4

Architecture families that beat region growing

FamilyKey moveRepresentativesStrength
Encoder-decoder CNN downsamples to coarse features, then upsamples; skip connections glue fine detail back. FCN, U-Net, SegNet Clean medical-imaging masks; small, fast, interpretable.
Dilated / multi-scale (this page) Keep high resolution; grow receptive field with dilated conv + atrous spatial pyramid pooling. DeepLab v1–v3+, PSPNet Big receptive field without resolution loss; great on natural images.
Mask-prediction / Transformer Decoder queries emit (class, binary mask) pairs directly—no per-pixel argmax. Mask R-CNN, MaskFormer, Mask2Former, SAM Natively handles instance and panoptic; state of the art.
Why resolution vs receptive field is a real tension. To identify a pixel as car the model needs context from many surrounding pixels (big receptive field, deep features). To place that pixel right at the car's edge the model needs full input resolution. Every architecture above is one particular truce: skip connections, dilation, mask queries. DeepLab (this page) uses dilated convolutions + atrous spatial pyramid pooling.
Step 5

Why cross-entropy alone fails

Per-pixel cross-entropy is the obvious loss, but on most real images 80% of pixels are background. A model that always predicts "background" gets 80% accuracy while being useless. Three losses fix the imbalance:

Step 6

The big number

A segmentation prediction is one decision per pixel. Scale matters: even a small photo has hundreds of thousands of them.

Per-pixel predictions for this photo

Every one has to get a class. A 99% accurate model still gets thousands of pixels wrong—which usually shows up as jagged boundaries and missed thin structures.

Step 7

Four things that still trip people up

Myth

"Pixel accuracy tells you if the model is good."
On a sky-heavy photo, the background-predicting trivial model wins. Report mean IoU alongside pixel accuracy, always.

Myth

"A mask that's a superset is safe."
A mask that covers the truth plus extra has IoU = truth / mask, which shrinks as the mask grows. Tight wins; sprawl is penalised.

Myth

"Segmentation is classification, done more."
Per-pixel argmax would give flickery, isolated wrong pixels. Real architectures inject structural priors: skip connections, dilated context, CRFs, decoder queries.

Myth

"IoU 0.9 looks perfect."
It can still miss a 10-pixel telephone wire threading across the image—maybe the one pixel your autopilot needs. Qualitative failure review is not optional.

Final takeaway. A segmenter is a function from pixels to labels. You've now watched a real DeepLab do this live on your photo, painted a competing mask yourself, graded it against the model, and seen a pre-neural region-grower succeed and fail in exactly the ways that motivated modern architectures. Everything newer (Mask R-CNN, SAM, SegFormer) is the same task with bigger networks and smarter decoders.