Interactive Explainer
Image Segmentation, with a Real Segmenter
Classification labels the image. Detection labels boxes. Segmentation labels every pixel. This page runs a real pretrained DeepLab v3+ segmenter on whatever photo you hand it, then lets you paint your own mask and watch mean-IoU and pixel accuracy tick up toward the model's output.
A real segmenter running in your browser
The model below is DeepLab v3+ (MobileNetV2 backbone) trained on the Pascal VOC benchmark's 21 classes (background + 20 foreground: person, cat, dog, bicycle, car, chair, sofa, tv, …). TensorFlow.js downloads it once (~10 MB) and runs every inference locally on your GPU.
Three flavours of segmentation to keep in mind:
- Semantic: per-pixel class. Two people merge into one "person" blob.
- Instance: per-pixel class and instance ID. Two people get two masks.
- Panoptic: the union—"things" (countable) get instance IDs, "stuff" (sky, road) just gets a class. DeepLab does semantic; Mask R-CNN / Mask2Former handle the others.
Pick a photo Loading segmenter…
Six CC-licensed stock photos with Pascal VOC classes.
Classes detected in this image
A segmentation is a function from pixels to labels
The model's output is a $W \times H \times C$ tensor of class logits—one logit per pixel per class. Taking the argmax at every pixel collapses it to a label map with one integer per pixel. Below is the model's raw label map rendered as a colour overlay. Every pixel has exactly one colour, because every pixel has exactly one winner.
Move the mouse over the image to see what class the model assigned to that exact pixel.
Paint your own mask; grade it live
The canvas on the left is the image (the model's mask faintly overlaid as a guide). The canvas on the right is your blank mask. Pick a class, paint, and watch mean-IoU, pixel accuracy, and per-class IoU update after every stroke—against the real model output as ground truth.
Per-class IoU
Region growing: the pre-neural baseline
Before CNNs, segmentation started from a seed pixel and grew outward while neighbours stayed similar in colour. The rule is four lines of code: BFS over 4-connected neighbours; include a pixel if its RGB distance to the seed (or the running mean of the region) is within a threshold $\tau$.
Compare its output to the neural network's on the same photo.
Architecture families that beat region growing
| Family | Key move | Representatives | Strength |
|---|---|---|---|
| Encoder-decoder | CNN downsamples to coarse features, then upsamples; skip connections glue fine detail back. | FCN, U-Net, SegNet | Clean medical-imaging masks; small, fast, interpretable. |
| Dilated / multi-scale (this page) | Keep high resolution; grow receptive field with dilated conv + atrous spatial pyramid pooling. | DeepLab v1–v3+, PSPNet | Big receptive field without resolution loss; great on natural images. |
| Mask-prediction / Transformer | Decoder queries emit (class, binary mask) pairs directly—no per-pixel argmax. | Mask R-CNN, MaskFormer, Mask2Former, SAM | Natively handles instance and panoptic; state of the art. |
Why cross-entropy alone fails
Per-pixel cross-entropy is the obvious loss, but on most real images 80% of pixels are background. A model that always predicts "background" gets 80% accuracy while being useless. Three losses fix the imbalance:
- Dice loss (Milletari et al., 2016): maximize the soft Dice coefficient. Numerator and denominator shrink together, so the model still gets signal on rare classes.
- IoU / Lovász: directly optimize the metric you care about. Lovász makes it differentiable.
- Focal (Lin et al., 2017): cross-entropy re-weighted so easy pixels contribute less. Mostly a detection trick, but widely borrowed for segmentation too.
The big number
A segmentation prediction is one decision per pixel. Scale matters: even a small photo has hundreds of thousands of them.
Per-pixel predictions for this photo
Every one has to get a class. A 99% accurate model still gets thousands of pixels wrong—which usually shows up as jagged boundaries and missed thin structures.
Four things that still trip people up
"Pixel accuracy tells you if the model is good."
On a sky-heavy photo, the background-predicting trivial model
wins. Report mean IoU alongside pixel accuracy, always.
"A mask that's a superset is safe."
A mask that covers the truth plus extra has IoU =
truth / mask, which shrinks as the mask grows. Tight wins;
sprawl is penalised.
"Segmentation is classification, done more."
Per-pixel argmax would give flickery, isolated wrong pixels.
Real architectures inject structural priors: skip
connections, dilated context, CRFs, decoder queries.
"IoU 0.9 looks perfect."
It can still miss a 10-pixel telephone wire threading across
the image—maybe the one pixel your autopilot needs.
Qualitative failure review is not optional.