Interactive Explainer

Image Segmentation, with a Real Segmenter

Classification labels the image. Detection labels boxes. Segmentation labels every pixel. This page runs a real pretrained DeepLab v3+ segmenter on whatever photo you hand it, then lets you paint your own mask and watch mean-IoU and pixel accuracy tick up toward the model's output.

~15 min Computer Vision · TensorFlow.js · Real Model

Prelude

A real segmenter running in your browser

The model below is DeepLab v3+ (MobileNetV2 backbone) trained on the Pascal VOC benchmark's 21 classes (background + 20 foreground: person, cat, dog, bicycle, car, chair, sofa, tv, …). TensorFlow.js downloads it once (~10 MB) and runs every inference locally on your GPU.

Three flavours of segmentation to keep in mind:

Semantic: per-pixel class. Two people merge into one "person" blob.
Instance: per-pixel class and instance ID. Two people get two masks.
Panoptic: the union—"things" (countable) get instance IDs, "stuff" (sky, road) just gets a class. DeepLab does semantic; Mask R-CNN / Mask2Former handle the others.

Pick a photo Loading segmenter…

Drop a photo here, or click to upload. Runs locally. Pascal VOC classes work best.

Six CC-licensed stock photos with Pascal VOC classes.

Loading… the segmentation overlay will appear here as soon as the model and photo are ready.

Model DeepLab v3+

Classes found —

Inference time —

Classes detected in this image

Step 1

A segmentation is a function from pixels to labels

The model's output is a $W \times H \times C$ tensor of class logits—one logit per pixel per class. Taking the argmax at every pixel collapses it to a label map with one integer per pixel. Below is the model's raw label map rendered as a colour overlay. Every pixel has exactly one colour, because every pixel has exactly one winner.

Toggle between the image and the model's semantic mask.

Mask opacity 0.55

Hover pixel —

Move the mouse over the image to see what class the model assigned to that exact pixel.

Step 2

Paint your own mask; grade it live

The canvas on the left is the image (the model's mask faintly overlaid as a guide). The canvas on the right is your blank mask. Pick a class, paint, and watch mean-IoU, pixel accuracy, and per-class IoU update after every stroke—against the real model output as ground truth.

Image with model mask (guide).

Your mask (paint here).

Brush size 14 px

Pixel accuracy —

Mean IoU —

Pixels painted 0

Per-class IoU

Step 3

Region growing: the pre-neural baseline

Before CNNs, segmentation started from a seed pixel and grew outward while neighbours stayed similar in colour. The rule is four lines of code: BFS over 4-connected neighbours; include a pixel if its RGB distance to the seed (or the running mean of the region) is within a threshold $\tau$.

Compare its output to the neural network's on the same photo.

Colour threshold τ = 30

Compare to seed pixel colour

Click any pixel to drop a seed. Orange = included in the region.

Seed colour—

Region pixels0

IoU vs model—

The pre-neural weakness. Drop a seed on the sky, raise $\tau$ slowly. At low $\tau$, you get a clean sky mask. At high $\tau$, the region leaks into buildings the moment their colour is similar enough. That's the core problem: "same colour" is a fragile proxy for "same object." A neural net learned to ignore colour when it doesn't matter (shaded vs sunlit grass is still grass) and to pay attention when it does.

Step 4

Architecture families that beat region growing

Family	Key move	Representatives	Strength
Encoder-decoder	CNN downsamples to coarse features, then upsamples; skip connections glue fine detail back.	FCN, U-Net, SegNet	Clean medical-imaging masks; small, fast, interpretable.
Dilated / multi-scale (this page)	Keep high resolution; grow receptive field with dilated conv + atrous spatial pyramid pooling.	DeepLab v1–v3+, PSPNet	Big receptive field without resolution loss; great on natural images.
Mask-prediction / Transformer	Decoder queries emit (class, binary mask) pairs directly—no per-pixel argmax.	Mask R-CNN, MaskFormer, Mask2Former, SAM	Natively handles instance and panoptic; state of the art.

Why resolution vs receptive field is a real tension. To identify a pixel as car the model needs context from many surrounding pixels (big receptive field, deep features). To place that pixel right at the car's edge the model needs full input resolution. Every architecture above is one particular truce: skip connections, dilation, mask queries. DeepLab (this page) uses dilated convolutions + atrous spatial pyramid pooling.

Step 5

Why cross-entropy alone fails

Per-pixel cross-entropy is the obvious loss, but on most real images 80% of pixels are background. A model that always predicts "background" gets 80% accuracy while being useless. Three losses fix the imbalance:

Dice loss (Milletari et al., 2016): maximize the soft Dice coefficient. Numerator and denominator shrink together, so the model still gets signal on rare classes.
IoU / Lovász: directly optimize the metric you care about. Lovász makes it differentiable.
Focal (Lin et al., 2017): cross-entropy re-weighted so easy pixels contribute less. Mostly a detection trick, but widely borrowed for segmentation too.

Step 6

The big number

A segmentation prediction is one decision per pixel. Scale matters: even a small photo has hundreds of thousands of them.

Per-pixel predictions for this photo

—

Every one has to get a class. A 99% accurate model still gets thousands of pixels wrong—which usually shows up as jagged boundaries and missed thin structures.

Step 7

Four things that still trip people up

Myth

"Pixel accuracy tells you if the model is good."
On a sky-heavy photo, the background-predicting trivial model wins. Report mean IoU alongside pixel accuracy, always.

Myth

"A mask that's a superset is safe."
A mask that covers the truth plus extra has IoU = truth / mask, which shrinks as the mask grows. Tight wins; sprawl is penalised.

Myth

"Segmentation is classification, done more."
Per-pixel argmax would give flickery, isolated wrong pixels. Real architectures inject structural priors: skip connections, dilated context, CRFs, decoder queries.

Myth

"IoU 0.9 looks perfect."
It can still miss a 10-pixel telephone wire threading across the image—maybe the one pixel your autopilot needs. Qualitative failure review is not optional.

Final takeaway. A segmenter is a function from pixels to labels. You've now watched a real DeepLab do this live on your photo, painted a competing mask yourself, graded it against the model, and seen a pre-neural region-grower succeed and fail in exactly the ways that motivated modern architectures. Everything newer (Mask R-CNN, SAM, SegFormer) is the same task with bigger networks and smarter decoders.