Keep the CNN. Add two heads:
class ClsLocHead(nn.Module):
def __init__(self, feat_dim, n_classes):
super().__init__()
self.cls = nn.Linear(feat_dim, n_classes) # class scores
self.box = nn.Linear(feat_dim, 4) # (x, y, w, h)
For one image:
One image of a cat. True class index 1. True box
Model predicts:
Box loss (MSE).
Total (
This 3.18 is what backprop sees → updates both the class head and the box head simultaneously.
Two primitives every detector needs
Interactive: draw boxes on a canvas, see IoU + NMS live — object-detection.
Predicted box ·
Ground-truth box ·
Intersection ·
Union · 10,000 + 10,000 − 2,500 = 17,500.
An IoU below 0.5 is almost always considered a miss. Detectors report mAP at multiple IoU thresholds because a box that's 90% right (IoU 0.9) is much more useful than one that's 60% right (IoU 0.6) — the metric penalizes imprecision.
After predictions you get a pile of overlapping boxes on the same object · like several people shouting the same answer.
NMS · keep the most confident shout · silence the others.
Concretely · sort by confidence, keep the best, discard any later box that overlaps it (IoU above threshold). Repeat until none left. Standard since R-CNN; survives in YOLO and most detectors.
5 predicted boxes for one car. IoU threshold 0.5.
| Box | Score |
|---|---|
| A | 0.95 |
| B | 0.90 |
| C | 0.80 |
| D | 0.75 |
| E | 0.70 |
Step 1 · pick A (highest score). Add to keep.
Compare IoU(A, others): IoU(A,B)=0.8 → drop B; IoU(A,C)=0.2 → keep; IoU(A,D)=0.7 → drop D; IoU(A,E)=0.1 → keep.
Pool now [C, E].
Step 2 · pick C. Add to keep.
IoU(C, E) = 0.15 → keep E.
Pool now [E].
Step 3 · pick E. Add to keep. Pool empty. Stop.
Final · keep = [A, C, E]. 5 boxes → 3 final detections.
def nms(boxes, scores, iou_threshold=0.5):
idx = scores.argsort(descending=True)
keep = []
while len(idx):
i = idx[0]
keep.append(i)
# drop any later box that overlaps >threshold with this one
idx = [j for j in idx[1:] if iou(boxes[i], boxes[j]) < iou_threshold]
return keep
Greedy — highest-confidence box wins in its neighborhood. Every detector uses some form of NMS (Faster R-CNN, YOLO) or its learned replacement (DETR's Hungarian matching).
Analogy · search engine for "cat". You return 10 results.
Trade-off · 100% recall = return everything; 100% precision = return only one sure answer. mAP summarizes the trade-off curve.
3 ground-truth cats. 5 predictions, sorted by confidence. IoU > 0.5 → True Positive.
| Rank | Conf | TP/FP | TP cum | FP cum | Recall (TP/3) | Precision (TP/(TP+FP)) |
|---|---|---|---|---|---|---|
| 1 | 0.98 | TP | 1 | 0 | 0.33 | 1.00 |
| 2 | 0.95 | TP | 2 | 0 | 0.67 | 1.00 |
| 3 | 0.88 | FP | 2 | 1 | 0.67 | 0.67 |
| 4 | 0.75 | TP | 3 | 1 | 1.00 | 0.75 |
| 5 | 0.60 | FP | 3 | 2 | 1.00 | 0.60 |
AP = area under the precision–recall curve built from these points.
mAP = average AP over all classes.
mAP@0.5 = AP at IoU threshold 0.5. mAP@[0.5:0.95] = average over 10 IoU thresholds (COCO standard).
R-CNN family · YOLO · DETR
R-CNN (2014) → Fast R-CNN (2015) → Faster R-CNN (2015)
Two-stage idea:
Evolution:
| How regions proposed | Inference time | |
|---|---|---|
| R-CNN | selective search (outside CNN) | ~50 s |
| Fast R-CNN | selective search + shared backbone | ~2 s |
| Faster R-CNN | Region Proposal Network (RPN) inside CNN | ~0.1 s |
Two-stage detectors still win on accuracy for small objects. They are slower and more complex — often replaced by YOLO in production.
Divide the image into a 7×7 grid. Each cell fills out a form with three questions:
YOLO's total loss = sum of all three across every cell:
Cell holds a dog (class index 2).
Ground truth. Box
Prediction.
Components.
Total per cell (no weighting):
Analogy · giving directions.
Anchor boxes are starting corners. Conv layers are translation-equivariant — the same filter runs at every spatial location, so it should predict the same correction style everywhere, not different absolute coordinates.
Anchor
Centre — small shift, scaled by anchor size:
Size — log-space correction; exp keeps width/height positive:
(If
Worked numeric. Anchor
Final box ·
| Detector | mAP (COCO) | FPS (V100) | Notes |
|---|---|---|---|
| Faster R-CNN | 42 | ~5 | accuracy king, slow |
| YOLOv8-m | 50 | ~250 | production default |
| DETR | 44 | ~30 | elegant, data-hungry |
| RT-DETR | 53 | ~100 | real-time Transformer-based |
Choose by constraint · real-time camera feed → YOLO. Labeled-data poor → DETR with good augmentations. Highest accuracy → large backbone + Faster R-CNN variant. There is no universally-best detector.
| Version | Year | Key contribution |
|---|---|---|
| YOLOv1 | 2015 | grid formulation, one shot |
| YOLOv3 | 2018 | multi-scale predictions, anchor clustering |
| YOLOv5 | 2020 | mosaic augmentation, practical toolkit |
| YOLOv8 | 2023 | anchor-free, efficient |
| YOLOv11 | 2024 | current production default |
For any real-time detection task in 2026, start with ultralytics YOLOv11. pip install ultralytics → model downloads + runs in 10 lines.
YOLO and Faster R-CNN are conceptually two-step: predict densely (thousands of boxes), then clean up with NMS.
DETR's question · can we just directly predict the final, clean set?
It outputs a fixed-size set of 100 predictions. No grid, no anchors, no NMS.
Challenge · the model outputs 100 boxes; the ground truth might have only 3. Prediction order is arbitrary — pred #47 might match ground truth #1.
Solution · Hungarian matching. Imagine 3 tasks (the GT objects) and 100 workers (predictions). Assign one worker per task to minimize total cost. The Hungarian algorithm finds this optimal one-to-one matching. Once matched, compute loss on the matched pairs; the other 97 are matched to "no object."
DETR cleans up detection conceptually but is slower and data-hungry. YOLO still wins speed; DETR wins elegance.
Pixel-level classification
Detection gives boxes around objects. Segmentation gives a label per pixel.
Key architectural change: we need to go back up in spatial resolution — the feature map shrinks through convs/pooling, but the output must match the input size.
Solution: encoder-decoder with upsampling.
Interactive: click an image region, see segmentation fill in — image-segmentation.
Think of the encoder as creating a rich but blurry summary of the image · "I see a kidney here, a vessel there" · but exact edges have been smudged by pooling.
The skip connections give the decoder a cheat sheet from the matching-resolution encoder layer · the original crisp pixel detail.
Result · the decoder uses high-level summary for what and the cheat sheet for exactly where the boundary is. Net effect · sharp accurate masks.
The encoder compresses spatial info into richer features. But spatial precision is lost — a 16×16 feature map can't localize edges accurately.
Skip connections let the decoder concatenate encoder features at matching resolution:
def forward(self, x):
e1 = self.enc1(x) # 128×128
e2 = self.enc2(pool(e1)) # 64×64
e3 = self.enc3(pool(e2)) # 32×32
b = self.bridge(pool(e3))
d3 = self.dec3(cat([up(b), e3])) # ← skip from e3
d2 = self.dec2(cat([up(d3), e2])) # ← skip from e2
d1 = self.dec1(cat([up(d2), e1])) # ← skip from e1
return self.final(d1)
Every modern segmentation net (DeepLab, SegFormer) uses this pattern.
For boxes we used IoU. For masks we can do the same:
Good metric — but its gradient is "sharp," tricky for SGD. The Dice coefficient is the smoother cousin:
Optimizers minimize, so:
Perfect prediction → Dice 1 → loss 0. No overlap → loss 1.
Why it handles imbalance. It only counts pixels in
Task · segment the top-left pixel.
Ground truth
The model backprops this 0.28.
Class imbalance is the #1 issue. A medical image with 99% background pixels optimizes to "predict background always" under plain CE. Reach for Dice (or weighted CE) first.
Ronneberger 2015 targeted electron microscopy cell segmentation. Two reasons it spread:
By 2020, U-Net was the default segmentation network not just in medicine but in satellite imagery, materials science, audio-spectrogram analysis, and later diffusion models (L21 / L22) — where the same encoder-decoder-with-skips handles noise-to-image mapping.
Built on Faster R-CNN. Adds a third head:
He et al. 2017 · Mask R-CNN — cleanly combines detection and segmentation. Standard baseline for instance tasks.
Zero-shot segmentation by prompting
A foundation model for segmentation:
SAM changed segmentation the way CLIP changed classification — you don't need to train for your specific dataset; you just prompt a pretrained model.
In 2026: for most segmentation tasks, start with SAM-2 and fine-tune only if the domain is truly specialized (medical, satellite).
Traditional detectors are like a translator with a fixed dictionary. Train on "cat / dog / car" → it can only ever detect those 3 things. Ask for "bicycle" → "sorry, not in my dictionary."
The open-vocabulary goal · build a universal translator that understands concepts, not fixed labels.
Embedding · a vector that represents data. Image embedding = vector representing an image; text embedding = vector representing text.
CLIP · OpenAI 2021. Trained on millions of (image, caption) pairs. Learns a shared embedding space: the vector for a dog photo lands close to the vector for "a photo of a dog"; both far from "a photo of a cat."
How open-vocab detectors use this:
Examples · OWLv2, GroundingDINO, SAM-with-text. The 2024–2026 frontier is fully prompt-driven vision.