Task	Output	Example
Classification	1 label	"cat"
Classification + localization	1 label + 1 bbox	"cat @ (60, 80, 200, 180)"
Object detection	many labels + many bboxes	"cat @ ... dog @ ... car @ ..."
Semantic segmentation	label per pixel	every pixel → class (no instance)
Instance segmentation	label + instance per pixel	"cat 1 vs cat 2 as separate masks"

NMS · step-by-step example

5 predicted boxes for one car. IoU threshold 0.5.

Box	Score
A	0.95
B	0.90
C	0.80
D	0.75
E	0.70

Step 1 · pick A (highest score). Add to keep.
Compare IoU(A, others): IoU(A,B)=0.8 → drop B; IoU(A,C)=0.2 → keep; IoU(A,D)=0.7 → drop D; IoU(A,E)=0.1 → keep.
Pool now [C, E].

Step 2 · pick C. Add to keep.
IoU(C, E) = 0.15 → keep E.
Pool now [E].

Step 3 · pick E. Add to keep. Pool empty. Stop.

Final · keep = [A, C, E]. 5 boxes → 3 final detections.

Rank	Conf	TP/FP	TP cum	FP cum	Recall (TP/3)	Precision (TP/(TP+FP))
1	0.98	TP	1	0	0.33	1.00
2	0.95	TP	2	0	0.67	1.00
3	0.88	FP	2	1	0.67	0.67
4	0.75	TP	3	1	1.00	0.75
5	0.60	FP	3	2	1.00	0.60

	How regions proposed	Inference time
R-CNN	selective search (outside CNN)	~50 s
Fast R-CNN	selective search + shared backbone	~2 s
Faster R-CNN	Region Proposal Network (RPN) inside CNN	~0.1 s

Detector	mAP (COCO)	FPS (V100)	Notes
Faster R-CNN	42	~5	accuracy king, slow
YOLOv8-m	50	~250	production default
DETR	44	~30	elegant, data-hungry
RT-DETR	53	~100	real-time Transformer-based

Version	Year	Key contribution
YOLOv1	2015	grid formulation, one shot
YOLOv3	2018	multi-scale predictions, anchor clustering
YOLOv5	2020	mosaic augmentation, practical toolkit
YOLOv8	2023	anchor-free, efficient
YOLOv11	2024	current production default

Detection & Segmentation

Lecture 9 · ES 667: Deep Learning

Learning outcomes

Recap · where we are

Four questions

PART 1

Classification → Localization → Detection

The spectrum of "what's in this image"

Training for two goals at once

Classification + localization · the multi-task loss

Worked numeric · multi-task loss

PART 2

IoU and NMS

IoU · the metric · NMS · the cleanup

IoU · visual examples

IoU · a quick worked example

NMS · the "shouting answers" analogy

NMS · step-by-step example

NMS · pseudocode

How do we grade a detector?

Computing AP · a worked example

PART 3

One-stage vs two-stage detectors

R-CNN family · two-stage (brief)

YOLO · you only look once

YOLO · the grid-of-responsibilities idea

Worked numeric · YOLO loss for one cell

Why predict deltas · relative directions

Decoding the box prediction · with example

Speed vs accuracy · a detector comparison

The YOLO lineage

DETR · ditch the post-processing

PART 4

Semantic segmentation · U-Net

From detection to segmentation

U-Net architecture

U-Net · with channels and sizes

U-Net skips · the cheat sheet analogy

Why skip connections are essential

Segmentation loss · IoU to Dice

Worked numeric · Dice loss on a 2×2 image

Other segmentation losses

Why U-Net caught on in medical imaging

Instance segmentation · Mask R-CNN (brief)

PART 5

2026 frontier · SAM

Segment Anything · SAM (Meta, 2023)

The fixed-dictionary problem

The open-vocabulary shift · how it works

Lecture 9 — summary

Read before Lecture 10

Next lecture