Mean Squared Error on coordinates:
| True | Predicted | Error |
|---|---|---|
| x=100 | x̂=105 | |
| y=80 | ŷ=75 | |
| w=60 | ŵ=55 | |
| h=40 | ĥ=45 |
Total box loss:
class LocalizationNet(nn.Module):
def __init__(self):
super().__init__()
self.cnn = ... # CNN backbone
self.fc_class = nn.Linear(512, 3) # 3 classes
self.fc_box = nn.Linear(512, 4) # 4 box coords
def forward(self, x):
features = self.cnn(x)
class_out = self.fc_class(features) # [batch, 3]
box_out = self.fc_box(features) # [batch, 4]
return class_out, box_out
# Two losses!
class_loss = F.cross_entropy(class_pred, class_true)
box_loss = F.mse_loss(box_pred, box_true)
total_loss = class_loss + 5 * box_loss
How do we know if a predicted box is good?
IoU = Intersection over Union
| IoU Value | Meaning |
|---|---|
| 1.0 | Perfect match! |
| 0.7 | Good overlap |
| 0.5 | Acceptable (threshold) |
| 0.0 | No overlap at all |
IoU is used everywhere: loss functions, evaluation, NMS
Localization works great for ONE object.
But what if there are 3 cats and 2 dogs?
Image with 5 objects → Neural Network → ???
| Problem | Why It's Hard |
|---|---|
| Variable output size | 1 image might have 2 objects, another has 10 |
| Different classes | Mix of cats, dogs, cars... |
| Overlapping objects | Objects can be on top of each other |
Neural networks want FIXED output size!
Idea: Slide a window, classify each patch.
Problems:
| Issue | Why It's Bad |
|---|---|
| Many sizes | Small cat? Big cat? Try all! |
| Many positions | Thousands of patches |
| Slow | Classify each one separately |
Result: ~50 seconds per image! Way too slow.
What we want:
The insight:
Instead of sliding a window...
Divide the image into a grid!
Divide image into S × S grid (e.g., 7×7)
| Concept | Meaning |
|---|---|
| Grid | 49 cells covering image |
| Responsibility | Each cell detects objects whose CENTER is in it |
| Output | Each cell predicts boxes + classes |
Rule: If object's center is in a cell, that cell predicts it!
Think of it like assigning responsibility:
| Without Grid | With Grid |
|---|---|
| "Find all objects... somehow" | "Cell (3,4), is there something in you?" |
| Variable-length output (hard!) | Fixed 7×7 output (easy!) |
| Complex architecture | Simple regression |
Each cell answers TWO questions:
Grid converts variable detection into fixed-size prediction!
For EACH cell, predict:
| Output | Size | Meaning |
|---|---|---|
| x, y | 2 | Box center (relative to cell) |
| w, h | 2 | Box width & height (relative to image) |
| confidence | 1 | P(object) × IoU |
| class probs | C | P(class |
Per cell:
Example: 20 classes → each cell outputs 25 numbers
Confidence = "Is there an object here, and how good is my box?"
| Scenario | P(object) | IoU | Confidence |
|---|---|---|---|
| No object in cell | 0 | - | 0 |
| Object, perfect box | 1 | 1.0 | 1.0 |
| Object, okay box | 1 | 0.7 | 0.7 |
| Object, bad box | 1 | 0.3 | 0.3 |
High confidence = "I'm sure there's an object AND my box is accurate"
What if two objects have centers in the same cell?
Solution: Each cell predicts B boxes (usually B=2)
| Per Cell Output | Count |
|---|---|
| Box 1: (x, y, w, h, conf) | 5 numbers |
| Box 2: (x, y, w, h, conf) | 5 numbers |
| Class probs: P(cat), P(dog), ... | C numbers |
| Total | B×5 + C |
Example: B=2 boxes, C=20 classes → 30 numbers per cell
For an S × S grid with B boxes and C classes:
Output shape: S × S × (B × 5 + C)
Example: S=7, B=2, C=20
| Dimension | Value | Meaning |
|---|---|---|
| 7 × 7 | 49 cells | Grid covering image |
| × 30 | 30 numbers/cell | 2 boxes + 20 classes |
| Total | 1470 numbers | Full detection output! |
ONE forward pass → 1470 predictions!
The full YOLO pipeline:
| Stage | Input | Output |
|---|---|---|
| Image | 448×448×3 | Raw pixels |
| CNN | Pixels | Features |
| FC/Conv | Features | 7×7×30 tensor |
Output tensor: 7×7 grid × 30 numbers per cell = 1470 predictions
Key insight: Detection as a single regression problem!
One forward pass → all boxes + classes + confidences
Total loss = Box loss + Confidence loss + Class loss
| Loss | What It Penalizes |
|---|---|
| Wrong box coordinates (x, y, w, h) | |
| Wrong confidence scores | |
| Wrong class predictions |
Only for cells that HAVE an object:
Why
Two cases:
| Case | What We Want |
|---|---|
| Object present | Confidence → high (close to IoU) |
| No object | Confidence → 0 |
Only for cells WITH an object:
Simple squared error on class probabilities.
If cell has a dog:
YOLO Training Summary:
For each training image:
1. Divide into 7×7 grid
2. For each object, assign to cell containing its center
3. Forward pass → get 7×7×30 predictions
4. Compute all three losses
5. Backpropagate and update weights
At test time:
| Method | Approach | Speed |
|---|---|---|
| R-CNN | 2000 region proposals, classify each | ~50 sec |
| Fast R-CNN | Share CNN features | ~2 sec |
| YOLO | One forward pass, predict all | ~0.02 sec |
YOLO processes 45 frames per second!
Traditional: Image → Find regions → Classify each → Slow
YOLO: Image → CNN → All detections at once → Fast!
Problem: Different objects have different shapes!
| Object | Typical Shape |
|---|---|
| Person | Tall and thin |
| Car | Wide and short |
| Dog | Medium, horizontal |
Solution: Pre-define "anchor boxes" of different shapes
Problem: Multiple cells may detect the same object!
Before NMS: After NMS:
[Dog: 0.9] [Dog: 0.9] ← Keep highest
[Dog: 0.8] (removed)
[Dog: 0.7] (removed)
Algorithm:
Result: One clean box per object!
Think of it like election results:
| Without NMS | With NMS |
|---|---|
| 3 people claim they found the dog | Best one wins |
| Overlapping claims | Remove duplicates |
| Messy output | Clean output |
Why do duplicates happen?
NMS = "Only the best detection survives!"
from ultralytics import YOLO
# Load pre-trained model
model = YOLO('yolov8n.pt')
# Detect objects in an image
results = model('photo.jpg')
# Print detections
for result in results:
for box in result.boxes:
x1, y1, x2, y2 = box.xyxy[0].tolist()
confidence = box.conf[0].item()
class_id = int(box.cls[0].item())
print(f"Found object at ({x1:.0f}, {y1:.0f}) to ({x2:.0f}, {y2:.0f})")
| Model | Speed | Accuracy | Best For |
|---|---|---|---|
| YOLOv8n | Fastest | Lower | Mobile phones |
| YOLOv8s | Fast | Medium | General use |
| YOLOv8m | Medium | Good | Better accuracy |
| YOLOv8x | Slowest | Best | Maximum accuracy |
"n" = nano (smallest), "x" = extra-large
| Version | Year | Key Innovation |
|---|---|---|
| YOLOv1 | 2016 | One-stage detection |
| YOLOv2 | 2016 | Batch norm, anchor boxes |
| YOLOv3 | 2018 | Multi-scale predictions |
| YOLOv4 | 2020 | Bag of tricks |
| YOLOv5 | 2020 | PyTorch, easy to use |
| YOLOv8 | 2023 | State-of-the-art, anchor-free |
Use YOLOv8 — it's the most modern and easy to use!
| Type | How It Works | Example |
|---|---|---|
| Two-stage | 1) Find regions, 2) Classify | R-CNN, Faster R-CNN |
| One-stage | Predict everything at once | YOLO, SSD |
| Two-Stage | One-Stage | |
|---|---|---|
| Speed | Slower | Faster |
| Accuracy | Slightly better | Good enough |
| Use case | When accuracy critical | Real-time |
Why does this matter?
| Application | Needs | Model Choice |
|---|---|---|
| Self-driving car | Real-time (30+ FPS) | YOLOv8n or s |
| Medical diagnosis | High accuracy | YOLOv8x |
| Phone app | Low battery usage | YOLOv8n |
| Surveillance | Balance | YOLOv8s or m |
You choose based on your constraints!
Ground Truth box: (30, 30) to (100, 100)
Predicted box: (50, 50) to (120, 120)
| Step | Calculation |
|---|---|
| 1. Find intersection | x: max(30,50)=50 to min(100,120)=100 |
| y: max(30,50)=50 to min(100,120)=100 | |
| Area = 50 × 50 = 2,500 | |
| 2. Find union | GT area = 70×70 = 4,900 |
| Pred area = 70×70 = 4,900 | |
| Union = 4,900 + 4,900 - 2,500 = 7,300 | |
| 3. IoU | 2,500 / 7,300 = 0.34 |
IoU = 0.34 → Not a good match (need > 0.5)
| Application | What YOLO Detects |
|---|---|
| Self-driving cars | People, cars, traffic signs |
| Retail stores | Customers, products |
| Sports analysis | Players, ball |
| Security cameras | People, vehicles |
| Medical imaging | Tumors, lesions |
| Challenge | Why It's Hard |
|---|---|
| Small objects | Few pixels, hard to see |
| Crowded scenes | Objects overlap |
| Unusual angles | Different from training data |
| Real-time speed | Must process 30+ FPS |
| Class imbalance | Rare objects (e.g., fire) |
Modern detectors (YOLO v8) handle most of these well!
Common Objects in Context (COCO):
| Property | Value |
|---|---|
| Images | 330,000+ |
| Object instances | 2.5 million |
| Classes | 80 (person, car, dog, pizza, ...) |
| Annotations | Bounding boxes + segmentation |
If your model works well on COCO, it probably works in the real world!
| Task | Output | Use Case |
|---|---|---|
| Classification | One label | "Is this a cat?" |
| Detection | Boxes + labels | "Where are all cats?" |
| Segmentation | Pixel-level masks | "Exact shape of each cat" |
Segmentation = Detection's precise cousin
Steps to train YOLO on your data:
# 1. Collect and label images (use tools like Roboflow)
# 2. Export in YOLO format
# 3. Train
from ultralytics import YOLO
model = YOLO('yolov8n.pt') # Start from pre-trained
model.train(data='my_data.yaml', epochs=50)
# 4. Use it!
results = model('new_image.jpg')
Transfer learning: Start from pre-trained weights, fine-tune on your data!
| Step | What Happens |
|---|---|
| 1. Image | Grid of pixels (numbers) |
| 2. CNN | Extract features (edges → shapes → objects) |
| 3. Detection | Predict box coordinates + class |
| 4. Output | List of (box, class, confidence) |
Images are grids of numbers (pixels)
CNNs use filters to detect patterns
Object Detection = Classification + Location
YOLO enables real-time detection
| Topic | What It Is |
|---|---|
| Convolution math | Detailed filter operations |
| CNN architectures | ResNet, VGG, EfficientNet |
| Non-Maximum Suppression | Removing duplicate detections |
| Segmentation | Pixel-level object boundaries |
| Pose estimation | Detecting body keypoints |
You'll learn these in advanced CV courses!