Lecture	What We Learned
5	Neural networks, layers, training
6	How neural networks see images

Application	What It Does
Self-driving cars	Detect pedestrians, cars, signs
Medical imaging	Find tumors, diagnose diseases
Phone cameras	Face detection, filters
Security cameras	Person detection
Manufacturing	Defect detection

Image Size	What Computer Sees
28 × 28 grayscale	784 numbers (0-255)
224 × 224 color	150,528 numbers

We see	Computer sees
A picture	Numbers
Shapes, objects	0 to 255 values
Meaning	Just math

Value	Meaning
0	Black
128	Gray
255	White

Color	RGB Values
Red	(255, 0, 0)
Green	(0, 255, 0)
Blue	(0, 0, 255)
White	(255, 255, 255)

Human	Computer
"It's a 7!"	784 numbers
Instant recognition	Needs ML to learn

Position	Pixel Values
Dog on left	[14, 82, 201, ...]
Dog on right	[201, 55, 14, ...]

Digit	Human View	Computer View
"3"	The number 3	784 numbers (0-255)
"7"	The number 7	Different 784 numbers

Challenge	Example
Viewpoint	Front vs side
Scale	Tiny vs huge
Deformation	Sitting vs jumping
Occlusion	Partially hidden
Lighting	Bright vs dark
Background	Camouflage

Network	Parameters
Fully Connected	150 MILLION
CNN (3×3 filter)	Just 27!

Filter Type	What It Detects
Edge filter	Boundaries between regions
Corner filter	Sharp corners
Texture filter	Repeating patterns

Left Column	Middle	Right Column
+1 (add it)	0 (ignore)	-1 (subtract it)

Input	Filters	Output
1 image	32 different filters	32 feature maps

Input Type	Channels	Filter Shape
Grayscale	1	3×3×1
Color (RGB)	3	3×3×3
After Conv1 (32 filters)	32	3×3×32

Layer	What It Does
Conv	Apply filters to detect patterns
ReLU	Add non-linearity (like before!)
Pool	Shrink the image (keep important info)
FC	Final classification (like before!)

Why Pool?	Benefit
Reduces size	Fewer parameters, faster
Translation invariance	Small shifts don't matter
Keeps important features	Max = strongest signal

Full Description	Pooled Summary
"There's a strong edge at pixel (10,20), a medium edge at (10,21), a weak edge at (11,20)"	"There's an edge around (10,20)"

Concept	What It Does	Example
Stride	How far filter moves each step	Stride 2 = skip every other position
Padding	Add zeros around edges	Keeps output size = input size

Layer	Receptive Field
Conv1 (3×3)	3×3 pixels
Conv2 (3×3)	5×5 pixels
Conv3 (3×3)	7×7 pixels
Deeper...	Larger and larger

Feature	Why It Helps
Weight sharing	Same filter applied everywhere = fewer parameters
Local patterns	Detect edges, corners, textures locally
Hierarchy	Early layers → edges; Later layers → complex shapes
Position invariance	Cat on left ≈ Cat on right

Layer	What Activates	Analogy
Layer 1	"There's a sharp edge here!"	Seeing brush strokes
Layer 2	"This looks like fur texture!"	Seeing patterns
Layer 3	"This could be an ear shape!"	Seeing parts
Layer 4	"This has cat-like features!"	Understanding
Output	"95% confident: CAT"	Decision

Property	Value
Images	14 million+
Classes	1,000 categories
Examples	Dog breeds, cars, foods, objects
Challenge	Annual competition (2010-2017)

Year	Model	Error Rate	Key Innovation
2011	Hand-crafted	25.8%	SIFT, HOG features
2012	AlexNet	16.4%	Deep CNNs, GPU training
2014	VGG	7.3%	Deeper (19 layers)
2015	ResNet	3.6%	Skip connections (152 layers!)
2017	SENet	2.3%	Attention mechanisms

Year	Architecture	Key Innovation
2012	AlexNet	Deep CNNs work! GPU training
2014	VGGNet	Deeper = better (16-19 layers)
2015	ResNet	Skip connections (152 layers!)
2017	EfficientNet	Smart scaling
2020	ViT	Transformers for vision

Augmentation	Effect
Flip	Mirror horizontally
Rotate	Small angle rotations
Crop	Random crops
Color	Brightness, contrast

Layer	What It Learned	Reusable?
Early	Edges, textures	Yes! (universal)
Middle	Shapes, parts	Mostly yes
Late	Specific objects	Needs fine-tuning

Your Dataset	Strategy
Small (< 1000)	Freeze all, train only final layer
Medium (1000-10K)	Freeze early, fine-tune later layers
Large (> 10K)	Fine-tune everything (or train from scratch)

From Scratch	Transfer Learning
Teach what "edge" means	Already knows edges!
Teach what "shapes" are	Already knows shapes!
Teach to recognize cats	Just teach this part!

Step	Task	Output
1	Classification	"This is a cat"
2	Localization	"Cat is HERE" (one box)
3	Detection	"Cat HERE, dog THERE" (multiple boxes)

Output	Meaning
1 number per class	Probability of each class
Sum = 1.0	Probabilities add up

Output	Size	Meaning
Class probs	C numbers	What is it?
Bounding box	4 numbers	Where is it?

Number	Meaning	Example
x	Center x-coordinate	120 pixels from left
y	Center y-coordinate	80 pixels from top
w	Width of box	50 pixels wide
h	Height of box	40 pixels tall

Format	Values	Used By
Center	(x, y, w, h)	YOLO
Corner	(x1, y1, x2, y2)	COCO, VOC

Loss	What It Measures	Type
	Wrong class?	Cross-entropy
	Wrong position?	MSE (L2 loss)

True	Predicted	Error
x=100	x̂=105
y=80	ŷ=75
w=60	ŵ=55
h=40	ĥ=45

IoU Value	Meaning
1.0	Perfect match!
0.7	Good overlap
0.5	Acceptable (threshold)
0.0	No overlap at all

The Challenge: Multiple Objects!

Localization works great for ONE object.

But what if there are 3 cats and 2 dogs?

Image with 5 objects → Neural Network → ???

Problem	Why It's Hard
Variable output size	1 image might have 2 objects, another has 10
Different classes	Mix of cats, dogs, cars...
Overlapping objects	Objects can be on top of each other

Neural networks want FIXED output size!

Issue	Why It's Bad
Many sizes	Small cat? Big cat? Try all!
Many positions	Thousands of patches
Slow	Classify each one separately

Concept	Meaning
Grid	49 cells covering image
Responsibility	Each cell detects objects whose CENTER is in it
Output	Each cell predicts boxes + classes

Without Grid	With Grid
"Find all objects... somehow"	"Cell (3,4), is there something in you?"
Variable-length output (hard!)	Fixed 7×7 output (easy!)
Complex architecture	Simple regression

Output	Size	Meaning
x, y	2	Box center (relative to cell)
w, h	2	Box width & height (relative to image)
confidence	1	P(object) × IoU
class probs	C	P(class

Scenario	P(object)	IoU	Confidence
No object in cell	0	-	0
Object, perfect box	1	1.0	1.0
Object, okay box	1	0.7	0.7
Object, bad box	1	0.3	0.3

Per Cell Output	Count
Box 1: (x, y, w, h, conf)	5 numbers
Box 2: (x, y, w, h, conf)	5 numbers
Class probs: P(cat), P(dog), ...	C numbers
Total	B×5 + C

Dimension	Value	Meaning
7 × 7	49 cells	Grid covering image
× 30	30 numbers/cell	2 boxes + 20 classes
Total	1470 numbers	Full detection output!

Stage	Input	Output
Image	448×448×3	Raw pixels
CNN	Pixels	Features
FC/Conv	Features	7×7×30 tensor

Loss	What It Penalizes
	Wrong box coordinates (x, y, w, h)
	Wrong confidence scores
	Wrong class predictions

Case	What We Want
Object present	Confidence → high (close to IoU)
No object	Confidence → 0

Method	Approach	Speed
R-CNN	2000 region proposals, classify each	~50 sec
Fast R-CNN	Share CNN features	~2 sec
YOLO	One forward pass, predict all	~0.02 sec

Without NMS	With NMS
3 people claim they found the dog	Best one wins
Overlapping claims	Remove duplicates
Messy output	Clean output

Model	Speed	Accuracy	Best For
YOLOv8n	Fastest	Lower	Mobile phones
YOLOv8s	Fast	Medium	General use
YOLOv8m	Medium	Good	Better accuracy
YOLOv8x	Slowest	Best	Maximum accuracy

Version	Year	Key Innovation
YOLOv1	2016	One-stage detection
YOLOv2	2016	Batch norm, anchor boxes
YOLOv3	2018	Multi-scale predictions
YOLOv4	2020	Bag of tricks
YOLOv5	2020	PyTorch, easy to use
YOLOv8	2023	State-of-the-art, anchor-free

Type	How It Works	Example
Two-stage	1) Find regions, 2) Classify	R-CNN, Faster R-CNN
One-stage	Predict everything at once	YOLO, SSD

	Two-Stage	One-Stage
Speed	Slower	Faster
Accuracy	Slightly better	Good enough
Use case	When accuracy critical	Real-time

Application	Needs	Model Choice
Self-driving car	Real-time (30+ FPS)	YOLOv8n or s
Medical diagnosis	High accuracy	YOLOv8x
Phone app	Low battery usage	YOLOv8n
Surveillance	Balance	YOLOv8s or m

Step	Calculation
1. Find intersection	x: max(30,50)=50 to min(100,120)=100
	y: max(30,50)=50 to min(100,120)=100
	Area = 50 × 50 = 2,500
2. Find union	GT area = 70×70 = 4,900
	Pred area = 70×70 = 4,900
	Union = 4,900 + 4,900 - 2,500 = 7,300
3. IoU	2,500 / 7,300 = 0.34

Application	What YOLO Detects
Self-driving cars	People, cars, traffic signs
Retail stores	Customers, products
Sports analysis	Players, ball
Security cameras	People, vehicles
Medical imaging	Tumors, lesions

Challenge	Why It's Hard
Small objects	Few pixels, hard to see
Crowded scenes	Objects overlap
Unusual angles	Different from training data
Real-time speed	Must process 30+ FPS
Class imbalance	Rare objects (e.g., fire)

Property	Value
Images	330,000+
Object instances	2.5 million
Classes	80 (person, car, dog, pizza, ...)
Annotations	Bounding boxes + segmentation

Task	Output	Use Case
Classification	One label	"Is this a cat?"
Detection	Boxes + labels	"Where are all cats?"
Segmentation	Pixel-level masks	"Exact shape of each cat"

Step	What Happens
1. Image	Grid of pixels (numbers)
2. CNN	Extract features (edges → shapes → objects)
3. Detection	Predict box coordinates + class
4. Output	List of (box, class, confidence)

Topic	What It Is
Convolution math	Detailed filter operations
CNN architectures	ResNet, VGG, EfficientNet
Non-Maximum Suppression	Removing duplicate detections
Segmentation	Pixel-level object boundaries
Pose estimation	Detecting body keypoints

Computer Vision

How Machines See

From Pixels to Object Detection

The Story So Far

A Child vs. A Computer

Why Computer Vision Matters

Today's Agenda

Part 1: Images as Data

What a Computer Actually Sees

What Is an Image to a Computer?

Human View vs Computer View

Grayscale Images

Color Images (RGB)

MNIST: The "Hello World" of Vision

MNIST: What the Computer Sees

The Challenge

MNIST: Real Examples

Why is Vision Hard?

Part 2: CNNs

Neural Networks for Images

Why Not Fully Connected Networks?

The Key Idea: Look at Small Regions

Filters (Kernels)

Convolution: Step by Step

Convolution Example

Why Does This Filter Detect Edges?

Vertical vs Horizontal Filters

Edge Detection in Action

Multiple Filters, Multiple Features

Channels: From Grayscale to Color

CNN: The Big Picture

Pooling: Shrinking the Image

Pooling: The Intuition

Stride and Padding

Receptive Field

A Simple CNN in PyTorch

Why CNNs Work So Well

The Feature Hierarchy

What Does the CNN "See"?

Famous CNN Moment: ImageNet 2012

ImageNet: The Game Changer

The ImageNet Journey

Famous CNN Architectures

Data Augmentation

Data Augmentation in PyTorch

Transfer Learning

Transfer Learning in Practice

When to Use Transfer Learning?

Transfer Learning: The School Analogy

Part 3: Object Detection

Building Up Step by Step

Let's Build Up Gradually

Step 1: Classification (Review)

Step 2: Localization - Add 4 Numbers!

What Do the 4 Numbers Mean?

Localization: Two Losses!

Box Loss: How Far Off?

Localization in PyTorch

Measuring Box Quality: IoU

The Challenge: Multiple Objects!

Naive Approach: Sliding Window

We Need a Smarter Approach!

Part 4: YOLO

You Only Look Once

YOLO's Big Idea

Why Grid? The Clever Insight

What Does Each Cell Predict?

Understanding Confidence Score

Multiple Boxes Per Cell

The Full YOLO Output

YOLO Predicts EVERYTHING at Once

YOLO Loss Function (3 Parts!)

Box Loss in Detail

Confidence Loss

Class Loss

Putting It All Together

Why YOLO is Fast

Anchor Boxes (Brief Intuition)

Non-Maximum Suppression (NMS)

NMS Intuition: Voting with Cleanup