State-of-the-art in almost EVERY field:
| Domain | Application | What It Does |
|---|---|---|
| Vision | Instagram filters, Face ID | Recognizes objects, faces |
| Language | ChatGPT, Claude, Google Translate | Understands & generates text |
| Audio | Siri, Alexa, Spotify recommendations | Speech recognition, music |
| Science | AlphaFold, drug discovery | Predicts protein structures |
| Games | AlphaGo, game bots | Superhuman game play |
The same basic idea powers ALL of these.
Lecture 5A (today): Building the network
1. The Paradigm Change — why neural networks are different
2. The Perceptron — the simplest neuron
3. Logic Gates — what one neuron can do
4. The XOR Problem — where one neuron FAILS
5. Multi-Layer Perceptrons — the fix!
6. Activation Functions — why non-linearity matters
7. Forward Propagation — how data flows through
8. Softmax — multi-class output
Lecture 5B (next): Training the network
Problem: Classify handwritten digits (is this a 2? a 7?)
The old approach (before deep learning):
Image → [Human designs features] → [Classifier learns] → Prediction
Step 1: Human expert designs features:
Step 2: Feed these features into logistic regression / SVM
The features are HAND-CRAFTED — designed by a human expert
The classifier is TRAINABLE — learned from data
| Component | Who designs it? |
|---|---|
| Feature extractor | Human (slow, error-prone, domain-specific) |
| Classifier | Algorithm (learned from data) |
What's wrong with this?
EVERYTHING is trainable!
| Component | Who? |
|---|---|
| Low-level features | Learned |
| High-level features | Learned |
| Classifier | Learned |
Paradigm change: Stop hand-crafting features. Let the network learn them from data.
Old way: New problem = hire new domain expert, design new features
Neural network way: Same architecture works for everything!
| Problem | Input | Output | Same NN idea? |
|---|---|---|---|
| Digit recognition | 28x28 image | Which digit (0-9) | Yes |
| Spam detection | Email text | Spam or not | Yes |
| House price | Features | Price | Yes |
| Next word | Text so far | Next word | Yes |
The only thing that changes is the data.
You already know logistic regression:
This is:
Key insight: This is exactly ONE neuron!
Logistic regression = a single neuron with sigmoid activation.
A neural network = MANY of these neurons, connected together.
How your brain works:
The artificial neuron mimics this!
The first artificial neuron — inspired by biological neurons
A neuron does two things:
1. Summation: Weighted sum of inputs
2. Activation: Apply a non-linear function
For the original perceptron,
The neuron "fires" when the weighted sum exceeds the threshold!
What happens if we use different activation functions?
| Activation |
Model Name |
|---|---|
| Step function | Perceptron |
| Logistic Regression! | |
| Linear Regression! |
You already know two kinds of neurons!
- Linear regression = neuron with identity activation
- Logistic regression = neuron with sigmoid activation
- Perceptron = neuron with step activation
Let's see what a single neuron can compute
Challenge: Can we find weights
The perceptron computes:
Where
| 0 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |
Solution:
Let's verify:
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 1 |
Solution:
Let's verify:
| 0 | 1 |
| 1 | 0 |
Solution:
Verify:
Intuition: A negative weight means "this input DECREASES the chance of firing."
The decision boundary is a straight line!
| Gate | Decision Boundary | Side with |
|---|---|---|
| AND | Above the line | |
| OR | Above the line |
A single perceptron divides the input space with a straight line.
This is exactly like logistic regression's decision boundary!

| 0 | 0 | 1 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
Can you find
Hint 1: NAND = NOT(AND). What if you just flip the sign?
Hint 2: Try
Or: Compose AND → NOT (two neurons in sequence!)
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
Output is 1 if inputs are DIFFERENT.
Can we find
Try it! ... You'll find it's impossible.
Look at the decision boundary diagram we saw earlier:
No matter how you draw one straight line, you can't separate the 1s from the 0s!
XOR is NOT linearly separable.
This was a huge deal in the 1960s!
Marvin Minsky and Seymour Papert published "Perceptrons" — proving that a single-layer perceptron cannot learn XOR.
The reaction: "Neural networks are useless!"
The consequence: Funding dried up → First AI Winter (1970s-80s)
But they were wrong about one thing:
They proved single-layer perceptrons are limited.
They did NOT prove multi-layer networks are limited!
The fix was already known: just add more layers.
Approach 1: Hand-Craft a New Feature
Add feature
| 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 1 |
| 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 0 |
Now a single neuron with
But this is hand-crafting features again! We want the network to learn them.
Approach 2: Add More Neurons (the neural network way!)
What if we use TWO layers of neurons?
Input Hidden Layer Output
x1 ────┐
├──→ [h1] ───┐
x2 ────┤ ├──→ [output] ──→ ŷ
└──→ [h2] ───┘
Two lines can separate XOR!
The hidden layer doesn't classify — it TRANSFORMS the data!
In the original space: XOR is not linearly separable.
In the hidden layer's space: it becomes separable!
Analogy: You can't cut a crumpled piece of paper with one straight cut. But if you unfold the paper first, you can!
The hidden layer unfolds the data.
A Multi-Layer Perceptron has:
| Layer | What it does |
|---|---|
| Input | Receives raw features |
| Hidden | Learns intermediate representations |
| Output | Makes the final prediction |
Each neuron in a layer connects to ALL neurons in the next layer.
Also called "fully connected" or "dense" network.
Input layer: We see the data going in.
Output layer: We see the predictions coming out.
Hidden layer: We DON'T tell it what to compute — it figures that out on its own!
Layer 1 learns: "Is there a curve in the top half?"
Layer 2 learns: "Is there a loop at the bottom?"
Layer 3 learns: "Does this look like a 2?"
The network discovers its own features!
This is the paradigm change — no hand-crafting needed.
Critical question: What happens if we stack linear neurons?
Layer 1:
Layer 2:
Substitute:
Still just a linear function! 100 linear layers = 1 linear layer.
Without non-linearity, depth is useless! We MUST add activation functions between layers.

Think of it like paper folding:
| Activation | Formula | Range | Used Where |
|---|---|---|---|
| Step | 0 or 1 | Original perceptron | |
| Sigmoid | Output (binary) | ||
| Tanh | Hidden layers (older) | ||
| ReLU | Hidden layers (modern!) |

def relu(z):
return max(0, z)
| Input |
ReLU |
Gradient |
|---|---|---|
| -5.0 | 0 | 0 |
| -0.1 | 0 | 0 |
| 0.1 | 0.1 | 1 |
Why ReLU won:
max(0, z)Hidden layer: reuse our AND and OR perceptrons!
Output layer: OR but NOT AND = XOR!
| 0 | 0 | 0 | 0 | 0 ✓ |
| 0 | 1 | 1 | 0 | 1 ✓ |
| 1 | 0 | 1 | 0 | 1 ✓ |
| 1 | 1 | 1 | 1 | 0 ✓ |
We'll also solve this with PyTorch in the notebook — and let PyTorch LEARN the weights!
The hidden layer TRANSFORMED the data!
| Original space |
Hidden space |
|
|---|---|---|
| Not separable | ||
In the hidden layer's space:
Now a single line CAN separate them!
Key insight: Hidden layers transform data into a space where it becomes linearly separable!

Raw Data → [Hidden Layer 1] → [Hidden Layer 2] → ... → [Output]
Transform data Transform again Classify
(simple features) (complex features) (final decision)
| Depth | What It Can Learn | Example |
|---|---|---|
| 0 hidden layers | Linear boundaries | Logistic regression |
| 1 hidden layer | Simple curves | XOR, basic patterns |
| 2-5 hidden layers | Complex patterns | Image classification |
| 10+ hidden layers | Very complex | GPT, modern AI |
Deeper = more complex transformations = more powerful
Data flows left → right through the network:
Input → [Multiply + Add] → [Activation] → [Multiply + Add] → [Activation] → Output
(weights, bias) (ReLU) (weights, bias) (sigmoid)
That's it! Each layer does two things:
Network: 2 inputs → 2 hidden (ReLU) → 1 output (sigmoid)
All weights and biases:
| Neuron |
Neuron |
|
|---|---|---|
| Weight from |
||
| Weight from |
||
| Bias |
Output neuron: weights
Input:
Follow along on paper or in the notebook!
Step 1: Hidden layer — weighted sums
Step 2: ReLU activation
Step 3: Output neuron — weighted sum
Step 4: Sigmoid activation
Prediction: 59% probability of class 1.
That's the entire forward pass! Just multiply, add, activate, repeat.
| Connection | Weights | Biases | Total |
|---|---|---|---|
| 2 inputs → 2 hidden | 2 | 6 | |
| 2 hidden → 1 output | 1 | 3 | |
| Total | 9 parameters |
For MNIST (784 → 128 → 10): 101,770 parameters!
The network needs to learn all of these from data.
For MNIST (10 digits), we need 10 outputs:
Softmax converts raw scores to probabilities:
Example: Raw scores
All probabilities sum to 1! Predicted class = argmax = class 0.

MNIST digit classifier outputs raw scores for each digit:
| Digit | Raw Score |
Probability | |
|---|---|---|---|
| 0 | 0.5 | 1.65 | 5.5% |
| 1 | 0.2 | 1.22 | 4.1% |
| 2 | 3.1 | 22.2 | 74.0% |
| 3 | 1.0 | 2.72 | 9.1% |
| 4 | -0.5 | 0.61 | 2.0% |
| 5-9 | ... | ... | 5.3% |
Prediction: It's a 2 with 74% confidence!
Key: Softmax amplifies the largest score and suppresses the rest.
What you've learned:
We built the architecture. We can compute the forward pass.
But all the weights were GIVEN to us!
# We just made these up!
W1 = np.array([[0.2, 0.4], [-0.5, 0.1], [0.3, -0.2]])
The critical question:
How do we FIND good weights from data?
Answer: Training! (Next lecture)
The network starts with random weights → makes terrible predictions → gradually improves.
Part 1 gave us:
The problem:
Today's question: How do we find good weights from data?
import numpy as np
# Random weights!
W1 = np.random.randn(128, 784) * 0.01
b1 = np.zeros(128)
W2 = np.random.randn(10, 128) * 0.01
b2 = np.zeros(10)
# Show a "7" to the network
prediction = forward(image_of_7, W1, b1, W2, b2)
# → [0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10]
# "I have no idea — every digit is equally likely"
We need the network to output [0, 0, 0, 0, 0, 0, 0, 1, 0, 0] for a "7".
How far off are we? → That's what the loss function measures.
You already know these from Lecture 3!
For regression (predicting a number):
For classification (predicting a class):
Cross-entropy punishes confident wrong predictions HARD:
| Prediction for correct class | Cross-entropy loss |
|---|---|
| 99% confident (good!) | 0.01 |
| 50% confident (meh) | 0.69 |
| 1% confident (terrible!) | 4.6 |
Training = finding weights that minimize the loss
where
For our MNIST network: 101,770 parameters to optimize!
How? The same tool you learned in Lecture 3: gradient descent.
But how do we compute
We need to know: For each weight, how much does the loss change if I change this weight a tiny bit?
Naive approach: Change each weight by
For 100K parameters → 100K forward passes. Way too slow!
Backpropagation: Compute ALL gradients in ONE backward pass.
Every forward pass is a chain of operations:
x → [multiply by W1] → z1 → [ReLU] → h → [multiply by W2] → z2 → [sigmoid] → ŷ → [loss] → L
The chain rule lets us work backward:
Each factor is simple! We just multiply them together going backward.
Think of it like tracing responsibility in a company:
CEO made a bad decision (high loss)
↓ Who contributed?
VP of Sales contributed 60%, VP of Engineering 40%
↓ Within Engineering?
Backend team 70%, Frontend team 30%
↓ Within Backend?
Alice's code contributed 50%, Bob's 50%
Backprop = assigning blame to each weight for the final error.
The beautiful thing about modern deep learning:
# PyTorch computes ALL gradients automatically!
# Forward pass
predictions = model(images)
loss = criterion(predictions, labels)
# Backward pass — ONE line!
loss.backward() # Computes gradients for ALL parameters
# That's it. PyTorch traced the computation graph
# and applied the chain rule automatically.
This is called "automatic differentiation" (autograd).
PyTorch builds the computational graph during the forward pass, then walks backward through it.
import torch
# Create a tensor that tracks gradients
x = torch.tensor([2.0], requires_grad=True)
# Forward: compute y = x² + 3x
y = x**2 + 3*x # y = 4 + 6 = 10
# Backward: compute dy/dx
y.backward()
print(x.grad) # tensor([7.])
# Because dy/dx = 2x + 3 = 2(2) + 3 = 7 ✓
PyTorch figured out the derivative automatically!
Now imagine this for a network with millions of operations — same idea, just bigger graph.
┌─────────────────────────────────────────┐
│ for each batch of training data: │
│ │
│ 1. FORWARD: predictions = model(inputs)│
│ 2. LOSS: loss = how_wrong(pred, true)│
│ 3. BACKWARD: loss.backward() (gradients)│
│ 4. UPDATE: weights -= lr * gradients │
│ │
│ Repeat until loss is small enough │
└─────────────────────────────────────────┘
That's it. This is how EVERY neural network trains. From a 2-neuron XOR network to GPT-4.
import torch
import torch.nn as nn
# 1. Define the network
model = nn.Sequential(
nn.Linear(784, 128), # Input → Hidden
nn.ReLU(), # Activation
nn.Linear(128, 10) # Hidden → Output (10 digits)
)
# 2. Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# 3. Train!
for epoch in range(10):
for images, labels in train_loader:
| Line | What It Does | Why |
|---|---|---|
model(images) |
Forward pass | Get predictions |
criterion(output, labels) |
Compute loss | How wrong are we? |
optimizer.zero_grad() |
Clear previous gradients | Gradients accumulate by default |
loss.backward() |
Backprop | Compute all gradients |
optimizer.step() |
Update weights | Move toward better weights |
Why zero_grad() ? PyTorch accumulates gradients. Without clearing, old gradients add up with new ones — usually not what we want.
The learning rate
| Learning Rate | What Happens |
|---|---|
| Too small ( |
Very slow training, might get stuck |
| Just right ( |
Steady progress toward minimum |
| Too large ( |
Unstable! Loss may explode |
Analogy: Walking down a mountain in fog.
How many examples per weight update?
| Method | Examples per Update | Trade-off |
|---|---|---|
| Batch GD | ALL training data | Stable but slow (one update per epoch) |
| SGD | 1 example | Fast but very noisy |
| Mini-batch | 32-256 examples | Best of both worlds! |
Why mini-batch wins:
Common batch sizes: 32, 64, 128, 256
| Term | Meaning | Example (10K images, batch=100) |
|---|---|---|
| Batch | One group of examples | 100 images |
| Iteration | One weight update (one batch) | 1 update |
| Epoch | One pass through ALL data | 100 iterations |
Typical training: 10–100 epochs
Epoch 1: Loss = 2.30 Accuracy = 10% (random guessing!)
Epoch 5: Loss = 0.50 Accuracy = 85% (learning!)
Epoch 20: Loss = 0.08 Accuracy = 97% (pretty good!)
Epoch 50: Loss = 0.01 Accuracy = 99% (excellent!)
Good signs:
Training loss ↓ AND Validation loss ↓ → Learning well!
Bad sign — Overfitting:
Training loss ↓ BUT Validation loss ↑ → Memorizing, not learning!
| Epoch | Train Loss | Val Loss | Diagnosis |
|---|---|---|---|
| 1 | 2.3 | 2.3 | Starting |
| 10 | 0.1 | 0.3 | Training well |
| 50 | 0.001 | 0.5 | Overfitting! |
When val loss starts going up → stop training! (Early stopping)
For image recognition:
| Layer | What It Learns | Example |
|---|---|---|
| Layer 1 | Edges, gradients | —, |, \, / |
| Layer 2 | Corners, curves | ⌐, ⌞, ⌒ |
| Layer 3 | Parts | eyes, wheels, loops |
| Layer 4+ | Objects | faces, cars, digits |
Each layer builds on the previous one:
The network discovers this hierarchy automatically from data!
A neural network with ONE hidden layer (with enough neurons) can approximate ANY continuous function.
This is incredible! But there's a catch:
| What the theorem says | Reality |
|---|---|
| Any function is learnable | May need millions of neurons |
| One hidden layer suffices | Deeper is more efficient |
| Solution exists | Finding it may be hard |
In practice: We use deeper networks (more layers, fewer neurons per layer) rather than one giant layer.
Deep networks learn hierarchical features — much more efficient!
| Approach | Example | Trade-off |
|---|---|---|
| Wide | 1 layer x 10,000 neurons | Can represent anything, but hard to train |
| Deep | 10 layers x 100 neurons | Learns hierarchical features, easier to train |
Why deep beats wide:
Each layer reuses what the previous layer learned. That's efficient!
A wide network has to learn everything from scratch in one shot.