Scenario: You're a real estate agent. A client asks:
"I'm looking at a 1750 sqft house. What should I expect to pay?"
You have data from recent sales:
| Size (sqft) | Price (₹ lakhs) |
|---|---|
| 1000 | 40 |
| 1500 | 60 |
| 2000 | 80 |
| 2500 | 100 |
Can you see the pattern?
When we plot the data:
The points seem to follow a line!
Linear regression = finding the best line through the points
Every 500 sqft adds ₹20 lakhs.
| Size | Price | Pattern |
|---|---|---|
| 1000 | 40 | |
| 1500 | 60 | +500 sqft → +₹20 lakhs |
| 2000 | 80 | +500 sqft → +₹20 lakhs |
| 2500 | 100 | +500 sqft → +₹20 lakhs |
So 1750 sqft should cost... ₹70 lakhs!
You just did linear regression in your head.
| Symbol | Name | Meaning | Our Example |
|---|---|---|---|
| Input | Feature value | Size (sqft) | |
| Output | Predicted value | Price | |
| Weight | Slope | 0.04 | |
| Bias | Intercept | 0 |
The "hat" on y means it's our prediction!
Weight w = 0.04 means:
"For every 1 sqft increase, price increases by ₹0.04 lakhs"
Or equivalently:
"For every 100 sqft increase, price increases by ₹4 lakhs"
The weight tells you the sensitivity — how much does output change when input changes?
Bias b = 0 means:
"A 0 sqft house would cost ₹0"
In reality, bias captures the baseline cost:
If b = 10, then even a tiny house costs at least ₹10 lakhs.
What if price depends on more than just size?
Or in vector form (both notations are equivalent):
| Symbol | Shape | Example |
|---|---|---|
| (d,) | [1500, 3, 2] — size, beds, baths | |
| (d,) | [0.03, 5.0, 8.0] — learned weights | |
| scalar | -10 |
Note:
Going forward, we combine weights and bias into one vector
Trick: Add a column of 1s to
| Original |
Augmented |
|---|---|
Now
# After training on multiple features:
# coef_ = [0.03, 5.0, 8.0]
# intercept_ = -10
| Feature | Weight | Interpretation |
|---|---|---|
| Size (sqft) | 0.03 | +100 sqft → +₹3 lakhs |
| Bedrooms | 5.0 | +1 bedroom → +₹5 lakhs |
| Bathrooms | 8.0 | +1 bathroom → +₹8 lakhs |
Each weight shows that feature's independent contribution to price!
Real data has noise — points don't fall exactly on a line.
Three candidate lines:
| Line | Equation |
|---|---|
| A | |
| B | |
| C |
Which line is "best"?
The one with the smallest total error!
Residual = Actual - Predicted =
| Size | Actual | Predicted | Residual | Residual² |
|---|---|---|---|---|
| 1000 | 42 | 40 | +2 | 4 |
| 1500 | 58 | 60 | -2 | 4 |
| 2000 | 83 | 80 | +3 | 9 |
| 2500 | 97 | 100 | -3 | 9 |
Goal: Find
We minimize Sum of Squared Errors (SSE):
| Why Square? | Reason |
|---|---|
| Errors don't cancel | +3 and -3 both contribute positively |
| Penalizes big errors more | Error of 10 costs 100, not 10 |
| Has nice math properties | Differentiable, convex |
More commonly, we use MSE (average of squared errors):
This is also called the Loss Function or Cost Function:
Our goal: Find
| Method | How It Works | When to Use |
|---|---|---|
| Normal Equation | Direct formula, one-shot | Small datasets |
| Gradient Descent | Iterative, step-by-step | Large datasets, neural nets |
Let's learn both!
| Concept | What it means | Example |
|---|---|---|
| Derivative | Rate of change (1 variable) | |
| Partial Derivative | Rate of change w.r.t. one variable (others fixed) | |
| Gradient | Vector of all partial derivatives |
Key insight: Gradient points in direction of steepest increase. To minimize, go opposite to gradient!
At point
Interpretation:
Setting
For simple functions like
| Step | What we do | Result |
|---|---|---|
| 1. Write the function | ||
| 2. Take derivative | ||
| 3. Set = 0 | ||
| 4. Solve | Minimum! |
Can we do the same for linear regression?
Yes! Our loss is
Take derivative, set to 0, solve...
Setting
How to read this: "The best parameters
import numpy as np
X = np.array([[1, 1000], [1, 1500], [1, 2000], [1, 2500]]) # column of 1s for bias
y = np.array([42, 58, 83, 97])
theta = np.linalg.inv(X.T @ X) @ X.T @ y
print(f"bias = {theta[0]:.1f}, weight = {theta[1]:.4f}")
# bias = 1.0, weight = 0.038 → Line B from our plot!
Limitation: Requires matrix inversion — too slow for millions of features. We need gradient descent!
The idea: Take small steps downhill until you reach the minimum!

Update rule:
| Symbol | Name | Meaning |
|---|---|---|
| Learning rate | How big each step is | |
| Gradient | Direction of steepest increase | |
| Direction of steepest decrease |
Let's walk through ONE step:
| Current | Value |
|---|---|
| 0.01 | |
| Loss | 150 |
| Gradient | -80 |
| Learning rate |
0.001 |
The update:
Gradient was negative → we moved weight UP!
How should we nudge
Think about one data point
| Situation | Error | What to do with |
|---|---|---|
| Predicted too low | Increase |
|
| Predicted too high | Decrease |
|
| Predicted correctly | Don't change! |
The gradient captures exactly this: error
Big error + big feature = big update. Zero error = no update.
def gradient_descent(X, y, lr=0.01, epochs=1000):
theta = np.zeros(X.shape[1]) # Start with zeros
for epoch in range(epochs):
y_pred = X @ theta # Predictions
error = y - y_pred # Residuals
gradient = (-2/len(y)) * (X.T @ error)
theta = theta - lr * gradient # Update!
return theta
Just 8 lines of code! This is all of gradient descent.

| Too small ( |
Just right ( |
Too large ( |
|---|---|---|
| Slow convergence | Fast convergence ✓ | Diverges! |
Imagine walking down a hill blindfolded:
| Learning Rate | What Happens |
|---|---|
| Tiny steps (0.0001) | Safe, but takes forever to reach bottom |
| Normal steps (0.01) | Good progress, reach bottom reasonably |
| Giant leaps (1.0) | Overshoot, end up on the other side! |
How to choose?
| Normal Equation | Gradient Descent |
|---|---|
| One-shot computation | Iterative process |
| Exact solution | Approximate (but close enough) |
| O(n³) complexity | O(n) per iteration |
| Only works for linear models | Works for ANY differentiable model! |
This is the foundation of neural network training!
Different features have different scales:
| Feature | Range | Scale |
|---|---|---|
| House size | 500 - 5000 sqft | ~1000s |
| Bedrooms | 1 - 6 | ~1s |
| Age | 0 - 100 years | ~10s |
Problem: Large-scale features dominate gradient descent!
Without scaling:
With scaling:
Imagine training with mixed currencies:
| Feature | Value | Scale |
|---|---|---|
| Price in rupees | 50,00,000 | Millions |
| Number of rooms | 3 | Single digits |
Without scaling: The model thinks rupees matter MORE (bigger numbers!).
Standardization: Convert everything to "standard units"
Now both features speak the same language!
| Method | Formula | Result |
|---|---|---|
| Standardization | Mean=0, Std=1 | |
| Min-Max Scaling | Range [0, 1] |
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Standardization (most common)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Min-Max (when you need bounded range)
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# CORRECT way:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit AND transform
X_test_scaled = scaler.transform(X_test) # Only transform!
# WRONG (data leakage!):
X_scaled = scaler.fit_transform(X) # Fitting on all data
Never fit the scaler on test data! It would leak information.
| Algorithm | Needs Scaling? | Why |
|---|---|---|
| Linear/Logistic Regression | Yes | Gradient descent |
| Neural Networks | Yes | Gradient descent |
| Decision Trees | No | Split-based, scale-invariant |
| K-Nearest Neighbors | Yes | Distance-based |
| Random Forest | No | Tree-based |
from sklearn.linear_model import LinearRegression
import numpy as np
# Our data
X = np.array([[1000], [1500], [2000], [2500]])
y = np.array([40, 60, 80, 100])
# Create and train model
model = LinearRegression()
model.fit(X, y)
# Now predict!
model.predict([[1750]]) # → 70.0 (₹70 lakhs)
print(f"Weight (w): {model.coef_[0]}") # 0.04
print(f"Intercept (b): {model.intercept_}") # 0.0
The equation it learned:
Verify:
import torch
import torch.nn as nn
# Data as tensors
X = torch.tensor([[1000.], [1500.], [2000.], [2500.]])
y = torch.tensor([[40.], [60.], [80.], [100.]])
# Normalize for stable training
X_norm = X / 1000
# Linear model: y = wx + b
model = nn.Linear(1, 1) # 1 input, 1 output
criterion = nn.MSELoss() # Loss function
optimizer = torch.optim.SGD(model.parameters(), lr=0.1) # Optimizer
for epoch in range(100):
y_pred = model(X_norm) # 1. Forward pass
loss = criterion(y_pred, y) # 2. Compute loss
optimizer.zero_grad() # 3. Clear gradients
loss.backward() # 4. Compute gradients
optimizer.step() # 5. Update weights
After training: model.weight ≈ 0.04, model.bias ≈ 0 (same as sklearn!)
The 5-step training cycle (memorize this!):
| Step | Code | What it does |
|---|---|---|
| 1 | y_pred = model(X) |
Forward pass: compute predictions |
| 2 | loss = criterion(y_pred, y) |
Measure error |
| 3 | optimizer.zero_grad() |
Clear old gradients |
| 4 | loss.backward() |
Compute new gradients |
| 5 | optimizer.step() |
Update θ using gradients |
This exact loop works for ANY neural network - from linear regression to GPT!
| Aspect | sklearn | PyTorch |
|---|---|---|
| Simplicity | 2 lines of code | 10+ lines |
| Method | Closed-form (SVD) | Gradient descent |
| Customization | Limited | Full control |
| Neural nets | Basic only | Full support |
| GPU support | No | Yes! |
Start with sklearn, move to PyTorch when you need more power!
Scenario: You're building a spam filter.
| Exclamation marks | Has "FREE" | Is Spam? | |
|---|---|---|---|
| 1 | 5 | Yes | Spam |
| 2 | 0 | No | Not Spam |
| 3 | 3 | Yes | Spam |
| 4 | 1 | No | Not Spam |
The output is a category, not a number!
If we use linear regression:
Problem: This gives any number (-∞ to +∞)
| Score | What does it mean? |
|---|---|
| -2.5 | ??? |
| 0.3 | ??? |
| 1.5 | ??? |
| 147 | ??? |
We need something between 0 and 1 (a probability)!
Solution: Squash any number to range (0, 1)

The sigmoid is an S-curve:
| Region | Behavior |
|---|---|
| Very negative z | Output ≈ 0 |
| z near 0 | Output changes rapidly (decision boundary) |
| Very positive z | Output ≈ 1 |
Key insight: It converts any number to a probability!
Plug in some numbers:
| Linear score z | Meaning | |
|---|---|---|
| z = -5 | ~0% chance | |
| z = 0 | 50/50 | |
| z = +5 | ~100% chance |
No matter what z is, output is always between 0 and 1!
Two steps:
Linear: Compute a score (same as linear regression!)
Sigmoid: Convert to probability
Email features: 5 exclamation marks, has "FREE" (=1)
Learned weights:
Step 1: Linear score
Step 2: Sigmoid
Decision: 97% → This is spam!

| If P(spam) | Decision | Threshold can be tuned! |
|---|---|---|
| > 0.5 | Predict SPAM | Lower → catch more spam |
| ≤ 0.5 | Predict NOT SPAM | Higher → fewer false alarms |
from sklearn.linear_model import LogisticRegression
X = [[5, 1], [0, 0], [3, 1], [1, 0]] # [exclamations, has_FREE]
y = [1, 0, 1, 0] # 1=spam, 0=not spam
model = LogisticRegression()
model.fit(X, y)
model.predict([[4, 1]]) # → [1] (spam)
model.predict_proba([[4, 1]]) # → [[0.12, 0.88]] = [P(not spam), P(spam)]
import torch
import torch.nn as nn
# Model: Linear + Sigmoid
class LogisticRegression(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.linear = nn.Linear(input_dim, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
return self.sigmoid(self.linear(x))
model = LogisticRegression(input_dim=2)
# Binary Cross-Entropy Loss (for classification)
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for epoch in range(100):
# Forward pass
y_pred = model(X)
# Compute loss (cross-entropy, not MSE!)
loss = criterion(y_pred, y)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
For classification, MSE doesn't work well — Cross-Entropy penalizes confident wrong predictions severely!

Key insight: Being confident AND wrong = very high loss!
Problem: What if the relationship isn't linear?

Example: Ice cream sales vs temperature — clearly not a straight line!
Solution: Transform the inputs using basis functions
Instead of:
Use:
| Original Feature | Basis-Expanded Features |
|---|---|
The model is still linear in

By adding
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
X = np.array([[15], [20], [25], [30], [35]]) # Temperature
y = np.array([15, 10, 20, 55, 120]) # Ice cream sales
poly = PolynomialFeatures(degree=2) # Transform x → [1, x, x²]
X_poly = poly.fit_transform(X)
model = LinearRegression()
model.fit(X_poly, y) # Now it can fit curves!
Same idea with Logistic Regression:
from sklearn.linear_model import LogisticRegression
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model = LogisticRegression()
model.fit(X_poly, y) # Now decision boundary can be curved!
Key insight: Basis functions let linear models learn nonlinear patterns!
| Concept | Key Takeaway |
|---|---|
| Linear Regression | |
| Loss Function | MSE measures how wrong we are |
| Gradient Descent | |
| Logistic Regression | Linear + Sigmoid for classification |
| Cross-Entropy | Loss for classification |
| Basis Functions | Transform inputs for nonlinear patterns |
Linear Regression fits a line through data
Two ways to find optimal weights
Logistic Regression classifies using the sigmoid
sklearn → PyTorch uses the same concepts!