from sklearn.model_selection import cross_val_score
model = RandomForestClassifier(n_estimators=100)
# Run 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"Fold scores: {scores}")
print(f"Mean: {scores.mean():.3f}")
print(f"Std: {scores.std():.3f}")
Fold scores: [0.82, 0.85, 0.84, 0.81, 0.83]
Mean: 0.830
Std: 0.015
Report as: 83.0% ± 1.5%
Problem: If classes are imbalanced, random folds may have different class ratios.
Solution: Stratified K-Fold ensures each fold has same class distribution.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in skf.split(X, y):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Each fold has same % of positive/negative
Use stratified CV for classification problems.
| Data Type | Problem | Solution |
|---|---|---|
| Time series | Future data leaks into past | TimeSeriesSplit |
| Grouped data | Same patient in train & test | GroupKFold |
| Very small data | K folds too small | Leave-One-Out CV |
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
# Split 1: Train on [1], Test on [2]
# Split 2: Train on [1,2], Test on [3]
# Split 3: Train on [1,2,3], Test on [4]
# ...
Total Error = Bias² + Variance + Irreducible Noise
| High Bias | High Variance | |
|---|---|---|
| Meaning | Model too simple | Model too complex |
| Symptom | Underfitting | Overfitting |
| Train error | High | Low |
| Test error | High | High |
| Example | Linear model on curved data | Deep tree on small data |
Key insight: You cannot minimize both simultaneously. The goal is to find the sweet spot.
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
print(f"Train: {train_acc:.1%}, Test: {test_acc:.1%}")
| Train Acc | Test Acc | Diagnosis | Fix |
|---|---|---|---|
| 70% | 68% | Underfitting | More complex model |
| 99% | 75% | Overfitting | Regularization, more data |
| 85% | 83% | Good fit | You're done! |
Gap between train and test indicates overfitting.
More training data - Best solution if available
Regularization - Penalize complexity
LogisticRegression(C=0.1) # Smaller C = more regularization
Simpler model - Fewer parameters
DecisionTreeClassifier(max_depth=5) # Limit tree depth
Early stopping - Stop before overfitting
model.fit(X, y, early_stopping_rounds=10)
Dropout (neural networks) - Randomly drop neurons
More complex model
# From linear to polynomial
PolynomialFeatures(degree=3)
More features - Engineer better inputs
Less regularization
LogisticRegression(C=10) # Larger C = less regularization
Train longer (neural networks)
Remove noise from data
Complexity vs Accuracy:
5. Deep Neural Network ← Only if data is huge
4. Gradient Boosting ← Often best for tabular
3. Random Forest ← Great default
2. Logistic Regression ← Start here
1. Majority Class ← Your baseline floor
Rule: Climb one step at a time. Only go up if the improvement justifies complexity.
from sklearn.dummy import DummyClassifier
# Always predict most common class
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
print(f"Baseline: {dummy.score(X_test, y_test):.1%}")
If 70% of movies succeed, predicting "success" always = 70% accuracy.
Any real model must beat this!
Idea: Weighted sum of features → probability
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.1%}")
# Interpretable: see feature weights
print(f"Weights: {model.coef_}")
Pros: Fast, interpretable, works well for linearly separable data
Idea: Sequence of if-else rules
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=5)
tree.fit(X_train, y_train)
Visualization:
from sklearn.tree import plot_tree
plot_tree(tree, feature_names=feature_names, filled=True)
Pros: Interpretable, handles non-linear relationships
Cons: High variance (overfits easily)
Ensemble: Train many trees, take majority vote.
Bagging (Bootstrap Aggregating):
Feature randomness:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=100, # Number of trees
max_depth=10, # Limit tree depth
random_state=42
)
rf.fit(X_train, y_train)
# Which features matter most?
importances = rf.feature_importances_
for name, imp in sorted(zip(feature_names, importances),
key=lambda x: -x[1])[:5]:
print(f"{name}: {imp:.3f}")
budget: 0.25
star_power: 0.18
genre_action: 0.12
runtime: 0.10
is_sequel: 0.08
Useful for: Feature selection, model interpretation, debugging
1. Try Logistic Regression... okay
2. Try Random Forest... better
3. Try XGBoost... similar
4. Try different hyperparameters...
5. Try feature engineering...
6. Repeat for days...
AutoML automates this entire process.
Automatically searches through models, hyperparameters, and ensembles.
from autogluon.tabular import TabularPredictor
# Just point to your data
predictor = TabularPredictor(label='target_column')
predictor.fit(train_data)
# Predict
predictions = predictor.predict(test_data)
What happens inside:
predictor.leaderboard(test_data)
model score_val fit_time
0 WeightedEnsemble_L2 0.873 180s
1 CatBoost 0.856 60s
2 LightGBM 0.851 40s
3 XGBoost 0.848 55s
4 RandomForest 0.832 30s
5 LogisticRegression 0.789 10s
The ensemble combines the best models!
# Quick run (5 minutes)
predictor.fit(train_data, time_limit=300)
# Full run (1 hour)
predictor.fit(train_data, time_limit=3600)
| Time | What AutoGluon Does |
|---|---|
| 1 min | Basic models (LR, RF) |
| 5 min | + XGBoost, LightGBM |
| 30 min | + Neural nets, tuning |
| 2+ hours | Full tuning, multi-layer stacking |
# Different quality levels
predictor.fit(train_data, presets='medium_quality')
| Preset | Speed | Accuracy | Use Case |
|---|---|---|---|
best_quality |
Slow | Highest | Competitions |
high_quality |
Medium | High | Production |
good_quality |
Fast | Good | Prototyping |
medium_quality |
Faster | Decent | Quick tests |
Good for:
Be careful:
| Train from Scratch | Transfer Learning | |
|---|---|---|
| Data needed | Millions | Hundreds |
| Training time | Days/weeks | Minutes/hours |
| Hardware | Multiple GPUs | 1 GPU or CPU |
| Expertise | High | Low |
Key insight: Someone already trained on massive data. Use their work!
For Images (ResNet, ViT):
| Layer | What It Learns |
|---|---|
| Early | Edges, textures |
| Middle | Shapes, parts |
| Late | Objects, scenes |
For Text (BERT, RoBERTa):
| Layer | What It Learns |
|---|---|
| Early | Word meanings |
| Middle | Syntax, grammar |
| Late | Context, semantics |
Lower layers = General (reusable)
Higher layers = Task-specific (replace)
| Feature Extraction | Fine-Tuning | |
|---|---|---|
| Pretrained layers | Frozen | Trained (slowly) |
| New head | Trained | Trained |
| Speed | Fast | Slower |
| Data needed | Less | More |
| Accuracy | Good | Better |
Start with feature extraction. Fine-tune if you need more accuracy.
import timm
import torch
# Load pretrained model
model = timm.create_model('resnet50', pretrained=True, num_classes=10)
# Freeze all layers except the head
for param in model.parameters():
param.requires_grad = False
for param in model.fc.parameters():
param.requires_grad = True
# Now train only the final layer
timm has 700+ pretrained models!
| Model | Size | Accuracy (ImageNet) | Speed |
|---|---|---|---|
| ResNet-50 | 25M | 76% | Fast |
| EfficientNet-B0 | 5M | 77% | Fast |
| ViT-B/16 | 86M | 81% | Medium |
| ConvNeXt-Base | 89M | 84% | Medium |
Recommendation:
from torchvision import transforms, datasets
import timm
# Load pretrained model
model = timm.create_model('efficientnet_b0', pretrained=True, num_classes=5)
# Data transforms
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# Load your data
dataset = datasets.ImageFolder('movie_posters/', transform=transform)
from transformers import pipeline
# Load pretrained sentiment classifier
classifier = pipeline("sentiment-analysis")
# Use immediately - no training!
result = classifier("This movie was fantastic!")
print(result)
# [{'label': 'POSITIVE', 'score': 0.9998}]
Zero training required for many tasks!
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments
)
# Load pretrained model
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2
)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Fine-tune on your data
trainer = Trainer(
model=model,
args=TrainingArguments(output_dir="./results", num_train_epochs=3),
train_dataset=train_dataset,
)
trainer.train()
| Model | Size | Best For |
|---|---|---|
| BERT-base | 110M | General NLP tasks |
| DistilBERT | 66M | Faster, 97% of BERT quality |
| RoBERTa | 125M | Better than BERT |
| DeBERTa | 140M | State-of-the-art |
Recommendation:
No training at all - just describe your classes!
from transformers import pipeline
classifier = pipeline("zero-shot-classification")
text = "The new iPhone has amazing battery life"
labels = ["technology", "sports", "politics", "entertainment"]
result = classifier(text, labels)
print(result['labels'][0]) # "technology"
Connection to Week 6: Similar to LLM classification, but smaller/faster models.
What type of data?
│
├── Tabular (spreadsheet) ──► AutoML (AutoGluon)
│
├── Images ──► Transfer Learning (timm, ResNet, ViT)
│
├── Text ──► Transfer Learning (HuggingFace, BERT)
│
└── Audio ──► Transfer Learning (Whisper, Wav2Vec)
Always:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from autogluon.tabular import TabularPredictor
# Load data
movies = pd.read_csv('movies.csv')
# Baseline with cross-validation
rf = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(rf, X, y, cv=5)
print(f"RF Baseline: {scores.mean():.1%} ± {scores.std():.1%}")
# AutoML
predictor = TabularPredictor(label='success')
predictor.fit(movies, time_limit=300)
print(predictor.leaderboard())
| Concept | Key Point |
|---|---|
| Cross-validation | Always use K-fold, never single split |
| Bias-variance | Underfitting vs overfitting |
| Baselines | Start simple before complex |
| AutoML | Automates model selection for tabular |
| Transfer learning | Use pretrained for images/text |
| Overfitting | Train acc >> Test acc = problem |
Why use cross-validation instead of single train/test split?
High train accuracy, low test accuracy - what's wrong?
When would you NOT use standard K-fold CV?
What's the difference between feature extraction and fine-tuning?
Cross-validation (20 min)
Baseline models (20 min)
AutoGluon (30 min)
Transfer learning (30 min)
Key concepts:
Remember: Simple first, complex only if needed!