| Train Acc | Test Acc | Gap | Diagnosis | Action |
|---|---|---|---|---|
| 70% | 68% | 2% | Underfitting | More complex model, better features |
| 85% | 83% | 2% | Good fit | Ship it |
| 95% | 80% | 15% | Mild overfitting | Regularize, more data |
| 99% | 65% | 34% | Severe overfitting | Simplify drastically |
The train-test gap is your overfitting detector. Gap > 10% = red flag.
Idea: Penalize complexity. "Fit the data, but keep weights small."
| Type | What It Does | Code |
|---|---|---|
| L2 (Ridge) | Shrinks all weights toward zero | LogisticRegression(C=0.1) |
| L1 (Lasso) | Drives some weights to exactly zero | LogisticRegression(penalty='l1') |
| Tree depth | Limits how deep trees can grow | max_depth=5 |
| Dropout | Randomly drops neurons during training | Dropout(0.5) |
Smaller C = more regularization (C = 1/
OK -- so how do we actually measure if our model is good?
How to actually trust your numbers
# Run 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model.fit(X_train, y_train)
print(model.score(X_test, y_test)) # 87.3%
# Run 2 (same code, different random seed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model.fit(X_train, y_train)
print(model.score(X_test, y_test)) # 79.8%
Which is the real accuracy? 87%? 80%? Something else?
You wouldn't bet $500M on a single coin flip. Don't bet your model evaluation on a single random split.
Different splits create different test sets:
Same model, same training data, wildly different results.
The fundamental issue: One test set is a sample. Samples have variance.
Key insight: Every data point is used for testing exactly once.
Split data into K equal parts (folds). Rotate which one is the test set.
Final score = Average of all 5 test scores
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
# One line. That's it.
scores = cross_val_score(model, X, y, cv=5)
print(f"Fold scores: {scores}")
print(f"Mean: {scores.mean():.3f}")
print(f"Std: {scores.std():.3f}")
Fold scores: [0.823, 0.851, 0.842, 0.815, 0.834]
Mean: 0.833
Std: 0.013
Wrong: "Our model achieves 87% accuracy"
Right: "Our model achieves 83.3% +/- 1.3% accuracy (5-fold CV)"
| Std Dev | Interpretation |
|---|---|
| +/- 1% | Very reliable estimate |
| +/- 3% | Reasonable |
| +/- 5% | Noisy -- need more data or folds |
| +/- 10% | Don't trust this number |
The standard deviation tells you how much to trust the mean.
| K | Train Size | Bias | Variance | Speed |
|---|---|---|---|---|
| 2 | 50% | High (less training data) | High | Fast |
| 5 | 80% | Low | Low | Good |
| 10 | 90% | Very low | Medium (correlated folds) | Slower |
| N (LOO) | N-1 | Lowest | High (!) | Very slow |
Default: K=5. Use K=10 if dataset is small. LOO only for < 100 samples.
Why LOO has high variance: Each test set has 1 sample. That's a very noisy estimate per fold.
Problem: Our movie data is 70% success, 30% failure.
Random splits might give:
Stratified K-Fold ensures every fold maintains the 70/30 ratio.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)
Good news: cross_val_score uses stratified folds by default for classifiers.
| Data Type | Problem | Solution |
|---|---|---|
| Time series | Training on future, testing on past | TimeSeriesSplit |
| Grouped data | Same patient in train AND test | GroupKFold |
| Very small | K folds too small to be useful | LeaveOneOut |
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
# Split 1: Train [2018], Test [2019]
# Split 2: Train [2018-2019], Test [2020]
# Split 3: Train [2018-2020], Test [2021]
# Always: past predicts future. Never the reverse.
Leakage: Information from the test set "leaks" into training.
# WRONG: Scaler sees ALL data (including test)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # <-- Leakage!
scores = cross_val_score(model, X_scaled, y, cv=5)
# RIGHT: Use a Pipeline (scaler fits only on training fold)
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestClassifier())
])
scores = cross_val_score(pipe, X, y, cv=5) # <-- Clean
The Pipeline ensures preprocessing happens inside each fold.
| Leakage Type | Example | Fix |
|---|---|---|
| Preprocessing | Scaling on full data | Use Pipeline |
| Feature selection | Selecting features using full data | Select inside CV |
| Target leakage | Feature that encodes the label | Remove the feature |
| Temporal leakage | Using future data | TimeSeriesSplit |
| Duplicate leakage | Same sample in train and test | Deduplicate first |
Data leakage gives you optimistic results that won't hold in production.
Your model looks great in the notebook, fails in the real world.
Plot score vs training set size. Diagnoses whether you need more data.
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(
model, X, y, train_sizes=[0.1, 0.25, 0.5, 0.75, 1.0], cv=5
)
| Shape | Diagnosis | Action |
|---|---|---|
| Big gap, both rising | Overfitting -- more data would help | Collect more data |
| Both flat at low score | Underfitting -- more data won't help | More complex model |
| Converged at high score | Good fit | You're done |
Plot score vs hyperparameter value. Finds the sweet spot.
from sklearn.model_selection import validation_curve
train_scores, val_scores = validation_curve(
RandomForestClassifier(), X, y,
param_name="max_depth", param_range=[1, 2, 5, 10, 20, 50], cv=5
)
| Situation | Use This | Code |
|---|---|---|
| Classification | StratifiedKFold |
cross_val_score(model, X, y, cv=5) |
| Regression | KFold |
cross_val_score(model, X, y, cv=5) |
| Time series | TimeSeriesSplit |
TimeSeriesSplit(n_splits=5) |
| Grouped data | GroupKFold |
GroupKFold(n_splits=5) |
| Avoid leakage | Pipeline |
Pipeline([('scaler', ...), ('model', ...)]) |
| Find right K | Validation curve | validation_curve(...) |
| Need more data? | Learning curve | learning_curve(...) |
Finding the best knobs to turn
Parameters: The model figures these out from data (weights, thresholds).
Hyperparameters: You decide these before training.
| Model | Hyperparameter | What It Controls | Typical Range |
|---|---|---|---|
| Logistic Reg | C |
Regularization strength | 0.001 - 1000 |
| Decision Tree | max_depth |
Tree complexity | 1 - 50 |
| Random Forest | n_estimators |
Number of trees | 50 - 500 |
| Random Forest | min_samples_leaf |
Minimum leaf size | 1 - 20 |
| XGBoost | learning_rate |
Step size | 0.01 - 0.3 |
| XGBoost | max_depth |
Tree complexity | 3 - 10 |
How do you find the best combination?
# Monday
model = RandomForestClassifier(n_estimators=100, max_depth=10)
# Accuracy: 83.2%
# Tuesday
model = RandomForestClassifier(n_estimators=200, max_depth=15)
# Accuracy: 84.1%
# Wednesday
model = RandomForestClassifier(n_estimators=200, max_depth=20)
# Accuracy: 82.8% ... wait, was Tuesday's result with max_depth=15 or 20?
# Thursday: give up, use the Tuesday one. Probably.
Problems: No record of what you tried. No systematic coverage. Easy to miss the best combination.
Try every combination on a predefined grid.
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15, None],
'min_samples_leaf': [1, 2, 5]
}
grid = GridSearchCV(
RandomForestClassifier(), param_grid, cv=5, scoring='accuracy'
)
grid.fit(X, y)
print(f"Best params: {grid.best_params_}")
print(f"Best score: {grid.best_score_:.3f}")
Parameters:
n_estimators: [50, 100, 200] → 3 values
max_depth: [5, 10, 15, None] → 4 values
min_samples_leaf: [1, 2, 5] → 3 values
Total combinations: 3 × 4 × 3 = 36
Cross-validation: 36 × 5 folds = 180 model fits
Now add two more parameters with 5 values each:
Grid search doesn't scale. Each new parameter multiplies the cost.
Sample random combinations instead of trying all of them.
Bergstra & Bengio (2012): A landmark result.
Key insight: Not all hyperparameters matter equally.
max_depth matters a lot, but min_samples_leaf barely affects performance.In practice: 60 random trials often beats a full grid search.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
param_distributions = {
'n_estimators': randint(50, 500), # Sample integers 50-500
'max_depth': randint(3, 30), # Sample integers 3-30
'min_samples_leaf': randint(1, 20), # Sample integers 1-20
'max_features': uniform(0.1, 0.9), # Sample floats 0.1-1.0
}
search = RandomizedSearchCV(
RandomForestClassifier(),
param_distributions,
n_iter=60, # Only 60 random trials
cv=5,
random_state=42
)
search.fit(X, y)
print(f"Best: {search.best_score_:.3f} with {search.best_params_}")
| Grid Search | Random Search | |
|---|---|---|
| Combinations | All | Sampled |
| Budget | Grows exponentially | You control it (n_iter) |
| Coverage | Even but sparse per dimension | Better for important params |
| Best for | 1-2 hyperparameters | 3+ hyperparameters |
| Guarantees | Finds best in grid | May miss the best |
Rule of thumb: Use grid for quick searches (few params, few values). Use random for everything else.
Idea: Use results so far to decide what to try next.
Grid and random search are blind -- they don't learn from previous trials.
Bayesian optimization builds a model of "hyperparameter → score" and picks the next point intelligently.
Trial 1: max_depth=5, lr=0.1 → 82%
Trial 2: max_depth=10, lr=0.01 → 85%
Trial 3: max_depth=8, lr=0.05 → ??? (model predicts ~86%, tries this region)
Explores uncertain regions + exploits promising regions.
import optuna
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 500),
'max_depth': trial.suggest_int('max_depth', 3, 30),
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20),
}
model = RandomForestClassifier(**params)
scores = cross_val_score(model, X, y, cv=5)
return scores.mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print(f"Best score: {study.best_value:.3f}")
print(f"Best params: {study.best_params}")
# Which hyperparameters matter most?
optuna.visualization.plot_param_importances(study)
# How did optimization progress over trials?
optuna.visualization.plot_optimization_history(study)
# How do parameters interact?
optuna.visualization.plot_contour(study)
Optuna also supports:
| Grid | Random | Bayesian (Optuna) | |
|---|---|---|---|
| Intelligence | None | None | Learns from trials |
| Efficiency | Low | Medium | High |
| Setup | Easy | Easy | Moderate |
| Best for | Few params | Many params | Expensive models |
| Parallelizable | Yes | Yes | Partially |
Practical advice:
RandomizedSearchCV (simple, effective)A subtle but critical mistake:
# WRONG: Tune and evaluate on the SAME cross-validation
grid = GridSearchCV(model, params, cv=5)
grid.fit(X, y)
print(f"Best score: {grid.best_score_:.3f}") # Optimistic!
Why this is wrong: You searched over many configurations and picked the best one. By definition, it's the luckiest. This is selection bias.
The score from GridSearchCV.best_score_ is always optimistic.
Solution: Separate the tuning loop from the evaluation loop.
from sklearn.model_selection import cross_val_score, GridSearchCV
# Inner loop: tune hyperparameters
inner_cv = GridSearchCV(
RandomForestClassifier(),
param_grid={'max_depth': [5, 10, 15], 'n_estimators': [100, 200]},
cv=3 # 3-fold inner CV for tuning
)
# Outer loop: evaluate the tuned model
outer_scores = cross_val_score(inner_cv, X, y, cv=5) # 5-fold outer CV
print(f"Nested CV score: {outer_scores.mean():.3f} +/- {outer_scores.std():.3f}")
This is the gold standard for reporting tuned model performance.
GridSearchCV.best_score_ is optimisticlearning_rate: [0.001, 0.0011, 0.0012, ...] -- waste of computemax_depth and n_estimators interact -- tune them togetherbest_score_ as final performance: Always use nested CV"Which run was the good one?"
After a week of tuning, your notebook has:
print() statementsYou need a system. Tools exist for this:
| Tool | Type | Best For |
|---|---|---|
| Weights & Biases | Cloud service | Teams, dashboards, sweeps |
| MLflow | Self-hosted | Privacy, local-first |
| TensorBoard | Built into TF/PyTorch | Training curves |
| Optuna Dashboard | Built into Optuna | Hyperparameter viz |
| Category | Examples |
|---|---|
| Hyperparameters | max_depth=10, lr=0.01, n_estimators=200 |
| Metrics | Accuracy, F1, loss, training time |
| Data | Dataset version, split seed, preprocessing steps |
| Code | Git commit hash, notebook version |
| Artifacts | Saved model file, plots, confusion matrix |
We'll cover these tools in depth next week (Reproducibility).
For now: if you're using Optuna, its built-in tracking is a great start.
What if the computer did all of this for you?
Step 1: Pick a model (complexity ladder)
Step 2: Choose hyperparameters (grid/random/Bayesian search)
Step 3: Evaluate properly (nested cross-validation)
Step 4: Try another model (repeat steps 1-3)
Step 5: Compare all models (pick the best)
Step 6: Maybe ensemble the top ones (combine for better accuracy)
AutoML automates steps 1-6.
from autogluon.tabular import TabularPredictor
# 1. Create predictor
predictor = TabularPredictor(label='success')
# 2. Fit (give it a time budget)
predictor.fit(train_data, time_limit=300) # 5 minutes
# 3. Predict
predictions = predictor.predict(test_data)
That's it. No model selection. No hyperparameter tuning. No ensembling.
AutoGluon does all of it.
AutoGluon: Starting fit...
Preprocessing data...
15 numeric features, 3 categorical features
Fitting 11 models...
LightGBM ✓ (32s) val_acc=0.851
CatBoost ✓ (45s) val_acc=0.856
XGBoost ✓ (38s) val_acc=0.848
RandomForest ✓ (25s) val_acc=0.832
ExtraTrees ✓ (28s) val_acc=0.828
NeuralNetTorch ✓ (65s) val_acc=0.819
LogisticRegression ✓ (5s) val_acc=0.789
...
Ensembling top models... ✓ (15s)
Best: WeightedEnsemble_L2 (val_acc=0.873)
predictor.leaderboard(test_data)
model score_val fit_time pred_time
0 WeightedEnsemble_L2 0.873 180s 0.5s
1 CatBoost 0.856 60s 0.1s
2 LightGBM 0.851 40s 0.1s
3 XGBoost 0.848 55s 0.1s
4 RandomForest 0.832 30s 0.2s
5 LogisticRegression 0.789 10s 0.0s
The ensemble beats every individual model. That's the power of stacking.
# Quick exploration (minutes)
predictor.fit(train_data, presets='medium_quality', time_limit=60)
# Balanced (recommended default)
predictor.fit(train_data, presets='good_quality', time_limit=300)
# Production
predictor.fit(train_data, presets='high_quality', time_limit=3600)
# Competition (best possible, very slow)
predictor.fit(train_data, presets='best_quality', time_limit=14400)
| Preset | Time | Models | Ensembling |
|---|---|---|---|
medium_quality |
~1 min | 5-6 | Simple |
good_quality |
~5 min | 8-10 | Weighted |
high_quality |
~1 hour | 10+ | Multi-layer stacking |
best_quality |
Hours | 15+ | Deep stacking |
Good for:
Be careful when:
| Approach | Effort | Control | Accuracy |
|---|---|---|---|
DummyClassifier() |
Zero | N/A | Baseline |
LogisticRegression() |
Low | High | Good |
RandomizedSearchCV(RF, ...) |
Medium | High | Better |
optuna.optimize(objective, ...) |
Medium | High | Better |
TabularPredictor().fit() |
Low | Low | Best |
They're not mutually exclusive. Use AutoML to find the ceiling, then manually build an interpretable model that gets close.
# Step 1: Know your floor
dummy = cross_val_score(DummyClassifier(), X, y, cv=5).mean()
# Step 2: Simple interpretable model
lr = cross_val_score(LogisticRegression(), X, y, cv=5).mean()
# Step 3: Strong default with tuning
search = RandomizedSearchCV(RandomForestClassifier(), params, n_iter=60, cv=5)
outer = cross_val_score(search, X, y, cv=5) # Nested CV
# Step 4: AutoML ceiling
predictor = TabularPredictor(label='target').fit(train_data, time_limit=300)
# Step 5: Decide
# If LR is close to AutoML → deploy LR (interpretable, fast)
# If RF is close to AutoML → deploy RF (good balance)
# If only AutoML is good enough → deploy AutoML (accept complexity)
| Concept | One-Liner |
|---|---|
| Cross-validation | Never trust a single train/test split |
| Stratified CV | Preserve class ratios in each fold |
| Data leakage | Preprocessing must happen inside CV, not before |
| Learning curves | Tells you if more data would help |
| Validation curves | Tells you the best hyperparameter value |
| Random > Grid | Better coverage of important dimensions |
| Nested CV | Tune inside, evaluate outside |
| AutoML | Automates model selection + tuning + ensembling |
Q1: Why use CV instead of a single train/test split?
Single split can be lucky/unlucky. CV averages over K splits for a reliable estimate with a standard error.
Q2: Train accuracy 99%, test accuracy 70%. What's wrong?
Overfitting. Model memorized training data. Fix: regularize, simplify, more data.
Q3: You scale all features, then run cross_val_score. What's wrong?
Data leakage. The scaler saw test fold data. Fix: use a
Pipeline.
Q4: Why does random search often beat grid search?
Important hyperparameters get more unique values tested (Bergstra & Bengio 2012).
Q5: What is nested cross-validation and when do you need it?
Inner loop tunes hyperparameters, outer loop evaluates. Needed whenever you report performance of a tuned model, because
best_score_from GridSearchCV is optimistically biased.
Q6: When would you NOT use standard K-fold CV?
Time series (use TimeSeriesSplit), grouped data (use GroupKFold).
Q7: AutoML gets 88% accuracy. Your logistic regression gets 85%. Which do you deploy?
Depends on context. If interpretability, latency, or model size matter, the 3% gap may not justify AutoML's complexity. There's no universal right answer -- articulate the tradeoffs.
| Task | Time | What You'll Do |
|---|---|---|
| 1. CV Exploration | 15 min | Single split vs 5-fold: see the variance yourself |
| 2. Validation Curves | 15 min | Plot max_depth vs accuracy for a Random Forest |
| 3. Grid vs Random | 20 min | Same budget, compare coverage and best score |
| 4. Optuna | 20 min | Bayesian optimization with visualizations |
| 5. Nested CV | 10 min | Compare best_score_ vs nested CV score |
| 6. AutoGluon | 20 min | Run AutoML, analyze leaderboard |
This week's message:
Measure properly (CV). Tune systematically (random/Bayesian search).
Report honestly (nested CV). Automate when it makes sense (AutoML).
Start simple. Climb the ladder only when justified.
Next week: Reproducibility & Environments