Sample random combinations instead of trying all.

Bergstra & Bengio (2012):
"Random search is more efficient than grid search because not all hyperparameters are equally important."
Grid search wastes evaluations varying unimportant parameters. Random search covers each dimension more uniformly.
J. Bergstra and Y. Bengio. "Random Search for Hyper-Parameter Optimization." JMLR, 13:281-305, 2012.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
search = RandomizedSearchCV(
RandomForestClassifier(),
{'n_estimators': randint(50, 500),
'max_depth': randint(3, 30),
'min_samples_leaf': randint(1, 20),
'max_features': uniform(0.1, 0.9)},
n_iter=60, cv=5, random_state=42)
search.fit(X, y)
60 random trials often beats a full grid search of 900+ combos.
Notebook Part 1c: Compare Grid vs Random search on the same budget.
| Grid Search | Random Search | |
|---|---|---|
| Coverage | Exhaustive but sparse in each dim | Uniform in each dim |
| Budget | Fixed by grid size | You choose n_iter |
| Scales to | 3-4 parameters | 10+ parameters |
| Distributions | Discrete values only | Continuous distributions |
Practical rule: Use grid for 2-3 params, random for everything else.
Use past results to decide what to try next
Grid and Random are blind — they don't learn from previous evaluations.
Bayesian Optimization:
The surrogate predicts both:
This lets us balance:
The acquisition function (e.g., Expected Improvement) combines both.
from bayes_opt import BayesianOptimization
def black_box(x):
return -((x - 2)**2) + 1 # unknown to optimizer
optimizer = BayesianOptimization(
f=black_box,
pbounds={'x': (-5, 5)},
random_state=42)
optimizer.maximize(init_points=3, n_iter=7)
After 3 random points, the GP surrogate "sees" the landscape and focuses on the peak.
Notebook Part 2a: Visualize the GP surrogate step by step as it learns.
| Gaussian Process (GP) | Tree Parzen Estimator (TPE) | |
|---|---|---|
| Surrogate | GP regression | Density estimators |
| Library | bayesian-optimization |
Optuna |
| Strengths | Exact uncertainty, smooth functions | Scales well, handles categorical |
| Best for | < 20 params, continuous | Any size, mixed types |
GP-based: Models the objective function directly.
TPE: Models the distribution of good vs bad configurations.
Notebook Part 3: Side-by-side comparison of GP vs TPE behavior.
from bayes_opt import BayesianOptimization
def rf_objective(n_estimators, max_depth, min_samples_leaf):
model = RandomForestClassifier(
n_estimators=int(n_estimators),
max_depth=int(max_depth),
min_samples_leaf=int(min_samples_leaf))
return cross_val_score(model, X, y, cv=5).mean()
optimizer = BayesianOptimization(
f=rf_objective,
pbounds={'n_estimators': (50, 500),
'max_depth': (3, 30),
'min_samples_leaf': (1, 20)})
optimizer.maximize(init_points=10, n_iter=50)
Notebook Part 2b: Full GP-based tuning with convergence plot.
import optuna
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 500),
'max_depth': trial.suggest_int('max_depth', 3, 30),
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20),
}
model = RandomForestClassifier(**params)
return cross_val_score(model, X, y, cv=5).mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(f"Best: {study.best_value:.3f}")
print(f"Params: {study.best_params}")
Notebook Part 2c: Full Optuna tuning with visualization.
def objective(trial):
lr = trial.suggest_float('lr', 1e-5, 1e-1, log=True)
hidden = trial.suggest_int('hidden_size', 32, 512)
model = build_network(hidden)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
for epoch in range(50):
train_loss = train_one_epoch(model, optimizer)
val_acc = evaluate(model)
trial.report(val_acc, epoch)
if trial.should_prune():
raise optuna.TrialPruned() # stop early!
return val_acc
Optuna monitors intermediate results and kills unpromising trials, saving compute.
Notebook Part 2d: Optuna with pruning — see how it skips bad trials.
| Grid | Random | GP-BayesOpt | Optuna (TPE) | |
|---|---|---|---|---|
| Intelligence | None | None | High | High |
| Efficiency | Low | Medium | High | High |
| Scales to many params | No | Yes | No (< 20) | Yes |
| Handles categorical | Yes | Yes | No | Yes |
| Pruning support | No | No | No | Yes |
Practical rule: Random for quick exploration, Optuna for serious tuning.
# WRONG: Tune and evaluate on SAME cross-validation
grid = GridSearchCV(model, params, cv=5)
grid.fit(X, y)
print(f"Best score: {grid.best_score_:.3f}") # Optimistic!
Why? You tried many configs and picked the best. By definition, it's the luckiest.
GridSearchCV.best_score_ is always optimistically biased.
Separate the tuning loop from the evaluation loop.

The inner loop finds the best hyperparameters. The outer loop tells you how well the process of tuning + training will perform on new data.
from sklearn.model_selection import cross_val_score, GridSearchCV
# Inner loop: tune hyperparameters
inner_cv = GridSearchCV(
RandomForestClassifier(),
param_grid={'max_depth': [5, 10, 15],
'n_estimators': [100, 200]},
cv=3)
# Outer loop: evaluate the tuned model
outer_scores = cross_val_score(inner_cv, X, y, cv=5)
print(f"Nested CV: {outer_scores.mean():.3f} ± {outer_scores.std():.3f}")
This is the gold standard for reporting tuned model performance.
Notebook Part 4: Compare
grid.best_score_vs nested CV scores.
What if the computer did all of this for you?
Instead of manually choosing one model and tuning it:
For each model family (LogReg, KNN, SVM, RF, GB, ...):
For each hyperparameter combination:
Run cross-validation
Keep the best config
Return the overall best model + hyperparameters
This is exactly what tools like AutoGluon, FLAML, and auto-sklearn do.
model_configs = {
'LogReg': (LogisticRegression(),
{'C': [0.01, 0.1, 1, 10]}),
'KNN': (KNeighborsClassifier(),
{'n_neighbors': [3, 5, 11]}),
'SVM': (SVC(),
{'C': [0.1, 1, 10], 'kernel': ['rbf', 'poly']}),
'RF': (RandomForestClassifier(),
{'n_estimators': [100, 200]}),
'GB': (GradientBoostingClassifier(),
{'learning_rate': [0.01, 0.1]}),
}
for name, (model, params) in model_configs.items():
gs = GridSearchCV(model, params, cv=5)
gs.fit(X_train, y_train)
results[name] = gs.best_score_
Notebook Part 5: Full DIY AutoML with 6 model families.
Model Combos Best CV Time
==============================================
Logistic Regression 10 0.8575 0.3s
KNN 12 0.9050 0.5s
SVM 12 0.9325 1.2s
Random Forest 36 0.9338 4.1s
Gradient Boosting 27 0.9400 6.8s
Extra Trees 36 0.9313 3.9s
Winner: Gradient Boosting (CV=0.9400)
No extra packages. Just loop over model families with their grids.
| Good for | Be careful when |
|---|---|
| Tabular data (CSVs) | Model must be interpretable |
| Quick baselines | Latency matters (real-time serving) |
| Lack time or ML expertise | Model must fit on edge device |
| Kaggle competitions | Non-tabular data (images, text) |
Use AutoML to find the ceiling, then manually build an interpretable model that gets close.
# Step 1: Know your floor (dummy baseline)
dummy = cross_val_score(DummyClassifier(), X, y, cv=5).mean()
# Step 2: Simple interpretable model
lr = cross_val_score(LogisticRegression(), X, y, cv=5).mean()
# Step 3: Strong model with tuning (nested CV)
search = RandomizedSearchCV(RF(), params, n_iter=60, cv=5)
outer = cross_val_score(search, X, y, cv=5)
# Step 4: AutoML ceiling
for name, (model, params) in model_configs.items():
gs = GridSearchCV(model, params, cv=5)
gs.fit(X_train, y_train)
If LR is close to the best → deploy LR (interpretable, fast).
Making experiments repeatable
Without reproducibility:
With reproducibility:
# One parameter is enough
model = RandomForestClassifier(n_estimators=100, random_state=42)
Every run with random_state=42 gives the exact same result.
sklearn uses NumPy's random number generator, which is fully deterministic given a seed.
PyTorch has many sources of randomness:
import torch, random, numpy as np, os
def set_seed(seed=42):
random.seed(seed) # Python
np.random.seed(seed) # NumPy
torch.manual_seed(seed) # PyTorch CPU
torch.cuda.manual_seed_all(seed) # PyTorch GPU
torch.backends.cudnn.deterministic = True # cuDNN
torch.backends.cudnn.benchmark = False # cuDNN
torch.use_deterministic_algorithms(True) # All ops
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
Miss any one of these → non-reproducible results.
Notebook Part 7: Test what happens when you skip each seed.
| Source | What It Affects | Setting |
|---|---|---|
Python random |
Data shuffling, augmentation | random.seed() |
| NumPy | Data splits, noise generation | np.random.seed() |
| PyTorch CPU | Weight init, dropout masks | torch.manual_seed() |
| PyTorch GPU | GPU random ops | torch.cuda.manual_seed_all() |
| cuDNN | Conv algorithm selection | deterministic=True |
| CUBLAS | Matrix multiplication order | CUBLAS_WORKSPACE_CONFIG |
Each component has its own random number generator.
Full determinism is not always practical. Report variance instead:
results = []
for seed in [42, 123, 456, 789, 1024]:
set_seed(seed)
acc = train_and_evaluate()
results.append(acc)
print(f"Accuracy: {np.mean(results):.3f} ± {np.std(results):.3f}")
More informative than a single deterministic result.
Papers increasingly require multi-seed results (NeurIPS, ICML checklist).
No more spreadsheets
# Monday: lr=0.01, depth=10 → 83.2%
# Tuesday: lr=0.001, depth=15 → 84.1%
# Wednesday: ... was Tuesday depth=15 or 20?
# Thursday: "I think the best was Tuesday's run. Probably."
Sound familiar? You need a system that automatically records every experiment.
| Category | Examples |
|---|---|
| Config | Hyperparameters, model type, dataset version |
| Metrics | Accuracy, loss, F1 — per epoch and final |
| Artifacts | Model weights, plots, confusion matrices |
| Environment | Python version, package versions, git hash |
| Metadata | Run name, tags, notes, timestamp |
Tracking all of this manually in a spreadsheet breaks down after 10 runs.
import trackio
trackio.init(project="netflix-predictor", config={
"learning_rate": 0.01,
"n_estimators": 100,
"seed": 42})
model = train(trackio.config)
trackio.log({"accuracy": accuracy, "f1": f1_score})
trackio.finish()
Trackio (Hugging Face): free, local-first, W&B-compatible API.
Notebook Part 8a: Log your first sklearn experiment.
trackio.init(project="mnist-cnn", config={
"lr": 1e-3, "epochs": 20, "batch_size": 64})
for epoch in range(20):
for batch_x, batch_y in train_loader:
loss = train_step(model, batch_x, batch_y)
trackio.log({"train_loss": loss})
val_acc = evaluate(model, val_loader)
trackio.log({"epoch": epoch, "val_acc": val_acc})
trackio.finish()
Trackio auto-generates loss curves and accuracy plots in its local dashboard.
Notebook Part 8b-8c: Log training curves and neural network runs.
| Feature | Details |
|---|---|
| Local storage | SQLite in ~/.cache/huggingface/trackio/ |
| Dashboard | Gradio-based, runs locally |
| W&B-compatible API | init, log, finish |
| Free forever | No cloud account needed |
| Share | Optionally sync to Hugging Face Spaces |
pip install trackio
trackio # launches local dashboard
# Run 1: baseline
trackio.init(project="nlp",
config={"model": "lstm", "lr": 1e-3})
# ... train ...
trackio.finish()
# Run 2: improved
trackio.init(project="nlp",
config={"model": "transformer", "lr": 5e-4})
# ... train ...
trackio.finish()
Open the local dashboard to see both runs side-by-side with their configs, metrics, and curves.
Notebook Part 8d: Compare multiple learning rates visually.
lr0.01_depth10 not run_42baseline, augmented, final| Tool | Hosting | Best For |
|---|---|---|
| Trackio | Local | Free, simple, course projects |
| MLflow | Self-hosted | Enterprise, model registry |
| W&B | Cloud | Teams, sweeps, rich visualizations |
| TensorBoard | Local | TF/PyTorch training curves |
Pick based on your needs. Start with Trackio, graduate to MLflow or W&B for team projects.
| Concept | Key Idea |
|---|---|
| Grid search | Exhaustive but doesn't scale (curse of dimensionality) |
| Random search | Better coverage, you control the budget |
| GP-based BayesOpt | Models the function, uses uncertainty |
| Optuna (TPE) | Scales well, handles categorical, supports pruning |
| Nested CV | Tune inside, evaluate outside — unbiased estimate |
| Concept | Key Idea |
|---|---|
| DIY AutoML | Loop over model families + GridSearchCV |
| sklearn reproducibility | random_state=42 is enough |
| PyTorch reproducibility | 8 settings needed for full determinism |
| Multi-seed reporting | Report mean ± std across 5 seeds |
| Trackio | Local-first, free experiment tracking |
Q1: Why does random search often beat grid search?
Not all hyperparameters are equally important. Grid wastes evaluations varying unimportant params. Random covers each dimension more uniformly. (Bergstra & Bengio, JMLR 2012)
Q2: What is the difference between GP-based and TPE-based Bayesian optimization?
GP models the objective function directly with uncertainty. TPE models the distribution of good vs bad configurations. GP works better for small, continuous spaces; TPE scales to more parameters and handles categorical.
Q3: What is nested CV and when do you need it?
Inner loop tunes hyperparameters, outer loop evaluates. Needed because
GridSearchCV.best_score_is optimistically biased — it reports the luckiest result from many tries.
Q4: You train a neural net with torch.manual_seed(42) but get different results each run. Why?
PyTorch has multiple RNG sources. Also need:
np.random.seed(),random.seed(),cuda.manual_seed_all(),cudnn.deterministic=True,cudnn.benchmark=False,use_deterministic_algorithms(True),CUBLAS_WORKSPACE_CONFIG.
Q5: Why should you track experiments programmatically instead of in a spreadsheet?
Spreadsheets don't scale, are error-prone, and can't automatically capture configs, metrics over time, or model artifacts. Tools like Trackio automate all of this.
Tune systematically (random or Bayesian, not manual).
Report honestly (nested CV, multi-seed).
Let AutoML find the ceiling. Track everything locally.
Next week: Git — Version Your Code