def lf_positive(text):
return 1 if any(w in text for w in ["great", "excellent"]) else -1
def lf_negative(text):
return 0 if any(w in text for w in ["bad", "terrible"]) else -1
labels = [lf_positive(t), lf_negative(t)]
noisy_label = Counter(l for l in labels if l != -1).most_common(1)[0][0]
Key insight: 70–90% of manual-labeling accuracy from hand-written
rules + a denoiser. That's the Snorkel playbook.
| Domain | Techniques |
|---|---|
| Images | flip, rotate, crop, color jitter, RandAugment |
| Text | synonym replacement, back-translation |
| Audio | time shift, pitch shift, noise injection |
| Tabular | SMOTE, noise injection, mixup |
Canonical mistake: flipping a handwritten
6 into a 9.
Always ask: does this transform preserve the label?
"A simple model you can trust beats a complex one you can't."
from google import genai
client = genai.Client()
response = client.models.generate_content(
model="gemini-2.0-flash",
contents=prompt,
config={"response_mime_type": "application/json",
"response_schema": ReviewSentiment},
)
Four skills: structured outputs · multimodal · streaming · prompt versioning.
| Task | Better choice |
|---|---|
| Exact arithmetic | A calculator |
| Date parsing | datetime.strptime |
| Structured extraction | regex / small fine-tuned model |
| Classification with lots of labels | logistic regression / BERT |
| Anything that must be deterministic | Pretty much anything else |
LLMs are a tool, not a default.
| Concept | Why it matters |
|---|---|
| Cross-validation | Single split is a coin flip |
| Stratified K-Fold | Preserves class balance |
| Nested CV | Needed when you tune hyperparameters |
| TimeSeriesSplit | Time-ordered data needs time-ordered folds |
| GroupKFold | Patients / users must not span folds |
Minimum viable evaluation: 5-fold stratified CV.
Test-time information sneaking into training:
log_of_price to predict price)Rule of thumb: 99% where everyone else is at 85% → you have a leak.
| Method | Tries | When |
|---|---|---|
| Grid Search | all combos | < 3 hyperparams |
| Random Search | random subset | 3+ hyperparams |
| Bayesian (Optuna) | adaptive | expensive evals |
| Hyperband / ASHA | early stopping | deep learning |
Bergstra & Bengio 2012: only ~2 of your 5 hyperparams matter,
and random explores those coordinates more densely than grid.
import trackio
run = trackio.init(project="moons", config={"lr": 1e-3, "hidden": 16})
for epoch in range(100):
trackio.log({"loss": loss.item(), "acc": acc.item()})
run.finish()
Answers "which run was the good one?" — without this you're
re-running experiments by hand.
"A model in a notebook is not a model."
| Layer | Symptom | Cure |
|---|---|---|
| The Math | Different results each run | seeds everywhere |
| The Memory | Lost the best hyperparams | experiment tracking |
| The Machine | "Works on my laptop" | Docker / pyproject.toml |
All three matter. Missing one is a reproducibility bug.
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "app.py"]
docker build turns your code into a portable imagedocker run -p 7860:7860 ships to any Linux / Mac / CI server| Test | Detects |
|---|---|
| KS test | Shift in continuous features |
| PSI | Population stability (train vs prod) |
| Chi-squared | Categorical feature shifts |
| Jensen–Shannon / Wasserstein | General distribution distance |
Key insight: 95% accuracy at deployment → silent 70% two months
later if the input distribution shifts.
print(time.time() - t0) # first check: how slow?
cProfile # which function?
line_profiler # which line?
memory_profiler # which allocation?
torch.profiler # GPU timing & kernel breakdown
Key insight: the #1 speedup in student projects this semester was
loading the model once at startup instead of per request.
def quantize_tensor(x):
s = x.abs().max().item() / 127.0
q = torch.round(x / s).clamp(-127, 127).to(torch.int8)
return q, s
4× smaller weights, typically < 1% accuracy drop.
Same 3-line routine applied to progressively bigger models:
| Notebook | Model | Params | INT8 vs FP32 |
|---|---|---|---|
09-…from-scratch |
MLP · make_moons |
~400 | ≈ 0 |
11-…llm-from-scratch |
2-layer Transformer · Hamlet | ~60K | Δ loss ≈ 0.005 |
12-…real-llm |
Qwen 2.5 0.5B (HF) | 494M | < 1% CE |
The math doesn't care about model size.
| Technique | Idea | Typical gain |
|---|---|---|
| Quantization | Fewer bits per weight | 4× (INT8), 8× (INT4) |
| Pruning | Zero out small weights | 50–90% sparsity |
| Distillation | Small "student" mimics big "teacher" | 5–10× smaller model |
| ONNX export | Portable graph format | Runs anywhere (mobile, browser, edge) |
Real deployments stack these: distill → prune → quantize → ONNX.
# Gradio
import gradio as gr
gr.Interface(predict, "text", "label").launch()
# Streamlit
import streamlit as st
x = st.text_input("Prompt"); st.write(model(x))
# FastAPI
@app.post("/predict")
def predict(body: Input) -> Output: ...
Gradio for quick ML demos · Streamlit for dashboards ·
FastAPI for production APIs.
# .github/workflows/ci.yml
on: [push]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements.txt
- run: pytest
- run: docker build -t app .
Every commit: install, test, build a Docker image. If it breaks,
you find out in 30 seconds, not two weeks later.
"The same pattern that builds Claude Code builds yours."
| Piece | Role | Analogy |
|---|---|---|
| LLM | Think, reason, decide | Brain |
| Tools | Functions the LLM can call | Hands |
| Loop | Keep going until done | Work ethic |
You built a 4-tool agent with Gemma 4 on free Colab in ~100 lines.
TOOLS = [{"type": "function",
"function": {"name": "calculate",
"description": "Do math",
"parameters": {...}}}]
while not done:
response = llm(messages, tools=TOOLS)
for call in response.tool_calls:
result = dispatch(call)
messages.append({"role": "tool", "content": result})
Bigger tools, same architecture. Claude Code, Cursor, Devin, Perplexity — all follow this shape.
The lab exercise: upgrade calculate to handle sqrt, sin, log, factorial, np.mean, np.dot, matmul, and keep eval() safe.
{"__builtins__": {}}, AST walking)This is the real-world recipe: LLM + expert tool >> LLM alone.
requirements.txt → unreproducible/home/student/...) → breaks for the next person.gitignore docsAvoid these 10 → you're ahead of most production codebases.
| Symptom | Diagnosis | Lecture |
|---|---|---|
| "Accuracy changes every run" | No seed | 9 |
| "My script doesn't run for the TA" | Docker + pinned deps | 9 |
| "Accuracy dropped in production" | Data drift | 10 |
| "Test 99%, prod 70%" | Data leakage | 7 |
| "Model too slow" | Profile first | 11 |
| Symptom | Diagnosis | Lecture |
|---|---|---|
| "Model too big for device" | Quantize / prune / distill | 11 |
| "LLM hallucinates numbers" | Give it a calculator tool | 12 |
| "Lost my best hyperparams" | TrackIO / W&B | 8 |
| "Can't get enough labels" | Active / weak supervision | 4 |
| "Not enough data to train" | Augmentation | 5 |
Screenshot these two slides. Come back in 5 years.
my-project/
├── data/ # raw + processed (gitignored)
├── notebooks/ # exploration only
├── src/
│ ├── train.py # training with seeds + tracking
│ ├── evaluate.py # reproducible metrics
│ └── app.py # Gradio / Streamlit / FastAPI
├── tests/ # at least one smoke test
├── requirements.txt # pinned versions
├── Dockerfile # recipes that always work
└── README.md # runnable in < 5 min
Must have: seeds · 5-fold CV · tracking · a working demo.
| Topic | Where to learn |
|---|---|
| Deep Learning (CNNs / transformers / fine-tuning) | CS 337, fast.ai |
| MLOps (registries, pipelines) | Made With ML |
| Cloud Deployment (AWS / GCP / Azure / k8s) | cloud docs |
| Data Engineering (ETL, warehouses) | dbt, Airflow |
| ML System Design (scaling, A/B, feedback) | Chip Huyen's book |
| MCP / A2A (agent tool standards) | Anthropic / Google docs |
Foundations → yours. Next steps → above.
requirements.txt.train.py. Run it. Does it reproduce?You now know enough to be the person who does these things.
This was CS 203: Software Tools and Techniques for AI.
You started with scripts. You end with reproducible, monitored,
optimized ML systems — and agents that can take actions.
All slides, code, videos, labs:
https://nipunbatra.github.io/stt-ai-teaching/
Go build something.