Compare model training to going to college.
Training compute is the years spent in college learning general knowledge.
After college, on a hard exam question, you don't just answer instantly. You pause, sketch on scratch paper, double-check. That's test-time compute.
Reasoning models (o1, Claude thinking) are exactly this · same base capability as a "college graduate" model · but allowed to use scratch paper before responding. The scratch paper is internal chain-of-thought that the user never sees.
Until 2024, the only scaling axis was training compute. In 2024+ we added test-time compute.
Two knobs now: (a) spend more on pretraining → better general capability; (b) spend more on per-query reasoning → better on hard problems.
Both pay off. OpenAI reported that o1's performance on math benchmarks scales smoothly with inference-time compute budget.
| Model | AIME 2024 (math) | Codeforces |
|---|---|---|
| GPT-4 | 12% | ~800 Elo |
| o1 | 74% | ~1800 Elo |
| o3 | 97% | ~2700 Elo (grandmaster) |
Comparable gains on HumanEval, MATH, GPQA. This is an entirely new capability curve.
Practical rule · if the problem needs multi-step reasoning, use a reasoning model. If it needs fast response or is simple, use a regular model.
What is the model actually doing?
A 70B Llama has 70 billion parameters. We trained it, and it works. But if it produces a wrong answer — or a dangerous one — we can't read the weights and know why.
Mechanistic interpretability (mech interp) tries to reverse-engineer specific computations inside trained networks.
Anthropic's interpretability team · Olah et al. circuits research 2020+, sparse autoencoders 2023+, dictionary learning 2024+.
A Transformer layer = experts working on a shared whiteboard.
The Transformer equation just says: read from the bus → add a contribution → write back.
Toy model,
Step 1 · attention update. Attention reads
(boost features 2 and 3 from related tokens)
Step 2 · FFN update. FFN processes
(slightly boost feature 1, decrease feature 2, clarify feature 4)
Step 3 · sum (the residual update).
Same equation:
Inside a model, a single neuron must represent many ideas at once · "bank" means river bank AND financial bank, plus context-dependent shades. This is superposition · confusing for analysis.
A sparse autoencoder forces the model to use a giant explicit dictionary. Instead of one ambiguous "bank" neuron, distinct river_bank and financial_institution features fire.
The dictionary is much wider than the residual stream (e.g., 100k features for a 12k-dim residual). Most features are 0 for any input · the active ones become human-interpretable concepts.
A normal autoencoder · narrow bottleneck. "Compress 100 words to 5."
A sparse autoencoder (SAE) · inverted bottleneck. "Have a 100,000-word dictionary, but only allowed to use 5 of them." You must pick extremely precise words.
Problem · superposition. A single neuron in the residual stream might fire for "Golden Gate Bridge", "the colour red", AND "Python syntax errors". Confusing.
SAE recipe:
Worked numeric (toy).
Researchers find all inputs that activate feature 6 → all about the Golden Gate Bridge. Label: "Feature 6 = Golden Gate Bridge".
Anthropic 2024 · SAE on Claude 3 Sonnet → millions of human-readable features. Clamp features → control behavior ("Golden Gate Claude" demo).
Two reasons:
Still a young field. Most interp results are about small models or narrow circuits. Scaling interp to frontier-level models is a 2026+ research agenda.
What the next decade of DL research looks like
As models become more capable, the cost of misalignment grows:
| 2018 | 2024 | 2030 (?) |
|---|---|---|
| Misclassify image | Give wrong factual answer | Autonomously execute bad plan |
| Cost: annoy user | Cost: spread misinformation | Cost: catastrophic |
Claude, GPT, Gemini all ship with elaborate safety stacks · constitutional AI, RL from safety feedback, red-teaming, classifier filters, refusal training. Safety is not a layer; it's the product.
| Module | Covered |
|---|---|
| Foundations (L1-L2) | why DL, UAT, depth vs width, residuals |
| Training craft (L3-L6) | recipe, SGD / Adam, schedules, regularization |
| Vision (L7-L9) | CNN mechanics, ResNet family, detection, SAM |
| Sequences → Transformers (L10-L14) | RNN/LSTM/GRU, Seq2Seq, attention, Transformer, tokenization |
| LLMs (L15-L16) | scaling laws, RoPE, GQA, LoRA, RLHF, DPO |
| Self-supervision + VLMs (L17-L18) | SimCLR, MAE, CLIP, LLaVA |
| Generative (L19-L22) | VAE, GAN, DDPM, CFG, latent diffusion |
| Systems + frontier (L23-L24) | KV-cache, quantization, agents, reasoning, interp |
Each is a PhD worth of work. Pick one.
Predictions (take with salt):
What you learned
| Module | Lectures | Big ideas |
|---|---|---|
| 1 Foundations | L1–L3 | MLP, ResNets, training recipe |
| 2 Optimization | L4–L5 | SGD, momentum, Adam, schedules |
| 3 Regularization | L6 | double descent, augmentation, norm, dropout |
| 4 CNNs | L7–L9 | architecture evolution, detection, SAM |
| 5 Sequences | L10–L11 | LSTM, Seq2Seq, bottleneck |
| 6 Transformers | L12–L14 | attention, nanoGPT, tokenization |
| 7 LLMs | L15–L16 | scaling laws, LoRA, RLHF, DPO |
| 8 SSL + VLM | L17–L18 | SimCLR, CLIP, LLaVA |
| 9 Generative | L19–L22 | VAE, GAN, DDPM, Stable Diffusion |
| 10 Frontier | L23–L24 | inference, agents, interp |
This is the current skill floor for a DL engineer or research student.