How wrong was our prediction?
| True Char | P(true) | Loss |
|---|---|---|
| a | 0.95 | 0.05 (good!) |
| a | 0.50 | 0.69 (okay) |
| a | 0.01 | 4.6 (terrible!) |
Lower probability for correct answer → higher loss!
Cross-entropy punishes confident wrong predictions:
| Scenario | Loss |
|---|---|
| 95% confident, correct | 0.05 |
| 95% confident, WRONG | 3.0 |
The model learns to be confident only when right!
Context "aar" → True next char: 'a' (index 1)
Model predicts:
logits = [0.5, 2.1, 0.3, ...] # 'a' has score 2.1
probs = softmax(logits) = [0.12, 0.58, 0.08, ...]
Loss calculation:
| If P(correct) was... | Loss would be... |
|---|---|
| 0.95 | 0.05 (great!) |
| 0.58 | 0.54 (okay) |
| 0.10 | 2.30 (bad!) |
model = NameGenerator()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()
for epoch in range(1000):
# Forward pass
logits = model(X) # X is our context tensor
# Compute loss
loss = criterion(logits, Y) # Y is target tensor
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 100 == 0:
print(f"Epoch {epoch}: Loss = {loss.item():.4f}")
| Component | What It Learns |
|---|---|
| Embeddings | Vector for each character |
| Hidden layer | Patterns like "aa" → likely "r" |
| Output layer | Which chars follow which patterns |
All weights are learned from data!
Epoch 0: Loss = 3.29 (random guessing)
Epoch 100: Loss = 2.45 (learning patterns)
Epoch 500: Loss = 1.82 (getting better)
Epoch 1000: Loss = 1.54 (good predictions!)
Loss decreases as model learns patterns in names!
Algorithm:
def generate_name(model, context_size=3):
context = [0] * context_size # Start with "..."
name = ""
while True:
# Get probabilities
logits = model(torch.tensor([context]))
probs = torch.softmax(logits, dim=-1)
# Sample next character
next_idx = torch.multinomial(probs, 1).item()
if next_idx == 0: # End token
break
name += chars[next_idx]
context = context[1:] + [next_idx] # Slide window
return name
After training on Indian names:
>>> generate_name(model)
'arya'
>>> generate_name(model)
'priti'
>>> generate_name(model)
'kavish'
>>> generate_name(model)
'neha'
The model learned patterns of Indian names!
Let's appreciate this moment:
| What You Built | What It Learned (Without Being Told!) |
|---|---|
| 27-char vocabulary | Which chars are common |
| Embedding layer | Vowels cluster together |
| Simple MLP | "aa" often followed by "r" |
| Next-token prediction | Patterns in names |
You gave it names. It learned the RULES of names!
This is the SAME idea that powers ChatGPT. Just scaled up!
If we always pick the most likely character:
generate() → "a"
generate() → "a"
generate() → "a" (same every time!)
Sampling gives variety:
generate() → "arya"
generate() → "priti"
generate() → "neel" (different each time!)
Temperature controls how "peaked" the distribution is:
| Temperature | Effect |
|---|---|
| T → 0 | Always pick highest (deterministic) |
| T = 1 | Standard sampling |
| T → ∞ | Uniform random |
How T changes the distribution:
| T | Effect |
|---|---|
| 0.5 | Very peaked (safe) |
| 1.0 | Balanced |
| 2.0 | Flat (creative) |
Low T = predictable, High T = surprising
def generate_with_temp(model, temperature=1.0):
context = [0] * context_size
name = ""
while True:
logits = model(torch.tensor([context]))
# Apply temperature!
logits = logits / temperature
probs = torch.softmax(logits, dim=-1)
next_idx = torch.multinomial(probs, 1).item()
if next_idx == 0:
break
name += chars[next_idx]
context = context[1:] + [next_idx]
return name
Same model, different temperatures:
| T | Generated Names |
|---|---|
| 0.5 | arya, priya, aarav (common) |
| 1.0 | kavish, neeti, rohan (varied) |
| 1.5 | xylon, qira, zvak (unusual) |
Low T = safe, High T = creative
Imagine the model is choosing where to eat:
| Temperature | Behavior |
|---|---|
| T = 0.1 | "I'll just go to my favorite place. Always." |
| T = 1.0 | "I'll try any good restaurant, weighted by preference." |
| T = 2.0 | "Let's try that weird new place! Might be great, might be terrible." |
When to use what?
| Task | Temperature | Why |
|---|---|---|
| Code generation | Low (0.2-0.5) | Want correct, predictable code |
| Creative writing | High (0.8-1.2) | Want variety and surprise |
| Brainstorming | Higher (1.0-1.5) | Want unusual ideas |
| Feature | Our Model | GPT-4 |
|---|---|---|
| Vocab | 27 chars | 100K tokens |
| Context | 3 chars | 128K tokens |
| Embedding | 10 dim | 12,288 dim |
| Layers | 2 | ~120 |
| Parameters | ~3,000 | 1,000,000,000,000 |
Same principle. Different scale!
| Our Model | Real LLMs |
|---|---|
| Characters | Subword tokens |
| MLP | Transformer (attention) |
| 3 char context | Thousands of tokens |
| Train on names | Train on internet |
Real LLMs use subword tokens:
| Method | Problem |
|---|---|
| Characters | Too slow |
| Words | Vocab too big |
| Subwords | Just right! |
| Text | Tokens |
|---|---|
| "Hello" | ["Hello"] |
| "ChatGPT" | ["Chat", "G", "PT"] |
| "unhappiness" | ["un", "happiness"] |
| "Nipun" | ["N", "ip", "un"] |
Common words = 1 token, rare words = multiple tokens
"How many r's in strawberry?"
The model sees: ["str", "aw", "berry"]
It doesn't see individual letters!
| Task | LLMs struggle because... |
|---|---|
| Counting letters | Tokens ≠ characters |
| Spelling | Can't see each letter |
| Anagrams | No character access |
https://platform.openai.com/tokenizer
Type any text and see how it gets tokenized!
| Your Name | Tokens |
|---|---|
| "Nipun" | ? |
| "Aarav" | ? |
| Your name | Try it! |
Problem: Long-range dependencies
"The cat sat on the mat. It was comfortable."
What does "It" refer to?
Attention: Let each position "look at" all other positions!
Our MLP can only look at fixed 3 characters.
What if we need to look at something 100 characters ago?
| MLP (Fixed Window) | Attention (Dynamic) |
|---|---|
| Always looks at last 3 | Looks at whatever is relevant |
| "mat" can't see "cat" | "It" can attend to "cat" |
| Fixed pattern | Learns where to look! |
Key insight: The model LEARNS which positions to attend to!
"When predicting after 'It', pay attention to 'cat' not 'mat'"
"Attention Is All You Need"
| Innovation | Benefit |
|---|---|
| Self-attention | Look at all context |
| Parallel processing | Very fast training |
| Stacking layers | Deep understanding |
This enabled GPT, BERT, and all modern LLMs!
Pre-training alone gives a text completer, not an assistant.
| Stage | What It Learns |
|---|---|
| Pre-training | Predict next token (internet text) |
| Fine-tuning | Follow instructions |
| RLHF | Be helpful, safe, honest |
Lecture 08 covers this journey!
| Concept | Key Insight |
|---|---|
| Next-token prediction | The ONLY task LLMs do |
| Embeddings | Tokens become meaningful vectors |
| Context window | How much history the model sees |
| Softmax | Turn scores into probabilities |
| Cross-entropy | Punish wrong predictions |
| Sampling | Create variety in generation |
| Temperature | Control creativity vs safety |
Lecture 08: From Language Model to Assistant
| Topic | Question |
|---|---|
| Pre-training | How to train on internet scale? |
| Fine-tuning | How to follow instructions? |
| RLHF | How to be helpful and safe? |
| ChatGPT | How it all comes together? |