Week	What We Did	Outcome
Week 1	Collected movie data from APIs	Raw dataset
Week 2	Validated and cleaned the data	Clean dataset
Week 3	Labeled data with quality control	1,000 labeled movies

Metric	Calculation	Total
Cost	99,000 × ₹25/movie	₹24.75 lakhs
Time	99,000 × 5 min	8,333 hours

	Cost	Time
Labeling 100K examples	₹25 lakhs	8,333 hours
Training model	Minimal	Hours

Strategy	How It Works	Savings
Active Learning	Model picks hardest examples for humans	2-10x fewer labels
Weak Supervision	Write labeling functions (code) instead of manual labels	10-100x faster
LLM Labeling	Use GPT/Claude as cheap annotators	10-50x cost reduction
Noisy Label Handling	Detect and fix label errors	+10-20% accuracy

Passive Learning	Active Learning
Random sampling	Model picks "hard" examples
Wastes labels on easy examples	Focuses on informative examples

	Passive Learning	Active Learning
Approach	Instructor randomly picks roads	You tell instructor what you struggle with
Highways	50 trips (easy, repetitive)	10 trips (got it!)
Parking lots	3 trips (never practiced!)	20 trips (need practice)
Rain driving	2 trips (rare but important!)	15 trips (challenging!)
Result	Wasted time on easy stuff	Focused on weak areas

Example Type	Model Confidence	Action
Easy	> 90%	Skip (already knows)
Hard	~50%	Label (most informative)

Example	P(Pos)	P(Neu)	P(Neg)	Which to label?
"Amazing film!"	0.95	0.03	0.02	✗ Confident
"Terrible movie"	0.02	0.03	0.95	✗ Confident
"It was okay"	0.35	0.40	0.25	✓ Uncertain
"Interesting..."	0.45	0.10	0.45	✓ Very uncertain
"Not bad"	0.48	0.04	0.48	✓ Most uncertain

Strategy	Formula	Picks Example With...
Uncertainty	1 - max(P)	Lowest max probability
Margin	P₁ - P₂	Smallest gap between top 2
Entropy	-Σ P·log(P)	Most spread distribution

Example	Probs [A, B, C]	1-max(P)	P₁-P₂	Entropy	Picked by
Ex 1	[0.80, 0.10, 0.10]	0.20	0.70	0.92	—
Ex 2	[0.50, 0.49, 0.01]	0.50	0.01	0.69	Margin
Ex 3	[0.40, 0.35, 0.25]	0.60	0.05	1.53	Uncertainty
Ex 4	[0.34, 0.33, 0.33]	0.66	0.01	1.58	Entropy

Strategy	What it finds	Best for
Uncertainty (1-max)	Least confident overall	General exploration
Margin (P₁-P₂)	Near decision boundary	Binary-like decisions
Entropy	Maximum confusion	Multi-class spread

Interpretation	Reading	Benefit
Horizontal (green)	Fix accuracy at 90%	50% fewer labels needed
Vertical (purple)	Fix budget at 200 labels	+15% higher accuracy

Round	Model Quality	Uncertainty Estimates
Round 1	Random guessing	Unreliable
Round 5	Slightly better	Still noisy
Round 20	Decent model	Now useful!

Stopping Criterion	How It Works
Budget exhausted	Fixed ₹ or time limit
Accuracy plateau	No improvement for N rounds
Uncertainty threshold	All remaining examples have confidence > 95%

Class	Frequency	Model Behavior
"Positive"	90%	High confidence (lots of examples)
"Negative"	10%	Always uncertain (few examples)

Mode	Pros	Cons
Sequential	Most efficient per label	Slow (retrain after each)
Batch	Faster (parallel labeling)	Less efficient per label

Model	Prediction for "It was okay"
Model 1 (SVM)	POSITIVE
Model 2 (RF)	NEGATIVE
Model 3 (NB)	POSITIVE
Model 4 (LR)	NEGATIVE

Measure	Formula	Interpretation
Vote Entropy	-Σ (votes/n) log(votes/n)	High = disagreement
KL Divergence	Avg divergence from consensus	High = disagreement

Strategy	Formula	Intuition
Predicted Variance	σ²(x) from ensemble	Models disagree on value
QBC Std Dev	std([f₁(x), f₂(x), ...])	Committee predictions vary
Gaussian Process	Posterior variance	Epistemic uncertainty

House	Model 1	Model 2	Model 3	Std Dev
A	₹50L	₹52L	₹51L	₹1L (low)
B	₹40L	₹60L	₹45L	₹10L (high) ✓

Task	Output	Uncertainty Measure
NER	Token labels	Token-level entropy
Object Detection	Boxes + classes	Detection confidence
Segmentation	Pixel masks	Pixel/region entropy
Translation	Sequences	Sequence probability

Approach	How It Works
MC Dropout	Run model multiple times with dropout, measure variance
Deep Ensembles	Train multiple neural nets, measure disagreement
BALD	Maximize mutual information between predictions and model parameters

Tool	Description	Best For
modAL	Python library, sklearn-compatible	Research, prototyping
Label Studio ML	ML backend for Label Studio	Production annotation
Prodigy	Commercial, built-in active learning	NLP tasks
BaaL	Bayesian active learning	Deep learning

Signal	Rule	Label
Keywords	"amazing", "loved", "masterpiece"	→ Positive
Keywords	"boring", "waste", "terrible"	→ Negative
Rating	Mentions score > 8/10	→ Positive
Punctuation	Multiple !!!	→ Probably positive
Awards	Mentions "Oscar"	→ Probably positive
Length	Very short review	→ Often negative

#	Review	Rating
1	"A good movie with great acting"	8.0
2	"Good visuals but boring plot"	5.0
3	"Absolutely terrible, waste of time"	2.0
4	"Decent film, worth watching"	7.5
5	"Good fun but poorly made"	4.0

#	Review	LF₁ (keyword)	LF₂ (rating)
1	"good movie" (8.0)	POS	POS
2	"good but boring" (5.0)	POS	—
3	"terrible" (2.0)	—	NEG
4	"decent film" (7.5)	—	POS
5	"good but poor" (4.0)	POS	—

#	LF₁	LF₂	Both vote?	Agree?
1	POS	POS	Yes	Yes
2	POS	—	No	—
3	—	NEG	No	—
4	—	POS	No	—
5	POS	—	No	—

LF	α (accuracy)	Meaning
LF₁	α₁ = 0.80	When LF₁ votes, it's correct 80% of the time
LF₂	α₂ = 0.90	When LF₂ votes, it's correct 90% of the time

Statistic	Value
LF₁ fires on	300 reviews (30%)
LF₂ fires on	250 reviews (25%)
Both fire on	100 reviews (overlaps)
Both agree	85 reviews
Both disagree	15 reviews

Pair	Agreement	Equation
LF₁, LF₂	85%	α₁α₂ + (1-α₁)(1-α₂) = 0.85
LF₁, LF₃	80%	α₁α₃ + (1-α₁)(1-α₃) = 0.80
LF₂, LF₃	90%	α₂α₃ + (1-α₂)(1-α₃) = 0.90

N (# LFs)	Pairwise Equations	Unknowns (α's)	Status
2	1	2	Underdetermined (can't solve)
3	3	3	✓ Exactly determined
4	6	4	✓ Overdetermined
5	10	5	✓ Overdetermined
N	N(N-1)/2	N	✓ if N ≥ 3

Term	Equations vs Unknowns	Example	Reliability
Underdetermined	Fewer equations	1 eq, 2 unknowns	Infinite solutions
Exactly determined	Equal	3 eq, 3 unknowns	One solution (fragile)
Overdetermined	More equations	6 eq, 4 unknowns	Most reliable!

K (# examples)	Agreement Estimate Quality	Solution Reliability
100	Noisy (±5-10%)	Rough estimates
1,000	Reasonable (±1-3%)	✓ Good estimates
10,000+	Very accurate (±0.5%)	Highly reliable

LF	Accuracy (α)	Calculation	Weight (w)
LF₁	0.80	log(0.8/0.2) = log(4)	1.39
LF₂	0.90	log(0.9/0.1) = log(9)	2.20

#	Review Text	LF₁ Vote	LF₂ Vote	Score POS	Score NEG
1	"good movie" (8.0)	✓ POS	✓ POS	1.39+2.20=3.59	0
2	"good but boring" (5.0)	✓ POS	—	1.39	0
3	"terrible" (2.0)	—	✓ NEG	0	2.20
4	"decent film" (7.5)	—	✓ POS	2.20	0

#	Score POS	Score NEG	e^(POS)	e^(NEG)	Sum	P(POS)
1	3.59	0	36.2	1.0	37.2	97.3%
2	1.39	0	4.0	1.0	5.0	80.0%
3	0	2.20	1.0	9.0	10.0	10.0%
4	2.20	0	9.0	1.0	10.0	90.0%

#	Review	LF₁ (w=1.39)	LF₂ (w=2.20)	Score POS	Score NEG	P(POS)
5	"good acting, bad plot"	✓ POS	✓ NEG	1.39	2.20	31%
6	"poor quality, great story"	✓ NEG	✓ POS	2.20	1.39	69%

Role	Purpose	Example
System	Define the task, persona, output format	"You are a movie critic. Classify as POSITIVE/NEGATIVE/NEUTRAL"
User	The actual text to classify	"Review: 'Mind-blowing visuals!'"

Technique	Example	Benefit
Clear labels	"POSITIVE/NEGATIVE/NEUTRAL"	No ambiguity
Definitions	"POSITIVE: Reviewer enjoyed the movie"	Consistent criteria
Few-shot	Give 2-3 examples first	Much higher accuracy
JSON output	"Respond in JSON: {label, confidence}"	Easy parsing

When LLMs Struggle

1. Subjective Tasks

"This movie is so bad it's good"
LLM: NEGATIVE (wrong - it's ironic praise!)

2. Domain-Specific Knowledge

"The mise-en-scene was pedestrian but the diegetic sound..."
LLM: ? (needs film theory knowledge)

3. Nuanced Categories

5-point scale: Very Negative, Negative, Neutral, Positive, Very Positive
LLM accuracy drops significantly with more categories

4. Ambiguous Guidelines

What exactly counts as "slightly negative"?

Method	Cost/1000 labels	INR/1000	Quality	Speed
Expert humans	$300-500	₹25,000-42,000	Highest	Slow
Crowdsourcing	$50-100	₹4,200-8,400	Medium	Medium
GPT-4	$20-50	₹1,700-4,200	Good	Fast
GPT-3.5	$2-5	₹170-420	Moderate	Very Fast
Claude Haiku	$1-3	₹85-250	Moderate	Very Fast
Open source LLM	~$0	~₹0 (compute)	Varies	Depends

Source	Example	How common
Annotator error	Tired annotator clicks wrong button	5-15%
Task ambiguity	"The movie was okay" - POS or NEG?	10-20%
Weak supervision	Heuristic "good" → POS catches "not good"	15-30%
Data entry errors	Columns swapped, typos	1-5%

Review	Given Label	Model says	Suspicious?
"Loved it!"	POS	95% POS	No
"Not good at all"	POS	92% NEG	Yes!
"Meh, it was fine"	NEG	60% NEG	No

Strategy	When to use	Code
Remove	Few errors, enough data	`X_clean = X[~mislabeled]`
Re-label	Important examples	Send back to humans
Label smoothing	Many errors	`y = [0.9, 0.05, 0.05]` instead of `[1,0,0]`

Data Size	First Choice	Add if...
<1,000	Manual labeling	—
1k-10k	Active Learning	+ Weak supervision if patterns exist
>10k	Weak Supervision or LLM	+ Active Learning for hard cases

#	LF₁	LF₂	Both vote?	Agree?
1	POS	POS	Yes	Yes
2	POS	—	No	—
3	—	NEG	No	—
4	—	POS	No	—
5	POS	—	No	—

Question	Yes →	No →
Can you write labeling heuristics?	Weak Supervision	LLM Labeling
Do you have budget for LLM API?	LLM Labeling	Weak Supervision
Is high precision critical?	Active Learning + humans	LLM or Weak Supervision

Approach	Setup Cost	Per-Label Cost	Quality
Manual only	Low	$0.30	High
+ Active Learning	Medium	$0.30 (fewer)	High
+ Weak Supervision	High	~$0	Medium
+ LLM Labeling	Low	$0.002	Medium-High
+ Noise Cleaning	Medium	~$0	Improved

#	LF₁	LF₂	Both vote?	Agree?
1	POS	POS	Yes	Yes
2	POS	—	No	—
3	—	NEG	No	—
4	—	POS	No	—
5	POS	—	No	—

Optimizing the Labeling Process

Week 4 · CS 203: Software Tools and Techniques for AI

Part 1: The Labeling Cost Problem

Previously on CS 203...

The Labeling Bottleneck

Four Strategies to Reduce Labeling Cost

Part 2: Active Learning

What is Active Learning?

The Teaching Analogy

Active Learning: The Intuition

Movie Review Example: Why Uncertainty Matters

The Decision Boundary Intuition

The Active Learning Loop

Query Strategies: How to Pick Examples

Visualizing Uncertainty: Binary Case

Query Strategies: 3-Class Comparison

Active Learning: The Algorithm

Active Learning: Python Pseudocode

Active Learning: Typical Results

Batch Active Learning: The Problem

Batch Active Learning: The Solution

Diversity in Batch Selection

Practical Issue 1: Cold Start Problem

Practical Issue 2: When to Stop?

Practical Issue 3: Class Imbalance

Practical Issue 4: Batch vs Sequential

Query By Committee (QBC)

Active Learning for Regression

Active Learning for Other Tasks

AL for NER: Token-Level Uncertainty

AL for Object Detection

AL for Semantic Segmentation

Bayesian Active Learning

Active Learning Tools

Part 3: Weak Supervision

Weak Supervision: The Big Picture

What is Weak Supervision?

The Expert Knowledge Intuition

Labeling Functions: Netflix Example (using Snorkel)

Labeling Functions: Characteristics

Labeling Functions: Types

Labeling Function Conflicts

Snorkel: The Weak Supervision Framework

Worked Example: Our 5 Movie Reviews

Step 1: Apply LFs → Label Matrix

Step 2: Find Overlaps & Conflicts

Step 3a: What is α (Accuracy)?

Step 3b: Scaling to 1000 Reviews

Step 3c: The Agreement Equation

Step 3d: Solving with 3 LFs

Generalizing: N Labeling Functions, K Examples

What Does "Overdetermined" Mean?

Effect of Dataset Size (K Examples)

Step 3e: Solving with Gradient Descent (PyTorch)

Step 4: Convert Accuracy → Voting Weight

Step 5: Compute Final Probabilities

Step 5: Full Calculation Table

Step 5: Conflicting Votes Example

Worked Example: Step 5 — Train Final Model

When to Use Weak Supervision

Part 4: LLM-Based Labeling

The LLM Labeling Revolution

Why LLMs Can Label Data

LLM Labeling: ChatGPT Interface

LLM Labeling: System vs User Messages

LLM Labeling: API for Scale

Better Prompts = Better Labels

LLM Labeling Quality Control

When LLMs Struggle

Hybrid Approach: LLM + Human

LLM Labeling: Cost Comparison

Part 5: Handling Noisy Labels

Sources of Label Noise

Detecting Label Errors with Cleanlab

What to Do With Noisy Labels?

Part 6: Combining Approaches

Decision Tree: Which Technique?

Cost-Benefit Analysis

Part 7: Key Takeaways

Key Takeaways

#	LF₁	LF₂	Both vote?	Agree?
1	POS	POS	Yes	Yes
2	POS	—	No	—
3	—	NEG	No	—
4	—	POS	No	—
5	POS	—	No	—