Title	Year	Genre	Runtime	Director	Budget
Inception	2010	Sci-Fi	148 min	Nolan	$160M
The Room	2003	Drama	99 min	Wiseau	$6M
Parasite	2019	Thriller	132 min	Bong	$11M
...	...	...	...	...	...

Title	Year	Genre	Budget	TotalRevenue
Inception	2010	Sci-Fi	$160M	???
The Room	2003	Drama	$6M	???
Parasite	2019	Thriller	$11M	???

Title	Year	Genre	Budget	TotalRevenue
Inception	2010	Sci-Fi	$160M	$836M ✓
The Room	2003	Drama	$6M	$1.8M ✓
Parasite	2019	Thriller	$11M	$263M ✓
Movie #4	2015	Action	$45M	?
...	...	...	...	?
Movie #10,000	2020	Comedy	$12M	?

Task	Label Needed	Sources
Predict Revenue	TotalRevenue ($)	Box Office Mojo, TheNumbers, studio reports
Predict Critical Success	CriticScore (0-100)	Rotten Tomatoes, Metacritic, IMDB
Predict Audience Appeal	AudienceScore (0-100)	RT Audience, IMDB user ratings
Content Classification	Genre tags	Manual annotation, Wikipedia

Source	Data Volume	Labels Needed	Estimated Cost
E-commerce	1M reviews	Sentiment (pos/neg/neu)	$10K-30K
Call center	10K hours	Transcripts + Intent	$50K-150K
Security	1 year footage	Event annotations	$200K+
Hospital	50K scans	Tumor boundaries	$500K+ (expert)

TEXT	IMAGES
Classification	Classification
Named Entity Recognition	Object Detection (bbox)
Sentiment Analysis	Segmentation (pixel)
Question Answering	Keypoint Detection
Relation Extraction	Instance Segmentation

AUDIO	VIDEO
Transcription	Action Recognition
Speaker Identification	Object Tracking
Event Detection	Temporal Segmentation
Emotion Recognition	Dense Captioning

Domain	Simple Task	Complex Task	Multiplier
Text	Classification: ₹2/item	Relation extraction: ₹80/item	40x
Vision	Image label: ₹2/img	Instance segmentation: ₹400/img	200x
Audio	Clip label: ₹4/clip	Full diarization: ₹150/min	40x
Video	Video label: ₹8/clip	Dense captioning: ₹800/min	100x

NER: Common Challenges

1. Boundary Ambiguity:

"New York City Mayor"
Tag "New York" or "New York City"?

2. Nested Entities:

"MIT AI Lab director"
- "MIT AI Lab" -> ORGANIZATION
- "MIT" -> ORGANIZATION (nested inside!)

3. Overlapping Context:

"Bank of America" - ORG or LOC?
"Washington" - PERSON or LOC?

Solution: Clear guidelines with examples for every edge case.

Guideline	Rule of Thumb
Box tightness	Box should touch object edges — not too loose, not cutting into object
Partial occlusion	Label if >20% of object is visible
Tiny objects	Skip if <10 pixels in either dimension
Reflections	Don't label objects in mirrors/windows
Pictures on walls	Don't label objects inside photos/posters

Binary Mask	Polygon
Pixel-by-pixel (0 or 1)	List of (x, y) vertices
Exact boundaries	Approximate boundaries
Large file size	Compact storage
`mask[y][x] = 1`	`[(x1,y1), (x2,y2), ...]`

Modality	Key Tasks	Challenge
Audio	Transcription, speaker diarization, emotion	Temporal alignment, overlapping speech
Video	Object tracking, action recognition, temporal segmentation	Maintaining identity across frames (occlusion)

Task Type	Speed (per hour)	Complexity
Text Classification	200-500 items	Low
Sentiment Analysis	150-300 items	Low
NER	50-150 sentences	Medium
Image Classification	100-300 images	Low
Bounding Boxes	20-50 images	Medium
Segmentation	5-15 images	High
Audio Transcription	15-30 min audio	Medium
Video Tracking	5-10 min video	High

Open Source	Commercial
Label Studio (flexible)	Labelbox (enterprise)
CVAT (video/CV focused)	Scale AI (full service)
Doccano (NLP focused)	V7 (auto-annotation)
VGG Image Annotator	Prodigy (active learning)
	Amazon SageMaker GT

Platform	Cost	Quality	Best For
Amazon MTurk	$0.01-0.10/task	Variable	Simple tasks
Scale AI	$1-5/task	High	Complex CV
Labelbox	Subscription	Med-High	Enterprise
Prolific	$0.10-0.50/task	Higher	Research
Appen	Variable	Medium	Multilingual

Use Case	Recommended Tool
Prototyping / Learning	Label Studio
Video / Tracking	CVAT
NLP / Text Focus	Prodigy or Doccano
Enterprise Scale	Labelbox
Full-Service	Scale AI
Research Crowdsourcing	Prolific
Budget Crowdsourcing	Amazon MTurk

Domain	Example	Why People Disagree
Medical	Is this X-ray showing pneumonia?	Subtle patterns, experience level
Sentiment	"Not bad for what it is"	Sarcasm, context, cultural
Spam	Newsletter from a store you signed up for	Intent vs. content
Toxicity	Political criticism	Subjectivity, personal values

Email	Ann A	Ann B	Agree?
1	Spam	Spam	✓
2	Not Spam	Spam	✗
3	Spam	Spam	✓
4	Not Spam	Not Spam	✓
5	Spam	Not Spam	✗

Kappa	Meaning
0	No better than random guessing
1	Perfect agreement
<0	Worse than random (labels swapped?)

Step	Calculation	Result
P_observed	8/10	0.80
P_expected	(0.6×0.6) + (0.4×0.4)	0.52
κ	(0.80 - 0.52) / (1 - 0.52)	0.58

Kappa	Level	Action
< 0	Worse than chance	Check for label swap!
0.0–0.40	Slight/Fair	Major guideline rewrite
0.41–0.60	Moderate	Refine guidelines
0.61–0.80	Substantial	Minor tweaks
0.81–1.0	Almost Perfect	Production ready!

Scenario	Use This Metric
2 annotators, categorical	Cohen's Kappa
3+ annotators, categorical	Fleiss' Kappa
Bounding boxes / masks	IoU (Intersection over Union)
Text spans (NER)	Span F1
Transcription	WER (Word Error Rate)

Task	Metric	Typical Good IAA
Text Classification	Cohen's Kappa	> 0.8
Sentiment Analysis	Cohen's Kappa	> 0.7
NER	Span F1	> 0.85
Object Detection	Mean IoU	> 0.7
Segmentation	Mean IoU	> 0.8
Transcription	WER < 5%	Between annotators
Emotion Recognition	Cohen's Kappa	> 0.6 (subjective)

Pillar	Key Point
1. GUIDELINES	Clear definitions + edge cases
2. TRAINING	Calibration rounds until κ > 0.8
3. GOLD STANDARD	10% known-correct items mixed in
4. REDUNDANCY	2-3 annotators per item
5. MONITORING	Track IAA and accuracy over time
6. ADJUDICATION	Majority vote or expert review

Segment	Start	End	Label
1	0:00	0:15	gather_ingredients
2	0:15	0:45	chop_vegetables
3	0:45	1:30	cook_in_pan

Metric	Value
Exact Match	2/3 = 0.67 (spans must match exactly)
Partial Match	3/3 = 1.0 (overlapping spans count)

Data Labeling & Annotation

Week 3 · CS 203: Software Tools and Techniques for AI

Part 1: The Motivation

Previously on CS 203...

The Missing Ingredient

The Labeling Bottleneck

Same Data, Different Labels

Today's Mission

Part 2: Data Without Labels

Unlabeled Data is Everywhere

The Labeling Cost Reality

The Label Gap

From Unlabeled to Labeled

Our Tool: Label Studio

Part 3: Types of Labeling Tasks

Annotation Task Taxonomy

Text: Complexity Spectrum

Vision: Complexity Spectrum

Audio: Complexity Spectrum

Video: Complexity Spectrum

The Cost Reality

Part 3a: Text Annotation Tasks

The Problem: Email Overload

The Solution: Text Classification

Text Classification: Annotation Diagram

The Problem: Extracting Information from Clinical Notes

The Solution: Named Entity Recognition (NER)

NER: Annotation Diagram

NER: Common Challenges

The Problem: Understanding Customer Feedback at Scale

The Solution: Sentiment Analysis

Sentiment: Aspect-Based Analysis

Sentiment: The Ambiguity Problem

Text: Question Answering

Text: Relation Extraction

Relation Extraction: Diagram

Part 3b: Image Annotation Tasks

The Problem: Organizing Millions of Products

The Solution: Image Classification

Image Classification: Diagram

The Problem: What Does a Self-Driving Car See?

The Solution: Object Detection

Object Detection: Diagram

Object Detection: Best Practices

The Problem: Finding Tumors in Medical Scans

The Solution: Pixel-level Annotation

The Solution: Semantic Segmentation

Segmentation: Diagram

Instance vs Semantic Segmentation

Segmentation Formats: Binary Mask vs Polygon

Another Application: Urban Planning from Satellites

The Problem: Tracking Patient Movement

The Solution: Keypoint Detection

Keypoint Detection: Diagram

Audio & Video Annotation

Annotation Speed Benchmarks

Part 4: Labeling Tools & Platforms

Labeling Tool Landscape

Label Studio: The Swiss Army Knife

Label Studio: Text Classification Config

Label Studio: NER Config

Label Studio: Object Detection Config

CVAT: Video & CV Focus

Crowdsourcing Platforms

Tool Selection Guide

Part 5: Label Quality & IAA

The Quality Problem

Real-World Disagreement Examples

Why Agreement Matters

The Coin Flip Problem

Why 80% Agreement Can Be Misleading

Cohen's Kappa: The Formula

Kappa Example: Spam Classification

Kappa Interpretation Guide

Beyond Cohen's Kappa

IoU for Spatial Annotations

IoU for Segmentation Masks

Typical IAA by Task Type

Part 6: Quality Control

Quality Control: 6 Pillars