Title	Year	Genre	Budget	Revenue	Rating	Director	Cast
Inception	2010	Sci-Fi	$160M	$836M	8.8	C. Nolan	DiCaprio
Avatar	2009	Action	$237M	$2.9B	7.9	Cameron	Worthington
The Room	2003	Drama	$6M	$1.9M	3.9	Wiseau	Wiseau
...	...	...	...	...	...	...	...

Challenge	Example
Scattered sources	IMDb, Box Office Mojo, Rotten Tomatoes
Different formats	JSON, HTML, CSV
Missing values	Budget missing for 40% of movies
Inconsistent naming	"The Dark Knight" vs "Dark Knight, The"
Rate limits	Only 100 requests/day

Source	Example Datasets	Pros	Cons
Kaggle	Movies, Titanic, Housing	Ready to use, competitions	May be outdated
UCI ML Repository	Classic ML datasets	Well-documented	Academic focus
HuggingFace	NLP datasets, models	Easy loading	Specialized
Government Portals	Census, economic data	Authoritative	Limited scope

Data Needed	Source	Method
Movie titles, years	OMDb API	API calls
Ratings, genres	OMDb API	API calls
Budget, revenue	TMDb API	API calls
User reviews	IMDb website	Scraping
Critic reviews	Rotten Tomatoes	Scraping

Restaurant	API
Menu	Documentation
Order	Request
Kitchen	Server
Food	Response

Without APIs	With APIs
Anyone reads ALL data	Only expose what you want
Anyone can modify/delete	Validate every request
No tracking	Log and monitor usage
Server overwhelmed	Rate limiting protects resources

Type	Description	Example
REST API	HTTP-based, stateless, resource-oriented	OMDb, GitHub
GraphQL	Query language, get exactly what you need	GitHub v4, Shopify
SOAP	XML-based, enterprise	Legacy banking
WebSocket	Real-time, bidirectional	Chat apps, live data

Tier	Requests/Day
Free	100
Basic	1,000
Pro	10,000

Component	Example
Protocol	`https://` (secure)
Host	`api.omdbapi.com`
Path	`/v1/movies`
Query	`?t=Inception&y=2010`

Header	What it does
`Host`	Which server to contact
`User-Agent`	Identifies your browser/script
`Accept`	What format you want back
`Content-Type`	Format of data you're sending
`Authorization`	Your API key or token

	GET	POST
Purpose	Retrieve data	Submit data
Parameters	In URL (`?key=value`)	In body
Example	Search, fetch details	Login, upload, create
Data collection	90% of the time	10% of the time

Range	Category	Meaning
1xx	Informational	Request received, processing
2xx	Success	Request succeeded
3xx	Redirection	Further action needed
4xx	Client Error	Your fault
5xx	Server Error	Their fault

Code	Meaning	When
`200 OK`	Success	Request succeeded
`201 Created`	Created	POST created resource
`400 Bad Request`	Client error	Malformed request
`401 Unauthorized`	Auth needed	Missing credentials
`403 Forbidden`	Denied	Not allowed
`404 Not Found`	Missing	Resource doesn't exist
`429 Too Many Requests`	Rate limit	Slow down!
`500 Internal Error`	Server crash	Their fault

Format	Full Name	Use Case
JSON	JavaScript Object Notation	APIs, Web apps
XML	eXtensible Markup Language	Enterprise, Legacy
CSV	Comma Separated Values	Spreadsheets, ML
HTML	HyperText Markup Language	Web pages
Protobuf	Protocol Buffers	High-performance

Aspect	JSON	XML
Syntax	`{"name": "Inception"}`	`<name>Inception</name>`
Structure	Curly braces `{}`	Tags `<tag></tag>`
Size	Lighter (~30% smaller)	More verbose
Attributes	Not supported	Supported
Arrays	`[1, 2, 3]`	Repeated elements
Usage	Modern APIs	Legacy/Enterprise

Format	Size	Readability	Use Case
JSON	150 bytes	High	REST APIs
XML	200 bytes	Medium	Enterprise
CSV	50 bytes	High	Data exchange
HTML	300 bytes	Low	Web pages
Protobuf	30 bytes	None	High-perf APIs

curl: Basic Syntax

curl [options] [URL]

Common options:

Option	Meaning	Example
`-X`	HTTP method	`-X POST`
`-H`	Add header	`-H "Accept: application/json"`
`-d`	Send data (body)	`-d '{"key": "value"}'`
`-o`	Output to file	`-o movie.json`
`-I`	Headers only	`-I`
`-v`	Verbose output	`-v`
`-s`	Silent mode	`-s`

Aspect	curl	Python requests
Use case	Quick testing	Automation
Learning	Interactive exploration	Production code
Looping	Bash scripts	Native Python
JSON parsing	Needs jq	Built-in .json()
Error handling	Exit codes	Exceptions
DevTools	Copy as curl (yes)	Convert from curl

Aspect	API	Scraping
Reliability	Stable	Fragile (HTML changes)
Speed	Fast	Slower
Data Format	Structured JSON	Unstructured HTML
Rate Limits	Documented	Unknown
Legality	Clear TOS	Gray area
Maintenance	Low	High

Selector	Meaning	Example Match
`div`	Element type	`<div>...</div>`
`.movie`	Class name	`<div class="movie">`
`#main`	Element ID	`<div id="main">`
`div.movie`	Tag with class	`<div class="movie">`
`.movie .title`	Nested element	`.title` inside `.movie`
`a[href="/movies"]`	Attribute value	`<a href="/movies">`

Mistake	Solution
No delays	Add `time.sleep(1)`
Hardcoded selectors	Handle missing elements
No error handling	Wrap in try/except
Ignoring encoding	Check `response.encoding`
Not saving raw HTML	Save before parsing

Tool	When to Use	Key Commands
Chrome DevTools	Discover APIs, inspect requests	Network tab, Copy as curl
curl	Test requests quickly	`-X`, `-H`, `-d`, `
Python requests	Automate collection	`.get()`, `.post()`, `.json()`

Data Collection for Machine Learning

Week 1 · CS 203: Software Tools and Techniques for AI

Part 1: The Motivation

Imagine: You Work at Netflix

The Problem Statement

What We Need: The Target Dataset

The Reality Check

The ML Pipeline Reality

Why Is Data Collection So Hard?

Today's Mission

Part 2: Where Does Data Come From?

Three Ways to Get Data

Option 1: Pre-built Datasets

Option 2: APIs (Application Programming Interface)

Option 3: Web Scraping

Our Strategy for Netflix Project

Decision Tree: How to Get Data

Decision Tree (continued)

Part 3: What is an API?

API: A Restaurant Analogy

Our Sample Database

API: The Formal Definition

Why Do APIs Exist?

APIs Provide Protection

Reading API Documentation

Example: OMDb API Docs

Types of APIs

REST API: Key Principles

Anatomy of an API Call

API Authentication

Rate Limiting

Dealing with Rate Limits

Exponential Backoff

Part 4: HTTP Fundamentals

What is HTTP?

Understanding "Stateless"

The Client-Server Model

Chrome DevTools: Your HTTP Inspector

Try It: Visit iitgn.ac.in

Try It: Visit Teaching API

URL Anatomy

Key HTTP Headers

HTTP Methods: GET vs POST

HTTP Status Codes

Common Status Codes

Status Code Intuition

Part 6: Response Formats

Why Different Formats?

Format 1: JSON

JSON Data Types

JSON Gotchas

More JSON Gotchas

Format 2: XML

JSON vs XML: Same Data

Format 3: CSV

Format 4: HTML

Format 5: Protocol Buffers (Protobuf)

Format Comparison: Same Movie

Part 7: Making Requests with curl

What is curl?

curl: Basic Syntax

curl: GET Request

curl: Real API Example (OMDb)

curl: Adding Headers

curl: Viewing Response Headers

curl: Verbose Mode

Pretty Printing with jq

jq: Extracting and Transforming Data

curl: Saving to File

curl: POST Request

curl: POST with Form Data

curl: File Upload

curl: Useful Options

Part 9: Python requests Library

Why Python requests?

requests: Simple GET

requests: Using params

requests: Adding Headers

requests: Response Object

requests: POST with JSON