# Install required packages (uncomment if needed)
# !pip install requests beautifulsoup4 pandasWeek 1 Lab: Data Collection for Machine Learning
CS 203: Software Tools and Techniques for AI
Lab Overview
In this lab, you will learn to collect data from the web using:
- HTTP fundamentals - Understanding how the web works
- curl - Command-line HTTP client
- Python requests - Programmatic API calls
- BeautifulSoup - Web scraping when APIs don’t exist
Goal: Build a movie data collection pipeline for Netflix-style movie prediction.
Setup
First, let’s install and import the required libraries.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import time
print("All imports successful!")Part 1: HTTP Fundamentals
Before we start collecting data, we need to understand how the web works.
1.1 Understanding URLs
A URL (Uniform Resource Locator) has several components:
https://api.omdbapi.com:443/v1/movies?t=Inception&y=2010#details
└─┬──┘ └──────┬───────┘└┬─┘└───┬───┘└─────────┬────────┘└───┬───┘
│ │ │ │ │ │
Protocol Host Port Path Query Fragment
Question 1.1 (Solved): Parse a URL
Use Python’s urllib.parse to break down a URL into its components.
# SOLVED EXAMPLE
from urllib.parse import urlparse, parse_qs
url = "https://api.omdbapi.com/?apikey=demo&t=Inception&y=2010"
parsed = urlparse(url)
print(f"Scheme (protocol): {parsed.scheme}")
print(f"Host (domain): {parsed.netloc}")
print(f"Path: {parsed.path}")
print(f"Query string: {parsed.query}")
# Parse query parameters into a dictionary
params = parse_qs(parsed.query)
print(f"\nParsed parameters: {params}")Question 1.2: Parse a Different URL
Parse the following GitHub API URL and extract: 1. The host 2. The path 3. All query parameters as a dictionary
URL: https://api.github.com/search/repositories?q=machine+learning&sort=stars&order=desc
# YOUR CODE HERE
url = "https://api.github.com/search/repositories?q=machine+learning&sort=stars&order=desc"
# Parse the URL
# Print the host
# Print the path
# Print the query parameters as a dictionary1.2 HTTP Status Codes
HTTP status codes tell you what happened with your request:
| Range | Category | Common Examples |
|---|---|---|
| 2xx | Success | 200 OK, 201 Created |
| 3xx | Redirect | 301 Moved, 302 Found |
| 4xx | Client Error | 400 Bad Request, 401 Unauthorized, 404 Not Found |
| 5xx | Server Error | 500 Internal Error, 503 Service Unavailable |
Question 1.3: Match Status Codes
Match each scenario to the most likely HTTP status code:
- You requested a movie that doesn’t exist in the database
- You made too many requests and hit the rate limit
- Your API key is invalid
- The request was successful and data was returned
- The server crashed while processing your request
Status codes to choose from: 200, 401, 404, 429, 500
# YOUR ANSWERS HERE
answers = {
"movie_not_found": None, # Replace None with the status code
"rate_limited": None,
"invalid_api_key": None,
"success": None,
"server_crashed": None
}
print(answers)Part 2: Making Requests with curl
curl is a command-line tool for making HTTP requests. It’s essential for quick testing.
2.1 Basic curl Commands
You can run shell commands in Jupyter using ! prefix.
Question 2.1 (Solved): Your First API Call
Let’s call a simple public API that requires no authentication.
# SOLVED EXAMPLE
# JSONPlaceholder is a free fake API for testing
!curl -s "https://jsonplaceholder.typicode.com/posts/1"Question 2.2: Pretty Print with jq
The output above is hard to read. Use jq to format it nicely.
Hint: Pipe the curl output to jq: curl ... | jq .
# YOUR CODE HERE
# Fetch the same post but format the output with jqQuestion 2.3: Extract Specific Fields with jq
Fetch all posts from https://jsonplaceholder.typicode.com/posts and extract only the title field from each post.
Hint: Use jq '.[].title' to get the title from each element in the array.
# YOUR CODE HEREQuestion 2.4: View Response Headers
Use the -I flag to fetch only the response headers (no body) from: https://api.github.com
What is the value of the X-RateLimit-Limit header?
# YOUR CODE HEREQuestion 2.5: Add Custom Headers
Make a request to https://httpbin.org/headers with the following custom headers: - User-Agent: CS203-Lab/1.0 - Accept: application/json
Hint: Use -H "Header-Name: value" for each header.
# YOUR CODE HEREPart 3: Python requests Library
While curl is great for testing, we need Python for automation.
3.1 Basic GET Requests
Question 3.1 (Solved): Simple GET Request
Make a GET request and inspect the response object.
# SOLVED EXAMPLE
import requests
response = requests.get("https://jsonplaceholder.typicode.com/posts/1")
print(f"Status Code: {response.status_code}")
print(f"Content-Type: {response.headers['Content-Type']}")
print(f"Response OK: {response.ok}")
print(f"\nJSON Data:")
print(response.json())Question 3.2: Fetch Multiple Posts
Fetch posts from https://jsonplaceholder.typicode.com/posts and: 1. Print the total number of posts 2. Print the titles of the first 5 posts
# YOUR CODE HEREQuestion 3.3 (Solved): Using Query Parameters
The proper way to add query parameters is using the params argument.
# SOLVED EXAMPLE
import requests
# Bad way (manual string building)
# url = "https://jsonplaceholder.typicode.com/posts?userId=1"
# Good way (using params)
response = requests.get(
"https://jsonplaceholder.typicode.com/posts",
params={"userId": 1}
)
posts = response.json()
print(f"User 1 has {len(posts)} posts")
print(f"\nActual URL used: {response.url}")Question 3.4: Filter Posts by User
Fetch all posts by user 5 and user 7. Compare how many posts each user has.
Hint: Make two separate requests with different userId values.
# YOUR CODE HERE3.2 Working with Real APIs
Let’s work with some real-world APIs.
Question 3.5 (Solved): GitHub API - Public Repositories
The GitHub API is free to use (with rate limits) and doesn’t require authentication for public data.
# SOLVED EXAMPLE
import requests
# Fetch information about a popular repository
response = requests.get(
"https://api.github.com/repos/pandas-dev/pandas",
headers={"Accept": "application/vnd.github.v3+json"}
)
if response.ok:
repo = response.json()
print(f"Repository: {repo['full_name']}")
print(f"Description: {repo['description']}")
print(f"Stars: {repo['stargazers_count']:,}")
print(f"Forks: {repo['forks_count']:,}")
print(f"Language: {repo['language']}")
else:
print(f"Error: {response.status_code}")Question 3.6: Compare Popular ML Libraries
Fetch information about these ML-related repositories and create a comparison table: - scikit-learn/scikit-learn - pytorch/pytorch - tensorflow/tensorflow
Show: name, stars, forks, and primary language.
Hint: Loop through the repos and collect data into a list of dictionaries, then create a DataFrame.
# YOUR CODE HERE
repos = [
"scikit-learn/scikit-learn",
"pytorch/pytorch",
"tensorflow/tensorflow"
]
# Fetch data for each repo
# Create a DataFrame
# Display the comparisonQuestion 3.7: Search GitHub Repositories
Use the GitHub search API to find the top 10 most starred repositories with “machine learning” in their description.
API endpoint: https://api.github.com/search/repositories
Parameters: - q: search query (e.g., “machine learning”) - sort: “stars” - order: “desc” - per_page: 10
Print the name and star count of each repository.
# YOUR CODE HERE3.3 Error Handling
Real-world APIs fail. We need to handle errors gracefully.
Question 3.8 (Solved): Handling HTTP Errors
# SOLVED EXAMPLE
import requests
def fetch_with_error_handling(url):
"""Fetch URL with proper error handling."""
try:
response = requests.get(url, timeout=10)
response.raise_for_status() # Raises exception for 4xx/5xx
return response.json()
except requests.exceptions.Timeout:
print(f"Timeout: Request took too long")
except requests.exceptions.HTTPError as e:
print(f"HTTP Error: {e.response.status_code}")
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
return None
# Test with valid URL
print("Valid URL:")
data = fetch_with_error_handling("https://jsonplaceholder.typicode.com/posts/1")
if data:
print(f" Got post: {data['title'][:50]}...")
# Test with invalid URL (404)
print("\nInvalid URL (404):")
fetch_with_error_handling("https://jsonplaceholder.typicode.com/posts/99999")Question 3.9: Robust Fetcher Function
Write a function safe_fetch(url, max_retries=3) that:
- Attempts to fetch the URL
- If it fails with a 5xx error, retries up to
max_retriestimes - Waits 1 second between retries
- Returns the JSON data if successful, None otherwise
Test it with https://httpbin.org/status/500 (always returns 500) and https://jsonplaceholder.typicode.com/posts/1 (always works).
# YOUR CODE HERE
import time
def safe_fetch(url, max_retries=3):
"""Fetch URL with retry logic for server errors."""
pass # Implement this
# Test your function
# print("Testing with working URL:")
# result = safe_fetch("https://jsonplaceholder.typicode.com/posts/1")
# print(f"Result: {result}")
# print("\nTesting with failing URL (500):")
# result = safe_fetch("https://httpbin.org/status/500")
# print(f"Result: {result}")Part 4: The OMDb Movie API
Now let’s work with the OMDb API - our main data source for the Netflix project.
Note: You need an API key from https://www.omdbapi.com/apikey.aspx (free tier available).
For this lab, we’ll use a demo key that has limited functionality.
# Set your API key here
# Get a free key from: https://www.omdbapi.com/apikey.aspx
OMDB_API_KEY = "YOUR_API_KEY_HERE" # Replace with your actual key
# For demo purposes, you can try with key "demo" but it's very limited
# OMDB_API_KEY = "demo"Question 4.1 (Solved): Fetch a Single Movie
# SOLVED EXAMPLE
import requests
def fetch_movie(title, year=None, api_key=OMDB_API_KEY):
"""Fetch movie data from OMDb API."""
params = {
"apikey": api_key,
"t": title, # Search by title
"type": "movie"
}
if year:
params["y"] = year
response = requests.get("https://www.omdbapi.com/", params=params)
if response.ok:
data = response.json()
if data.get("Response") == "True":
return data
else:
print(f"Movie not found: {data.get('Error')}")
return None
# Fetch Inception
movie = fetch_movie("Inception", 2010)
if movie:
print(f"Title: {movie['Title']}")
print(f"Year: {movie['Year']}")
print(f"Director: {movie['Director']}")
print(f"IMDB Rating: {movie['imdbRating']}")
print(f"Genre: {movie['Genre']}")Question 4.2: Explore the Response
Fetch data for “The Dark Knight” and print ALL available fields in the response.
Which fields might be useful for predicting movie success?
# YOUR CODE HEREQuestion 4.3: Fetch Multiple Movies
Create a function fetch_movies(titles) that: 1. Takes a list of movie titles 2. Fetches data for each movie 3. Returns a list of movie dictionaries (only successful fetches) 4. Adds a 0.5 second delay between requests (to respect rate limits)
Test it with: ["Inception", "The Matrix", "Interstellar", "NonExistentMovie123"]
# YOUR CODE HERE
def fetch_movies(titles):
"""Fetch multiple movies from OMDb API."""
pass # Implement this
# Test
# test_titles = ["Inception", "The Matrix", "Interstellar", "NonExistentMovie123"]
# movies = fetch_movies(test_titles)
# print(f"Successfully fetched {len(movies)} out of {len(test_titles)} movies")Question 4.4: Create a Movie DataFrame
Using the movies you fetched, create a pandas DataFrame with these columns: - title - year (as integer) - genre - director - imdb_rating (as float) - imdb_votes (as integer, remove commas) - runtime_minutes (as integer, extract from “148 min”) - box_office (keep as string for now)
Hint: You’ll need to clean the data types.
# YOUR CODE HEREQuestion 4.5: Search Movies by Title
OMDb also has a search endpoint that returns multiple results.
Use the s parameter instead of t to search for movies containing “Star Wars”.
API endpoint: https://www.omdbapi.com/?apikey=YOUR_KEY&s=Star Wars&type=movie
Print the title and year of each result.
# YOUR CODE HEREQuestion 4.6: Handle Pagination
The OMDb search API returns 10 results per page and includes a totalResults field.
Write a function search_all_movies(query) that: 1. Searches for movies matching the query 2. Fetches ALL pages of results (use the page parameter) 3. Returns a list of all movies found
Hint: totalResults tells you how many movies exist. Divide by 10 to get the number of pages.
Test with a query that has many results like “Batman”.
# YOUR CODE HERE
def search_all_movies(query, api_key=OMDB_API_KEY):
"""Search OMDb and return ALL matching movies across all pages."""
pass # Implement this
# Test
# all_batman = search_all_movies("Batman")
# print(f"Found {len(all_batman)} Batman movies")Part 5: Web Scraping with BeautifulSoup
When APIs don’t exist or don’t have what we need, we scrape.
5.1 HTML Basics
Question 5.1 (Solved): Parse HTML
# SOLVED EXAMPLE
from bs4 import BeautifulSoup
html = """
<html>
<body>
<div class="movie" id="movie-1">
<h2 class="title">Inception</h2>
<span class="year">2010</span>
<span class="rating">8.8</span>
<a href="/movies/inception">More Info</a>
</div>
<div class="movie" id="movie-2">
<h2 class="title">The Matrix</h2>
<span class="year">1999</span>
<span class="rating">8.7</span>
<a href="/movies/matrix">More Info</a>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
# Find all movie divs
movies = soup.find_all('div', class_='movie')
print(f"Found {len(movies)} movies\n")
# Extract data from each
for movie in movies:
title = movie.find('h2', class_='title').text
year = movie.find('span', class_='year').text
rating = movie.find('span', class_='rating').text
link = movie.find('a')['href']
print(f"{title} ({year}) - Rating: {rating} - Link: {link}")Question 5.2: CSS Selectors
Rewrite the above extraction using CSS selectors (.select() and .select_one()) instead of .find() and .find_all().
Hint: - .movie selects elements with class “movie” - .movie .title selects elements with class “title” inside class “movie”
# YOUR CODE HERE
# Use the same 'soup' from above
# Extract using CSS selectorsQuestion 5.3: Scrape a Real Website
Let’s scrape the example website http://quotes.toscrape.com/ which is designed for scraping practice.
Extract all quotes from the first page, including: - The quote text - The author name - The tags
Return the results as a list of dictionaries.
# YOUR CODE HERE
import requests
from bs4 import BeautifulSoup
# Fetch the page
url = "http://quotes.toscrape.com/"
# Parse the HTML
# Extract quotes
# Print resultsQuestion 5.4: Handle Pagination in Scraping
The quotes website has multiple pages. Scrape the first 3 pages and collect all quotes.
Pages follow the pattern: - Page 1: http://quotes.toscrape.com/page/1/ - Page 2: http://quotes.toscrape.com/page/2/ - etc.
Remember: Add a delay between requests to be polite!
# YOUR CODE HEREQuestion 5.5: Extract Table Data
Scrape the table from https://www.w3schools.com/html/html_tables.asp.
The table contains company data. Extract all rows and create a pandas DataFrame.
Hint: Look for <table>, <tr> (table row), <th> (header), and <td> (data cell) elements.
# YOUR CODE HERE
# Hint: pandas has a read_html() function that can do this automatically!
# But try doing it manually first to understand the process.Part 6: Building the Movie Data Pipeline
Now let’s put everything together to build a complete data collection pipeline for our Netflix project.
6.1 The Complete Pipeline
Question 6.1 (Solved): Movie Data Collector Class
# SOLVED EXAMPLE
import requests
import pandas as pd
import time
from typing import List, Dict, Optional
class MovieDataCollector:
"""Collect movie data from OMDb API."""
def __init__(self, api_key: str):
self.api_key = api_key
self.base_url = "http://www.omdbapi.com/"
self.delay = 0.5 # Seconds between requests
def fetch_movie(self, title: str, year: Optional[int] = None) -> Optional[Dict]:
"""Fetch a single movie by title."""
params = {
"apikey": self.api_key,
"t": title,
"type": "movie"
}
if year:
params["y"] = year
try:
response = requests.get(self.base_url, params=params, timeout=10)
response.raise_for_status()
data = response.json()
if data.get("Response") == "True":
return data
except Exception as e:
print(f"Error fetching {title}: {e}")
return None
def fetch_movies(self, titles: List[str]) -> List[Dict]:
"""Fetch multiple movies."""
movies = []
for i, title in enumerate(titles):
print(f"Fetching {i+1}/{len(titles)}: {title}")
movie = self.fetch_movie(title)
if movie:
movies.append(movie)
time.sleep(self.delay)
return movies
def to_dataframe(self, movies: List[Dict]) -> pd.DataFrame:
"""Convert movie data to cleaned DataFrame."""
if not movies:
return pd.DataFrame()
# Extract relevant fields
rows = []
for m in movies:
rows.append({
"title": m.get("Title"),
"year": m.get("Year"),
"genre": m.get("Genre"),
"director": m.get("Director"),
"actors": m.get("Actors"),
"imdb_rating": m.get("imdbRating"),
"imdb_votes": m.get("imdbVotes"),
"runtime": m.get("Runtime"),
"box_office": m.get("BoxOffice"),
"imdb_id": m.get("imdbID")
})
df = pd.DataFrame(rows)
# Clean data types
df["year"] = pd.to_numeric(df["year"], errors="coerce").astype("Int64")
df["imdb_rating"] = pd.to_numeric(df["imdb_rating"], errors="coerce")
df["imdb_votes"] = df["imdb_votes"].str.replace(",", "").pipe(pd.to_numeric, errors="coerce").astype("Int64")
# Fix: str.extract returns a DataFrame, we need column 0 to get a Series
df["runtime_min"] = df["runtime"].str.extract(r"(\d+)")[0].pipe(pd.to_numeric, errors="coerce").astype("Int64")
return df
# Usage example
# collector = MovieDataCollector(OMDB_API_KEY)
# movies = collector.fetch_movies(["Inception", "The Matrix"])
# df = collector.to_dataframe(movies)
# print(df)Question 6.2: Add Search Functionality
Extend the MovieDataCollector class to add a search_movies(query, max_results=50) method that: 1. Searches for movies matching the query 2. Handles pagination to get up to max_results movies 3. For each search result, fetches the full movie details 4. Returns the detailed movie data
Hint: Search results only contain basic info (title, year, poster, imdbID). You need to use the imdbID to fetch full details.
# YOUR CODE HERE
# Extend the MovieDataCollector class or add a methodQuestion 6.3: Build a Genre-Based Dataset
Use your collector to build a dataset of popular movies from different genres:
- Search for 10 movies each for: “action”, “comedy”, “drama”, “horror”, “sci-fi”
- Combine all results into a single DataFrame
- Remove any duplicates (some movies might appear in multiple searches)
- Save to CSV
Note: This might take a while due to rate limiting. Start with fewer movies for testing.
# YOUR CODE HEREQuestion 6.4: Data Quality Analysis
Using the dataset you created:
- How many movies have missing IMDB ratings?
- How many movies have missing box office data?
- What’s the distribution of ratings? (min, max, mean, median)
- Which directors appear most frequently?
- What’s the average runtime by genre?
These quality checks will be important for Week 2 (Data Validation)!
# YOUR CODE HEREPart 7: Challenge Problems
These are optional advanced exercises for those who finish early.
Challenge 7.1: Rate Limit Handler
Create a RateLimiter class that: 1. Tracks how many requests have been made 2. Automatically adds delays to stay under a rate limit 3. Handles 429 (Too Many Requests) responses by waiting and retrying
limiter = RateLimiter(requests_per_minute=30)
response = limiter.get("https://api.example.com/data")# YOUR CODE HEREChallenge 7.2: Async Movie Collector
The synchronous approach is slow because we wait for each request to complete.
Create an async version using aiohttp that can fetch multiple movies concurrently (while still respecting rate limits).
Compare the time to fetch 20 movies with sync vs async approach.
# YOUR CODE HERE
# Hint: You'll need to install aiohttp: pip install aiohttp
# And use asyncio to run the async codeChallenge 7.3: Multi-Source Data Fusion
Create a data collection pipeline that: 1. Fetches basic movie data from OMDb 2. Enriches it with additional data from another source (e.g., Wikipedia API for plot summaries) 3. Merges the data based on movie title/year 4. Handles cases where data is missing from one source
Wikipedia API example:
https://en.wikipedia.org/api/rest_v1/page/summary/Inception_(film)
# YOUR CODE HERESummary
In this lab, you learned:
- HTTP Fundamentals: URLs, status codes, headers
- curl: Command-line HTTP requests
- Python requests: Programmatic data collection
- Error handling: Timeouts, retries, status codes
- OMDb API: Real-world movie data
- BeautifulSoup: Web scraping when APIs don’t exist
- Data pipelines: Building reusable collection code
Next Week
Week 2: Data Validation & Quality
The data we collected today is messy! Next week we’ll learn: - Schema validation with Pydantic - Data type cleaning - Handling missing values - Quality metrics
Submission
Save your completed notebook and submit: 1. This notebook with all cells executed 2. The CSV file of movies you collected 3. A brief summary (1 paragraph) of what you learned