Week 1 Lab: Data Collection for Machine Learning

CS 203: Software Tools and Techniques for AI

Lab Overview

In this lab, you will learn to collect data from the web using:

HTTP fundamentals - Understanding how the web works
curl - Command-line HTTP client
Python requests - Programmatic API calls
BeautifulSoup - Web scraping when APIs don’t exist

Goal: Build a movie data collection pipeline for Netflix-style movie prediction.

Setup

First, let’s install and import the required libraries.

# Install required packages (uncomment if needed)
# !pip install requests beautifulsoup4 pandas

import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import time

print("All imports successful!")

Part 1: HTTP Fundamentals

Before we start collecting data, we need to understand how the web works.

1.1 Understanding URLs

A URL (Uniform Resource Locator) has several components:

https://api.omdbapi.com:443/v1/movies?t=Inception&y=2010#details
└─┬──┘ └──────┬───────┘└┬─┘└───┬───┘└─────────┬────────┘└───┬───┘
  │           │         │      │              │             │
Protocol    Host      Port   Path          Query        Fragment

Question 1.1 (Solved): Parse a URL

Use Python’s urllib.parse to break down a URL into its components.

# SOLVED EXAMPLE
from urllib.parse import urlparse, parse_qs

url = "https://api.omdbapi.com/?apikey=demo&t=Inception&y=2010"

parsed = urlparse(url)

print(f"Scheme (protocol): {parsed.scheme}")
print(f"Host (domain): {parsed.netloc}")
print(f"Path: {parsed.path}")
print(f"Query string: {parsed.query}")

# Parse query parameters into a dictionary
params = parse_qs(parsed.query)
print(f"\nParsed parameters: {params}")

Question 1.2: Parse a Different URL

Parse the following GitHub API URL and extract: 1. The host 2. The path 3. All query parameters as a dictionary

URL: https://api.github.com/search/repositories?q=machine+learning&sort=stars&order=desc

# YOUR CODE HERE
url = "https://api.github.com/search/repositories?q=machine+learning&sort=stars&order=desc"

# Parse the URL

# Print the host

# Print the path

# Print the query parameters as a dictionary

1.2 HTTP Status Codes

HTTP status codes tell you what happened with your request:

Range	Category	Common Examples
2xx	Success	200 OK, 201 Created
3xx	Redirect	301 Moved, 302 Found
4xx	Client Error	400 Bad Request, 401 Unauthorized, 404 Not Found
5xx	Server Error	500 Internal Error, 503 Service Unavailable

Question 1.3: Match Status Codes

Match each scenario to the most likely HTTP status code:

You requested a movie that doesn’t exist in the database
You made too many requests and hit the rate limit
Your API key is invalid
The request was successful and data was returned
The server crashed while processing your request

Status codes to choose from: 200, 401, 404, 429, 500

# YOUR ANSWERS HERE
answers = {
    "movie_not_found": None,      # Replace None with the status code
    "rate_limited": None,
    "invalid_api_key": None,
    "success": None,
    "server_crashed": None
}

print(answers)

Part 2: Making Requests with `curl`

curl is a command-line tool for making HTTP requests. It’s essential for quick testing.

2.1 Basic curl Commands

You can run shell commands in Jupyter using ! prefix.

Question 2.1 (Solved): Your First API Call

Let’s call a simple public API that requires no authentication.

# SOLVED EXAMPLE
# JSONPlaceholder is a free fake API for testing
!curl -s "https://jsonplaceholder.typicode.com/posts/1"

Question 2.2: Pretty Print with jq

The output above is hard to read. Use jq to format it nicely.

Hint: Pipe the curl output to jq: curl ... | jq .

# YOUR CODE HERE
# Fetch the same post but format the output with jq

Question 2.3: Extract Specific Fields with jq

Fetch all posts from https://jsonplaceholder.typicode.com/posts and extract only the title field from each post.

Hint: Use jq '.[].title' to get the title from each element in the array.

# YOUR CODE HERE

Question 2.4: View Response Headers

Use the -I flag to fetch only the response headers (no body) from: https://api.github.com

What is the value of the X-RateLimit-Limit header?

# YOUR CODE HERE

Question 2.5: Add Custom Headers

Make a request to https://httpbin.org/headers with the following custom headers: - User-Agent: CS203-Lab/1.0 - Accept: application/json

Hint: Use -H "Header-Name: value" for each header.

# YOUR CODE HERE

Part 3: Python `requests` Library

While curl is great for testing, we need Python for automation.

3.1 Basic GET Requests

Question 3.1 (Solved): Simple GET Request

Make a GET request and inspect the response object.

# SOLVED EXAMPLE
import requests

response = requests.get("https://jsonplaceholder.typicode.com/posts/1")

print(f"Status Code: {response.status_code}")
print(f"Content-Type: {response.headers['Content-Type']}")
print(f"Response OK: {response.ok}")
print(f"\nJSON Data:")
print(response.json())

Question 3.2: Fetch Multiple Posts

Fetch posts from https://jsonplaceholder.typicode.com/posts and: 1. Print the total number of posts 2. Print the titles of the first 5 posts

# YOUR CODE HERE

Question 3.3 (Solved): Using Query Parameters

The proper way to add query parameters is using the params argument.

# SOLVED EXAMPLE
import requests

# Bad way (manual string building)
# url = "https://jsonplaceholder.typicode.com/posts?userId=1"

# Good way (using params)
response = requests.get(
    "https://jsonplaceholder.typicode.com/posts",
    params={"userId": 1}
)

posts = response.json()
print(f"User 1 has {len(posts)} posts")
print(f"\nActual URL used: {response.url}")

Question 3.4: Filter Posts by User

Fetch all posts by user 5 and user 7. Compare how many posts each user has.

Hint: Make two separate requests with different userId values.

# YOUR CODE HERE

3.2 Working with Real APIs

Let’s work with some real-world APIs.

Question 3.5 (Solved): GitHub API - Public Repositories

The GitHub API is free to use (with rate limits) and doesn’t require authentication for public data.

# SOLVED EXAMPLE
import requests

# Fetch information about a popular repository
response = requests.get(
    "https://api.github.com/repos/pandas-dev/pandas",
    headers={"Accept": "application/vnd.github.v3+json"}
)

if response.ok:
    repo = response.json()
    print(f"Repository: {repo['full_name']}")
    print(f"Description: {repo['description']}")
    print(f"Stars: {repo['stargazers_count']:,}")
    print(f"Forks: {repo['forks_count']:,}")
    print(f"Language: {repo['language']}")
else:
    print(f"Error: {response.status_code}")

Question 3.6: Compare Popular ML Libraries

Fetch information about these ML-related repositories and create a comparison table: - scikit-learn/scikit-learn - pytorch/pytorch - tensorflow/tensorflow

Show: name, stars, forks, and primary language.

Hint: Loop through the repos and collect data into a list of dictionaries, then create a DataFrame.

# YOUR CODE HERE
repos = [
    "scikit-learn/scikit-learn",
    "pytorch/pytorch",
    "tensorflow/tensorflow"
]

# Fetch data for each repo

# Create a DataFrame

# Display the comparison

Question 3.7: Search GitHub Repositories

Use the GitHub search API to find the top 10 most starred repositories with “machine learning” in their description.

API endpoint: https://api.github.com/search/repositories

Parameters: - q: search query (e.g., “machine learning”) - sort: “stars” - order: “desc” - per_page: 10

Print the name and star count of each repository.

# YOUR CODE HERE

3.3 Error Handling

Real-world APIs fail. We need to handle errors gracefully.

Question 3.8 (Solved): Handling HTTP Errors

# SOLVED EXAMPLE
import requests

def fetch_with_error_handling(url):
    """Fetch URL with proper error handling."""
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()  # Raises exception for 4xx/5xx
        return response.json()
    except requests.exceptions.Timeout:
        print(f"Timeout: Request took too long")
    except requests.exceptions.HTTPError as e:
        print(f"HTTP Error: {e.response.status_code}")
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
    return None

# Test with valid URL
print("Valid URL:")
data = fetch_with_error_handling("https://jsonplaceholder.typicode.com/posts/1")
if data:
    print(f"  Got post: {data['title'][:50]}...")

# Test with invalid URL (404)
print("\nInvalid URL (404):")
fetch_with_error_handling("https://jsonplaceholder.typicode.com/posts/99999")

Question 3.9: Robust Fetcher Function

Write a function safe_fetch(url, max_retries=3) that:

Attempts to fetch the URL
If it fails with a 5xx error, retries up to max_retries times
Waits 1 second between retries
Returns the JSON data if successful, None otherwise

Test it with https://httpbin.org/status/500 (always returns 500) and https://jsonplaceholder.typicode.com/posts/1 (always works).

# YOUR CODE HERE
import time

def safe_fetch(url, max_retries=3):
    """Fetch URL with retry logic for server errors."""
    pass  # Implement this

# Test your function
# print("Testing with working URL:")
# result = safe_fetch("https://jsonplaceholder.typicode.com/posts/1")
# print(f"Result: {result}")

# print("\nTesting with failing URL (500):")
# result = safe_fetch("https://httpbin.org/status/500")
# print(f"Result: {result}")

Part 4: The OMDb Movie API

Now let’s work with the OMDb API - our main data source for the Netflix project.

Note: You need an API key from https://www.omdbapi.com/apikey.aspx (free tier available).

For this lab, we’ll use a demo key that has limited functionality.

# Set your API key here
# Get a free key from: https://www.omdbapi.com/apikey.aspx
OMDB_API_KEY = "YOUR_API_KEY_HERE"  # Replace with your actual key

# For demo purposes, you can try with key "demo" but it's very limited
# OMDB_API_KEY = "demo"

Question 4.1 (Solved): Fetch a Single Movie

# SOLVED EXAMPLE
import requests

def fetch_movie(title, year=None, api_key=OMDB_API_KEY):
    """Fetch movie data from OMDb API."""
    params = {
        "apikey": api_key,
        "t": title,  # Search by title
        "type": "movie"
    }
    if year:
        params["y"] = year
    
    response = requests.get("https://www.omdbapi.com/", params=params)
    
    if response.ok:
        data = response.json()
        if data.get("Response") == "True":
            return data
        else:
            print(f"Movie not found: {data.get('Error')}")
    return None

# Fetch Inception
movie = fetch_movie("Inception", 2010)
if movie:
    print(f"Title: {movie['Title']}")
    print(f"Year: {movie['Year']}")
    print(f"Director: {movie['Director']}")
    print(f"IMDB Rating: {movie['imdbRating']}")
    print(f"Genre: {movie['Genre']}")

Question 4.2: Explore the Response

Fetch data for “The Dark Knight” and print ALL available fields in the response.

Which fields might be useful for predicting movie success?

# YOUR CODE HERE

Question 4.3: Fetch Multiple Movies

Create a function fetch_movies(titles) that: 1. Takes a list of movie titles 2. Fetches data for each movie 3. Returns a list of movie dictionaries (only successful fetches) 4. Adds a 0.5 second delay between requests (to respect rate limits)

Test it with: ["Inception", "The Matrix", "Interstellar", "NonExistentMovie123"]

# YOUR CODE HERE
def fetch_movies(titles):
    """Fetch multiple movies from OMDb API."""
    pass  # Implement this

# Test
# test_titles = ["Inception", "The Matrix", "Interstellar", "NonExistentMovie123"]
# movies = fetch_movies(test_titles)
# print(f"Successfully fetched {len(movies)} out of {len(test_titles)} movies")

Question 4.4: Create a Movie DataFrame

Using the movies you fetched, create a pandas DataFrame with these columns: - title - year (as integer) - genre - director - imdb_rating (as float) - imdb_votes (as integer, remove commas) - runtime_minutes (as integer, extract from “148 min”) - box_office (keep as string for now)

Hint: You’ll need to clean the data types.

# YOUR CODE HERE

Question 4.5: Search Movies by Title

OMDb also has a search endpoint that returns multiple results.

Use the s parameter instead of t to search for movies containing “Star Wars”.

API endpoint: https://www.omdbapi.com/?apikey=YOUR_KEY&s=Star Wars&type=movie

Print the title and year of each result.

# YOUR CODE HERE

Question 4.6: Handle Pagination

The OMDb search API returns 10 results per page and includes a totalResults field.

Write a function search_all_movies(query) that: 1. Searches for movies matching the query 2. Fetches ALL pages of results (use the page parameter) 3. Returns a list of all movies found

Hint: totalResults tells you how many movies exist. Divide by 10 to get the number of pages.

Test with a query that has many results like “Batman”.

# YOUR CODE HERE
def search_all_movies(query, api_key=OMDB_API_KEY):
    """Search OMDb and return ALL matching movies across all pages."""
    pass  # Implement this

# Test
# all_batman = search_all_movies("Batman")
# print(f"Found {len(all_batman)} Batman movies")

Part 5: Web Scraping with BeautifulSoup

When APIs don’t exist or don’t have what we need, we scrape.

5.1 HTML Basics

Question 5.1 (Solved): Parse HTML

# SOLVED EXAMPLE
from bs4 import BeautifulSoup

html = """
<html>
<body>
    <div class="movie" id="movie-1">
        <h2 class="title">Inception</h2>
        <span class="year">2010</span>
        <span class="rating">8.8</span>
        <a href="/movies/inception">More Info</a>
    </div>
    <div class="movie" id="movie-2">
        <h2 class="title">The Matrix</h2>
        <span class="year">1999</span>
        <span class="rating">8.7</span>
        <a href="/movies/matrix">More Info</a>
    </div>
</body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

# Find all movie divs
movies = soup.find_all('div', class_='movie')
print(f"Found {len(movies)} movies\n")

# Extract data from each
for movie in movies:
    title = movie.find('h2', class_='title').text
    year = movie.find('span', class_='year').text
    rating = movie.find('span', class_='rating').text
    link = movie.find('a')['href']
    
    print(f"{title} ({year}) - Rating: {rating} - Link: {link}")

Question 5.2: CSS Selectors

Rewrite the above extraction using CSS selectors (.select() and .select_one()) instead of .find() and .find_all().

Hint: - .movie selects elements with class “movie” - .movie .title selects elements with class “title” inside class “movie”

# YOUR CODE HERE
# Use the same 'soup' from above

# Extract using CSS selectors

Question 5.3: Scrape a Real Website

Let’s scrape the example website http://quotes.toscrape.com/ which is designed for scraping practice.

Extract all quotes from the first page, including: - The quote text - The author name - The tags

Return the results as a list of dictionaries.

# YOUR CODE HERE
import requests
from bs4 import BeautifulSoup

# Fetch the page
url = "http://quotes.toscrape.com/"

# Parse the HTML

# Extract quotes

# Print results

Question 5.4: Handle Pagination in Scraping

The quotes website has multiple pages. Scrape the first 3 pages and collect all quotes.

Pages follow the pattern: - Page 1: http://quotes.toscrape.com/page/1/ - Page 2: http://quotes.toscrape.com/page/2/ - etc.

Remember: Add a delay between requests to be polite!

# YOUR CODE HERE

Question 5.5: Extract Table Data

Scrape the table from https://www.w3schools.com/html/html_tables.asp.

The table contains company data. Extract all rows and create a pandas DataFrame.

Hint: Look for <table>, <tr> (table row), <th> (header), and <td> (data cell) elements.

# YOUR CODE HERE
# Hint: pandas has a read_html() function that can do this automatically!
# But try doing it manually first to understand the process.

Part 6: Building the Movie Data Pipeline

Now let’s put everything together to build a complete data collection pipeline for our Netflix project.

6.1 The Complete Pipeline

Question 6.1 (Solved): Movie Data Collector Class

# SOLVED EXAMPLE
import requests
import pandas as pd
import time
from typing import List, Dict, Optional

class MovieDataCollector:
    """Collect movie data from OMDb API."""
    
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "http://www.omdbapi.com/"
        self.delay = 0.5  # Seconds between requests
    
    def fetch_movie(self, title: str, year: Optional[int] = None) -> Optional[Dict]:
        """Fetch a single movie by title."""
        params = {
            "apikey": self.api_key,
            "t": title,
            "type": "movie"
        }
        if year:
            params["y"] = year
        
        try:
            response = requests.get(self.base_url, params=params, timeout=10)
            response.raise_for_status()
            data = response.json()
            
            if data.get("Response") == "True":
                return data
        except Exception as e:
            print(f"Error fetching {title}: {e}")
        
        return None
    
    def fetch_movies(self, titles: List[str]) -> List[Dict]:
        """Fetch multiple movies."""
        movies = []
        
        for i, title in enumerate(titles):
            print(f"Fetching {i+1}/{len(titles)}: {title}")
            movie = self.fetch_movie(title)
            
            if movie:
                movies.append(movie)
            
            time.sleep(self.delay)
        
        return movies
    
    def to_dataframe(self, movies: List[Dict]) -> pd.DataFrame:
        """Convert movie data to cleaned DataFrame."""
        if not movies:
            return pd.DataFrame()
        
        # Extract relevant fields
        rows = []
        for m in movies:
            rows.append({
                "title": m.get("Title"),
                "year": m.get("Year"),
                "genre": m.get("Genre"),
                "director": m.get("Director"),
                "actors": m.get("Actors"),
                "imdb_rating": m.get("imdbRating"),
                "imdb_votes": m.get("imdbVotes"),
                "runtime": m.get("Runtime"),
                "box_office": m.get("BoxOffice"),
                "imdb_id": m.get("imdbID")
            })
        
        df = pd.DataFrame(rows)
        
        # Clean data types
        df["year"] = pd.to_numeric(df["year"], errors="coerce").astype("Int64")
        df["imdb_rating"] = pd.to_numeric(df["imdb_rating"], errors="coerce")
        df["imdb_votes"] = df["imdb_votes"].str.replace(",", "").pipe(pd.to_numeric, errors="coerce").astype("Int64")
        # Fix: str.extract returns a DataFrame, we need column 0 to get a Series
        df["runtime_min"] = df["runtime"].str.extract(r"(\d+)")[0].pipe(pd.to_numeric, errors="coerce").astype("Int64")
        
        return df

# Usage example
# collector = MovieDataCollector(OMDB_API_KEY)
# movies = collector.fetch_movies(["Inception", "The Matrix"])
# df = collector.to_dataframe(movies)
# print(df)

Question 6.2: Add Search Functionality

Extend the MovieDataCollector class to add a search_movies(query, max_results=50) method that: 1. Searches for movies matching the query 2. Handles pagination to get up to max_results movies 3. For each search result, fetches the full movie details 4. Returns the detailed movie data

Hint: Search results only contain basic info (title, year, poster, imdbID). You need to use the imdbID to fetch full details.

# YOUR CODE HERE
# Extend the MovieDataCollector class or add a method

Question 6.3: Build a Genre-Based Dataset

Use your collector to build a dataset of popular movies from different genres:

Search for 10 movies each for: “action”, “comedy”, “drama”, “horror”, “sci-fi”
Combine all results into a single DataFrame
Remove any duplicates (some movies might appear in multiple searches)
Save to CSV

Note: This might take a while due to rate limiting. Start with fewer movies for testing.

# YOUR CODE HERE

Question 6.4: Data Quality Analysis

Using the dataset you created:

How many movies have missing IMDB ratings?
How many movies have missing box office data?
What’s the distribution of ratings? (min, max, mean, median)
Which directors appear most frequently?
What’s the average runtime by genre?

These quality checks will be important for Week 2 (Data Validation)!

# YOUR CODE HERE

Part 7: Challenge Problems

These are optional advanced exercises for those who finish early.

Challenge 7.1: Rate Limit Handler

Create a RateLimiter class that: 1. Tracks how many requests have been made 2. Automatically adds delays to stay under a rate limit 3. Handles 429 (Too Many Requests) responses by waiting and retrying

limiter = RateLimiter(requests_per_minute=30)
response = limiter.get("https://api.example.com/data")

# YOUR CODE HERE

Challenge 7.2: Async Movie Collector

The synchronous approach is slow because we wait for each request to complete.

Create an async version using aiohttp that can fetch multiple movies concurrently (while still respecting rate limits).

Compare the time to fetch 20 movies with sync vs async approach.

# YOUR CODE HERE
# Hint: You'll need to install aiohttp: pip install aiohttp
# And use asyncio to run the async code

Challenge 7.3: Multi-Source Data Fusion

Create a data collection pipeline that: 1. Fetches basic movie data from OMDb 2. Enriches it with additional data from another source (e.g., Wikipedia API for plot summaries) 3. Merges the data based on movie title/year 4. Handles cases where data is missing from one source

Wikipedia API example:

https://en.wikipedia.org/api/rest_v1/page/summary/Inception_(film)

# YOUR CODE HERE

Summary

In this lab, you learned:

HTTP Fundamentals: URLs, status codes, headers
curl: Command-line HTTP requests
Python requests: Programmatic data collection
Error handling: Timeouts, retries, status codes
OMDb API: Real-world movie data
BeautifulSoup: Web scraping when APIs don’t exist
Data pipelines: Building reusable collection code

Next Week

Week 2: Data Validation & Quality

The data we collected today is messy! Next week we’ll learn: - Schema validation with Pydantic - Data type cleaning - Handling missing values - Quality metrics

Submission

Save your completed notebook and submit: 1. This notebook with all cells executed 2. The CSV file of movies you collected 3. A brief summary (1 paragraph) of what you learned