Week 2 Lab: Data Validation & Quality

CS 203: Software Tools and Techniques for AI


Lab Overview

In this lab, you will learn to validate and clean data using: 1. Unix CLI tools - Quick data inspection 2. jq - JSON processing 3. pandas - Data profiling and cleaning 4. Pydantic - Schema validation 5. Great Expectations - Data quality testing

Goal: Transform messy movie data into clean, validated data ready for ML.


Setup

# Install required packages (run this first!)!sudo apt-get install -y jq -qq  # JSON processor for CLI!pip install -q pandas pydantic csvkit# Verify installations!echo "jq version:" && jq --version!echo "csvkit version:" && csvlook --version 2>&1 | head -1
import pandas as pd
import numpy as np
import json
from typing import Optional, List
from pydantic import BaseModel, Field, validator, ValidationError

print("All imports successful!")

Part 1: Sample Messy Data

Let’s create some realistic messy movie data that mimics what you’d get from APIs.

# Create sample messy data (simulating what we collected in Week 1)
messy_movies = [
    {"Title": "Inception", "Year": "2010", "Runtime": "148 min", "imdbRating": "8.8", "BoxOffice": "$292,576,195", "Genre": "Action, Sci-Fi"},
    {"Title": "Avatar", "Year": "2009", "Runtime": "162 min", "imdbRating": "7.9", "BoxOffice": "$760,507,625", "Genre": "Action, Adventure"},
    {"Title": "The Room", "Year": "2003", "Runtime": "99 min", "imdbRating": "3.9", "BoxOffice": "N/A", "Genre": "Drama"},
    {"Title": "Inception", "Year": "2010", "Runtime": "148 min", "imdbRating": "8.8", "BoxOffice": "$292,576,195", "Genre": "Action, Sci-Fi"},  # Duplicate!
    {"Title": "Tenet", "Year": "N/A", "Runtime": "150 min", "imdbRating": "7.3", "BoxOffice": "N/A", "Genre": "Action, Sci-Fi"},  # Missing year
    {"Title": "The Matrix", "Year": "1999", "Runtime": "136 min", "imdbRating": "8.7", "BoxOffice": "$171,479,930", "Genre": "Action, Sci-Fi"},
    {"Title": "Interstellar", "Year": "2014", "Runtime": "", "imdbRating": "8.6", "BoxOffice": "$188,020,017", "Genre": "Adventure, Drama"},  # Empty runtime
    {"Title": "", "Year": "2020", "Runtime": "120 min", "imdbRating": "7.0", "BoxOffice": "$50,000,000", "Genre": "Comedy"},  # Missing title!
    {"Title": "Joker", "Year": "2019", "Runtime": "122 min", "imdbRating": "invalid", "BoxOffice": "$335,451,311", "Genre": "Crime, Drama"},  # Invalid rating
    {"Title": "Parasite", "Year": "2019", "Runtime": "132 min", "imdbRating": "8.5", "BoxOffice": "$53,369,749", "Genre": "Drama, Thriller"},
]

# Save as JSON for CLI exercises
with open('messy_movies.json', 'w') as f:
    json.dump(messy_movies, f, indent=2)

# Create DataFrame
df_messy = pd.DataFrame(messy_movies)
df_messy.to_csv('messy_movies.csv', index=False)

print("Created messy_movies.json and messy_movies.csv")
print(f"\nDataFrame shape: {df_messy.shape}")
df_messy

Part 2: CLI Data Inspection

Before writing code, always inspect your data with CLI tools.

Question 2.1 (Solved): Basic File Inspection

# SOLVED EXAMPLE
# Check file sizes and line counts
!echo "=== File sizes ==="
!ls -lh messy_movies.json messy_movies.csv

!echo "\n=== Line counts ==="
!wc -l messy_movies.json messy_movies.csv

!echo "\n=== First 5 lines of CSV ==="
!head -5 messy_movies.csv

Question 2.2: Inspect JSON with jq

Use jq to: 1. Pretty-print the JSON file 2. Get the length (number of movies) 3. Extract just the titles

# YOUR CODE HERE
# Pretty print

# Get length

# Extract titles

Question 2.3: Find Data Issues with jq

Use jq to find: 1. Movies where Year is “N/A” 2. Movies where BoxOffice is “N/A” 3. Movies where Title is empty

Hint: Use select() function in jq

# YOUR CODE HERE

Part 3: Data Profiling with Pandas

Now let’s do systematic data profiling.

Question 3.1 (Solved): Basic Data Profiling

# SOLVED EXAMPLE
print("=== Data Types ===")
print(df_messy.dtypes)

print("\n=== Missing Values ===")
print(df_messy.isnull().sum())

print("\n=== Unique Values per Column ===")
print(df_messy.nunique())

print("\n=== Sample Values ===")
for col in df_messy.columns:
    print(f"{col}: {df_messy[col].unique()[:5]}")

Question 3.2: Identify All Data Quality Issues

Write a function profile_data(df) that returns a dictionary summarizing: 1. Total rows 2. Duplicate rows 3. Missing values per column (including “N/A” strings) 4. Empty strings per column 5. Data type issues (strings that should be numbers)

# YOUR CODE HERE
def profile_data(df):
    """Generate a data quality profile."""
    profile = {}
    
    # TODO: Implement
    
    return profile

# Test
# profile = profile_data(df_messy)
# print(json.dumps(profile, indent=2))

Question 3.3: Find Duplicates

Find all duplicate rows in the dataset. How many duplicates are there? Which movies are duplicated?

# YOUR CODE HERE

Part 4: Data Cleaning

Now let’s clean the data systematically.

Question 4.1 (Solved): Clean Runtime Column

# SOLVED EXAMPLE
def clean_runtime(runtime_str):
    """Convert '148 min' to integer 148."""
    if pd.isna(runtime_str) or runtime_str == '' or runtime_str == 'N/A':
        return None
    # Extract digits
    import re
    match = re.search(r'(\d+)', str(runtime_str))
    if match:
        return int(match.group(1))
    return None

# Test
print(clean_runtime('148 min'))  # 148
print(clean_runtime('N/A'))      # None
print(clean_runtime(''))         # None

Question 4.2: Clean BoxOffice Column

Write a function clean_box_office(value) that converts: - "$292,576,195"292576195 (integer) - "N/A"None - ""None

# YOUR CODE HERE
def clean_box_office(value):
    """Convert '$292,576,195' to integer."""
    pass

# Test
# print(clean_box_office('$292,576,195'))  # 292576195
# print(clean_box_office('N/A'))           # None

Question 4.3: Clean Year Column

Write a function clean_year(value) that: - Converts valid year strings to integers - Returns None for “N/A” or invalid values - Validates that year is between 1888 (first film) and current year + 2

# YOUR CODE HERE

Question 4.4: Clean Rating Column

Write a function clean_rating(value) that: - Converts valid rating strings to floats - Returns None for invalid values - Validates that rating is between 0.0 and 10.0

# YOUR CODE HERE

Question 4.5: Complete Cleaning Pipeline

Create a function clean_movie_data(df) that: 1. Removes duplicates 2. Removes rows with empty titles 3. Cleans all columns using the functions above 4. Returns a clean DataFrame with proper data types

# YOUR CODE HERE
def clean_movie_data(df):
    """Complete cleaning pipeline."""
    pass

# Test
# df_clean = clean_movie_data(df_messy)
# print(f"Before: {len(df_messy)} rows")
# print(f"After: {len(df_clean)} rows")
# df_clean

Part 5: Schema Validation with Pydantic

Pydantic provides type validation and data parsing.

Question 5.1 (Solved): Define a Movie Schema

# SOLVED EXAMPLE
from pydantic import BaseModel, Field, field_validator
from typing import Optional
import re

class Movie(BaseModel):
    """Validated movie schema."""
    title: str = Field(..., min_length=1, description="Movie title")
    year: int = Field(..., ge=1888, le=2030, description="Release year")
    runtime_minutes: Optional[int] = Field(None, ge=1, le=1000)
    imdb_rating: Optional[float] = Field(None, ge=0, le=10)
    box_office: Optional[int] = Field(None, ge=0)
    genre: str = Field(..., min_length=1)
    
    @field_validator('title')
    @classmethod
    def title_not_empty(cls, v):
        if not v or not v.strip():
            raise ValueError('Title cannot be empty')
        return v.strip()

# Test with valid data
movie = Movie(
    title="Inception",
    year=2010,
    runtime_minutes=148,
    imdb_rating=8.8,
    box_office=292576195,
    genre="Action, Sci-Fi"
)
print("Valid movie:")
print(movie.model_dump())

Question 5.2: Test Validation Errors

Try creating Movie objects with invalid data and observe the validation errors: 1. Empty title 2. Year before 1888 3. Rating above 10 4. Negative box office

# YOUR CODE HERE
# Try each invalid case and catch ValidationError

Question 5.3: Validate and Convert Raw Data

Write a function validate_movies(raw_data) that: 1. Takes a list of raw movie dictionaries (from JSON) 2. Attempts to clean and validate each movie 3. Returns two lists: valid_movies and invalid_movies (with error messages)

# YOUR CODE HERE
def validate_movies(raw_data):
    """Validate a list of raw movie dictionaries."""
    valid_movies = []
    invalid_movies = []
    
    # TODO: Implement
    
    return valid_movies, invalid_movies

# Test
# valid, invalid = validate_movies(messy_movies)
# print(f"Valid: {len(valid)}, Invalid: {len(invalid)}")

Part 6: Data Quality Assertions

Write assertions that should pass for clean data.

Question 6.1 (Solved): Data Quality Checks

# SOLVED EXAMPLE
def check_data_quality(df):
    """Run data quality assertions."""
    checks = []
    
    # Check 1: No duplicate rows
    duplicates = df.duplicated().sum()
    checks.append(("No duplicates", duplicates == 0, f"Found {duplicates} duplicates"))
    
    # Check 2: No empty titles
    empty_titles = (df['Title'] == '').sum() + df['Title'].isna().sum()
    checks.append(("No empty titles", empty_titles == 0, f"Found {empty_titles} empty titles"))
    
    # Check 3: Year in valid range
    if 'Year' in df.columns:
        invalid_years = ((df['Year'] < 1888) | (df['Year'] > 2030)).sum()
        checks.append(("Valid years", invalid_years == 0, f"Found {invalid_years} invalid years"))
    
    # Print results
    print("Data Quality Checks:")
    print("-" * 50)
    for name, passed, message in checks:
        status = "✓" if passed else "✗"
        print(f"{status} {name}: {message if not passed else 'OK'}")
    
    return all(passed for _, passed, _ in checks)

# Test on messy data (should fail)
print("Checking messy data:")
check_data_quality(df_messy)

Question 6.2: Add More Quality Checks

Extend the check_data_quality function to include: 1. Rating values between 0 and 10 2. No negative box office values 3. Runtime between 1 and 1000 minutes 4. At least 90% of rows have non-null ratings

# YOUR CODE HERE

Part 7: Complete Pipeline

Put everything together into a complete validation pipeline.

Question 7.1: Build the Complete Pipeline

Create a DataValidationPipeline class that: 1. Loads data from JSON or CSV 2. Profiles the data 3. Cleans the data 4. Validates with Pydantic 5. Runs quality checks 6. Exports clean data

# YOUR CODE HERE
class DataValidationPipeline:
    """Complete data validation pipeline."""
    
    def __init__(self, input_file):
        self.input_file = input_file
        self.raw_data = None
        self.clean_data = None
    
    def load(self):
        """Load data from file."""
        pass
    
    def profile(self):
        """Generate data profile."""
        pass
    
    def clean(self):
        """Clean the data."""
        pass
    
    def validate(self):
        """Validate with Pydantic."""
        pass
    
    def check_quality(self):
        """Run quality checks."""
        pass
    
    def export(self, output_file):
        """Export clean data."""
        pass
    
    def run(self, output_file):
        """Run complete pipeline."""
        self.load()
        print("\n1. Data Profile:")
        self.profile()
        print("\n2. Cleaning...")
        self.clean()
        print("\n3. Validation...")
        self.validate()
        print("\n4. Quality Checks:")
        self.check_quality()
        print("\n5. Exporting...")
        self.export(output_file)

# Test
# pipeline = DataValidationPipeline('messy_movies.json')
# pipeline.run('clean_movies.csv')

Part 8: Challenge Problems

Challenge 8.1: Fuzzy Duplicate Detection

Sometimes duplicates have slight variations (e.g., “The Matrix” vs “Matrix, The”).

Write a function that finds potential duplicates using fuzzy string matching.

Hint: Use the fuzzywuzzy or rapidfuzz library.

# YOUR CODE HERE

Challenge 8.2: Automatic Type Inference

Write a function that automatically infers the correct data type for each column by analyzing the values.

# YOUR CODE HERE

Summary

In this lab, you learned:

  1. CLI Inspection: Using jq and Unix tools for quick data exploration
  2. Data Profiling: Systematic analysis of data quality issues
  3. Data Cleaning: Converting messy strings to proper types
  4. Schema Validation: Using Pydantic for type safety
  5. Quality Checks: Automated assertions for data quality

Next Week

Week 3: Data Labeling & Annotation - Setting up Label Studio - Annotation workflows - Measuring inter-annotator agreement