# Install required packages (run this first!)!sudo apt-get install -y jq -qq # JSON processor for CLI!pip install -q pandas pydantic csvkit# Verify installations!echo "jq version:" && jq --version!echo "csvkit version:" && csvlook --version 2>&1 | head -1Week 2 Lab: Data Validation & Quality
CS 203: Software Tools and Techniques for AI
Lab Overview
In this lab, you will learn to validate and clean data using: 1. Unix CLI tools - Quick data inspection 2. jq - JSON processing 3. pandas - Data profiling and cleaning 4. Pydantic - Schema validation 5. Great Expectations - Data quality testing
Goal: Transform messy movie data into clean, validated data ready for ML.
Setup
import pandas as pd
import numpy as np
import json
from typing import Optional, List
from pydantic import BaseModel, Field, validator, ValidationError
print("All imports successful!")Part 1: Sample Messy Data
Let’s create some realistic messy movie data that mimics what you’d get from APIs.
# Create sample messy data (simulating what we collected in Week 1)
messy_movies = [
{"Title": "Inception", "Year": "2010", "Runtime": "148 min", "imdbRating": "8.8", "BoxOffice": "$292,576,195", "Genre": "Action, Sci-Fi"},
{"Title": "Avatar", "Year": "2009", "Runtime": "162 min", "imdbRating": "7.9", "BoxOffice": "$760,507,625", "Genre": "Action, Adventure"},
{"Title": "The Room", "Year": "2003", "Runtime": "99 min", "imdbRating": "3.9", "BoxOffice": "N/A", "Genre": "Drama"},
{"Title": "Inception", "Year": "2010", "Runtime": "148 min", "imdbRating": "8.8", "BoxOffice": "$292,576,195", "Genre": "Action, Sci-Fi"}, # Duplicate!
{"Title": "Tenet", "Year": "N/A", "Runtime": "150 min", "imdbRating": "7.3", "BoxOffice": "N/A", "Genre": "Action, Sci-Fi"}, # Missing year
{"Title": "The Matrix", "Year": "1999", "Runtime": "136 min", "imdbRating": "8.7", "BoxOffice": "$171,479,930", "Genre": "Action, Sci-Fi"},
{"Title": "Interstellar", "Year": "2014", "Runtime": "", "imdbRating": "8.6", "BoxOffice": "$188,020,017", "Genre": "Adventure, Drama"}, # Empty runtime
{"Title": "", "Year": "2020", "Runtime": "120 min", "imdbRating": "7.0", "BoxOffice": "$50,000,000", "Genre": "Comedy"}, # Missing title!
{"Title": "Joker", "Year": "2019", "Runtime": "122 min", "imdbRating": "invalid", "BoxOffice": "$335,451,311", "Genre": "Crime, Drama"}, # Invalid rating
{"Title": "Parasite", "Year": "2019", "Runtime": "132 min", "imdbRating": "8.5", "BoxOffice": "$53,369,749", "Genre": "Drama, Thriller"},
]
# Save as JSON for CLI exercises
with open('messy_movies.json', 'w') as f:
json.dump(messy_movies, f, indent=2)
# Create DataFrame
df_messy = pd.DataFrame(messy_movies)
df_messy.to_csv('messy_movies.csv', index=False)
print("Created messy_movies.json and messy_movies.csv")
print(f"\nDataFrame shape: {df_messy.shape}")
df_messyPart 2: CLI Data Inspection
Before writing code, always inspect your data with CLI tools.
Question 2.1 (Solved): Basic File Inspection
# SOLVED EXAMPLE
# Check file sizes and line counts
!echo "=== File sizes ==="
!ls -lh messy_movies.json messy_movies.csv
!echo "\n=== Line counts ==="
!wc -l messy_movies.json messy_movies.csv
!echo "\n=== First 5 lines of CSV ==="
!head -5 messy_movies.csvQuestion 2.2: Inspect JSON with jq
Use jq to: 1. Pretty-print the JSON file 2. Get the length (number of movies) 3. Extract just the titles
# YOUR CODE HERE
# Pretty print
# Get length
# Extract titlesQuestion 2.3: Find Data Issues with jq
Use jq to find: 1. Movies where Year is “N/A” 2. Movies where BoxOffice is “N/A” 3. Movies where Title is empty
Hint: Use select() function in jq
# YOUR CODE HEREPart 3: Data Profiling with Pandas
Now let’s do systematic data profiling.
Question 3.1 (Solved): Basic Data Profiling
# SOLVED EXAMPLE
print("=== Data Types ===")
print(df_messy.dtypes)
print("\n=== Missing Values ===")
print(df_messy.isnull().sum())
print("\n=== Unique Values per Column ===")
print(df_messy.nunique())
print("\n=== Sample Values ===")
for col in df_messy.columns:
print(f"{col}: {df_messy[col].unique()[:5]}")Question 3.2: Identify All Data Quality Issues
Write a function profile_data(df) that returns a dictionary summarizing: 1. Total rows 2. Duplicate rows 3. Missing values per column (including “N/A” strings) 4. Empty strings per column 5. Data type issues (strings that should be numbers)
# YOUR CODE HERE
def profile_data(df):
"""Generate a data quality profile."""
profile = {}
# TODO: Implement
return profile
# Test
# profile = profile_data(df_messy)
# print(json.dumps(profile, indent=2))Question 3.3: Find Duplicates
Find all duplicate rows in the dataset. How many duplicates are there? Which movies are duplicated?
# YOUR CODE HEREPart 4: Data Cleaning
Now let’s clean the data systematically.
Question 4.1 (Solved): Clean Runtime Column
# SOLVED EXAMPLE
def clean_runtime(runtime_str):
"""Convert '148 min' to integer 148."""
if pd.isna(runtime_str) or runtime_str == '' or runtime_str == 'N/A':
return None
# Extract digits
import re
match = re.search(r'(\d+)', str(runtime_str))
if match:
return int(match.group(1))
return None
# Test
print(clean_runtime('148 min')) # 148
print(clean_runtime('N/A')) # None
print(clean_runtime('')) # NoneQuestion 4.2: Clean BoxOffice Column
Write a function clean_box_office(value) that converts: - "$292,576,195" → 292576195 (integer) - "N/A" → None - "" → None
# YOUR CODE HERE
def clean_box_office(value):
"""Convert '$292,576,195' to integer."""
pass
# Test
# print(clean_box_office('$292,576,195')) # 292576195
# print(clean_box_office('N/A')) # NoneQuestion 4.3: Clean Year Column
Write a function clean_year(value) that: - Converts valid year strings to integers - Returns None for “N/A” or invalid values - Validates that year is between 1888 (first film) and current year + 2
# YOUR CODE HEREQuestion 4.4: Clean Rating Column
Write a function clean_rating(value) that: - Converts valid rating strings to floats - Returns None for invalid values - Validates that rating is between 0.0 and 10.0
# YOUR CODE HEREQuestion 4.5: Complete Cleaning Pipeline
Create a function clean_movie_data(df) that: 1. Removes duplicates 2. Removes rows with empty titles 3. Cleans all columns using the functions above 4. Returns a clean DataFrame with proper data types
# YOUR CODE HERE
def clean_movie_data(df):
"""Complete cleaning pipeline."""
pass
# Test
# df_clean = clean_movie_data(df_messy)
# print(f"Before: {len(df_messy)} rows")
# print(f"After: {len(df_clean)} rows")
# df_cleanPart 5: Schema Validation with Pydantic
Pydantic provides type validation and data parsing.
Question 5.1 (Solved): Define a Movie Schema
# SOLVED EXAMPLE
from pydantic import BaseModel, Field, field_validator
from typing import Optional
import re
class Movie(BaseModel):
"""Validated movie schema."""
title: str = Field(..., min_length=1, description="Movie title")
year: int = Field(..., ge=1888, le=2030, description="Release year")
runtime_minutes: Optional[int] = Field(None, ge=1, le=1000)
imdb_rating: Optional[float] = Field(None, ge=0, le=10)
box_office: Optional[int] = Field(None, ge=0)
genre: str = Field(..., min_length=1)
@field_validator('title')
@classmethod
def title_not_empty(cls, v):
if not v or not v.strip():
raise ValueError('Title cannot be empty')
return v.strip()
# Test with valid data
movie = Movie(
title="Inception",
year=2010,
runtime_minutes=148,
imdb_rating=8.8,
box_office=292576195,
genre="Action, Sci-Fi"
)
print("Valid movie:")
print(movie.model_dump())Question 5.2: Test Validation Errors
Try creating Movie objects with invalid data and observe the validation errors: 1. Empty title 2. Year before 1888 3. Rating above 10 4. Negative box office
# YOUR CODE HERE
# Try each invalid case and catch ValidationErrorQuestion 5.3: Validate and Convert Raw Data
Write a function validate_movies(raw_data) that: 1. Takes a list of raw movie dictionaries (from JSON) 2. Attempts to clean and validate each movie 3. Returns two lists: valid_movies and invalid_movies (with error messages)
# YOUR CODE HERE
def validate_movies(raw_data):
"""Validate a list of raw movie dictionaries."""
valid_movies = []
invalid_movies = []
# TODO: Implement
return valid_movies, invalid_movies
# Test
# valid, invalid = validate_movies(messy_movies)
# print(f"Valid: {len(valid)}, Invalid: {len(invalid)}")Part 6: Data Quality Assertions
Write assertions that should pass for clean data.
Question 6.1 (Solved): Data Quality Checks
# SOLVED EXAMPLE
def check_data_quality(df):
"""Run data quality assertions."""
checks = []
# Check 1: No duplicate rows
duplicates = df.duplicated().sum()
checks.append(("No duplicates", duplicates == 0, f"Found {duplicates} duplicates"))
# Check 2: No empty titles
empty_titles = (df['Title'] == '').sum() + df['Title'].isna().sum()
checks.append(("No empty titles", empty_titles == 0, f"Found {empty_titles} empty titles"))
# Check 3: Year in valid range
if 'Year' in df.columns:
invalid_years = ((df['Year'] < 1888) | (df['Year'] > 2030)).sum()
checks.append(("Valid years", invalid_years == 0, f"Found {invalid_years} invalid years"))
# Print results
print("Data Quality Checks:")
print("-" * 50)
for name, passed, message in checks:
status = "✓" if passed else "✗"
print(f"{status} {name}: {message if not passed else 'OK'}")
return all(passed for _, passed, _ in checks)
# Test on messy data (should fail)
print("Checking messy data:")
check_data_quality(df_messy)Question 6.2: Add More Quality Checks
Extend the check_data_quality function to include: 1. Rating values between 0 and 10 2. No negative box office values 3. Runtime between 1 and 1000 minutes 4. At least 90% of rows have non-null ratings
# YOUR CODE HEREPart 7: Complete Pipeline
Put everything together into a complete validation pipeline.
Question 7.1: Build the Complete Pipeline
Create a DataValidationPipeline class that: 1. Loads data from JSON or CSV 2. Profiles the data 3. Cleans the data 4. Validates with Pydantic 5. Runs quality checks 6. Exports clean data
# YOUR CODE HERE
class DataValidationPipeline:
"""Complete data validation pipeline."""
def __init__(self, input_file):
self.input_file = input_file
self.raw_data = None
self.clean_data = None
def load(self):
"""Load data from file."""
pass
def profile(self):
"""Generate data profile."""
pass
def clean(self):
"""Clean the data."""
pass
def validate(self):
"""Validate with Pydantic."""
pass
def check_quality(self):
"""Run quality checks."""
pass
def export(self, output_file):
"""Export clean data."""
pass
def run(self, output_file):
"""Run complete pipeline."""
self.load()
print("\n1. Data Profile:")
self.profile()
print("\n2. Cleaning...")
self.clean()
print("\n3. Validation...")
self.validate()
print("\n4. Quality Checks:")
self.check_quality()
print("\n5. Exporting...")
self.export(output_file)
# Test
# pipeline = DataValidationPipeline('messy_movies.json')
# pipeline.run('clean_movies.csv')Part 8: Challenge Problems
Challenge 8.1: Fuzzy Duplicate Detection
Sometimes duplicates have slight variations (e.g., “The Matrix” vs “Matrix, The”).
Write a function that finds potential duplicates using fuzzy string matching.
Hint: Use the fuzzywuzzy or rapidfuzz library.
# YOUR CODE HEREChallenge 8.2: Automatic Type Inference
Write a function that automatically infers the correct data type for each column by analyzing the values.
# YOUR CODE HERESummary
In this lab, you learned:
- CLI Inspection: Using jq and Unix tools for quick data exploration
- Data Profiling: Systematic analysis of data quality issues
- Data Cleaning: Converting messy strings to proper types
- Schema Validation: Using Pydantic for type safety
- Quality Checks: Automated assertions for data quality
Next Week
Week 3: Data Labeling & Annotation - Setting up Label Studio - Annotation workflows - Measuring inter-annotator agreement