#	Issue	Example
1	DUPLICATES	Inception appears twice (rows 0 and 3)
2	MISSING	Year is "N/A" for Tenet (row 4)
3	WRONG TYPES	Runtime is "148 min" not integer 148
4	INCONSISTENT	BoxOffice has "$" and commas
5	N/A VALUES	Some BoxOffice entries are literally "N/A"

Company	What Happened	Cost
NASA Mars Orbiter	Lockheed used pound-seconds, NASA expected newton-seconds	$327 million spacecraft lost
Knight Capital	Old code reactivated on 1 of 8 servers during deployment	$440 million in 45 minutes
UK COVID Stats	Excel .xls format limited to 65,536 rows	16,000 cases unreported
Zillow iBuying	Home price algorithm couldn't handle market volatility	$500 million loss, program shut down

Stage	Discovery Cost	Example
Data Entry	$1	Validation rejects bad input
Processing	$10	ETL* pipeline fails
Analysis	$50	Analyst spots anomaly in report
Production	$100+	Model makes bad predictions
Business Impact	$1000+	Wrong decisions based on flawed data

Dimension	Question	Example Problem
Completeness	Is all expected data present?	Missing ratings, null values
Accuracy	Is the data correct?	Year 2099 for a 1999 movie
Consistency	Does data agree across sources?	"USA" vs "United States"
Validity	Does data conform to rules?	Rating of 15.0 (max is 10)
Uniqueness	Are there duplicates?	Same movie appears 3 times
Timeliness	Is data up-to-date?	Using 2019 prices in 2024

Problem	Question to Ask	Tool to Detect
Missing	Are there nulls/empty values?	`csvstat`, pandas
Types	Are numbers actually numbers?	`file`, schema validation
Format	Is date format consistent?	`grep`, regex
Duplicates	Are there repeated rows?	`sort`, `uniq`, `csvsql`
Outliers	Are values in valid range?	`csvstat`, histograms
Encoding	Is text readable?	`file`, `iconv`
Schema	Does structure match spec?	JSON Schema, Pydantic

Flag	Meaning
`-t','`	Field delimiter is comma
`-k3`	Sort by 3rd field
`-n`	Numeric sort
`-r`	Reverse (descending)
`-u`	Remove duplicates

Option	What it shows
(none)	Deduplicated lines
`-c`	Count of each line
`-d`	Only duplicated lines
`-u`	Only unique lines (appear once)

Option	Effect
`-c`	Count matches
`-n`	Show line numbers
`-v`	Invert (lines NOT matching)
`-i`	Case insensitive

Task	Command
Pretty print	`jq .`
Get field	`jq '.fieldname'`
Get nested	`jq '.a.b.c'`
Array element	`jq '.[0]'`
All elements	`jq '.[]'`
Filter	`jq '.[] \| select(.x > 5)'`

Task	Command
Build object	`jq '{a: .x, b: .y}'`
Count	`jq 'length'`
Sort	`jq 'sort_by(.field)'`
Unique	`jq 'unique'`
Raw strings	`jq -r`

Tool	Purpose	Example
`csvlook`	Pretty print	`csvlook data.csv`
`csvstat`	Statistics	`csvstat -c column data.csv`
`csvcut`	Select columns	`csvcut -c col1,col2 data.csv`
`csvgrep`	Filter rows	`csvgrep -c col -m "value"`
`csvsort`	Sort	`csvsort -c col -r data.csv`

Tool	Purpose	Example
`csvjson`	To JSON	`csvjson data.csv`
`csvsql`	SQL queries	`csvsql --query "..."`
`csvclean`	Fix issues	`csvclean data.csv`
`csvjoin`	Join files	`csvjoin -c id a.csv b.csv`
`csvstack`	Concatenate	`csvstack a.csv b.csv`

Aspect	Questions to Ask
Structure	How many rows? Columns? What types?
Completeness	How many nulls per column?
Uniqueness	How many distinct values? Duplicates?
Distribution	Min, max, mean, median? Outliers?
Patterns	What formats are used? Any anomalies?

Column	Type	Nulls	Unique	Issues
title	Text	0	987	13 duplicates
year	Int	13	85	1920-2024 range
rating	Float	108	78	"N/A", empty, "Not Rated"
revenue	Int	366	634	36% missing!
genre	Text	0	23	Multi-value ("Action, Drama")

Schema Defines	Examples
Field names	What columns/keys should exist?
Data types	String, integer, float, boolean, array?
Constraints	Required? Min/max? Pattern? Enum?
Relationships	References to other data?

Without Schema	With Schema
Builder guesses what's needed	Clear expectations upfront
Can't verify if correct	Automatic verification
Inconsistent decisions	Everyone builds the same
Problems found when it breaks	Problems caught early

JSON Schema: Type Keywords

Keyword	Valid Values
`"type": "string"`	`"hello"`, `""`
`"type": "integer"`	`42`, `-1`, `0`
`"type": "number"`	`3.14`, `42`, `-1.5`
`"type": "boolean"`	`true`, `false`
`"type": "null"`	`null`
`"type": "array"`	`[1, 2, 3]`, `[]`
`"type": "object"`	`{"a": 1}`, `{}`

Multiple types:

{"type": ["string", "null"]}   // String or null

Keyword	Meaning
`items`	Schema for each element
`minItems`	Minimum array length
`maxItems`	Maximum array length
`uniqueItems`	No duplicates allowed

Aspect	JSON Schema	Pydantic
Language	JSON (separate file)	Python (in your code)
Type hints	No	Yes
IDE support	Limited	Full autocomplete
Validation	Manual call	Automatic on create
Error messages	Technical	Human-readable
Learning curve	New syntax	Just Python

Step	Code	What Happens
1. DEFINE	`class Movie(BaseModel): ...`	Declare your schema with type hints
2. CREATE	`movie = Movie(**raw_data)`	Validation happens automatically
3. USE	`movie.title`, `movie.year + 1`	Data is guaranteed valid

Encoding	Characters	Use Case
ASCII	128	English only
Latin-1	256	Western European
UTF-8	1,112,064	Everything (modern standard)
UTF-16	Same as UTF-8	Different byte format
Windows-1252	256	Microsoft's Latin-1 variant

System	Line Ending	Bytes
Unix/Linux/Mac	LF	`\n` (0x0A)
Windows	CRLF	`\r\n` (0x0D 0x0A)
Old Mac	CR	`\r` (0x0D)

Row	rating value	Interpretation
1	`8.8`	Rating is 8.8
2	(nothing)	Rating is null/missing
3	`""`	Rating is empty string

Validation	Cleaning
Checks if data is valid	Fixes invalid data
Returns true/false	Modifies data
Should not modify	Requires decisions
Objective	Subjective

Mistake	Example	Better Approach
Only checking type	`isinstance(x, int)`	Also check range: `0 < x < 1000`
Trusting "not None"	`if value:`	Empty string `""` is falsy but not None
Case sensitivity	`if status == "active"`	`if status.lower() == "active"`
Whitespace	`if name == "John"`	`if name.strip() == "John"`
Encoding	Reading UTF-8 as ASCII	Always specify encoding
Off-by-one	`year < 2024`	Should it be `<= 2024`?

Stage	Action	Tools
1. Ingest	Load raw data	`curl`, `requests`
2. Inspect	Profile and understand	`jq`, `csvstat`, pandas
3. Validate	Check against rules	JSON Schema, Pydantic
4. Clean	Fix and transform	pandas, custom functions

Data Validation & Quality

Week 2 · CS 203: Software Tools and Techniques for AI

Part 1: The Motivation

Last Week: We Collected Data!

Reality Check: Let's Look at the Data

The Problems Emerge

Let's Dig Deeper

What Happens If We Ignore This?

Or Worse: Silent Failures

Real-World Data Quality Disasters

The Data Quality Pyramid

The Cost of Skipping Validation

Today's Mission

Part 2: Types of Data Problems

A Taxonomy of Data Problems

The Six Data Quality Dimensions

Problem 1: Missing Values

Problem 2: Wrong Data Types

Problem 3: Inconsistent Formats

Problem 4: Duplicates

Problem 5: Outliers and Anomalies

Problem 6: Encoding Issues

Problem 7: Schema Violations

Summary: Data Problem Checklist

Part 3: First Look at Your Data

Demo Files Location

Before You Do Anything: Look at the Data

The file Command

The wc Command

The head Command

The tail Command

Combining head and tail

The sort Command

sort Flags

The uniq Command

uniq Options

Finding Duplicates: Practical Example

Counting Duplicates

The cut Command

The grep Command

grep Options

Putting It Together: Initial Inspection

Part 4: jq - JSON Processing

Why jq?

The jq Mental Model

jq Basics: Pretty Printing

jq: Extracting Fields

jq: Working with Arrays

jq: The Array Iterator []

jq: Building New Objects

jq: Filtering with select()

jq: Type Conversion

jq: Handling Missing Data

jq: Aggregation Functions

jq: Sorting

jq: Grouping

jq: Raw Output Mode

jq: Finding Data Issues

jq: Data Quality Checks

jq Cheat Sheet - Basics

jq Cheat Sheet - Advanced

Part 5: CSVkit

Why CSVkit?

csvlook: Pretty Print CSV

csvstat: Data Profiling

csvstat: Specific Columns

csvcut: Select Columns

csvgrep: Filter Rows

csvsort: Sort Data

csvjson: Convert to JSON

csvsql: Query CSV with SQL!

csvsql: Data Validation Queries

csvclean: Fix Common Issues

CSVkit Pipeline Example

CSVkit Cheat Sheet - Core Tools

CSVkit Cheat Sheet - Advanced Tools

Part 6: Data Profiling

What is Data Profiling?

Profiling Step 1: Basic Shape

Profiling Step 2: Column Types

The `file` Command

The `wc` Command

The `head` Command

The `tail` Command

The `sort` Command

`sort` Flags

The `uniq` Command

`uniq` Options

The `cut` Command

The `grep` Command

`grep` Options

jq: The Array Iterator `[]`

jq: Filtering with `select()`