Problem	`X`	`Y`
House pricing	size, location, age	price
Spam detection	words in email	spam / not spam
Digit recognition	pixel values	digit 0-9

Stage	What happens
Training	We learn patterns from old labeled data
Deployment	The model sees new real-world inputs
Monitoring	We check whether the new world still looks like the old one

What changed?	Plain English
Inputs changed	the `X` values look different
Rule changed	the correct answer for the same input changed
Outcome mix changed	some labels became much more or less common

Type	One-line meaning	Ask this question
Data drift	Inputs changed	"Do the incoming `X` values still look like training data?"
Concept drift	The rule changed	"For the same `X`, is the correct `Y` now different?"
Label drift	Outcome mix changed	"Are some labels much more common now?"

Question	Answer
Did the task change?	No, still predict car price
Did the correct relationship change?	No, more km still means lower price
Did the input range change?	Yes
Does the model now see unfamiliar inputs?	Yes

Metric	Clean test	Drifted test
`R^2`	`0.89`	`-10.3`
`MAE`	`Rs 0.28L`	`Rs 2.26L`

Training message	Likely label
"Dear team, please find the invoice attached."	not spam
"Reminder: meeting moved to 3 PM."	not spam
"Congratulations, you won a free iPhone. Click now."	spam

Production message	Why this is tricky
"send ppt asap"	short, informal, almost no familiar office words
"bro claim reward fast"	slang-heavy, different spam style
"meeting at 3? send loc"	same intent as email, very different wording

Domain	Training world	Production world	What changed in `X`?
OCR / document AI	flat scanned forms	phone photos of folded forms	lighting, blur, perspective
Speech recognition	quiet office audio	classroom or railway-station audio	noise, echo, microphone quality
Retail demand forecasting	normal weekdays	festival season	basket size, time-of-day, product mix
Loan scoring	salaried urban users	rural self-employed users	income pattern, transaction pattern, document quality

Feature	Drifted?	Why?
mileage	Yes	more older cars are arriving
car age	Yes	the dealership now receives older models
brand	Maybe not	the same brands may still dominate
fuel type	Maybe not	petrol vs diesel mix may stay similar

Domain	Same input	Old label	New label	What changed in the world?
Customer scoring	4 orders/week, `Rs 500` spend	premium	regular	subscription launch changed user behavior
Fraud detection	late-night purchase at a new merchant	suspicious	often normal	quick-commerce habits became common
Review helpfulness	"delivery in 2 days"	not helpful	helpful	delays changed what users valued

Quantity	Training	Production
Fraud rate	`1%`	`5%`
Non-fraud rate	`99%`	`95%`

Score `p`	Old action with cutoff `0.80`	Plain reading
`0.92`	review	clearly risky
`0.78`	auto-approve	suspicious, but below the old cutoff
`0.71`	auto-approve	suspicious, but below the old cutoff
`0.30`	auto-approve	probably normal

Domain	Training world	Production world	What became more common?
Fraud detection	ordinary months	festive scam wave	fraud cases
App support triage	stable release	buggy payment release	payment complaints
Review sentiment	old product version	improved redesign	positive reviews
Medical screening	normal district	local outbreak zone	positive cases

Type	What changed?	Notation	Main signal to watch
Data drift	inputs `X`	`P(X)`	feature distributions
Concept drift	rule from `X -> Y`	`P(Y given X)`	labeled performance
Label drift	label frequencies	`P(Y)`	class proportions

Symptom	More likely drift	More likely bug
feature mean moves slowly over weeks	Yes	No
new kinds of users appear gradually	Yes	No
a feature suddenly becomes all zeros	No	Yes
a currency value becomes 100x larger overnight	No	Yes

New Sample A	New Sample B	Mean gap
`24, 25, 25, 26, 24`	`31, 32, 33, 31, 32`	`7.0`
`24, 25, 25, 31, 32`	`26, 24, 33, 31, 32`	`1.8`
`24, 25, 31, 32, 33`	`25, 26, 24, 31, 32`	`1.4`

Toy example	KS example
two samples of `5` numbers	two samples of many numbers
compute one mean for each sample	compute one CDF for each sample
summarize by the mean gap	summarize by the largest CDF gap `D`

Categorical Features Need a Different Test

Suppose one input row looks like this:

X = (city, income, age)
Y = loan approved or rejected

Here we are checking only one input column inside X:

the city column

So we ignore income, age, and even Y for the moment.
We just ask:

"Did the mix of cities in the input data change?"

City	Train count	Production count
Ahmedabad	400	200
Rajkot	350	200
Surat	250	600

This is not a job for KS test.

Why not?

KS expects numeric values such as mileage or apartment size
Ahmedabad, Rajkot, and Surat are categories, not numbers with a natural CDF order

Here we compare category counts.
The usual tool is the chi-squared (chi^2) test.

Use it when the thing you compare is:

a categorical input feature, such as city, device, or plan type

For label counts, we will come back to the same idea later.

City	Train	Production	Total
Ahmedabad	`400`	`200`	`600`
Rajkot	`350`	`200`	`550`
Surat	`250`	`600`	`850`

PSI	Interpretation
`< 0.10`	little or no drift
`0.10 - 0.25`	moderate drift
`> 0.25`	significant drift

Drift type	Mostly watch	Need fresh labels?
Data drift	inputs `X`	No
Concept drift	prediction vs true label	Yes, usually delayed
Label drift	outcomes `Y`	Yes, usually delayed

If this changed	Look here first	Good first tool
one numeric input feature	input `X`	Kolmogorov-Smirnov (KS) or PSI
one categorical input feature	input `X`	chi-squared (`chi^2`) test
many input features together	input `X`	Evidently
rule from `X -> Y`	labeled production predictions	accuracy / precision / recall over time
label frequencies `Y`	production labels	class proportions or chi-squared (`chi^2`)

Scenario	Accuracy
Clean data	`81.5%`
One feature drifted	`74.5%`
All three important features drifted	`83.5%`

Type	What changed?	How to detect it	Typical action
Data drift	`X` changed	KS / PSI / chi-squared / Evidently	retrain with recent inputs
Concept drift	`X -> Y` changed	labeled production metrics	relearn the rule
Label drift	`Y` frequencies changed	class proportion tracking	recalibrate thresholds

Data Drift and Model Monitoring

Week 10: CS 203 - Software Tools and Techniques for AI

By the End of This Lecture

Quick Recap: What a Model Learns

Train World vs Real World

A Model Can Decay Without Any Code Change

What Happened: The World Moved

The Mental Model to Keep in Mind

Three Different Things Can Change

The Three Types of Drift

Type 1: Data Drift

The inputs changed, but the task stayed the same.

Data Drift: First Build the Idea Slowly

Step 1: The Training World

Step 2: Why the Market Changed

Step 3: What the Model Sees Now

What Stayed the Same? What Changed?

Data Drift Visualization

Read This Plot Slowly

The Metrics Tell the Same Story

Data Drift Example: Code, Part 1

Data Drift Example: Code, Part 2

The Same Idea Appears in Images

Images Are Still Input Features

Modern ML Note: Embeddings Can Drift Too

The Same Idea Appears in Text

Modern ML Note: Text Embeddings Can Drift Too

Text Drift: What Training Looked Like

Text Drift: What Production Looks Like Now

Real-World Data Drift Examples

Why Data Drift Happens in Practice

Not Every Feature Has To Drift

Type 2: Concept Drift

The meaning of the input changed.

Concept Drift: The Key Difference

Example Setup: Premium Customers in January

Same Inputs, New Labels Later

Concept Drift Visualization

More Real Concept Drift Examples

Concept Drift Cases: Visual Summary

Why Concept Drift Is the Hardest

Type 3: Label Drift

The outcome mix changed.

Label Drift: What It Means

A Visual Way to Think About Label Drift

Example: Fraud Detection with a Score

Same Model, New Fraud Rate

Same Model, New Fraud Rate: What Goes Wrong?

Label Drift: Read It Slowly

More Real Label Drift Examples

One More Label Drift Example

Compare the Three Types Once More

How to Detect Drift

Start with intuition, then move to tests.

First Question: Did the World Change, or Did the Pipeline Break?

First Question: Quick Clues

Step 0: Always Start With a Plot

Before the Kolmogorov-Smirnov (KS) Test: What Are We Trying to Do?

Before the Kolmogorov-Smirnov (KS) Test: What Is a CDF?

Why Use a CDF for the Kolmogorov-Smirnov (KS) Test?

Kolmogorov-Smirnov (KS) Test: First Compare Two Cases

Kolmogorov-Smirnov (KS) Test: What Is D?

Kolmogorov-Smirnov (KS) Test: How to Read D

Before KS p-values: The General Idea

Tiny Toy Example for p-values

Same Example, Now a Big Gap

How Many Re-splits Are Possible?

A Few Example Re-splits

p-value Intuition from All 252 Re-splits

Exact p-value for the Toy Cases

Exact p-value for the Different Case

Kolmogorov-Smirnov (KS) Test: From Samples to D

Kolmogorov-Smirnov (KS) Test: Same Logic as the Toy Example

Kolmogorov-Smirnov (KS) Test: What Is a p-value Here?

Kolmogorov-Smirnov (KS) Test: The p-value Question

Kolmogorov-Smirnov (KS) Test: How to Read p

Kolmogorov-Smirnov (KS) Test in 4 Steps

Kolmogorov-Smirnov (KS) Test: What the Library Returns

Kolmogorov-Smirnov (KS) Test: Code

Categorical Features Need a Different Test

Kolmogorov-Smirnov (KS) Test: What Is `D`?

Kolmogorov-Smirnov (KS) Test: How to Read `D`

Kolmogorov-Smirnov (KS) Test: From Samples to `D`

Kolmogorov-Smirnov (KS) Test: How to Read `p`

Chi-Squared (`chi^2`) Test: What Is the Task?

Chi-Squared (`chi^2`) Test: What Would "No Change" Look Like?

Chi-Squared (`chi^2`) Test: Where Does the Score Come From?

Chi-Squared (`chi^2`) Test: How Does This Connect to p-value?

Chi-Squared (`chi^2`) Test: Code

Chi-Squared (`chi^2`) Test: How to Read It

But Do We Have `Y` in Production?