Interactive Explainer
Feeling p-values through re-splitting
Forget formulas for a moment. Start with two tiny groups of numbers, mix them up, re-deal them every possible way, and watch a p-value emerge as a plain fraction—no black box, no hand-waving.
The courtroom analogy
In a courtroom, the defendant is innocent until proven guilty. The jury doesn't ask "did they do it?" — they ask "is the evidence so overwhelming that we can't reasonably explain it away?"
A p-value works exactly the same way. We start by assuming the treatment did nothing (the "defendant is innocent"). Then we ask: how surprising is the data we actually observed, if that assumption were true? If the answer is "extremely surprising"—the p-value is tiny—we reject the assumption.
By the end of this page, you'll compute an exact p-value with your own hands. No library calls, no memorised thresholds—just counting.
Pick an experiment
Every statistical test starts with data. We have two small groups of 5 subjects each—one received a treatment, the other didn't. Pick a scenario to work with:
Scores on a cognitive test
Group A (Control)
Mean:
Group B (Treatment)
Mean:
Observed Difference
Is this gap real, or just lucky chance?
The skeptic's assumption (Null Hypothesis)
To test whether the treatment worked, we play devil's advocate. We assume the harshest possible stance:
If H0 is true, then the labels "Control" and "Treatment" are meaningless—they're just arbitrary tags we slapped on. The gap we saw in Step 1 would be nothing more than the luck of the draw.
To simulate this world where group labels don't matter, let's erase the labels entirely. Throw all 10 numbers into one bucket:
Bucket is empty. Click the button above to mix!
Good. The group identity is gone. Now we have just 10 raw numbers in a pile, and we can re-deal them however we like.
Re-split a few times by hand
Imagine re-running the experiment in the null world: randomly pick 5 numbers for "Group A" and the remaining 5 become "Group B." Compute the gap between their means. This is what random noise looks like when the treatment truly does nothing.
Click below to try it. Each click is one random re-split of the 10 numbers:
Click a button to draw random splits.
Enumerate every possible split
Drawing a few random splits gives us intuition, but it's not rigorous. We might just have been unlucky. To be exact, we need to check every possible way to deal 10 numbers into two groups of 5.
How many ways are there? The binomial coefficient tells us:
Exactly 252. A computer can enumerate all of them in milliseconds. Let's do it and map out the complete landscape of random chance:
The p-value is just a fraction
Count the orange bars. Divide by 252. That's it—that's the p-value. No calculus, no lookup tables. Just:
Your exact p-value is:
What does the p-value actually tell you?
- p = probability. Specifically, the probability of seeing a gap this large or larger if the treatment truly did nothing.
- Small p (typically < 0.05): the observed gap is very unlikely under pure chance → we reject the null hypothesis and conclude the treatment likely had an effect.
- Large p: gaps this big happen all the time by chance → we fail to reject H0. We can't say the treatment did anything.
Three things a p-value is not
"The probability that the treatment works."
Nope. The p-value says nothing about how likely the treatment is to
be effective. It only measures how surprising the data is under H0.
"The probability that H0 is true."
Also no. The p-value assumes H0 is true and then asks
how likely the data is. It doesn't flip this around.
"p < 0.05 means the result is important."
Statistical significance ≠ practical significance. A drug that
lowers blood pressure by 0.1 mmHg might yield p = 0.001
with a huge sample—but no doctor would care.
From 252 to 137 billion: why we need shortcuts
What you just did is called a Permutation Test (or Randomization Test). It's an exact test—no assumptions about bell curves or equal variances.
But it has a catch. With 10 numbers split 5-vs-5, there are only 252 combinations. Scale up to 20-vs-20 (40 numbers total) and you hit:
Over 137 billion. Even fast computers would sweat. So statisticians developed two families of shortcuts:
Shortcut 1: Parametric tests
Assume the data follows a known shape (often a bell curve) and use calculus to compute the p-value with a formula. The t-test, ANOVA, and Chi-Square test are all examples. They're fast and elegant, but they break down when the assumption is wrong.
Shortcut 2: Monte Carlo sampling
Don't enumerate all splits—just randomly sample 10,000 of them (exactly what you did in Step 3, but many more times). The fraction of extreme samples converges to the true p-value. No distributional assumptions needed.
Visualizing the Shortcuts
Let's see both shortcuts in action on our current scenario.
1. Parametric (t-test)
Instead of counting combinations, a parametric test assumes the distribution of random noise forms a perfect, continuous Bell Curve (a normal distribution).
Because the Null Hypothesis assumes the treatment does nothing, the centre of the curve is exactly zero (μ = 0). To draw the curve, we estimate the Standard Deviation (σ) directly from the spread of our 10 data points using a shortcut formula.
- Group A: ... (Mean: ..., Var: ...)
- Group B: ... (Mean: ..., Var: ...)
- Observed Gap (The "Signal"): ...
In statistics, the variance of an average shrinks by the number of data points ($n$). Because we are subtracting two independent groups to find the gap, their variances add up. By taking the square root, the formula below estimates the Standard Deviation (σ) of random differences. This represents the expected "Noise".
Now we calculate the famous Z-score (sometimes called a Z-statistic). The formula is beautifully simple: it is just the ratio of Signal to Noise.
To compute the exact shaded area in the tails without manually counting combinations, we use calculus to integrate the continuous curve from our $Z$ value all the way to infinity:
2. Monte Carlo Simulation
We randomly draw splits and build a histogram. As we draw more and more, it approximates the exact 252-split distribution from Step 4.
Comparing the Three Methods
We've now seen three different ways to calculate a p-value for the exact same data. Let's compare them:
| Method | How it works | Estimated p-value |
|---|---|---|
| 1. Exact Permutation Test | Calculated exactly by computing all 252 splits. (The gold standard for small data). | ... |
| 2. Parametric (z-approximation) | Assumes a Gaussian curve and uses standard deviation formulas. | ≈ ... |
| 3. Monte Carlo Simulation | Randomly shuffles data. (Converges to the exact test with enough samples). | ≈ Not run yet |
If you run the "New Fertilizer" or "Coffee" scenarios, you might notice the Parametric test gives a much smaller p-value than the Exact test. Did the math fail?
Yes! This is the fundamental danger of shortcuts. A perfect Bell Curve has very "thin tails." But when dealing with small, noisy datasets ($N=10$), extreme random events happen far more often than a perfect Bell Curve expects. A normal distribution underestimates the probability of these extreme tail events, artificially driving the p-value down.
This exact problem is why Gosset invented the "Student's t-distribution"—a special curve with fatter tails designed for small samples. But even that is an approximation. The Exact Permutation Test (Method 1) makes zero assumptions and gives the absolute ground truth for your specific data!
The permutation test you built today is the conceptual foundation under all of these. When someone says "p = 0.03," they're saying: "Only 3% of the null world looks this extreme." Now you know exactly what that means—because you've counted it yourself.