Feeling p-values through re-splitting

Prelude

The courtroom analogy

In a courtroom, the defendant is innocent until proven guilty. The jury doesn't ask "did they do it?" — they ask "is the evidence so overwhelming that we can't reasonably explain it away?"

A p-value works exactly the same way. We start by assuming the treatment did nothing (the "defendant is innocent"). Then we ask: how surprising is the data we actually observed, if that assumption were true? If the answer is "extremely surprising"—the p-value is tiny—we reject the assumption.

By the end of this page, you'll compute an exact p-value with your own hands. No library calls, no memorised thresholds—just counting.

Step 1

Pick an experiment

Every statistical test starts with data. We have two small groups of 5 subjects each—one received a treatment, the other didn't. Pick a scenario to work with:

Scores on a cognitive test

Group A (Control)

Mean:

Group B (Treatment)

Mean:

Observed Difference

Is this gap real, or just lucky chance?

Pause and think. Look at those two groups. Does the gap feel large to you? Could you get a gap that big just by randomly splitting ten people into two groups? Hold that intuition—we'll test it rigorously.

Step 2

The skeptic's assumption (Null Hypothesis)

To test whether the treatment worked, we play devil's advocate. We assume the harshest possible stance:

The Null Hypothesis (H₀): The treatment had zero effect. Every subject would have gotten the exact same score no matter which group they landed in.

If H₀ is true, then the labels "Control" and "Treatment" are meaningless—they're just arbitrary tags we slapped on. The gap we saw in Step 1 would be nothing more than the luck of the draw.

To simulate this world where group labels don't matter, let's erase the labels entirely. Throw all 10 numbers into one bucket:

Bucket is empty. Click the button above to mix!

Good. The group identity is gone. Now we have just 10 raw numbers in a pile, and we can re-deal them however we like.

Step 3

Re-split a few times by hand

Imagine re-running the experiment in the null world: randomly pick 5 numbers for "Group A" and the remaining 5 become "Group B." Compute the gap between their means. This is what random noise looks like when the treatment truly does nothing.

Click below to try it. Each click is one random re-split of the 10 numbers:

Click a button to draw random splits.

Your random draws so far. Each dot is a gap produced by pure chance. The dashed orange line is the gap you actually observed.

What do you notice? Are most of your random gaps smaller than the observed gap, or about the same size? If you can easily produce gaps as large as the observed one just by shuffling, that's a hint the treatment might not have done anything special.

Step 4

Enumerate every possible split

Drawing a few random splits gives us intuition, but it's not rigorous. We might just have been unlucky. To be exact, we need to check every possible way to deal 10 numbers into two groups of 5.

How many ways are there? The binomial coefficient tells us:

Exactly 252. A computer can enumerate all of them in milliseconds. Let's do it and map out the complete landscape of random chance:

Every bar is a group of splits that produced a similar gap. Blue = smaller than what we observed. Orange = as large or larger.

Read the histogram. The orange region is the fraction of the "null world" that looks at least as extreme as our actual experiment. If that fraction is tiny, our observed gap is hard to explain by chance alone.

Step 5

The p-value is just a fraction

Count the orange bars. Divide by 252. That's it—that's the p-value. No calculus, no lookup tables. Just:

Your exact p-value is:

What does the p-value actually tell you?

p = probability. Specifically, the probability of seeing a gap this large or larger if the treatment truly did nothing.
Small p (typically < 0.05): the observed gap is very unlikely under pure chance → we reject the null hypothesis and conclude the treatment likely had an effect.
Large p: gaps this big happen all the time by chance → we fail to reject H₀. We can't say the treatment did anything.

Three things a p-value is not

False

"The probability that the treatment works."
Nope. The p-value says nothing about how likely the treatment is to be effective. It only measures how surprising the data is under H₀.

False

"The probability that H₀ is true."
Also no. The p-value assumes H₀ is true and then asks how likely the data is. It doesn't flip this around.

False

"p < 0.05 means the result is important."
Statistical significance ≠ practical significance. A drug that lowers blood pressure by 0.1 mmHg might yield p = 0.001 with a huge sample—but no doctor would care.

Bonus

From 252 to 137 billion: why we need shortcuts

What you just did is called a Permutation Test (or Randomization Test). It's an exact test—no assumptions about bell curves or equal variances.

But it has a catch. With 10 numbers split 5-vs-5, there are only 252 combinations. Scale up to 20-vs-20 (40 numbers total) and you hit:

Over 137 billion. Even fast computers would sweat. So statisticians developed two families of shortcuts:

Shortcut 1: Parametric tests

Assume the data follows a known shape (often a bell curve) and use calculus to compute the p-value with a formula. The t-test, ANOVA, and Chi-Square test are all examples. They're fast and elegant, but they break down when the assumption is wrong.

Shortcut 2: Monte Carlo sampling

Don't enumerate all splits—just randomly sample 10,000 of them (exactly what you did in Step 3, but many more times). The fraction of extreme samples converges to the true p-value. No distributional assumptions needed.

Visualizing the Shortcuts

Let's see both shortcuts in action on our current scenario.

1. Parametric (t-test)

Instead of counting combinations, a parametric test assumes the distribution of random noise forms a perfect, continuous Bell Curve (a normal distribution).

Because the Null Hypothesis assumes the treatment does nothing, the centre of the curve is exactly zero (μ = 0). To draw the curve, we estimate the Standard Deviation (σ) directly from the spread of our 10 data points using a shortcut formula.

Worked out for our data:

Group A: ... (Mean: ..., Var: ...)
Group B: ... (Mean: ..., Var: ...)
Observed Gap (The "Signal"): ...

In statistics, the variance of an average shrinks by the number of data points ($n$). Because we are subtracting two independent groups to find the gap, their variances add up. By taking the square root, the formula below estimates the Standard Deviation (σ) of random differences. This represents the expected "Noise".

Now we calculate the famous Z-score (sometimes called a Z-statistic). The formula is beautifully simple: it is just the ratio of Signal to Noise.

To compute the exact shaded area in the tails without manually counting combinations, we use calculus to integrate the continuous curve from our $Z$ value all the way to infinity:

Theoretical continuous distribution. Shaded area = area under the curve (the p-value).

2. Monte Carlo Simulation

We randomly draw splits and build a histogram. As we draw more and more, it approximates the exact 252-split distribution from Step 4.

Random samples: 0. Estimated p-value: -

Comparing the Three Methods

We've now seen three different ways to calculate a p-value for the exact same data. Let's compare them:

Method	How it works	Estimated p-value
1. Exact Permutation Test	Calculated exactly by computing all 252 splits. (The gold standard for small data).	...
2. Parametric (z-approximation)	Assumes a Gaussian curve and uses standard deviation formulas.	≈ ...
3. Monte Carlo Simulation	Randomly shuffles data. (Converges to the exact test with enough samples).	≈ Not run yet

Wait, why is the Parametric test so different?
If you run the "New Fertilizer" or "Coffee" scenarios, you might notice the Parametric test gives a much smaller p-value than the Exact test. Did the math fail?

Yes! This is the fundamental danger of shortcuts. A perfect Bell Curve has very "thin tails." But when dealing with small, noisy datasets ($N=10$), extreme random events happen far more often than a perfect Bell Curve expects. A normal distribution underestimates the probability of these extreme tail events, artificially driving the p-value down.

This exact problem is why Gosset invented the "Student's t-distribution"—a special curve with fatter tails designed for small samples. But even that is an approximation. The Exact Permutation Test (Method 1) makes zero assumptions and gives the absolute ground truth for your specific data!

Final Takeaway
The permutation test you built today is the conceptual foundation under all of these. When someone says "p = 0.03," they're saying: "Only 3% of the null world looks this extreme." Now you know exactly what that means—because you've counted it yourself.