Interactive Explainer
Seeing the Multivariate Normal
You already know the bell curve. This article shows you how it generalizes to two dimensions, why the covariance matrix controls everything, and what happens when you condition on one variable to learn about the other. Every formula below has an interactive figure right next to it — move the sliders and watch the math update in real time.
I. One dimension: density, area, and the 68-95-99.7 rule
A normal distribution describes a random variable that clusters around a center value
and drops off symmetrically in both directions. It is specified by exactly two numbers:
the mean μ, which sets the center, and the standard deviation
σ, which sets the width. Nothing else matters. Every bell curve in
the world, once you know those two numbers, is the same shape up to a shift and a
stretch.
The density function tells you the relative likelihood of landing near any particular
value of x. It is not a probability — it can actually be
greater than 1 for narrow distributions. Probability comes only from area under the
curve between two endpoints:
The exponential in the formula is doing all the work. The quantity
(x−μ)² / (2σ²) measures how far x
sits from the center in units of the standard deviation. At the mean, this exponent is
zero and the density is at its peak. One standard deviation away, the exponent is
−½, so the density has dropped by a factor of
e−0.5 ≈ 0.61. Two standard deviations away, the
exponent is −2, and the density has dropped to
e−2 ≈ 0.14 of the peak. The drop-off is fast.
This leads to the famous 68-95-99.7 rule: about 68% of the probability mass lies within ±1σ of the mean, about 95% within ±2σ, and about 99.7% within ±3σ. These numbers do not depend on which particular μ and σ you have — they are universal properties of the bell-curve shape. You can verify them in the figure below by adjusting the interval edges.
Concrete example. Suppose the temperature sensor on a weather station reads 20.0°C on average (μ = 20) with a measurement noise of σ = 0.5°C. Then about 68% of readings will fall between 19.5°C and 20.5°C, and a reading outside the range 18.5–21.5°C (±3σ) would occur less than 0.3% of the time. That is the kind of quantitative statement the normal distribution lets you make with just two numbers.
What to notice. Watch the peak drop as σ grows. The total area
under the curve is always exactly 1, so spreading the mass wider forces the peak down.
For the standard normal (μ=0, σ=1), the peak height is
1/√(2π) ≈ 0.399. For σ=2 it is half that.
The standardized variable z = (x−μ)/σ converts any
normal to the standard normal. This is not just a trick for looking up tables —
it is the conceptual move that makes higher dimensions manageable. In 2D, the analogue
of “how many standard deviations away?” is the Mahalanobis
distance, which accounts for the fact that spread can be different in different
directions.
II. Two dimensions: the covariance matrix and what it controls
Now imagine measuring two quantities at once — say, hours studied and exam score, or latitude and longitude from a noisy GPS. Each measurement pair is a point in the plane. If the noise in both coordinates is roughly Gaussian, the joint distribution is a bivariate normal. Instead of a single bell curve, you get a bell-shaped hill in 3D. Viewed from above, the level curves of that hill are ellipses.
The bivariate normal is parametrized by five numbers: two means (μx, μy), two standard deviations (σx, σy), and one correlation coefficient ρ. Those five numbers get packaged into a mean vector and a 2×2 covariance matrix:
Reading the covariance matrix. The diagonal entries σx² and σy² are the variances you already know from 1D — they control spread along each axis independently. The off-diagonal entry ρσxσy is the covariance. It tells you whether the two variables tend to increase together (positive), move in opposite directions (negative), or vary independently (zero). The correlation ρ is just the covariance normalized to lie between −1 and +1.
Why ellipses? The density formula for the 2D case involves the quadratic form (x−μ)T Σ−1 (x−μ). Setting that quadratic form equal to a constant gives you the equation of an ellipse. Points on the same ellipse have the same density, so the contours of the distribution are ellipses. The eigenvalues of Σ determine the lengths of the ellipse axes; the eigenvectors determine the tilt.
The prefactor 1/(2π|Σ|1/2) is the normalization
constant — it ensures the density integrates to 1 over the whole plane. The
determinant |Σ| measures the “volume” of the ellipse: larger
determinant means more spread, which means lower peak density, just like the 1D case.
What to notice. The dots are random samples from the current distribution. The nested ellipses are contour lines at 1, 1.8, 2.45, and 3.1 standard deviations (Mahalanobis distance). When ρ=0, the ellipses are axis-aligned — the axes of the ellipse line up with the coordinate axes. As |ρ| increases toward 1, the ellipses tilt and narrow. At ρ≈±1, all the mass concentrates near a line, and the two variables become nearly deterministic functions of each other.
Eigenvalue interpretation. The covariance matrix Σ can be decomposed as Σ = QΛQT, where Q is a rotation matrix (the eigenvectors) and Λ is diagonal (the eigenvalues). The eigenvectors give the directions of the ellipse axes. The eigenvalues give the variances along those axes. This decomposition is the reason the bivariate normal “looks like” an axis-aligned Gaussian that has been rotated: because that is exactly what it is.
Concrete examples
- Hours studied vs. exam score (ρ≈0.7): more study time is associated with higher scores, but not perfectly — the ellipse is tilted upward but still has substantial width.
- Speed vs. travel time (ρ≈−0.8): faster speeds mean shorter trips, producing a downward tilt.
- GPS latitude vs. longitude on a road: the uncertainty along the road is much larger than across it, producing a very narrow ellipse aligned with the road direction.
- Manufacturing width vs. thickness (ρ≈0.9): if the production line drifts, both measurements drift together, so the ellipse is narrow and tilted.
III. Sampling: turning a distribution into actual data
Knowing the formula for a distribution is one thing. Being able to generate actual draws from it is another. In practice, every Monte Carlo simulation, every Bayesian posterior sample, and every generative model relies on being able to sample efficiently. For the multivariate normal, the algorithm is elegant and relies on a single idea: start with something simple and transform it.
- Draw a vector
zof independent standard normals — each component is an independent N(0,1). This gives you a spherical distribution centered at the origin. - Find a matrix
Lsuch thatLLT = Σ. This is the Cholesky decomposition of the covariance matrix — it exists whenever Σ is positive definite, which is always the case for a legitimate covariance matrix. - Compute
x = μ + Lz. The multiplication by L stretches and rotates the spherical distribution into the correct elliptical shape, and adding μ shifts it to the right location.
Why does this work? If z has covariance
I, then Lz has covariance
L · I · LT = LLT = Σ. That is
the entire proof. The linearity of expectations gives us the correct mean, and the
covariance transformation rule gives us the correct spread. This is why linear algebra
and probability fit together so naturally.
The figure below draws actual samples from the bivariate normal you configured in Section II. Each time you hit “Redraw,” a fresh set of independent standard normals is generated and transformed through the Cholesky factor. Notice how the sample mean and sample covariance fluctuate from draw to draw, especially at small sample sizes. This is sampling variability, and it is the fundamental reason why statistical estimation is needed at all.
What to notice.
At n=20, the sample estimates can be way off.
At n=300, they settle down close to the true parameters. This convergence is the law of large numbers in action.
Try setting the count to 20 and hitting Redraw several times. You will see the sample
mean jump around and the sample covariance matrix change noticeably between draws.
Then set it to 300 and do the same — the estimates become much more stable.
The rate of convergence is roughly proportional to 1/√n.
IV. Estimation: fitting parameters from observed points
In practice, you rarely know μ and Σ — you have to estimate them from data. For the Gaussian, the maximum-likelihood estimates have a beautifully simple form: the sample mean is the average of the data points, and the sample covariance is the average of the outer products of the centered data.
What the outer product does. For each data point, the term
(xi−μ̂)(xi−μ̂)T
is a 2×2 matrix. Its (1,1) entry is the squared x-deviation. Its (2,2) entry is
the squared y-deviation. Its off-diagonal entries are the product of the x- and
y-deviations. Averaging these over all data points gives you the sample variances on the
diagonal and the sample covariance off the diagonal. There is no black box here —
it is literally averaging squared deviations.
The figure below shows a small toy dataset. The solid ellipse is the fitted Gaussian using the sample mean and covariance. The dashed ellipse is the true distribution that generated the data. Click anywhere inside the plot to add new data points and watch the fitted ellipse adjust in real time. Adding points near the center stabilizes the estimates; adding outliers can tilt the ellipse dramatically.
| # | x | y |
|---|
What to notice. With 10–15 points, the fitted ellipse is often visibly tilted away from the true one. With 50+ points it becomes a close match. Try clicking a cluster of points far from the current mean — you will see the mean get pulled toward the new cluster and the covariance expand to cover the wider spread. This is exactly what happens in practice when your data contains subpopulations or outliers.
V. Marginals: projecting away one axis
A marginal distribution answers the question: if I completely ignore one of the two coordinates, what does the remaining coordinate look like on its own? Formally, you integrate (sum) the joint density over the coordinate you want to forget:
For the Gaussian, this integration has a remarkable result: the marginal is itself Gaussian, and its parameters are simply the corresponding entries from the mean vector and the diagonal of the covariance matrix. You do not need ρ to compute the marginal. The marginal of X depends only on μx and σx². The marginal of Y depends only on μy and σy².
This is counterintuitive. You might expect that if two variables are correlated, the marginal of one should somehow reflect that correlation. It does not. Correlation controls the shape of the joint distribution — it determines how the ellipse tilts — but when you project that ellipse onto a single axis, the shadow has the same width regardless of the tilt. Try it: go back to Section II, change ρ while keeping σx and σy fixed, and watch the marginal curves below stay unchanged.
Geometrically, the marginal is the “shadow” of the 2D distribution projected onto one axis. Imagine shining a light straight down onto the xy-plane and looking at the resulting silhouette on the x-axis. The shape of that silhouette is the x-marginal. No matter how you tilt the ellipse, as long as its horizontal extent stays the same, the shadow stays the same.
X marginal (top projection)
X ~ N(0.00, 1.20²)
Joint model
Why this matters. In many applications, you observe only one variable at a time. If you are looking at a patient’s blood pressure without knowing their cholesterol, you are looking at a marginal. The marginal tells you about the spread of that one variable in isolation, but it throws away all information about how the variables relate. That relationship information is exactly what conditionals recover.
VI. Conditionals: the Gaussian version of “given that”
A conditional distribution answers a much sharper question than the marginal: if I know that X takes a specific value x0, what can I say about Y? For the Gaussian, the answer is again Gaussian, with two specific changes:
The conditional mean shifts. It does not stay at μy. Instead, it moves by an amount proportional to how far x0 is from μx, scaled by the regression coefficient ρσy/σx. This is exactly the slope of the least-squares regression line of Y on X. The conditional mean traces out a straight line as x0 varies — and that line is the regression line.
The conditional variance shrinks. It is reduced from σy² to (1−ρ²)σy². The factor (1−ρ²) is the fraction of variance in Y that is not explained by X. When ρ=0, no variance is explained and the conditional equals the marginal. When |ρ|=1, all variance is explained and the conditional collapses to a point — if you know X exactly, you know Y exactly.
Connection to R². In linear regression, R² is the proportion of variance explained by the predictor. For a bivariate normal, R² = ρ². The conditional variance is (1−R²) times the marginal variance. So the entire machinery of simple linear regression is already hiding inside the bivariate normal — the conditional distribution is literally what regression computes.
The figure below adds two slice lines to the joint distribution: a vertical slice at x0 (which reveals Y|X) and a horizontal slice at y0 (which reveals X|Y). Drag the slice lines or use the sliders. Compare each conditional curve to its marginal (shown in teal): the conditional is always narrower or equal, never wider.
X marginal — & X|Y conditional —
Teal: X marginal. Gold: X | Y = 0.00.
Slice readout
x0 = 0.00, y0 = 0.00
Gold: X|Y. Coral: Y|X.
What to notice. Move x0 to +2 standard deviations with high positive ρ. The conditional mean of Y jumps well above μy, and the conditional curve becomes noticeably narrower than the marginal. Now set ρ=0. The conditional becomes identical to the marginal: observing X tells you nothing about Y. This is the operational definition of independence for Gaussian variables.
The closure property. The fact that conditioning a Gaussian on a subset of variables produces another Gaussian is the single most important algebraic property of this family. It is the reason Kalman filters work (prediction + update are both Gaussian operations), the reason Gaussian processes produce closed-form posteriors, and the reason Bayesian linear regression has an exact solution. Every time you see a model that says “assume the data is Gaussian,” this closure property is usually the real motivation.
VII. Summary and connections
- Density is not probability
- The value of the density at a point is a relative likelihood, not a probability. Probability is area under the curve between two values. The density can exceed 1.
- Two parameters control everything in 1D
- μ sets the center, σ sets the width. The 68-95-99.7 rule gives you quick probability estimates without computing any integrals.
- The covariance matrix is the whole story in 2D
- Diagonal entries are variances. The off-diagonal entry is the covariance. The eigenvalues and eigenvectors of Σ directly give you the ellipse shape and orientation.
- Sampling = transform a standard normal
- Draw z ~ N(0,I), compute x = μ + Lz where LLT = Σ. The Cholesky factor L does the stretching and rotating. This is how every Monte Carlo sampler generates multivariate normals.
- MLE is just averaging
- The maximum-likelihood mean is the sample average. The maximum-likelihood covariance is the average outer product of centered observations. No gradient descent, no iteration — just summation.
- Marginals ignore correlation
- The marginal of X depends only on μx and σx², no matter how strong the correlation. Marginals are projections, and projections discard relationship information.
- Conditionals are regression
- Conditioning on X=x0 gives a Gaussian for Y whose mean follows the regression line and whose variance is (1−ρ²)σy². The conditional distribution is what linear regression computes.
- Closure is the real superpower
- Marginals, conditionals, and linear combinations of Gaussians are all Gaussian. This closure property is why the family dominates: Kalman filters, Gaussian processes, Bayesian linear regression, and factor analysis all rely on it.
Where to go from here
Higher dimensions. Everything above generalizes directly: μ becomes a d-dimensional vector, Σ becomes a d×d matrix, contours become ellipsoids. Marginals and conditionals remain Gaussian with the same formulas (just involving submatrices of Σ).
Gaussian processes. A GP is a distribution over functions, defined by the property that any finite collection of function values is jointly Gaussian. Conditioning on observed data points is exactly the conditional formula from Section VI, applied to a potentially infinite-dimensional covariance structure.
Kalman filters. The state of a linear dynamical system at each time step is Gaussian. The predict step is a marginal (project forward in time). The update step is a conditional (incorporate a new measurement). Both operations stay Gaussian, which is why the Kalman filter is exact and efficient.
Maximum entropy. Among all distributions with a given mean and covariance, the Gaussian has the highest entropy — it makes the fewest additional assumptions. This is the information-theoretic justification for using Gaussians as default models of uncertainty.