Interactive Explainer

Seeing the Multivariate Normal

You already know the bell curve. This article shows you how it generalizes to two dimensions, why the covariance matrix controls everything, and what happens when you condition on one variable to learn about the other. Every formula below has an interactive figure right next to it — move the sliders and watch the math update in real time.

~20 min Probability · Statistics · Linear Algebra

I. One dimension: density, area, and the 68-95-99.7 rule

A normal distribution describes a random variable that clusters around a center value and drops off symmetrically in both directions. It is specified by exactly two numbers: the mean μ, which sets the center, and the standard deviation σ, which sets the width. Nothing else matters. Every bell curve in the world, once you know those two numbers, is the same shape up to a shift and a stretch.

The density function tells you the relative likelihood of landing near any particular value of x. It is not a probability — it can actually be greater than 1 for narrow distributions. Probability comes only from area under the curve between two endpoints:

f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\Bigl(-\frac{(x-\mu)^2}{2\sigma^2}\Bigr), \qquad P(a \le X \le b) = \int_a^b f(x)\,dx

The exponential in the formula is doing all the work. The quantity (x−μ)² / (2σ²) measures how far x sits from the center in units of the standard deviation. At the mean, this exponent is zero and the density is at its peak. One standard deviation away, the exponent is −½, so the density has dropped by a factor of e^−0.5 ≈ 0.61. Two standard deviations away, the exponent is −2, and the density has dropped to e⁻² ≈ 0.14 of the peak. The drop-off is fast.

This leads to the famous 68-95-99.7 rule: about 68% of the probability mass lies within ±1σ of the mean, about 95% within ±2σ, and about 99.7% within ±3σ. These numbers do not depend on which particular μ and σ you have — they are universal properties of the bell-curve shape. You can verify them in the figure below by adjusting the interval edges.

Concrete example. Suppose the temperature sensor on a weather station reads 20.0°C on average (μ = 20) with a measurement noise of σ = 0.5°C. Then about 68% of readings will fall between 19.5°C and 20.5°C, and a reading outside the range 18.5–21.5°C (±3σ) would occur less than 0.3% of the time. That is the kind of quantitative statement the normal distribution lets you make with just two numbers.

Mean μ 0.0 Std dev σ 1.0 Left edge -1.0 Right edge 1.0

P(a≤X≤b) = 68.27% z-range [-1.00, 1.00] Peak density 0.399 Probability that X lands inside the shaded region.

What to notice. Watch the peak drop as σ grows. The total area under the curve is always exactly 1, so spreading the mass wider forces the peak down. For the standard normal (μ=0, σ=1), the peak height is 1/√(2π) ≈ 0.399. For σ=2 it is half that.

The standardized variable z = (x−μ)/σ converts any normal to the standard normal. This is not just a trick for looking up tables — it is the conceptual move that makes higher dimensions manageable. In 2D, the analogue of “how many standard deviations away?” is the Mahalanobis distance, which accounts for the fact that spread can be different in different directions.

Try: · ·

II. Two dimensions: the covariance matrix and what it controls

Now imagine measuring two quantities at once — say, hours studied and exam score, or latitude and longitude from a noisy GPS. Each measurement pair is a point in the plane. If the noise in both coordinates is roughly Gaussian, the joint distribution is a bivariate normal. Instead of a single bell curve, you get a bell-shaped hill in 3D. Viewed from above, the level curves of that hill are ellipses.

The bivariate normal is parametrized by five numbers: two means (μ_x, μ_y), two standard deviations (σ_x, σ_y), and one correlation coefficient ρ. Those five numbers get packaged into a mean vector and a 2×2 covariance matrix:

\mathbf{x} \sim \mathcal{N}\!\left( \boldsymbol{\mu},\; \Sigma \right), \qquad \boldsymbol{\mu} = \begin{bmatrix} \mu_x \\ \mu_y \end{bmatrix}, \qquad \Sigma = \begin{bmatrix} \sigma_x^2 & \rho\,\sigma_x\sigma_y \\ \rho\,\sigma_x\sigma_y & \sigma_y^2 \end{bmatrix}

Reading the covariance matrix. The diagonal entries σ_x² and σ_y² are the variances you already know from 1D — they control spread along each axis independently. The off-diagonal entry ρσ_xσ_y is the covariance. It tells you whether the two variables tend to increase together (positive), move in opposite directions (negative), or vary independently (zero). The correlation ρ is just the covariance normalized to lie between −1 and +1.

Why ellipses? The density formula for the 2D case involves the quadratic form (x−μ)^T Σ⁻¹ (x−μ). Setting that quadratic form equal to a constant gives you the equation of an ellipse. Points on the same ellipse have the same density, so the contours of the distribution are ellipses. The eigenvalues of Σ determine the lengths of the ellipse axes; the eigenvectors determine the tilt.

f(\mathbf{x}) = \frac{1}{2\pi\,|\Sigma|^{1/2}} \exp\!\Bigl( -\tfrac{1}{2} (\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu}) \Bigr)

The prefactor 1/(2π|Σ|^1/2) is the normalization constant — it ensures the density integrates to 1 over the whole plane. The determinant |Σ| measures the “volume” of the ellipse: larger determinant means more spread, which means lower peak density, just like the 1D case.

Mean x 0.0 Mean y 0.0 Spread σ_x 1.2 Spread σ_y 0.9 Correlation ρ 0.55

Covariance Ellipse tilt 31.0° Major axis σ 1.41

What to notice. The dots are random samples from the current distribution. The nested ellipses are contour lines at 1, 1.8, 2.45, and 3.1 standard deviations (Mahalanobis distance). When ρ=0, the ellipses are axis-aligned — the axes of the ellipse line up with the coordinate axes. As |ρ| increases toward 1, the ellipses tilt and narrow. At ρ≈±1, all the mass concentrates near a line, and the two variables become nearly deterministic functions of each other.

Eigenvalue interpretation. The covariance matrix Σ can be decomposed as Σ = QΛQ^T, where Q is a rotation matrix (the eigenvectors) and Λ is diagonal (the eigenvalues). The eigenvectors give the directions of the ellipse axes. The eigenvalues give the variances along those axes. This decomposition is the reason the bivariate normal “looks like” an axis-aligned Gaussian that has been rotated: because that is exactly what it is.

Concrete examples

Hours studied vs. exam score (ρ≈0.7): more study time is associated with higher scores, but not perfectly — the ellipse is tilted upward but still has substantial width.
Speed vs. travel time (ρ≈−0.8): faster speeds mean shorter trips, producing a downward tilt.
GPS latitude vs. longitude on a road: the uncertainty along the road is much larger than across it, producing a very narrow ellipse aligned with the road direction.
Manufacturing width vs. thickness (ρ≈0.9): if the production line drifts, both measurements drift together, so the ellipse is narrow and tilted.

Try: · · · · · ·

III. Sampling: turning a distribution into actual data

Knowing the formula for a distribution is one thing. Being able to generate actual draws from it is another. In practice, every Monte Carlo simulation, every Bayesian posterior sample, and every generative model relies on being able to sample efficiently. For the multivariate normal, the algorithm is elegant and relies on a single idea: start with something simple and transform it.

Draw a vector z of independent standard normals — each component is an independent N(0,1). This gives you a spherical distribution centered at the origin.
Find a matrix L such that LL^T = Σ. This is the Cholesky decomposition of the covariance matrix — it exists whenever Σ is positive definite, which is always the case for a legitimate covariance matrix.
Compute x = μ + Lz. The multiplication by L stretches and rotates the spherical distribution into the correct elliptical shape, and adding μ shifts it to the right location.

z \sim \mathcal{N}(\mathbf{0}, I), \qquad \mathbf{x} = \boldsymbol{\mu} + L\,z, \qquad LL^\top = \Sigma

Why does this work? If z has covariance I, then Lz has covariance L · I · L^T = LL^T = Σ. That is the entire proof. The linearity of expectations gives us the correct mean, and the covariance transformation rule gives us the correct spread. This is why linear algebra and probability fit together so naturally.

The figure below draws actual samples from the bivariate normal you configured in Section II. Each time you hit “Redraw,” a fresh set of independent standard normals is generated and transformed through the Cholesky factor. Notice how the sample mean and sample covariance fluctuate from draw to draw, especially at small sample sizes. This is sampling variability, and it is the fundamental reason why statistical estimation is needed at all.

Sample count 120 Draw #1

Sample mean Sample cov

What to notice. At n=20, the sample estimates can be way off. At n=300, they settle down close to the true parameters. This convergence is the law of large numbers in action. Try setting the count to 20 and hitting Redraw several times. You will see the sample mean jump around and the sample covariance matrix change noticeably between draws. Then set it to 300 and do the same — the estimates become much more stable. The rate of convergence is roughly proportional to 1/√n.

IV. Estimation: fitting parameters from observed points

In practice, you rarely know μ and Σ — you have to estimate them from data. For the Gaussian, the maximum-likelihood estimates have a beautifully simple form: the sample mean is the average of the data points, and the sample covariance is the average of the outer products of the centered data.

\hat{\boldsymbol{\mu}} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i, \qquad\qquad \hat{\Sigma} = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i - \hat{\boldsymbol{\mu}}) (\mathbf{x}_i - \hat{\boldsymbol{\mu}})^\top

What the outer product does. For each data point, the term (x_i−μ̂)(x_i−μ̂)^T is a 2×2 matrix. Its (1,1) entry is the squared x-deviation. Its (2,2) entry is the squared y-deviation. Its off-diagonal entries are the product of the x- and y-deviations. Averaging these over all data points gives you the sample variances on the diagonal and the sample covariance off the diagonal. There is no black box here — it is literally averaging squared deviations.

The figure below shows a small toy dataset. The solid ellipse is the fitted Gaussian using the sample mean and covariance. The dashed ellipse is the true distribution that generated the data. Click anywhere inside the plot to add new data points and watch the fitted ellipse adjust in real time. Adding points near the center stabilizes the estimates; adding outliers can tilt the ellipse dramatically.

Current: sleep and quiz score More sleep tends to come with better quiz scores.

n = 18 Est. mean Est. cov The fitted ellipse should roughly follow the data’s orientation.

#	x	y

What to notice. With 10–15 points, the fitted ellipse is often visibly tilted away from the true one. With 50+ points it becomes a close match. Try clicking a cluster of points far from the current mean — you will see the mean get pulled toward the new cluster and the covariance expand to cover the wider spread. This is exactly what happens in practice when your data contains subpopulations or outliers.

Datasets: · · ·

V. Marginals: projecting away one axis

A marginal distribution answers the question: if I completely ignore one of the two coordinates, what does the remaining coordinate look like on its own? Formally, you integrate (sum) the joint density over the coordinate you want to forget:

f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dy \qquad\Longrightarrow\qquad X \sim \mathcal{N}(\mu_x,\;\sigma_x^2)

For the Gaussian, this integration has a remarkable result: the marginal is itself Gaussian, and its parameters are simply the corresponding entries from the mean vector and the diagonal of the covariance matrix. You do not need ρ to compute the marginal. The marginal of X depends only on μ_x and σ_x². The marginal of Y depends only on μ_y and σ_y².

This is counterintuitive. You might expect that if two variables are correlated, the marginal of one should somehow reflect that correlation. It does not. Correlation controls the shape of the joint distribution — it determines how the ellipse tilts — but when you project that ellipse onto a single axis, the shadow has the same width regardless of the tilt. Try it: go back to Section II, change ρ while keeping σ_x and σ_y fixed, and watch the marginal curves below stay unchanged.

Geometrically, the marginal is the “shadow” of the 2D distribution projected onto one axis. Imagine shining a light straight down onto the xy-plane and looking at the resulting silhouette on the x-axis. The shape of that silhouette is the x-marginal. No matter how you tilt the ellipse, as long as its horizontal extent stays the same, the shadow stays the same.

Mean x 0.0 Mean y 0.0 Spread σ_x 1.2 Spread σ_y 0.9 Correlation ρ 0.55

X marginal (top projection)

X ~ N(0.00, 1.20²)

Joint model

Changing ρ rotates the joint ellipse, but the marginal curves do not change width.

Why this matters. In many applications, you observe only one variable at a time. If you are looking at a patient’s blood pressure without knowing their cholesterol, you are looking at a marginal. The marginal tells you about the spread of that one variable in isolation, but it throws away all information about how the variables relate. That relationship information is exactly what conditionals recover.

VI. Conditionals: the Gaussian version of “given that”

A conditional distribution answers a much sharper question than the marginal: if I know that X takes a specific value x₀, what can I say about Y? For the Gaussian, the answer is again Gaussian, with two specific changes:

Y \mid X = x_0 \;\sim\; \mathcal{N}\!\left( \underbrace{\mu_y + \rho\,\frac{\sigma_y}{\sigma_x}(x_0 - \mu_x)}_{\text{conditional mean}},\;\; \underbrace{(1-\rho^2)\,\sigma_y^2}_{\text{conditional variance}} \right)

The conditional mean shifts. It does not stay at μ_y. Instead, it moves by an amount proportional to how far x₀ is from μ_x, scaled by the regression coefficient ρσ_y/σ_x. This is exactly the slope of the least-squares regression line of Y on X. The conditional mean traces out a straight line as x₀ varies — and that line is the regression line.

The conditional variance shrinks. It is reduced from σ_y² to (1−ρ²)σ_y². The factor (1−ρ²) is the fraction of variance in Y that is not explained by X. When ρ=0, no variance is explained and the conditional equals the marginal. When |ρ|=1, all variance is explained and the conditional collapses to a point — if you know X exactly, you know Y exactly.

Connection to R². In linear regression, R² is the proportion of variance explained by the predictor. For a bivariate normal, R² = ρ². The conditional variance is (1−R²) times the marginal variance. So the entire machinery of simple linear regression is already hiding inside the bivariate normal — the conditional distribution is literally what regression computes.

The figure below adds two slice lines to the joint distribution: a vertical slice at x₀ (which reveals Y|X) and a horizontal slice at y₀ (which reveals X|Y). Drag the slice lines or use the sliders. Compare each conditional curve to its marginal (shown in teal): the conditional is always narrower or equal, never wider.

Mean x 0.0 Mean y 0.0 Spread σ_x 1.2 Spread σ_y 0.9 Correlation ρ 0.55 Vertical slice x₀ +0.00 Horizontal slice y₀ +0.00

X marginal — & X|Y conditional —

Teal: X marginal. Gold: X | Y = 0.00.

Slice readout

x₀ = 0.00, y₀ = 0.00

Gold: X|Y. Coral: Y|X.

Marginals keep their width. Conditionals get narrower as |ρ| grows.

What to notice. Move x₀ to +2 standard deviations with high positive ρ. The conditional mean of Y jumps well above μ_y, and the conditional curve becomes noticeably narrower than the marginal. Now set ρ=0. The conditional becomes identical to the marginal: observing X tells you nothing about Y. This is the operational definition of independence for Gaussian variables.

The closure property. The fact that conditioning a Gaussian on a subset of variables produces another Gaussian is the single most important algebraic property of this family. It is the reason Kalman filters work (prediction + update are both Gaussian operations), the reason Gaussian processes produce closed-form posteriors, and the reason Bayesian linear regression has an exact solution. Every time you see a model that says “assume the data is Gaussian,” this closure property is usually the real motivation.

Try: · ·

VII. Summary and connections

Density is not probability: The value of the density at a point is a relative likelihood, not a probability. Probability is area under the curve between two values. The density can exceed 1.
Two parameters control everything in 1D: μ sets the center, σ sets the width. The 68-95-99.7 rule gives you quick probability estimates without computing any integrals.
The covariance matrix is the whole story in 2D: Diagonal entries are variances. The off-diagonal entry is the covariance. The eigenvalues and eigenvectors of Σ directly give you the ellipse shape and orientation.
Sampling = transform a standard normal: Draw z ~ N(0,I), compute x = μ + Lz where LL^T = Σ. The Cholesky factor L does the stretching and rotating. This is how every Monte Carlo sampler generates multivariate normals.
MLE is just averaging: The maximum-likelihood mean is the sample average. The maximum-likelihood covariance is the average outer product of centered observations. No gradient descent, no iteration — just summation.
Marginals ignore correlation: The marginal of X depends only on μ_x and σ_x², no matter how strong the correlation. Marginals are projections, and projections discard relationship information.
Conditionals are regression: Conditioning on X=x₀ gives a Gaussian for Y whose mean follows the regression line and whose variance is (1−ρ²)σ_y². The conditional distribution is what linear regression computes.
Closure is the real superpower: Marginals, conditionals, and linear combinations of Gaussians are all Gaussian. This closure property is why the family dominates: Kalman filters, Gaussian processes, Bayesian linear regression, and factor analysis all rely on it.

Where to go from here

Higher dimensions. Everything above generalizes directly: μ becomes a d-dimensional vector, Σ becomes a d×d matrix, contours become ellipsoids. Marginals and conditionals remain Gaussian with the same formulas (just involving submatrices of Σ).

Gaussian processes. A GP is a distribution over functions, defined by the property that any finite collection of function values is jointly Gaussian. Conditioning on observed data points is exactly the conditional formula from Section VI, applied to a potentially infinite-dimensional covariance structure.

Kalman filters. The state of a linear dynamical system at each time step is Gaussian. The predict step is a marginal (project forward in time). The update step is a conditional (incorporate a new measurement). Both operations stay Gaussian, which is why the Kalman filter is exact and efficient.

Maximum entropy. Among all distributions with a given mean and covariance, the Gaussian has the highest entropy — it makes the fewest additional assumptions. This is the information-theoretic justification for using Gaussians as default models of uncertainty.