Joint Distributions - What Multiple Random Variables Know About Each Other
Helpful context:
- Random Variables - Putting Numbers on Uncertain Outcomes
- Expectation, Variance & Covariance - The Center, the Spread, and the Relationship
- Conditional Probability - What You Already Know Changes Everything
A student studies hard for the midterm and gets an 85. You want to predict their final exam score. Should you expect an 85 again? Higher? Lower?
There’s a real question buried here: does midterm performance tell you something about final performance? And if so, how much? To answer it, you can’t treat the midterm score and the final score as two separate random variables living independent lives. You need to model them together - as a pair - and understand how they move together.
That’s what joint distributions do. They are the mathematics of two (or more) random variables considered simultaneously.
The Joint PMF and PDF
For two discrete random variables $X$ and $Y$, the joint PMF is simply:
$$p(x, y) = P(X = x,\ Y = y).$$
This assigns a probability to every possible pair of values. If $X$ is midterm score and $Y$ is final score, then $p(80, 90)$ is the probability that a randomly chosen student scored 80 on the midterm and 90 on the final. The joint PMF must satisfy $p(x,y) \geq 0$ and $\sum_x \sum_y p(x,y) = 1$.
For continuous random variables, we instead have a joint PDF $f(x,y) \geq 0$ such that probabilities are computed by integrating over regions:
$$P((X,Y) \in A) = \iint_A f(x,y) dx dy.$$
The normalization condition is $\iint f(x,y) dx dy = 1$. The density $f(x,y)$ can exceed 1 at a point - what matters is that it integrates to 1 over all of $\mathbb{R}^2$.
Marginal Distributions: Zooming Out
Suppose you have the full joint distribution but you only care about $X$ by itself - you want to forget $Y$ exists. How do you recover $X$’s individual distribution from the joint?
You sum out (or integrate out) $Y$:
$$f_X(x) = \int_{-\infty}^{\infty} f(x,y) dy, \qquad p_X(x) = \sum_y p(x,y).$$
This is called the marginal distribution of $X$. Similarly for $Y$:
$$f_Y(y) = \int_{-\infty}^{\infty} f(x,y) dx.$$
Think of it geometrically. The joint PDF $f(x,y)$ is a landscape over the plane. The marginal $f_X(x)$ is what you’d see if you projected that landscape down onto the $x$-axis - a shadow, collapsing all information about $y$ into a single curve.
The word “marginal” comes from an old statistical convention: when you display a joint distribution as a table, you write the row-sums and column-sums in the margins of the table.
Independence: When Variables Don’t Talk to Each Other
Two random variables are independent if knowing the value of one tells you nothing about the other. More precisely:
$$X \perp Y \iff f(x,y) = f_X(x) \cdot f_Y(y) \text{ for all } x, y.$$
The joint factors as a product of the marginals. This is the mathematical statement that $X$ and $Y$ have no relationship.
Independence has a useful consequence: if $g$ and $h$ are any functions, then
$$E[g(X) h(Y)] = E[g(X)] \cdot E[h(Y)].$$
The expectation of a product separates into a product of expectations - but only when $X$ and $Y$ are independent.
For our student example: midterm and final scores are definitely not independent. A student who does well on the midterm is more likely to do well on the final. The joint distribution is not a product of marginals.
Conditional Distributions: Slicing the Joint
Here is the most useful operation in the joint distribution toolkit. Suppose you’ve observed that $Y = y$ - say, you’ve seen the midterm score. Now you want the distribution of $X$ (the final score) given that observation.
This is the conditional distribution of $X$ given $Y = y$:
$$f_{X|Y}(x \mid y) = \frac{f(x,y)}{f_Y(y)}.$$
This is not just a number - it’s a full probability distribution over $x$, for a fixed value of $y$. You’ve taken the joint distribution and sliced it at $Y = y$, then renormalized so it integrates to 1.
This formula is Bayes' theorem in disguise. Rearranging: $f(x,y) = f_{X|Y}(x \mid y) \cdot f_Y(y)$. The joint equals the conditional times the marginal - the multiplication rule.
Covariance from the Joint
If you have the joint distribution, you can compute the covariance directly:
$$\text{Cov}(X,Y) = E[XY] - E[X]E[Y] = \iint xy f(x,y) dx dy - E[X] E[Y].$$
The quantity $E[XY] = \iint xy f(x,y) dx dy$ is the cross-moment, computed by integrating $xy$ against the joint density. It measures how $X$ and $Y$ move together on average.
If $X$ and $Y$ are independent, the joint factors and $E[XY] = E[X]E[Y]$, so $\text{Cov}(X,Y) = 0$. Independent variables are always uncorrelated. The converse is not true - uncorrelated does not mean independent.
The 2D Gaussian: The Most Important Joint Distribution
The bivariate Gaussian is the joint distribution you will encounter everywhere in statistics and machine learning. For zero-mean variables with unit variance, it has the form:
$$f(x,y) \propto \exp\left(-\frac{1}{2(1-\rho^2)}\left(x^2 - 2\rho xy + y^2\right)\right).$$
The single parameter $\rho \in (-1, 1)$ is the correlation between $X$ and $Y$.
The level curves of $f$ - the sets where $f(x,y)$ is constant - are ellipses. The shape of the ellipse depends on $\rho$:
- $\rho = 0$: the ellipses are circles. $X$ and $Y$ are independent. Knowing $X$ tells you nothing about $Y$.
- $\rho = 0.9$: thin ellipse tilted at $45°$. Large $X$ strongly predicts large $Y$.
- $\rho = -0.9$: thin ellipse tilted at $-45°$. Large $X$ strongly predicts small $Y$.
- $\rho = 1$ or $\rho = -1$ (degenerate): the ellipse collapses to a line. $Y$ is a deterministic linear function of $X$.
The conditional distribution of $X$ given $Y = y$ in the bivariate Gaussian is also Gaussian:
$$X \mid Y = y \sim \mathcal{N}(\rho y,\ 1 - \rho^2).$$
The conditional mean $\rho y$ is exactly the regression line. The conditional variance $1 - \rho^2$ tells you how much uncertainty remains about $X$ after seeing $Y$. When $|\rho|$ is close to 1, observing $Y$ almost completely pins down $X$. This formula is why linear regression and the bivariate Gaussian are so closely related.
Transformations: Distributions of Sums
One common operation: you have $X$ and $Y$ with known distributions, and you want the distribution of $Z = X + Y$.
For independent discrete variables, the convolution formula is:
$$P(Z = z) = \sum_x P(X = x) P(Y = z - x).$$
For independent continuous variables:
$$f_Z(z) = \int_{-\infty}^{\infty} f_X(x) f_Y(z - x) dx.$$
This integral is called the convolution of $f_X$ and $f_Y$. You can think of it as sliding one density over the other and measuring overlap.
For example, the sum of two independent normal random variables is normal: if $X \sim \mathcal{N}(\mu_1, \sigma_1^2)$ and $Y \sim \mathcal{N}(\mu_2, \sigma_2^2)$, then $X + Y \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$. The convolution of two Gaussian densities is another Gaussian density - a fact that becomes much easier to prove using moment-generating functions.
Discomfort check. Here is a subtlety that trips up almost everyone: the marginals do not determine the joint.
Two pairs $(X, Y)$ can have exactly the same marginal distributions for $X$ and for $Y$, but completely different joint distributions - and completely different dependence structures. For example, $X$ and $Y$ could both be standard normal marginally, yet be independent ($\rho = 0$) or almost perfectly correlated ($\rho = 0.99$). The marginals alone cannot distinguish these cases.
This means that if you only measure $X$ and $Y$ separately, you cannot reconstruct their joint behavior. The joint distribution is genuinely more information than both marginals combined.
The modern theory for describing this extra structure - the “dependence structure” of $(X, Y)$ separate from their individual marginals - is called copula theory. A copula captures all the correlation information and none of the marginal information. It’s the tool practitioners use when they want to model, say, the joint behavior of stock returns while specifying the marginal distribution of each stock separately.
Connection to Machine Learning
In machine learning, data is almost never truly independent. Your training examples $(x_1, y_1), \ldots, (x_n, y_n)$ are drawn from some process, and that process usually induces correlations.
The clean fiction we make is i.i.d.: independent and identically distributed. Under this assumption, the joint distribution over the whole dataset factors:
$$f(x_1, \ldots, x_n) = \prod_{i=1}^n f(x_i).$$
This factorization is what makes likelihood functions multiply into products and log-likelihoods into sums. It’s why stochastic gradient descent works: each mini-batch is an unbiased estimate of the gradient.
But the truth is that the real object is always the full joint distribution over $(X_1, \ldots, X_n)$. I.i.d. is an assumption - sometimes a good one, sometimes not. Text data has strong sequential dependencies. Medical records within a hospital are correlated. Images in the same scene share lighting and context. When the i.i.d. assumption breaks down, you need joint distribution models - graphical models, sequence models, Gaussian processes - that explicitly represent the dependence structure.
Summary
| Concept | Key Formula |
|---|---|
| Joint PMF | $p(x,y) = P(X=x, Y=y)$ |
| Joint PDF | $P((X,Y) \in A) = \iint_A f(x,y) dx dy$ |
| Marginal of $X$ | $f_X(x) = \int f(x,y) dy$ |
| Independence | $f(x,y) = f_X(x) f_Y(y)$ |
| Conditional distribution | $f_{X \mid Y}(x \mid y) = f(x,y)/f_Y(y)$ |
| Covariance from joint | $\text{Cov}(X,Y) = \iint xy f(x,y) dx dy - E[X]E[Y]$ |
| Convolution (sum $Z = X+Y$) | $f_Z(z) = \int f_X(x) f_Y(z-x) dx$ |
| Bivariate Gaussian ($\rho = 0$) | Independent; circular contours |
| Bivariate Gaussian ($\rho \neq 0$) | Correlated; elliptical contours |
| Marginals determine joint? | No - the joint has strictly more information |
The joint distribution is the full story. Marginals are what you see when you look at each variable in isolation. Conditional distributions are what you see when you fix one variable and look at the other. All of these are projections and slices of a single object: $f(x,y)$.
Read next: