Joint Distributions
Prerequisite:
A single random variable describes one quantity. Joint distributions describe the simultaneous behavior of two or more quantities - how they co-vary, what knowing one tells you about the other, and whether they are truly independent. Joint distributions are the mathematical foundation for regression, graphical models, and generative modeling.
Joint PMF and PDF
Definition (Discrete). The joint PMF of discrete random variables $X$ and $Y$ is the function $p: \mathbb{R}^2 \to [0,1]$ defined by:
$$p(x, y) = P(X = x,, Y = y).$$
It satisfies $p(x,y) \geq 0$ for all $(x,y)$ and $\sum_{x}\sum_{y} p(x,y) = 1$.
Definition (Continuous). The joint PDF of continuous random variables $X$ and $Y$ is a function $f: \mathbb{R}^2 \to [0,\infty)$ such that for any (measurable) region $A \subseteq \mathbb{R}^2$:
$$P((X,Y) \in A) = \iint_A f(x,y), dx, dy.$$
It satisfies $f(x,y) \geq 0$ and $\int_{-\infty}^{\infty}\int_{-\infty}^{\infty} f(x,y), dx, dy = 1$.
Joint CDF
The joint cumulative distribution function is defined for all $(x,y) \in \mathbb{R}^2$:
$$F(x,y) = P(X \leq x,, Y \leq y).$$
For continuous distributions, $f(x,y) = \partial^2 F(x,y)/\partial x, \partial y$ wherever the mixed partial derivative exists.
Marginal Distributions
The marginal distribution of $X$ is recovered by summing (or integrating) over $Y$:
$$p_X(x) = \sum_{y} p(x,y), \qquad f_X(x) = \int_{-\infty}^{\infty} f(x,y), dy.$$
Similarly for $Y$: $p_Y(y) = \sum_x p(x,y)$ and $f_Y(y) = \int_{-\infty}^{\infty} f(x,y), dx$.
Marginalization discards information about the relationship between $X$ and $Y$. Two joint distributions can have identical marginals yet very different joint structure (e.g., independent vs. perfectly correlated).
Conditional Distributions
Definition. The conditional PMF of $X$ given $Y = y$ (with $p_Y(y) > 0$) is:
$$p_{X|Y}(x \mid y) = \frac{p(x,y)}{p_Y(y)}.$$
Definition. The conditional PDF of $X$ given $Y = y$ (with $f_Y(y) > 0$) is:
$$f_{X|Y}(x \mid y) = \frac{f(x,y)}{f_Y(y)}.$$
For a fixed $y$, the conditional distribution $p_{X|Y}(\cdot \mid y)$ is a valid PMF (or PDF) over $x$.
Multiplication rule. The joint factors as:
$$f(x,y) = f_{X|Y}(x \mid y), f_Y(y) = f_{Y|X}(y \mid x), f_X(x).$$
This is Bayes' theorem in disguise.
Independence of Random Variables
Definition. $X$ and $Y$ are independent if their joint distribution factors:
$$p(x,y) = p_X(x),p_Y(y) \quad \text{(discrete)}, \qquad f(x,y) = f_X(x),f_Y(y) \quad \text{(continuous)}.$$
Equivalently, $X$ and $Y$ are independent iff $F(x,y) = F_X(x)F_Y(y)$ for all $(x,y)$, or iff $f_{X|Y}(x \mid y) = f_X(x)$ for all $y$.
Theorem. If $X$ and $Y$ are independent, then for any functions $g, h$:
$$E[g(X),h(Y)] = E[g(X)]\cdot E[h(Y)].$$
This follows directly from the factorization of the joint: the double integral (or sum) separates.
Covariance and Correlation
For jointly distributed $X$ and $Y$:
$$\text{Cov}(X,Y) = E[(X - \mu_X)(Y-\mu_Y)] = E[XY] - E[X]E[Y].$$
$$\rho(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X),\text{Var}(Y)}}, \quad -1 \leq \rho \leq 1.$$
The covariance matrix of the random vector $\mathbf{X} = (X_1,\ldots,X_n)^T$ is:
$$\Sigma = E[(\mathbf{X}-\boldsymbol{\mu})(\mathbf{X}-\boldsymbol{\mu})^T], \quad \Sigma_{ij} = \text{Cov}(X_i,X_j).$$
$\Sigma$ is symmetric and positive semidefinite. Independence implies $\text{Cov}(X_i, X_j) = 0$ for $i \neq j$, so $\Sigma$ is diagonal - but a diagonal covariance matrix does not imply independence in general.
The Multivariate Gaussian
The multivariate Gaussian is the most important joint distribution. It is entirely characterized by its mean vector and covariance matrix.
Definition. A random vector $\mathbf{X} \in \mathbb{R}^n$ follows the multivariate Gaussian distribution $\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$ if it has PDF:
$$f(\mathbf{x}) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp!\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})\right),$$
where $\boldsymbol{\mu} \in \mathbb{R}^n$ is the mean vector and $\Sigma \in \mathbb{R}^{n \times n}$ is a symmetric positive definite covariance matrix.
Geometric Interpretation
The level sets of $f$ are ellipsoids: ${(\mathbf{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu}) = c^2}$. The eigenvectors of $\Sigma$ give the principal axes of the ellipsoid, and the eigenvalues give the squared lengths of those axes. When $\Sigma = \sigma^2 I$ (spherical Gaussian), the level sets are spheres.
Marginals and Conditionals
Theorem (Marginals). Any marginal of a multivariate Gaussian is Gaussian. If we partition $\mathbf{X} = (\mathbf{X}_1, \mathbf{X}_2)$ with corresponding partition of $\boldsymbol{\mu} = (\boldsymbol{\mu}_1, \boldsymbol{\mu}_2)$ and
$$\Sigma = \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \ \Sigma_{21} & \Sigma_{22} \end{pmatrix},$$
then $\mathbf{X}_1 \sim \mathcal{N}(\boldsymbol{\mu}1, \Sigma{11})$.
Theorem (Conditionals). The conditional distribution of $\mathbf{X}_1$ given $\mathbf{X}_2 = \mathbf{x}_2$ is also Gaussian:
$$\mathbf{X}_1 \mid \mathbf{X}2 = \mathbf{x}2 \sim \mathcal{N}!\left(\boldsymbol{\mu}{1|2},, \Sigma{1|2}\right),$$
where:
$$\boldsymbol{\mu}{1|2} = \boldsymbol{\mu}1 + \Sigma{12}\Sigma{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2),$$
$$\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}.$$
The matrix $\Sigma_{1|2}$ is the Schur complement of $\Sigma_{22}$ in $\Sigma$. The conditional mean is linear in $\mathbf{x}_2$ - this is why linear regression is the maximum likelihood estimator under Gaussian assumptions.
Key property. For the multivariate Gaussian, zero covariance implies independence. That is, if $\Sigma_{12} = 0$ (i.e., $\mathbf{X}_1$ and $\mathbf{X}_2$ are uncorrelated), then they are independent. This is special to Gaussians - it fails for general distributions.
Examples
Bivariate Gaussian contours. Let $(X,Y) \sim \mathcal{N}(\mathbf{0}, \Sigma)$ with $\Sigma = \begin{pmatrix}1 & \rho \ \rho & 1\end{pmatrix}$. The correlation parameter $\rho$ controls the shape:
- $\rho = 0$: circular contours, $X$ and $Y$ are independent.
- $\rho = 0.9$: thin ellipse tilted at $45°$, $X$ and $Y$ strongly co-vary.
- $\rho = -0.9$: thin ellipse tilted at $-45°$, $X$ large implies $Y$ small.
The conditional distribution of $X$ given $Y = y$ is:
$$X \mid Y = y \sim \mathcal{N}(\rho y,, 1 - \rho^2).$$
The conditional mean $\rho y$ is the regression line, and the conditional variance $1 - \rho^2$ decreases as $|\rho|$ increases (knowing $Y$ reduces uncertainty about $X$).
Conditional Gaussian in regression. In Bayesian linear regression, the prior on weights is $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \tau^2 I)$ and the likelihood is $\mathbf{y} \mid \mathbf{w} \sim \mathcal{N}(X\mathbf{w}, \sigma^2 I)$. The joint distribution of $(\mathbf{w}, \mathbf{y})$ is Gaussian. The posterior $\mathbf{w} \mid \mathbf{y}$ is also Gaussian with mean and covariance given by the conditional Gaussian formulas above. The posterior mean is the ridge regression estimator $\hat{\mathbf{w}} = (X^TX + (\sigma^2/\tau^2)I)^{-1}X^T\mathbf{y}$, and the posterior covariance quantifies remaining uncertainty.
Discrete joint distribution. Let $X$ = number of heads in 2 fair coin flips and $Y$ = indicator that at least one head occurred. The joint PMF:
| $p(x,y)$ | $Y=0$ | $Y=1$ |
|---|---|---|
| $X=0$ | $1/4$ | $0$ |
| $X=1$ | $0$ | $1/2$ |
| $X=2$ | $0$ | $1/4$ |
Marginals: $p_X(0)=1/4$, $p_X(1)=1/2$, $p_X(2)=1/4$; $p_Y(0)=1/4$, $p_Y(1)=3/4$. Note $p(0,0) = 1/4 = p_X(0)p_Y(0)$ but $p(1,0) = 0 \neq p_X(1)p_Y(0) = 1/8$ - not independent (as expected, since $Y$ is a function of $X$).
Summary
Joint distributions are the language for describing multiple random variables simultaneously:
- Joint PMF/PDF specifies all joint probabilities.
- Marginals recover individual distributions by summing/integrating out the other variable.
- Conditional distributions describe one variable given the other; they factor the joint via the multiplication rule.
- Independence holds iff the joint equals the product of marginals.
- Covariance measures linear dependence; zero covariance implies independence only for Gaussians.
- Multivariate Gaussian: marginals and conditionals are Gaussian, with explicit formulas via the Schur complement. The conditional mean is linear - the link to linear regression.
Read Next: