Expectation, Variance & Covariance
Prerequisite:
Expectation, variance, and covariance are the first three numbers you extract from a random variable. Together they characterize location, spread, and the relationship between variables. They are the building blocks of everything from the central limit theorem to the bias-variance tradeoff in machine learning.
Expectation
Definition (Discrete). Let $X$ be a discrete random variable taking values in a countable set $\mathcal{X}$ with PMF $p$. The expected value (or mean) of $X$ is:
$$E[X] = \sum_{x \in \mathcal{X}} x, p(x),$$
provided the sum converges absolutely: $\sum_{x} |x|, p(x) < \infty$.
Definition (Continuous). Let $X$ be a continuous random variable with PDF $f$. The expected value is:
$$E[X] = \int_{-\infty}^{\infty} x, f(x), dx,$$
provided the integral converges absolutely.
Interpretation. $E[X]$ is the long-run average of $X$ over many independent repetitions. It is a weighted average of outcomes, where the weights are probabilities.
Expectation via LOTUS
The Law of the Unconscious Statistician (LOTUS) allows computing $E[g(X)]$ without finding the distribution of $g(X)$ first.
Theorem (LOTUS). For a function $g: \mathbb{R} \to \mathbb{R}$:
$$E[g(X)] = \sum_{x} g(x), p(x) \quad \text{(discrete)}, \qquad E[g(X)] = \int_{-\infty}^{\infty} g(x), f(x), dx \quad \text{(continuous)}.$$
Linearity of Expectation
Theorem. For random variables $X$ and $Y$ on the same probability space, and constants $a, b \in \mathbb{R}$:
$$E[aX + bY] = a,E[X] + b,E[Y].$$
Proof (discrete case). Let $(X, Y)$ have joint PMF $p(x,y)$.
$$E[aX + bY] = \sum_{x,y}(ax + by),p(x,y) = a\sum_{x,y} x,p(x,y) + b\sum_{x,y} y,p(x,y).$$
Since $\sum_y p(x,y) = p_X(x)$ and $\sum_x p(x,y) = p_Y(y)$:
$$= a\sum_x x,p_X(x) + b\sum_y y,p_Y(y) = a,E[X] + b,E[Y]. \quad \square$$
Key point. Linearity holds regardless of whether $X$ and $Y$ are independent. This is its great power: sums of dependent random variables are just as easy to handle as independent ones when computing expectations.
Variance
Definition. The variance of $X$ (with $E[X] = \mu$) is:
$$\text{Var}(X) = E!\left[(X - \mu)^2\right].$$
The standard deviation is $\sigma = \sqrt{\text{Var}(X)}$.
Computational formula. Expanding $(X - \mu)^2 = X^2 - 2\mu X + \mu^2$ and taking expectations:
$$\text{Var}(X) = E[X^2] - 2\mu,E[X] + \mu^2 = E[X^2] - \mu^2 = E[X^2] - (E[X])^2.$$
This is usually easier to compute than the definition directly.
Theorem (Affine transformation). For constants $a, b$:
$$\text{Var}(aX + b) = a^2,\text{Var}(X).$$
Proof. Let $\mu' = E[aX+b] = a\mu + b$. Then:
$$\text{Var}(aX+b) = E!\left[(aX+b - \mu')^2\right] = E!\left[(aX - a\mu)^2\right] = a^2,E!\left[(X-\mu)^2\right] = a^2,\text{Var}(X). \quad \square$$
Note: the shift $b$ has no effect on variance (shifting does not change spread).
Non-linearity. Unlike expectation, $\text{Var}(X + Y) \neq \text{Var}(X) + \text{Var}(Y)$ in general. The correct formula involves covariance, derived below.
Covariance
Definition. The covariance of random variables $X$ and $Y$ (with means $\mu_X$ and $\mu_Y$) is:
$$\text{Cov}(X, Y) = E!\left[(X - \mu_X)(Y - \mu_Y)\right].$$
Computational formula. Expanding:
$$\text{Cov}(X, Y) = E[XY] - E[X],E[Y].$$
Note that $\text{Cov}(X, X) = E[(X-\mu)^2] = \text{Var}(X)$.
Sign interpretation.
- $\text{Cov}(X,Y) > 0$: $X$ and $Y$ tend to move together.
- $\text{Cov}(X,Y) < 0$: when $X$ is large, $Y$ tends to be small.
- $\text{Cov}(X,Y) = 0$: $X$ and $Y$ are uncorrelated (no linear relationship).
Correlation
Covariance depends on the units of $X$ and $Y$. The correlation coefficient normalizes it:
$$\rho(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)},\sqrt{\text{Var}(Y)}}.$$
Theorem. $-1 \leq \rho(X,Y) \leq 1$, with $|\rho| = 1$ iff $Y = aX + b$ almost surely for some constants $a \neq 0$, $b$.
Proof sketch. By the Cauchy-Schwarz inequality for the inner product $\langle U, V \rangle = E[UV]$:
$$|E[UV]|^2 \leq E[U^2],E[V^2].$$
Apply with $U = X - \mu_X$, $V = Y - \mu_Y$. $\square$
Variance of a Sum
Theorem. For any random variables $X$ and $Y$:
$$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2,\text{Cov}(X,Y).$$
Proof. Let $\mu = E[X+Y] = \mu_X + \mu_Y$.
$$\text{Var}(X+Y) = E!\left[((X-\mu_X)+(Y-\mu_Y))^2\right] = E[(X-\mu_X)^2] + 2E[(X-\mu_X)(Y-\mu_Y)] + E[(Y-\mu_Y)^2]. \quad \square$$
For $n$ random variables:
$$\text{Var}!\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \text{Var}(X_i) + 2\sum_{i < j}\text{Cov}(X_i, X_j).$$
If the $X_i$ are pairwise uncorrelated (in particular, if they are independent), all covariance terms vanish and variance is additive.
Independence and Covariance
Theorem. If $X$ and $Y$ are independent, then $\text{Cov}(X,Y) = 0$.
Proof. Independence implies $E[XY] = E[X],E[Y]$ (factorization of expectations). Then $\text{Cov}(X,Y) = E[XY] - E[X]E[Y] = 0$. $\square$
The converse is false. Zero covariance does not imply independence.
Counterexample. Let $X \sim \text{Uniform}({-1, 0, 1})$ and $Y = X^2$. Then:
$$E[X] = 0, \quad E[XY] = E[X^3] = \frac{1}{3}(-1 + 0 + 1) = 0.$$
So $\text{Cov}(X,Y) = E[XY] - E[X]E[Y] = 0 - 0 = 0$. But $X$ and $Y$ are far from independent: knowing $X$ completely determines $Y$.
Covariance only detects linear dependence. Variables can be strongly dependent in a nonlinear way and still have zero covariance.
Examples
Expected value in a dice game. Roll a fair die and win $$X^2$ where $X$ is the outcome. By LOTUS:
$$E[X^2] = \frac{1}{6}(1 + 4 + 9 + 16 + 25 + 36) = \frac{91}{6} \approx 15.17.$$
Compare $E[X] = 3.5$, so $(E[X])^2 = 12.25 \neq E[X^2]$. The difference is the variance: $\text{Var}(X) = 91/6 - (7/2)^2 = 91/6 - 49/4 = 35/12 \approx 2.92$.
Portfolio variance. Suppose two assets have returns $X$ and $Y$ with $\text{Var}(X) = \sigma_X^2$, $\text{Var}(Y) = \sigma_Y^2$, and $\text{Cov}(X,Y) = \sigma_{XY}$. An equally weighted portfolio $P = (X+Y)/2$ has:
$$\text{Var}(P) = \frac{1}{4}\left(\sigma_X^2 + \sigma_Y^2 + 2\sigma_{XY}\right).$$
If the assets are uncorrelated ($\sigma_{XY} = 0$) and have equal variance $\sigma^2$, then $\text{Var}(P) = \sigma^2/2$ - diversification halves the variance. If the assets are perfectly correlated ($\rho = 1$, $\sigma_{XY} = \sigma^2$), then $\text{Var}(P) = \sigma^2$ - no diversification benefit.
Indicator random variables. Let $X_i = 1$ if event $A_i$ occurs and $0$ otherwise. Then $E[X_i] = P(A_i)$ and $E[\sum X_i] = \sum P(A_i)$ by linearity. This is often the fastest way to compute expected counts.
The Covariance Matrix
For a random vector $\mathbf{X} = (X_1, \ldots, X_n)^T$, the covariance matrix is:
$$\Sigma_{ij} = \text{Cov}(X_i, X_j).$$
The diagonal entries are variances, the off-diagonal entries are covariances. $\Sigma$ is always symmetric and positive semidefinite. For any vector $\mathbf{v}$:
$$\mathbf{v}^T \Sigma \mathbf{v} = \text{Var}(\mathbf{v}^T \mathbf{X}) \geq 0.$$
The covariance matrix is the central object in multivariate statistics, PCA, and the multivariate Gaussian distribution.
Read Next: