Expectation, Variance & Covariance - The Center, the Spread, and the Relationship
Helpful context:
You’re offered a game. Roll a fair six-sided die. Whatever number comes up, you win twice that amount in dollars. But to play, you pay 7 upfront.
Should you play?
Think about it intuitively. The die shows 1 through 6, each with probability 1/6. You win 2, 4, 6, 8, 10, or 12 dollars. The average payout is… somewhere around 7? Let’s compute: $(2 + 4 + 6 + 8 + 10 + 12)/6 = 42/6 = 7$. You pay 7 to play, and the average payout is 7. Expected gain: zero.
This game is fair in a precise sense - not because each individual outcome is guaranteed to be fair, but because if you played thousands of times, you’d approximately break even. The concept that makes this precise is expectation.
The Expected Value
For a discrete random variable $X$ taking values in a countable set with PMF $p$, the expected value (also called the mean or expectation) is:
$$E[X] = \sum_x x \cdot p(x).$$
This is a weighted average of the possible values, where each value is weighted by its probability. For the die game above: $E[X] = 2 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 6 \cdot \frac{1}{6} + 8 \cdot \frac{1}{6} + 10 \cdot \frac{1}{6} + 12 \cdot \frac{1}{6} = 7$.
For a continuous random variable with PDF $f$, the sum becomes an integral:
$$E[X] = \int_{-\infty}^\infty x \cdot f(x)dx.$$
Center of mass. Think of the PDF as a density of material spread along the real line. The expected value is the point where you’d balance the distribution on a fulcrum - the center of mass. If the distribution is symmetric around some point $\mu$, then $E[X] = \mu$.
Expectation doesn’t have to be a value $X$ can actually take. A fair die has $E[X] = 3.5$, but a die never shows 3.5. The expectation is a property of the distribution, not a prediction of a single outcome.
Single trial vs. long run. This is worth being precise about. The expected value tells you nothing guaranteed about any single roll. You could roll the die once and get 1 (win 2 dollars, lose 5). But if you roll 10,000 times, the law of large numbers guarantees the average payout will be very close to 7. Expectation is a long-run average, not a per-trial promise. This matters enormously in practice: a game with positive expected value can still bankrupt you if you run out of money before the long run arrives. Expected value answers “is this a good game to play repeatedly?” not “will I win this time?”
The Law of the Unconscious Statistician
What if you want $E[g(X)]$ for some function $g$, like $E[X^2]$ or $E[\sqrt{X}]$?
You could find the distribution of $g(X)$ first, then take expectation. But there’s a shortcut - the Law of the Unconscious Statistician (LOTUS). You can compute $E[g(X)]$ directly from the distribution of $X$ without finding the distribution of $g(X)$.
The name is a joke. A “conscious” statistician would first derive the PMF or PDF of $g(X)$ - the proper, careful procedure. An “unconscious” statistician just applies $g$ to each value of $X$ and averages using the original probabilities, without thinking about whether this is valid. It turns out to be valid, and the reason is just what you’d expect: a weighted average of $g(x)$ values, weighted by $P(X=x)$, is the right thing to compute.
$$E[g(X)] = \sum_x g(x) \cdot p(x) \quad \text{(discrete)}, \qquad E[g(X)] = \int_{-\infty}^\infty g(x) \cdot f(x)dx \quad \text{(continuous)}.$$
Example. If a fair die shows $X$, what’s $E[X^2]$?
$$E[X^2] = \frac{1}{6}(1^2 + 2^2 + 3^2 + 4^2 + 5^2 + 6^2) = \frac{1}{6}(1 + 4 + 9 + 16 + 25 + 36) = \frac{91}{6} \approx 15.17.$$
Note: $E[X^2] \neq (E[X])^2$. We have $E[X] = 3.5$, so $(E[X])^2 = 12.25$, which is not $15.17$. The gap between these two numbers is, as we’ll see, the variance.
Linearity of Expectation
This is the most important fact in probability, and it’s worth pausing on.
Theorem. For any random variables $X$ and $Y$, and constants $a, b$:
$$E[aX + bY] = a E[X] + b E[Y].$$
The extraordinary thing is: this holds whether or not $X$ and $Y$ are independent. It holds even if $X$ and $Y$ are perfectly correlated, or negatively correlated, or related in some complicated nonlinear way. Expectation is always linear.
Why? Because expectation is just a weighted average, and weighted averages are always linear. The dependence structure between $X$ and $Y$ affects how their values move together, but it doesn’t change the fact that the average of a sum is the sum of the averages - that is a basic property of averages themselves, not a fact about probability. If you have ten numbers and you add 3 to each of them, the average goes up by 3, regardless of how those numbers relate to each other.
Contrast this with variance, which is not linear in general (we’ll get there).
Example: Expected number of heads in $n$ fair coin flips.
Method 1 (hard way): define $X \sim \text{Binomial}(n, 1/2)$ and compute $E[X] = \sum_{k=0}^n k \binom{n}{k} (1/2)^n$. You’d need to work through a somewhat tedious calculation.
Method 2 (easy way - indicator trick): let $X_i = 1$ if flip $i$ is heads, $0$ otherwise. Then $X = X_1 + X_2 + \cdots + X_n$. Each $X_i$ is $\text{Bernoulli}(1/2)$, so $E[X_i] = 1/2$.
By linearity:
$$E[X] = E[X_1] + E[X_2] + \cdots + E[X_n] = n \cdot \frac{1}{2} = \frac{n}{2}.$$
No combinatorics required. The $X_i$’s are independent here, but the calculation didn’t use that - it just used linearity.
The indicator trick is a universal strategy: to find the expected count of events that occur, write the count as a sum of indicators, and use linearity plus $E[\mathbf{1}_A] = P(A)$.
A harder example: expected number of fixed points in a random shuffle. Shuffle a deck of 52 cards. A fixed point is a card that ends up in its original position. What is the expected number of fixed points? The variables $X_i = \mathbf{1}[\text{card } i \text{ is in position } i]$ are highly dependent - knowing card 1 is in place affects the probability that card 2 is in place. But linearity doesn’t care. $E[X_i] = P(\text{card } i \text{ is in position } i) = 1/52$ for each $i$. So the expected number of fixed points is $52 \cdot \frac{1}{52} = 1$. Exactly 1, regardless of deck size. The dependence between the $X_i$’s is completely irrelevant to the calculation.
Variance: Measuring Spread
Two distributions can have the same expected value but be very different. Consider:
- $X$: always equals 5. (Deterministic.)
- $Y$: equals 0 with probability 1/2, equals 10 with probability 1/2.
Both have $E[X] = E[Y] = 5$. But $Y$ is wildly unpredictable while $X$ is perfectly predictable. Expected value alone doesn’t capture this. What we need is a measure of spread.
Variance measures how far $X$ tends to deviate from its mean $\mu = E[X]$:
$$\text{Var}(X) = E\left[(X - \mu)^2\right].$$
We square the deviation $(X - \mu)$ to make it positive (deviations above and below the mean shouldn’t cancel out), and then take the expectation. Variance is the average squared distance from the mean.
Why square rather than just take the absolute value $|X - \mu|$? Both prevent cancellation. The practical answer is that squaring is mathematically tractable - it is differentiable, it works cleanly with the algebra of expectations, and as we’ll see, it connects directly to the covariance formula. The absolute value creates corners that make calculus harder. The cost is that variance lives in squared units: if $X$ is a temperature in degrees, variance is in degrees-squared, which has no physical interpretation. Standard deviation fixes this by taking the square root.
Risk interpretation. Variance is the mathematician’s measure of risk. Two investments with the same expected return but different variances have the same long-run average but very different experiences along the way. High variance means some outcomes are much better and some much worse than the mean - a riskier ride. Low variance means outcomes cluster tightly around the mean - predictable. This is literally how portfolio theory works: you optimize the trade-off between expected return (which you want high) and variance (which you want low).
Computational shortcut. Expanding $(X - \mu)^2 = X^2 - 2\mu X + \mu^2$ and taking expectations:
$$\text{Var}(X) = E[X^2] - 2\mu \cdot E[X] + \mu^2 = E[X^2] - 2\mu^2 + \mu^2 = E[X^2] - (E[X])^2.$$
This is almost always easier than using the definition directly. For the die:
$$\text{Var}(X) = E[X^2] - (E[X])^2 = \frac{91}{6} - \left(\frac{7}{2}\right)^2 = \frac{91}{6} - \frac{49}{4} = \frac{182}{12} - \frac{147}{12} = \frac{35}{12} \approx 2.92.$$
Standard deviation is $\sigma = \sqrt{\text{Var}(X)}$. The advantage: it’s in the same units as $X$, not squared units. For the die, $\sigma = \sqrt{35/12} \approx 1.71$ - the typical deviation from the mean of 3.5 is about 1.71 on the die scale.
Scaling rules. Adding a constant shifts the distribution but doesn’t change spread:
$$\text{Var}(X + b) = \text{Var}(X).$$
Multiplying by a constant scales the spread quadratically:
$$\text{Var}(aX) = a^2 \text{Var}(X).$$
Combined: $\text{Var}(aX + b) = a^2 \text{Var}(X)$.
Covariance: How Two Variables Move Together
When you have two random variables $X$ and $Y$, you might want to know: do they tend to be large at the same time? Do they move in opposite directions? Are they unrelated?
The covariance captures this:
$$\text{Cov}(X, Y) = E\left[(X - \mu_X)(Y - \mu_Y)\right].$$
When $X$ is above its mean and $Y$ is also above its mean, the product $(X - \mu_X)(Y - \mu_Y)$ is positive. When both are below, it’s also positive (negative times negative). When they move in opposite directions - one high while the other is low - the product is negative. So covariance is positive when $X$ and $Y$ tend to go up and down together, and negative when they tend to move oppositely.
When there’s no linear relationship, the covariance is zero - but why? Because the positive and negative products cancel on average. When $X$ is above its mean, $Y$ is equally likely to be above or below its mean (since they are unrelated), so the products are positive about half the time and negative about half the time. Over many observations, they average out to zero. Covariance being zero means the product $(X - \mu_X)(Y - \mu_Y)$ has no systematic tendency to be positive or negative - no consistent direction.
Computational shortcut:
$$\text{Cov}(X, Y) = E[XY] - E[X] E[Y].$$
Note that $\text{Cov}(X, X) = E[X^2] - (E[X])^2 = \text{Var}(X)$. Variance is a special case of covariance.
Concrete examples. Temperature and ice cream sales have positive covariance: hot days are above the mean in both. Temperature and heating bills have negative covariance: hot days are above the mean temperature and below the mean heating bill. The height and weight of a random person have positive covariance: taller people tend to be heavier. Two stocks in the same industry sector typically have positive covariance - they get hit by the same economic conditions. A bond and a stock often have negative covariance - bonds go up when stocks fall, which is why a portfolio with both has lower total variance than a portfolio with only one.
Discomfort check. This is the key place where people get confused: $E[X \cdot Y] \neq E[X] \cdot E[Y]$ in general. Covariance measures exactly this gap: $\text{Cov}(X,Y) = E[XY] - E[X]E[Y]$. Only when $X$ and $Y$ are independent does the product formula hold - and then covariance is zero. Don’t assume $E[XY] = E[X]E[Y]$ without first checking independence.
Variance of a sum. Now we can state the correct formula for the variance of a sum:
$$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y).$$
The covariance term is the correction that handles dependence. If $X$ and $Y$ are independent (or even just uncorrelated, meaning $\text{Cov}(X,Y) = 0$), then:
$$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y). \quad \text{(independent or uncorrelated)}$$
Variance is additive only when there’s no covariance. This is why diversification works in finance: if assets are uncorrelated, the variance of a portfolio is the sum of individual variances (weighted by portfolio weights), and spreading across many uncorrelated assets reduces the total variance.
Correlation: The Dimensionless Version
Covariance has a problem: its magnitude depends on the units of $X$ and $Y$. If you measure height in centimeters instead of meters, the covariance with weight changes by a factor of 100, even though the relationship between height and weight hasn’t changed.
The correlation coefficient normalizes away the units:
$$\rho(X,Y) = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y}$$
where $\sigma_X = \sqrt{\text{Var}(X)}$ and $\sigma_Y = \sqrt{\text{Var}(Y)}$.
Why is $\rho$ always between $-1$ and $1$? This follows from the Cauchy-Schwarz inequality, which in the language of random variables says:
$$|E[UV]| \leq \sqrt{E[U^2] E[V^2]}$$
for any random variables $U$ and $V$. Apply it with $U = X - \mu_X$ and $V = Y - \mu_Y$:
$$|\text{Cov}(X,Y)| = |E[(X-\mu_X)(Y-\mu_Y)]| \leq \sqrt{E[(X-\mu_X)^2] E[(Y-\mu_Y)^2]} = \sigma_X \sigma_Y.$$
Dividing both sides by $\sigma_X \sigma_Y$ gives $|\rho| \leq 1$, i.e. $-1 \leq \rho \leq 1$.
Geometric intuition. There is a way to think of random variables as vectors in a high-dimensional space, where the inner product of two random variables $U$ and $V$ is $E[UV]$ and the length of $U$ is $\sqrt{E[U^2]}$. In that language, Cauchy-Schwarz just says the inner product of two vectors cannot exceed the product of their lengths - which is the same as saying the cosine of the angle between them lies in $[-1, 1]$. The correlation $\rho$ is literally the cosine of the angle between the centered variables $X - \mu_X$ and $Y - \mu_Y$ in this space. $\rho = 1$ means the two vectors point in exactly the same direction; $\rho = -1$ means they point in exactly opposite directions; $\rho = 0$ means they are perpendicular.
The endpoints are achieved exactly when $Y - \mu_Y = c(X - \mu_X)$ for some constant $c$ - i.e., when $Y = aX + b$ for some $a \neq 0$ and $b$. One variable is an exact linear function of the other. $\rho = 1$ when $a > 0$ (same direction); $\rho = -1$ when $a < 0$ (opposite direction).
What different values of $\rho$ look like. Imagine a scatter plot of $(X, Y)$ pairs. At $\rho = 1$, all points fall exactly on an upward-sloping line - no scatter at all. At $\rho = 0.9$, points cluster tightly around an upward line with a little scatter. At $\rho = 0.5$, you can see the upward trend but the cloud is wide. At $\rho = 0.1$, the trend is barely visible - nearly a random cloud. At $\rho = 0$, a circular or otherwise shapeless cloud. A common mistake is treating $\rho = 0.3$ as “correlated” when it explains only $0.3^2 = 9%$ of the variance ($\rho^2$ is the fraction of variance in $Y$ explained by a linear relationship with $X$). Strong correlation in practice typically means $|\rho| > 0.7$.
Independence vs. Zero Covariance
Independence implies zero covariance. Here’s why: if $X$ and $Y$ are independent, then $E[XY] = E[X] \cdot E[Y]$ (expectations factor for independent random variables), so $\text{Cov}(X,Y) = E[XY] - E[X]E[Y] = 0$.
But the converse is false. Zero covariance does not imply independence.
Counterexample. Let $X$ be uniform on $\{-1, 0, 1\}$ and let $Y = X^2$.
$$E[X] = \frac{-1 + 0 + 1}{3} = 0.$$
$$E[XY] = E[X^3] = \frac{(-1)^3 + 0^3 + 1^3}{3} = \frac{-1 + 0 + 1}{3} = 0.$$
$$\text{Cov}(X,Y) = E[XY] - E[X]E[Y] = 0 - 0 \cdot E[Y] = 0.$$
So $X$ and $Y$ are uncorrelated. But they are far from independent: knowing $X$ completely determines $Y = X^2$. If $|X| = 1$ then $Y = 1$; if $X = 0$ then $Y = 0$.
The issue is that covariance only measures linear relationships. $Y = X^2$ is a nonlinear relationship - the parabola has no upward or downward linear trend. Covariance is blind to it.
Tail Probabilities and Concentration Inequalities
The probability that a random variable takes a value far from its mean is called a tail probability - it lives in the “tails” of the distribution, the regions far to the left or right of the bulk. Questions like “what is the probability that $X$ exceeds 100?” or “what is the probability that $X$ deviates from its mean by more than $10\sigma$?” are tail probability questions.
Concentration inequalities are upper bounds on these tail probabilities. They tell you how small the tails must be, given what you know about the distribution. The less you know, the weaker the bound you can prove. The two most fundamental ones - Markov and Chebyshev - are both concentration inequalities, and they come in sequence: Markov only needs the mean, Chebyshev needs the mean and variance, and each gives a tighter bound than the previous.
Markov’s Inequality: Concentration from the Mean Alone
Before Chebyshev, there is a simpler and more primitive bound that only uses the mean.
Markov’s inequality. For any non-negative random variable $X$ and any $a > 0$:
$$P(X \geq a) \leq \frac{E[X]}{a}.$$
The intuition: if the average value of $X$ is $\mu$, it cannot be the case that a large fraction of the probability sits far above $\mu$ - that would push the average too high. More precisely: if $P(X \geq a) = p$, then $E[X] \geq p \cdot a$ (since the contribution from that region alone is at least $p \cdot a$), which rearranges to $p \leq E[X]/a$.
Proof. Write $E[X] = \int_0^\infty xf(x)dx \geq \int_a^\infty xf(x)dx \geq a \int_a^\infty f(x)dx = a \cdot P(X \geq a)$. Dividing both sides by $a$ gives the result.
The bound is distribution-free: it holds for any non-negative $X$ with finite mean, no other assumptions. It is also tight - there are distributions that achieve equality. The cost is that it is often quite loose. If $E[X] = 1$, Markov says $P(X \geq 100) \leq 0.01$. For many distributions the actual probability is astronomically smaller. Markov’s inequality is useful when you know almost nothing about the distribution; when you know more (like the variance), you can do better.
Chebyshev’s Inequality: Concentration from Variance Alone
The mean and variance together give you a weak but universal bound on how often $X$ deviates far from its mean.
Chebyshev’s inequality. For any random variable $X$ with mean $\mu$ and standard deviation $\sigma > 0$:
$$P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}.$$
At most $1/k^2$ of the probability can be more than $k$ standard deviations from the mean.
- $k=2$: at most 25% of the probability lies more than 2 standard deviations from the mean
- $k=3$: at most 11%
- $k=10$: at most 1%
This bound is weak - for specific distributions like the normal, the actual probabilities are much smaller than $1/k^2$. But Chebyshev’s inequality holds for every distribution with finite variance, with no other assumptions. That universality is its value. Any distribution with finite variance has most of its probability concentrated near the mean - and Chebyshev quantifies “most.”
Proof. Let $Z = (X - \mu)^2$. Then $Z \geq 0$ always, and $E[Z] = \text{Var}(X) = \sigma^2$. For any $\epsilon > 0$, by Markov’s inequality (expectation of non-negative variable):
$$P(Z \geq \epsilon) \leq \frac{E[Z]}{\epsilon} = \frac{\sigma^2}{\epsilon}.$$
Set $\epsilon = k^2 \sigma^2$. Then $P((X-\mu)^2 \geq k^2\sigma^2) \leq 1/k^2$, which is $P(|X-\mu| \geq k\sigma) \leq 1/k^2$.
You might know a different inequality by the same name. There is a separate result also called Chebyshev’s inequality - the Chebyshev sum inequality - which says: if $a_1 \geq a_2 \geq \cdots \geq a_n$ and $b_1 \geq b_2 \geq \cdots \geq b_n$ are both sorted in the same order, then
$$n\sum_{i=1}^n a_i b_i \geq \left(\sum_{i=1}^n a_i\right)\left(\sum_{i=1}^n b_i\right).$$
In words: the average of products is at least the product of averages, when both sequences are aligned. If one goes up while the other goes down, the inequality reverses. This is a purely algebraic fact about sorted sequences - no probability involved. Both results are attributed to Pafnuty Chebyshev (1821 - 1894), who was extraordinarily prolific, and both legitimately carry his name. In a probability or statistics context, “Chebyshev’s inequality” almost always means the concentration bound above. Elsewhere it may mean the sum inequality.
Summary
| Concept | Formula | What It Measures |
|---|---|---|
| Expectation (discrete) | $E[X] = \sum x \cdot p(x)$ | Center of mass of the distribution |
| Expectation (continuous) | $E[X] = \int x \cdot f(x)dx$ | Center of mass of the distribution |
| LOTUS | $E[g(X)] = \sum g(x)p(x)$ | Expected value of a function of $X$ |
| Linearity | $E[aX+bY] = aE[X]+bE[Y]$ | Always holds, even when $X,Y$ are dependent |
| Indicator trick | $E[\mathbf{1}_A] = P(A)$ | Expected counts via sum of probabilities |
| Variance | $\text{Var}(X) = E[X^2] - (E[X])^2$ | Average squared deviation from mean |
| Standard deviation | $\sigma = \sqrt{\text{Var}(X)}$ | Spread, in original units |
| Var of sum | $\text{Var}(X+Y) = \text{Var}(X)+\text{Var}(Y)+2\text{Cov}(X,Y)$ | Includes cross-term when dependent |
| Covariance | $\text{Cov}(X,Y) = E[XY] - E[X]E[Y]$ | Tendency to move together |
| Correlation | $\rho = \text{Cov}(X,Y)/(\sigma_X\sigma_Y) \in [-1,1]$ | Dimensionless linear association |
| Zero cov $\not\Rightarrow$ independence | $X, Y=X^2$ are uncorrelated but dependent | Cov detects only linear relationships |
| Markov | $P(X \geq a) \leq E[X]/a$ | Concentration for non-negative $X$, mean only |
| Chebyshev | $P(|X-\mu| \geq k\sigma) \leq 1/k^2$ | Concentration using variance; follows from Markov |
Back to the opening question: should you play the die game? Expected gain is 7 - 7 = 0, so in the long run you break even. But variance matters too: the possible payoffs range from 2 to 12, with standard deviation $\sigma = \sqrt{35/12} \approx 1.71$. Whether to play depends on whether the variance is something you’re willing to take on - a question that depends on your risk tolerance, not just the expectation. Expectation tells you where you’re going on average. Variance tells you how bumpy the ride will be.
Read next: