Random Variables - Putting Numbers on Uncertain Outcomes
Helpful context:
You roll a die. You don’t really care which of the six physical faces lands up. You care about the number printed on it - 1, 2, 3, 4, 5, or 6 - and what it means for your game.
You flip a coin ten times. You don’t care about the exact sequence of heads and tails. You care about the total count of heads.
You measure how long a bus takes to arrive. You don’t need to specify the exact state of the universe at that moment. You care about one number: the waiting time in minutes.
In each case, you have a random experiment with some sample space of outcomes, but the outcome itself isn’t what matters - a numerical summary of it is. A random variable is the map from outcomes to numbers. It’s what lets you stop talking about specific outcomes and start talking about quantities.
A Function, Despite the Name
Here’s the thing that trips people up: a random variable is not actually a variable in the algebraic sense, and the randomness isn’t baked into the object itself. A random variable is a function.
Formally, let $\Omega$ be a sample space. A random variable is a function
$$X: \Omega \to \mathbb{R}$$
that assigns a real number $X(\omega)$ to each outcome $\omega \in \Omega$.
When you roll a die, $\Omega = \{1,2,3,4,5,6\}$ and the most natural random variable is the identity: $X(\omega) = \omega$. But you could define $Y(\omega) = \omega^2$ (the square of the roll) or $Z(\omega) = 1$ if $\omega$ is even and $0$ otherwise. All of these are perfectly valid random variables on the same sample space. The randomness comes from the random experiment picking $\omega$ - the function $X$ itself is deterministic.
Why does this matter? Because once you think of $X$ as a function, you can ask mathematical questions about it. What values does it take? How frequently does it take each value? What’s its average? You can compose it with other functions. You can add two random variables together, getting a new one. All of the tools of analysis become available because a random variable is just a function.
The name “random variable” is a historical accident that stuck. Don’t let it mislead you.
Discrete Random Variables
A random variable is discrete if it takes values in a finite or countably infinite set - a list you could in principle enumerate.
The distribution of a discrete random variable is captured by its probability mass function (PMF):
$$p(x) = P(X = x).$$
This is the probability that $X$ takes the specific value $x$. The PMF must satisfy two conditions:
- $p(x) \geq 0$ for all $x$ (probabilities are non-negative)
- $\sum_x p(x) = 1$ (the total probability is 1)
Example: Bernoulli random variable. The simplest possible random variable. Flip a coin once. Let $X = 1$ if heads, $X = 0$ if tails. With $P(\text{heads}) = p$:
$$p(1) = p, \quad p(0) = 1 - p.$$
That’s it. Bernoulli is just “success or failure” with probability $p$ of success. It models a single binary trial.
Example: Binomial random variable. Flip a fair coin $n$ times. Let $X$ be the number of heads. Here $X$ can take values $0, 1, 2, \ldots, n$. The PMF is:
$$p(k) = P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}.$$
Why this formula? There are $\binom{n}{k}$ ways to choose which $k$ of the $n$ flips land heads. Each such arrangement has probability $p^k(1-p)^{n-k}$. The binomial is just $n$ independent Bernoulli trials glued together, with $X$ counting the successes.
Example: Poisson random variable. Suppose events occur randomly over time at a constant average rate - buses arriving, emails landing in your inbox, radioactive atoms decaying. If the average rate is $\lambda$ events per unit time, and events occur independently of one another, then the number of events $X$ in one unit of time follows a Poisson distribution. Here is where the PMF comes from.
Divide the unit time interval into $n$ tiny sub-intervals, each of length $1/n$. Assume at most one event can happen in each sub-interval (true when $n$ is large enough), and each sub-interval independently has probability $p = \lambda/n$ of containing an event. The total count is then Binomial$(n, \lambda/n)$, so:
$$P(X = k) = \binom{n}{k} \left(\frac{\lambda}{n}\right)^k \left(1 - \frac{\lambda}{n}\right)^{n-k}.$$
Now let $n \to \infty$ - make the sub-intervals infinitely fine. Three things happen:
- $\displaystyle\frac{n(n-1)\cdots(n-k+1)}{n^k} \to 1$ (the $k$ factors each approach 1)
- $\displaystyle\left(1 - \frac{\lambda}{n}\right)^n \to e^{-\lambda}$ (the definition of $e$ )
- $\displaystyle\left(1 - \frac{\lambda}{n}\right)^{-k} \to 1$
Multiplying through gives exactly:
$$p(k) = P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} \quad \text{for } k = 0, 1, 2, \ldots$$
The $e^{-\lambda}$ is not magic - it is the limit of the Binomial’s $(1 - \lambda/n)^n$ term as the sub-intervals get infinitely fine. The Poisson distribution is simply the Binomial distribution taken to the limit of infinitely many trials, each with infinitesimally small probability.
The parameter $\lambda$ is called the Poisson rate - the average number of events per unit time. Small $\lambda$ means rare events; large $\lambda$ means a steady stream. The Poisson distribution shows up whenever you’re counting occurrences of independent events over a fixed window of time or space: emails arriving per hour, buses per afternoon, radioactive decays per second.
A less obvious example: database backups within a backup window. Even though backups are “scheduled,” they typically run within a window (say, 2am - 6am) rather than at a precise instant. Within that window, the exact moment each backup starts depends on how long the previous one took, resource availability, and queue ordering - all sources of genuine randomness. Multiple independent systems are running their backups in the same window. The number of backups completing in any given 10-minute sub-interval of that window behaves like a Poisson random variable: events are approximately independent, the average rate is roughly constant throughout the window, and you’re counting how many land in a fixed slice of time.
Continuous Random Variables
Some quantities vary continuously - a waiting time, a temperature, the height of a randomly selected person. For these, the range of the random variable is an interval (or union of intervals) in $\mathbb{R}$.
For a continuous random variable, the probability of taking any exact value is zero. You can’t assign positive probability to each of uncountably many points and have them sum to 1 - there are simply too many of them. Instead, probability is distributed as a density over the real line.
A continuous random variable has a probability density function (PDF) $f(x)$ satisfying:
- $f(x) \geq 0$ for all $x$
- $\int_{-\infty}^\infty f(x)dx = 1$
The probability that $X$ falls in an interval is:
$$P(a \leq X \leq b) = \int_a^b f(x)dx.$$
A PDF is not a probability - it’s a density, and it can exceed 1 at specific points. Only integrals of $f$ give probabilities.
Example: Uniform distribution on $[a,b]$. Every point in the interval is equally likely (in the continuous sense). The PDF is constant:
$$f(x) = \frac{1}{b-a} \quad \text{for } a \leq x \leq b, \quad f(x) = 0 \text{ elsewhere.}$$
Probability is proportional to length. $P(X \in [c,d]) = (d-c)/(b-a)$ - just the fraction of the interval covered.
Example: Exponential distribution. If buses arrive according to a Poisson process with rate $\lambda$ (as above), the waiting time $X$ until the next arrival has an exponential distribution. Here is where the PDF comes from - it follows directly from the Poisson PMF we just defined.
The key question is: what is the probability that the waiting time exceeds $t$? That happens exactly when zero buses arrive in the interval $[0, t]$. The number of arrivals in a window of length $t$ is Poisson with rate $\lambda t$, so:
$$P(X > t) = P(\text{zero arrivals in } [0,t]) = \frac{(\lambda t)^0 e^{-\lambda t}}{0!} = e^{-\lambda t}.$$
This is the survival function - the probability of waiting longer than $t$. Flipping it gives the probability of waiting at most $t$, which is $P(X \leq t) = 1 - e^{-\lambda t}$. This is called the cumulative distribution function (CDF) - it accumulates all the probability up to the point $t$, and we will define it properly in the next section. For now, just note that since probability is area under the PDF, the CDF is simply the integral of the PDF from 0 to $t$. Differentiating to recover the PDF:
$$f(x) = \frac{d}{dx}(1 - e^{-\lambda x}) = \lambda e^{-\lambda x} \quad \text{for } x \geq 0.$$
The formula isn’t pulled from thin air - it’s the derivative of one minus the Poisson zero-count probability. The parameter $\lambda$ is the rate (arrivals per unit time), and $1/\lambda$ is the average waiting time. The exponential has a remarkable property: it’s memoryless. If you’ve already been waiting 10 minutes, the additional waiting time has the exact same distribution as if you’d just arrived. The bus has no memory of how long you’ve been standing there.
Discomfort check. Why is $P(X = x) = 0$ for a continuous random variable, even though $x$ is a perfectly valid value it could take? The real line contains uncountably many points. If each point had positive probability - even just $0.000001$ - then the total probability would be infinite, since you’d need to add that positive number uncountably many times. The only consistent assignment is to give every individual point probability zero. But intervals still have positive probability: $P(a \leq X \leq b) > 0$ for $a < b$. Think of it this way: probability is like water. You can have a puddle (an interval with nonzero area), but a single infinitely thin line holds no water. Points have probability zero; intervals have probability equal to the area under the density curve. This is not a flaw - it’s just how the continuous world works.
The Cumulative Distribution Function
Both discrete and continuous random variables have a cumulative distribution function (CDF):
$$F(x) = P(X \leq x).$$
The CDF works for every type of random variable, no exceptions. It always satisfies:
- Non-decreasing: if $x \leq y$ then $F(x) \leq F(y)$
- Right-continuous: $\lim_{h \downarrow 0} F(x + h) = F(x)$
- Limits: $F(x) \to 0$ as $x \to -\infty$ and $F(x) \to 1$ as $x \to +\infty$
For a continuous random variable, the PDF and CDF are related by differentiation: $f(x) = F'(x)$ wherever the derivative exists.
For a discrete random variable, the CDF is a staircase: constant between mass points, with jumps of size $p(x)$ at each value $x$ in the range.
The CDF is often the most useful object in practice. “What’s the probability of waiting less than 5 minutes?” - that’s $F(5)$. “What’s the probability of waiting between 3 and 7 minutes?” - that’s $F(7) - F(3)$.
Functions of Random Variables
If $X$ is a random variable and $g$ is a function, then $Y = g(X)$ is also a random variable. You’ve changed the measurement.
For discrete $X$: the PMF of $Y = g(X)$ is $P(Y = y) = \sum_{x: g(x) = y} p_X(x)$. Sum over all inputs that map to $y$.
For continuous $X$ with a strictly monotone, differentiable $g$: the PDF of $Y = g(X)$ is
$$f_Y(y) = f_X(g^{-1}(y)) \cdot \left|\frac{d}{dy} g^{-1}(y)\right|.$$
The extra derivative term is a scaling correction. Here is why it is needed. Probability is area under the density curve, and the total area must stay 1 after the transformation. If $g$ stretches the $x$-axis by a factor of 2 near some point, then the same chunk of probability is now spread over twice the length on the $y$-axis - so the density must be halved there to keep the area the same. The factor $|(g^{-1})'(y)|$ measures exactly how much $g$ stretches or compresses near the point $y$: it is the rate at which the inverse map moves on the $x$-axis as $y$ changes, which is the reciprocal of how fast $g$ moves.
Example. If $X \sim \text{Uniform}(0,1)$ and $Y = -\ln X$, then $g^{-1}(y) = e^{-y}$ and $|(g^{-1})'(y)| = e^{-y}$, giving $f_Y(y) = e^{-y}$ for $y > 0$ - that’s $\text{Exponential}(1)$. Taking the negative log of a uniform random variable gives you an exponential. This trick is used extensively in simulation.
The Indicator Variable Trick
One of the most useful objects in probability is also the simplest.
For any event $A$, the indicator random variable is:
$$\mathbf{1}_A(\omega) = \begin{cases} 1 & \text{if } \omega \in A \\ 0 & \text{if } \omega \notin A \end{cases}$$
It’s 1 when $A$ occurs, 0 when it doesn’t.
Indicators are useful because they convert events into numbers, which you can then add, multiply, and take expectations of. The key fact:
$$E[\mathbf{1}_A] = 1 \cdot P(A) + 0 \cdot P(A^c) = P(A).$$
The expected value of an indicator is the probability of the event. This sounds trivial but it’s enormously productive, as we’ll see in the next post.
Indicators also interact cleanly with set operations:
- $\mathbf{1}_{A \cap B} = \mathbf{1}_A \cdot \mathbf{1}_B$ (both occur iff both indicators are 1)
- $\mathbf{1}_{A \cup B} = \mathbf{1}_A + \mathbf{1}_B - \mathbf{1}_A \cdot \mathbf{1}_B$
If you want to count how many events in a collection $A_1, A_2, \ldots, A_n$ occur, define:
Then $N$ is a random variable. To find its expected value, we use a fact called linearity of expectation: the expected value of a sum is always the sum of the expected values, $E[X + Y] = E[X] + E[Y]$, regardless of whether $X$ and $Y$ are independent or dependent. Intuitively this holds because expectation is just a weighted average, and averages distribute over sums - if you average a total, it’s the same as summing the averages. This will be proved carefully in the next post; for now, applying it gives:
No matter how complicated the dependence structure between the $A_i$’s, the expected count is just the sum of the probabilities. This is the indicator trick in action - and it’s one of the most powerful shortcuts in all of probability.
Independent and Identically Distributed Random Variables
One of the most common setups in all of probability and statistics: a sequence $X_1, X_2, \ldots, X_n$ that is independent and identically distributed, written i.i.d.
- Identically distributed means every $X_i$ has the same distribution - the same PMF or PDF, the same mean, the same variance. They’re all draws from the same underlying population.
- Independent means the outcome of any subset of the $X_i$’s gives you no information about the rest. Formally, the joint distribution factors: $P(X_1 = x_1, \ldots, X_n = x_n) = \prod_{i=1}^n P(X_i = x_i)$.
Rolling a die $n$ times: each roll is i.i.d. Uniform$\{1,\ldots,6\}$. Flipping a coin $n$ times: i.i.d. Bernoulli$(p)$. Measuring the heights of $n$ randomly selected people: approximately i.i.d. Normal$(\mu, \sigma^2)$.
The i.i.d. assumption makes calculations tractable. For i.i.d. $X_1, \ldots, X_n$ with mean $\mu$ and variance $\sigma^2$:
$$E\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \mu, \qquad \mathrm{Var}\left(\frac{1}{n}\sum_{i=1}^n X_i\right) = \frac{\sigma^2}{n}.$$
The sample mean has the right expected value and its variance shrinks as $n$ grows. This is the foundation of the Law of Large Numbers: the sample mean converges to the true mean. And the Central Limit Theorem says the normalized sum converges in distribution to a Normal, regardless of the original distribution - both results rely on independence.
The i.i.d. assumption is an idealization. Real data has correlations (time series), heterogeneity (different subpopulations), and distribution shift (train vs. test in ML). Knowing when the assumption holds - and when it breaks - is as important as knowing how to use it.
Summary
| Concept | What It Is |
|---|---|
| Random variable | A function $X: \Omega \to \mathbb{R}$ |
| Discrete PMF | $p(x) = P(X = x)$, must sum to 1 |
| Continuous PDF | $f(x) \geq 0$, integrates to 1; probability = area |
| $P(X = x)$ for continuous $X$ | Always zero - probability lives in intervals, not points |
| CDF | $F(x) = P(X \leq x)$, works for any type of RV |
| Bernoulli$(p)$ | Single binary trial with success probability $p$ |
| Binomial$(n,p)$ | Count of successes in $n$ independent Bernoulli$(p)$ trials |
| Uniform$[a,b]$ | Constant density on an interval |
| Exponential$(\lambda)$ | Waiting times; rate $\lambda$; memoryless |
| Indicator $\mathbf{1}_A$ | 1 if $A$ occurs, 0 if not; $E[\mathbf{1}_A] = P(A)$ |
Random variables are the bridge between probability theory and computation. Once you have a random variable, you can ask quantitative questions: What’s the typical value? How spread out are the values? How do two quantities move together? Those questions - expectation, variance, covariance - are next.
Read next: