Probability Distributions - The Shapes That Randomness Takes
Helpful context:
- Random Variables - Putting Numbers on Uncertain Outcomes
- Expectation, Variance & Covariance - The Center, the Spread, and the Relationship
Have you noticed that the same shapes keep appearing?
The number of cars passing through an intersection per minute. The number of typos on a page. The number of radioactive particles emitted in a second. These are completely different phenomena - traffic, typing, physics - and yet they all follow the same distribution. One formula describes all of them.
Or consider: the time until the next earthquake. The lifetime of a light bulb. The waiting time until your next phone call. Again: completely different things, same mathematical shape.
And heights of adults. Test scores in a large class. Measurement errors in a physics experiment. Same shape again - the bell curve - arising from completely different mechanisms.
This isn’t a coincidence. There are deeper reasons why certain shapes recur, and understanding those reasons is the point of this post. Let’s meet the main families, understand where they come from, and develop an intuition for when to reach for each one.
The Discrete Zoo
A discrete random variable takes values in a countable set - usually non-negative integers. Its distribution is described by a PMF $p(k) = P(X = k)$.
Bernoulli$(p)$: The Atom of Probability
The simplest non-trivial distribution. A single trial that either succeeds (with probability $p$) or fails (with probability $1-p$):
$$P(X = 1) = p, \qquad P(X = 0) = 1-p.$$
Mean: $p$. Variance: $p(1-p)$.
The Bernoulli is the hydrogen atom of probability - everything else is built from it. Flip a coin: Bernoulli(1/2). A patient either responds to treatment or doesn’t: Bernoulli($p$) for some unknown $p$. A user either clicks an ad or doesn’t: Bernoulli($p$).
Binomial$(n, p)$: Counting Successes
Now repeat the Bernoulli trial $n$ times, independently. How many successes do you get? That’s the Binomial.
$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n.$$
The factor $\binom{n}{k}$ counts the number of ways to arrange $k$ successes among $n$ trials. The factor $p^k(1-p)^{n-k}$ is the probability of any particular arrangement.
Mean: $np$. Variance: $np(1-p)$.
These follow from writing $X = X_1 + \cdots + X_n$ where each $X_i \sim \text{Bernoulli}(p)$. Linearity of expectation gives the mean immediately. Since the $X_i$ are independent, variances add.
Think of it as: 1000 fair coin flips, how many heads? $n = 1000$, $p = 0.5$, expected value $= 500$, standard deviation $= \sqrt{250} \approx 15.8$. You’ll almost always see between 450 and 550 heads.
Geometric$(p)$: Waiting for the First Success
You keep flipping a biased coin until you get heads. How many flips does it take?
$$P(X = k) = (1-p)^{k-1} p, \quad k = 1, 2, 3, \ldots$$
The intuition: you fail $k-1$ times in a row (probability $(1-p)^{k-1}$), then succeed on flip $k$ (probability $p$).
Mean: $1/p$. Variance: $(1-p)/p^2$.
The Geometric has a remarkable property called memorylessness. If you’ve already failed 10 times, the distribution of the number of additional trials is the same as if you were starting fresh. Formally:
$$P(X > m + n \mid X > m) = P(X > n).$$
This is the discrete analog of having no memory of the past. The coin doesn’t “remember” its previous outcomes.
Poisson$(\lambda)$: Counts of Rare Events
You’ll encounter the Poisson everywhere that rare events accumulate: phone calls arriving at a call center, car accidents in a city per day, typos in a document, defects in a manufacturing batch.
$$P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}, \quad k = 0, 1, 2, \ldots$$
Mean: $\lambda$. Variance: $\lambda$. (The mean equals the variance - this is the hallmark of the Poisson.)
The Poisson is not an ad hoc formula. It emerges naturally from a limiting argument.
Discomfort check. The Poisson is not the Binomial with $n = \infty$. That would be undefined. The Poisson is the limit of Binomial$(n, \lambda/n)$ as $n \to \infty$ - you take more and more trials, each with smaller and smaller success probability, keeping the expected count fixed at $\lambda$.
Here’s the derivation. Let $X_n \sim \text{Binomial}(n, \lambda/n)$. Then:
$$P(X_n = k) = \binom{n}{k}\left(\frac{\lambda}{n}\right)^k \left(1 - \frac{\lambda}{n}\right)^{n-k}.$$
As $n \to \infty$, three things happen: $\binom{n}{k}/n^k \to 1/k!$, the term $\left(1 - \lambda/n\right)^{n-k} \to e^{-\lambda}$ (using $(1 + x/n)^n \to e^x$), and the extra factor $(\lambda/n)^k$ combines with $n^k$ from the binomial coefficient. The result: $P(X_n = k) \to e^{-\lambda}\lambda^k/k!$. The Poisson PMF.
The practical interpretation: if you’re counting events that happen at an average rate of $\lambda$ per unit time, and each event is independent of all others, then the count follows Poisson$(\lambda)$.
Hypergeometric$(N, K, n)$: Sampling Without Replacement
The Binomial assumes each trial is independent - you sample with replacement, so the population doesn’t change. But what if you’re drawing from a finite population without putting items back?
You have a population of $N$ items, $K$ of which are “successes.” You draw $n$ without replacement. Let $X$ be the number of successes in the sample.
$$P(X = k) = \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}}, \quad \max(0, n-(N-K)) \leq k \leq \min(n, K).$$
The numerator counts the ways to pick $k$ successes from $K$ and $n-k$ failures from $N-K$. The denominator is the total number of ways to pick any $n$ items.
Mean: $n\frac{K}{N}$. Variance: $n\frac{K}{N}\cdot\frac{N-K}{N}\cdot\frac{N-n}{N-1}$.
The mean matches Binomial$(n, K/N)$ - the expected fraction of successes is the same whether you replace or not. But the variance is smaller by the finite population correction $\frac{N-n}{N-1}$: sampling without replacement reduces variability because you can’t repeatedly draw the same item. As $N \to \infty$ with $K/N \to p$ fixed, the correction factor approaches 1 and the Hypergeometric converges to Binomial$(n, p)$. So the Binomial is what you use when the population is large relative to the sample.
Typical use: quality control (inspecting $n$ items from a batch of $N$ with $K$ defects), card games (probability of drawing $k$ aces when dealt $n$ cards), and ecological capture-recapture estimation.
The Continuous Zoo
A continuous random variable has a probability density function (PDF) $f(x)$ with $\int f = 1$.
Uniform$[a,b]$: Complete Ignorance
When you know a value falls in $[a,b]$ and nothing else - no reason to think any subinterval is more likely than any other - you use the Uniform:
$$f(x) = \frac{1}{b-a}, \quad a \leq x \leq b.$$
Mean: $(a+b)/2$. Variance: $(b-a)^2/12$.
The Uniform is the distribution of maximum entropy over a bounded interval. It’s what you use when you genuinely have no information about where in $[a,b]$ the value falls. It’s also the building block for generating other distributions: if $U \sim \text{Uniform}[0,1]$, you can transform it into almost any other distribution using the inverse CDF method.
Exponential$(\lambda)$: Waiting Times
The Exponential describes how long you wait for an event to occur, when events arrive at a constant average rate.
$$f(x) = \lambda e^{-\lambda x}, \quad x \geq 0.$$
Mean: $1/\lambda$. Variance: $1/\lambda^2$.
If phone calls arrive at an average rate of $\lambda$ calls per minute, the waiting time between calls is Exponential$(\lambda)$. If a radioactive atom decays at rate $\lambda$, the time until decay is Exponential$(\lambda)$.
Like the Geometric, the Exponential is memoryless: if you’ve already waited $s$ minutes, the additional waiting time still has the same Exponential distribution:
$$P(X > s + t \mid X > s) = P(X > t).$$
The Exponential is the unique continuous memoryless distribution, just as the Geometric is the unique discrete one.
The Exponential and Poisson are deeply connected: if events follow a Poisson process (Poisson count per unit time with rate $\lambda$), then the inter-arrival times are i.i.d. Exponential$(\lambda)$. Count and waiting time are two faces of the same process.
Normal$(\mu, \sigma^2)$: The Universal Shape
$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right).$$
Mean: $\mu$. Variance: $\sigma^2$.
The Normal is special because of the Central Limit Theorem: averages of large numbers of independent random variables converge to a Normal distribution, regardless of what distribution those variables come from. This is why measurement errors are normally distributed (they’re sums of many small random perturbations), why heights are approximately normal (genetics is approximately additive), and why it appears so pervasively in nature.
The 68-95-99.7 rule is the most useful fact about the Normal: 68% of the probability mass falls within 1 standard deviation of the mean, 95% within 2, and 99.7% within 3. This is where “within 2 sigma” and “3 sigma events” come from.
Standardization: if $X \sim \mathcal{N}(\mu, \sigma^2)$, then the Z-score $Z = (X - \mu)/\sigma$ has a standard Normal distribution $\mathcal{N}(0, 1)$. Every Normal distribution is just a rescaled and shifted version of the standard Normal.
Beta$(\alpha, \beta)$: Distributions Over Probabilities
The Beta distribution is defined on the interval $(0, 1)$:
$$f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}, \quad x \in (0,1),$$
where $B(\alpha,\beta)$ is a normalizing constant.
Mean: $\alpha/(\alpha + \beta)$.
The Beta is the natural distribution to put over a probability $p \in [0,1]$. When you don’t know the bias of a coin and want to quantify your uncertainty about it, you describe your belief as a Beta distribution. After observing $k$ heads in $n$ flips, your updated belief is Beta$(k + \alpha,\ n - k + \beta)$ - Bayes' theorem in clean closed form.
The shape changes dramatically with $\alpha$ and $\beta$. Beta(1,1) is flat (complete ignorance). Beta(2,2) is a symmetric hill. Beta(10,2) is concentrated near 1 (biased toward heads). Beta(0.5, 0.5) is U-shaped (concentrated near 0 and 1).
Gamma$(\alpha, \beta)$: Generalized Waiting Time
The Gamma distribution generalizes the Exponential. If you wait for $\alpha$ events to occur (each arriving with rate $\beta$), the total waiting time is Gamma$(\alpha, \beta)$:
$$f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0.$$
Mean: $\alpha/\beta$. Variance: $\alpha/\beta^2$.
When $\alpha = 1$, this reduces to Exponential$(\beta)$. The Gamma is the distribution of the sum of $\alpha$ independent Exponential$(\beta)$ random variables.
Why the Normal Is Special
The Normal appears so often that it deserves a moment of reflection. Why this particular bell curve, out of all possible smooth symmetric distributions?
The answer - which will be made precise in the CLT post - is that the Normal is the unique stable distribution under addition. When you average together many independent random variables, the randomness coming from each individual variable cancels out in a specific way. The details of the original distribution wash away. Only the mean and variance survive in the limit. The Normal is the shape that remains.
This is why, in practice, you can often treat sample averages as approximately Normal even if the underlying data is Binomial, Poisson, or any other distribution, provided the sample size is large enough.
Which Distribution When?
Here is a rough decision guide:
- Counting events in a fixed period or region, when events are rare and independent → Poisson
- Number of successes in a fixed number of independent trials → Binomial
- Number of successes when drawing without replacement from a finite population → Hypergeometric
- Waiting time until the first (or $k$-th) event → Exponential (or Gamma)
- Proportion or probability that lies in $[0,1]$ → Beta
- Continuous measurement of a quantity that is the sum of many small contributions → Normal
- No information about a quantity except its range → Uniform
These aren’t rigid rules - they’re starting points. The real work is checking whether the distribution fits the data.
Summary
| Distribution | Parameters | $E[X]$ | $\text{Var}(X)$ | Typical use |
|---|---|---|---|---|
| Bernoulli$(p)$ | $p \in [0,1]$ | $p$ | $p(1-p)$ | Single binary trial |
| Binomial$(n,p)$ | $n \in \mathbb{N}$, $p \in [0,1]$ | $np$ | $np(1-p)$ | Count of successes in $n$ trials |
| Geometric$(p)$ | $p \in (0,1]$ | $1/p$ | $(1-p)/p^2$ | Trials until first success |
| Poisson$(\lambda)$ | $\lambda > 0$ | $\lambda$ | $\lambda$ | Count of rare independent events |
| Hypergeometric$(N,K,n)$ | $N,K,n \in \mathbb{N}$, $K \leq N$, $n \leq N$ | $nK/N$ | $n\frac{K}{N}\frac{N-K}{N}\frac{N-n}{N-1}$ | Successes when sampling without replacement |
| Uniform$(a,b)$ | $a < b$ | $(a+b)/2$ | $(b-a)^2/12$ | Complete ignorance over an interval |
| Exponential$(\lambda)$ | $\lambda > 0$ | $1/\lambda$ | $1/\lambda^2$ | Waiting time between events |
| Normal$(\mu,\sigma^2)$ | $\mu \in \mathbb{R}$, $\sigma > 0$ | $\mu$ | $\sigma^2$ | Sums of many small contributions |
| Beta$(\alpha,\beta)$ | $\alpha,\beta > 0$ | $\frac{\alpha}{\alpha+\beta}$ | (see formula) | Prior distribution over a probability |
| Gamma$(\alpha,\beta)$ | $\alpha,\beta > 0$ | $\alpha/\beta$ | $\alpha/\beta^2$ | Waiting time for $\alpha$ events |
Read next: