Prerequisite:


A probability distribution specifies the likelihood of each possible outcome of a random variable. This post develops the major discrete and continuous families rigorously, including derivations of their moments and key structural properties.

Discrete Distributions

A discrete random variable $X$ takes values in a countable set $\mathcal{X}$. Its distribution is fully described by a probability mass function (PMF) $p: \mathcal{X} \to [0,1]$ satisfying $\sum_{x \in \mathcal{X}} p(x) = 1$.

Bernoulli Distribution

The simplest non-trivial distribution models a single trial with success probability $p \in [0,1]$.

Definition. $X \sim \text{Bernoulli}(p)$ if $P(X=1) = p$ and $P(X=0) = 1-p$.

Compactly, $P(X=k) = p^k(1-p)^{1-k}$ for $k \in {0,1}$.

Moments. $E[X] = p$, $\text{Var}(X) = p(1-p)$.

Binomial Distribution

The Binomial counts the number of successes in $n$ independent Bernoulli$(p)$ trials.

Definition. $X \sim \text{Binomial}(n, p)$ has PMF

$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n.$$

This follows from choosing which $k$ of the $n$ trials succeed (the factor $\binom{n}{k}$) and multiplying the independent probabilities.

Moments. Writing $X = \sum_{i=1}^n X_i$ where $X_i \sim \text{Bernoulli}(p)$ are i.i.d., linearity of expectation gives

$$E[X] = np, \qquad \text{Var}(X) = np(1-p).$$

Geometric Distribution

The Geometric distribution models the number of trials until the first success.

Definition. $X \sim \text{Geometric}(p)$ has PMF $P(X = k) = (1-p)^{k-1}p$ for $k = 1, 2, \ldots$

Moments. $E[X] = 1/p$, $\text{Var}(X) = (1-p)/p^2$.

Memoryless Property. For integers $m, n \geq 1$,

$$P(X > m + n \mid X > m) = P(X > n).$$

Proof. Since $P(X > k) = (1-p)^k$,

$$P(X > m+n \mid X > m) = \frac{P(X > m+n)}{P(X > m)} = \frac{(1-p)^{m+n}}{(1-p)^m} = (1-p)^n = P(X > n). \qquad \square$$

This is the unique discrete distribution with the memoryless property.

Poisson Distribution

The Poisson distribution arises naturally as the limit of Binomial$(n, \lambda/n)$ as $n \to \infty$.

Definition. $X \sim \text{Poisson}(\lambda)$ for $\lambda > 0$ has PMF

$$P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}, \quad k = 0, 1, 2, \ldots$$

Derivation as Binomial limit. Let $X_n \sim \text{Binomial}(n, \lambda/n)$. Then

$$P(X_n = k) = \binom{n}{k}\left(\frac{\lambda}{n}\right)^k \left(1 - \frac{\lambda}{n}\right)^{n-k}.$$

As $n \to \infty$: $\binom{n}{k}/n^k \to 1/k!$, and $\left(1 - \lambda/n\right)^n \to e^{-\lambda}$, giving the Poisson PMF.

Moments. $E[X] = \lambda$, $\text{Var}(X) = \lambda$. The mean equals the variance - a hallmark of the Poisson.

Negative Binomial Distribution

$X \sim \text{NegBin}(r, p)$ counts the number of trials needed to obtain $r$ successes.

$$P(X = k) = \binom{k-1}{r-1} p^r (1-p)^{k-r}, \quad k = r, r+1, \ldots$$

Moments. $E[X] = r/p$, $\text{Var}(X) = r(1-p)/p^2$.

When $r = 1$ this reduces to the Geometric distribution.

Continuous Distributions

A continuous random variable $X$ has a probability density function (PDF) $f: \mathbb{R} \to [0,\infty)$ satisfying $\int_{-\infty}^{\infty} f(x),dx = 1$ and $P(a \leq X \leq b) = \int_a^b f(x),dx$.

Uniform Distribution

Definition. $X \sim \text{Uniform}(a, b)$ has PDF

$$f(x) = \frac{1}{b-a}, \quad a \leq x \leq b,$$

and $f(x) = 0$ otherwise.

Moments. $E[X] = (a+b)/2$, $\text{Var}(X) = (b-a)^2/12$.

Exponential Distribution

Definition. $X \sim \text{Exp}(\lambda)$ has PDF $f(x) = \lambda e^{-\lambda x}$ for $x \geq 0$.

Moments. $E[X] = 1/\lambda$, $\text{Var}(X) = 1/\lambda^2$.

Memoryless Property. For $s, t \geq 0$, $P(X > s + t \mid X > s) = P(X > t)$.

Proof. $P(X > x) = e^{-\lambda x}$, so

$$P(X > s+t \mid X > s) = \frac{e^{-\lambda(s+t)}}{e^{-\lambda s}} = e^{-\lambda t} = P(X > t). \qquad \square$$

The Exponential is the unique continuous memoryless distribution, making it the natural model for waiting times between events in a Poisson process.

Normal (Gaussian) Distribution

Definition. $X \sim \mathcal{N}(\mu, \sigma^2)$ has PDF

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad x \in \mathbb{R}.$$

Moments. $E[X] = \mu$, $\text{Var}(X) = \sigma^2$.

68-95-99.7 Rule. For $X \sim \mathcal{N}(\mu, \sigma^2)$:

  • $P(\mu - \sigma \leq X \leq \mu + \sigma) \approx 0.6827$
  • $P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \approx 0.9545$
  • $P(\mu - 3\sigma \leq X \leq \mu + 3\sigma) \approx 0.9973$

These follow from the standard normal CDF $\Phi$. The Normal is closed under linear transformations: if $X \sim \mathcal{N}(\mu, \sigma^2)$ then $aX + b \sim \mathcal{N}(a\mu + b, a^2\sigma^2)$.

Gamma Distribution

Definition. $X \sim \text{Gamma}(\alpha, \beta)$ with shape $\alpha > 0$ and rate $\beta > 0$ has PDF

$$f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}, \quad x > 0,$$

where $\Gamma(\alpha) = \int_0^\infty t^{\alpha-1} e^{-t},dt$ is the Gamma function.

Moments. $E[X] = \alpha/\beta$, $\text{Var}(X) = \alpha/\beta^2$.

Special cases: $\text{Gamma}(1, \beta) = \text{Exp}(\beta)$; the sum of $k$ i.i.d. $\text{Exp}(\lambda)$ variables is $\text{Gamma}(k, \lambda)$.

Beta Distribution

Definition. $X \sim \text{Beta}(\alpha, \beta)$ with $\alpha, \beta > 0$ has PDF

$$f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}, \quad x \in (0,1),$$

where $B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta)$.

Moments. $E[X] = \alpha/(\alpha+\beta)$, $\text{Var}(X) = \alpha\beta / [(\alpha+\beta)^2(\alpha+\beta+1)]$.

The Beta distribution is the conjugate prior for the Bernoulli/Binomial likelihood in Bayesian inference.

Summary Table

Distribution Parameters Mean Variance
Bernoulli$(p)$ $p \in [0,1]$ $p$ $p(1-p)$
Binomial$(n,p)$ $n \in \mathbb{N}$, $p \in [0,1]$ $np$ $np(1-p)$
Geometric$(p)$ $p \in (0,1]$ $1/p$ $(1-p)/p^2$
Poisson$(\lambda)$ $\lambda > 0$ $\lambda$ $\lambda$
Uniform$(a,b)$ $a < b$ $(a+b)/2$ $(b-a)^2/12$
Exponential$(\lambda)$ $\lambda > 0$ $1/\lambda$ $1/\lambda^2$
Normal$(\mu,\sigma^2)$ $\mu \in \mathbb{R}$, $\sigma > 0$ $\mu$ $\sigma^2$
Gamma$(\alpha,\beta)$ $\alpha,\beta > 0$ $\alpha/\beta$ $\alpha/\beta^2$
Beta$(\alpha,\beta)$ $\alpha,\beta > 0$ $\alpha/(\alpha+\beta)$ see above

Examples

Poisson Processes and Arrival Modeling

A Poisson process with rate $\lambda$ models events that arrive independently at a constant average rate. If $N(t)$ denotes the number of arrivals in $[0, t]$, then $N(t) \sim \text{Poisson}(\lambda t)$. The inter-arrival times are i.i.d. $\text{Exp}(\lambda)$.

For example, if a server receives $\lambda = 100$ requests per second on average, the probability of exactly $k$ requests in a given second is $e^{-100} \cdot 100^k / k!$. The probability of zero requests is $e^{-100} \approx 3.7 \times 10^{-44}$, essentially impossible.

Normal Distribution in Machine Learning

Weight initialization in neural networks exploits the Normal distribution. For a layer with $n$ inputs, He initialization sets weights $W_{ij} \sim \mathcal{N}(0, 2/n)$. This choice is motivated by the variance of the pre-activation $z = \sum_{i=1}^n W_i x_i$: if the inputs $x_i$ are i.i.d. with mean 0 and variance 1, then

$$\text{Var}(z) = n \cdot \text{Var}(W_i) \cdot \text{Var}(x_i) = n \cdot \frac{2}{n} \cdot 1 = 2.$$

Maintaining unit-order variance through layers prevents vanishing and exploding gradients. The 68-95-99.7 rule tells us that with He initialization, roughly 99.7% of initial weights lie within $3\sqrt{2/n}$ of zero - a tight initialization regime especially important for deep networks.


Read Next: