Moment-Generating Functions - All a Distribution's Moments in One Package // Megha Bose

Helpful context:

Consider two random variables, both with mean 0 and variance 1. One is the standard normal $\mathcal{N}(0,1)$. The other is a rescaled version of the difference of two exponentials - symmetric, mean 0, variance 1, but with heavier tails and a sharper peak.

Same mean. Same variance. Completely different shapes. If I only told you the first two moments, you could not reconstruct the distribution. So here is a natural question: is there a single function - some compact encoding - that captures an entire distribution, not just its first two moments?

Yes. It’s called the moment-generating function, and it turns out to be one of the most powerful tools in probability.

What Is the MGF?

The moment-generating function of a random variable $X$ is:

$$M_X(t) = E\left[e^{tX}\right].$$

For a discrete variable: $M_X(t) = \sum_x e^{tx} p(x)$. For a continuous variable: $M_X(t) = \int_{-\infty}^\infty e^{tx} f(x) dx$.

The function is defined for those values of $t$ where this expectation is finite. When the MGF exists in some open interval around $t = 0$ - say, $(-h, h)$ - everything nice follows.

But first: why $e^{tX}$? This seems like an odd thing to take the expectation of. The reason becomes clear when you expand the exponential.

Why the Exponential? The Moment-Generating Trick

Recall the Taylor series: $e^u = 1 + u + u^2/2! + u^3/3! + \cdots$

Substitute $u = tX$:

$$e^{tX} = 1 + tX + \frac{t^2 X^2}{2!} + \frac{t^3 X^3}{3!} + \cdots$$

Now take the expectation of both sides (swapping $E$ and the sum, which is justified when the MGF is finite):

$$M_X(t) = E[e^{tX}] = 1 + t E[X] + \frac{t^2}{2!} E[X^2] + \frac{t^3}{3!} E[X^3] + \cdots = \sum_{k=0}^\infty \frac{E[X^k]}{k!} t^k.$$

This is a power series in $t$, and its coefficients are exactly the moments $E[X^k]$. The MGF is a generating function for all the moments - hence the name.

Extracting the $k$-th moment is now easy: differentiate $k$ times and evaluate at $t = 0$. The $k$-th derivative of $M_X(t)$ at $t = 0$ is $E[X^k]$:

$$M_X'(0) = E[X], \qquad M_X''(0) = E[X^2], \qquad M_X^{(k)}(0) = E[X^k].$$

So the variance, which requires $E[X^2]$ and $E[X]$, is $\text{Var}(X) = M_X''(0) - (M_X'(0))^2$.

Worked Example: Bernoulli$(p)$

Let $X \sim \text{Bernoulli}(p)$. Then $X$ takes value 1 with probability $p$ and 0 with probability $1-p$. The MGF is:

$$M_X(t) = E[e^{tX}] = e^{t \cdot 0}(1-p) + e^{t \cdot 1}\cdot p = (1-p) + pe^t.$$

Let’s verify the moments. Differentiate:

$$M_X'(t) = pe^t, \quad \text{so } M_X'(0) = p = E[X].$$

$$M_X''(t) = pe^t, \quad \text{so } M_X''(0) = p = E[X^2].$$

For a Bernoulli variable, $X^2 = X$ (since $0^2 = 0$ and $1^2 = 1$), so $E[X^2] = E[X] = p$. And indeed the variance is $p - p^2 = p(1-p)$.

Worked Example: Normal$(0, 1)$

Let $X \sim \mathcal{N}(0, 1)$. The MGF is:

$$M_X(t) = \int_{-\infty}^\infty e^{tx} \cdot \frac{1}{\sqrt{2\pi}} e^{-x^2/2} dx = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^\infty e^{tx - x^2/2} dx.$$

Complete the square in the exponent: $tx - x^2/2 = -(x-t)^2/2 + t^2/2$. So:

$$M_X(t) = e^{t^2/2} \cdot \frac{1}{\sqrt{2\pi}} \int_{-\infty}^\infty e^{-(x-t)^2/2} dx = e^{t^2/2}.$$

The integral equals 1 because it’s the integral of a Normal density. The result: $M_X(t) = e^{t^2/2}$.

Let’s extract the moments:

$$M_X'(t) = t e^{t^2/2}, \quad M_X'(0) = 0 = E[X].$$

$$M_X''(t) = (1 + t^2)e^{t^2/2}, \quad M_X''(0) = 1 = E[X^2].$$

So the variance is $E[X^2] - (E[X])^2 = 1 - 0 = 1$. Correct.

More generally, $\mathcal{N}(\mu, \sigma^2)$ has MGF $M_X(t) = e^{\mu t + \sigma^2 t^2/2}$.

The Uniqueness Theorem

Here is the most important property of MGFs:

If $M_X(t) = M_Y(t)$ for all $t$ in some open interval around 0, then $X$ and $Y$ have the same distribution.

The MGF uniquely determines the distribution. If two random variables have the same MGF (in a neighborhood of 0), they are the same distribution - there is no other distribution with that MGF.

This is a powerful identification tool. To show that some random variable $Z$ is normally distributed, you compute $M_Z(t)$ and check that it equals $e^{\mu t + \sigma^2 t^2/2}$. If it does, $Z$ is Normal. No other argument needed.

Independence and Sums: The Key Algebraic Fact

Here is where MGFs become computationally indispensable. Suppose $X$ and $Y$ are independent. Then:

$$M_{X+Y}(t) = M_X(t) \cdot M_Y(t).$$

The MGF of a sum of independent variables is the product of the individual MGFs.

Proof: $M_{X+Y}(t) = E[e^{t(X+Y)}] = E[e^{tX} e^{tY}]$. Since $X$ and $Y$ are independent, $e^{tX}$ and $e^{tY}$ are independent, so the expectation of their product is the product of expectations: $E[e^{tX}] E[e^{tY}] = M_X(t) M_Y(t)$.

By induction, for $n$ independent variables:

$$M_{X_1 + \cdots + X_n}(t) = \prod_{i=1}^n M_{X_i}(t).$$

This converts convolution (the complicated integral formula for the density of a sum) into multiplication. It’s why MGFs make CLT proofs tractable.

Example: The sum of $n$ independent Bernoulli$(p)$ variables is Binomial$(n, p)$. The MGF of the sum is $(1-p+pe^t)^n$ - just the Bernoulli MGF raised to the $n$-th power. And you can verify directly that this is the MGF of a Binomial$(n,p)$.

The Cumulant Generating Function

Take the logarithm of the MGF. This gives the cumulant generating function (CGF):

$$K_X(t) = \log M_X(t).$$

The reason to do this: the CGF of a sum of independents is the sum of the CGFs. While MGFs multiply, log-MGFs add. Since $\log(M_X \cdot M_Y) = \log M_X + \log M_Y$:

$$K_{X+Y}(t) = K_X(t) + K_Y(t) \quad \text{(for independent } X, Y\text{)}.$$

This is cleaner than the MGF rule, because addition is easier to work with than multiplication.

The coefficients of the CGF’s Taylor series are called cumulants:

$$K_X(t) = \sum_{k=1}^\infty \frac{\kappa_k}{k!} t^k.$$

The first two cumulants are the most familiar:

$$\kappa_1 = K_X'(0) = \text{mean}, \qquad \kappa_2 = K_X''(0) = \text{variance}.$$

For a sum of $n$ independent, identically distributed variables, all cumulants scale by $n$. The mean of the sum is $n$ times the individual mean; the variance of the sum is $n$ times the individual variance - both facts you already knew, but now they’re consequences of a single principle.

For the Normal$(\mu, \sigma^2)$: $K_X(t) = \mu t + \sigma^2 t^2/2$, so all cumulants of order 3 and higher are zero. This characterizes the Normal - it’s the only distribution with finitely many non-zero cumulants.

Discomfort check. What if the MGF doesn’t exist?

The MGF $M_X(t) = E[e^{tX}]$ requires that $e^{tX}$ has a finite expectation. For heavy-tailed distributions, this can fail - the exponential grows faster than the tails fall. The classic example is the Cauchy distribution, which has such heavy tails that it has no finite mean, let alone finite moments. Its MGF doesn’t exist for any $t \neq 0$.

In these cases, we use the characteristic function instead:

$$\varphi_X(t) = E[e^{itX}]$$

where $i = \sqrt{-1}$. Since $|e^{itX}| = 1$ for all real $t$ and $X$, this expectation is always finite - no matter how heavy the tails. The characteristic function always exists and still uniquely determines the distribution.

The characteristic function is the Fourier transform of the density. It satisfies the same multiplication rule for sums of independents. The CLT proof works just as cleanly with characteristic functions as with MGFs (in fact, the standard rigorous proof uses characteristic functions precisely because they always exist).

The cost: you need complex analysis. The benefit: universality.

Sketch of the CLT via MGFs

Here is the central payoff. Let $X_1, X_2, \ldots$ be i.i.d. with mean $\mu = 0$ and variance $\sigma^2 = 1$. Define the normalized sum:

$$Z_n = \frac{X_1 + \cdots + X_n}{\sqrt{n}}.$$

The MGF of $Z_n$ is:

$$M_{Z_n}(t) = \left[M_X\left(\frac{t}{\sqrt{n}}\right)\right]^n.$$

Taylor-expand $M_X$ around 0 using $M_X(0) = 1$, $M_X'(0) = E[X] = 0$, $M_X''(0) = E[X^2] = 1$:

$$M_X\left(\frac{t}{\sqrt{n}}\right) = 1 + 0 \cdot \frac{t}{\sqrt{n}} + \frac{1}{2}\left(\frac{t}{\sqrt{n}}\right)^2 + O(n^{-3/2}) = 1 + \frac{t^2}{2n} + O(n^{-3/2}).$$

Therefore:

$$M_{Z_n}(t) = \left(1 + \frac{t^2}{2n} + O(n^{-3/2})\right)^n \xrightarrow{n \to \infty} e^{t^2/2}.$$

(This uses the standard limit $(1 + x/n)^n \to e^x$.)

The limit $e^{t^2/2}$ is exactly the MGF of $\mathcal{N}(0,1)$. By the uniqueness theorem, $Z_n$ converges in distribution to $\mathcal{N}(0,1)$.

The MGF converts a difficult question about the limiting distribution of a sum into an algebraic calculation. This calculation works no matter what distribution the $X_i$ come from - all that matters is that the mean is 0 and the variance is 1, because those are the only moments that appear in the leading terms of the Taylor expansion.

Summary

Concept	Formula
Definition	$M_X(t) = E[e^{tX}]$
Moment extraction	$M_X^{(k)}(0) = E[X^k]$
Bernoulli$(p)$ MGF	$M(t) = 1-p+pe^t$
Normal$(0,1)$ MGF	$M(t) = e^{t^2/2}$
Normal$(\mu,\sigma^2)$ MGF	$M(t) = e^{\mu t + \sigma^2 t^2/2}$
Uniqueness	$M_X = M_Y$ on interval $\Rightarrow$ $X \overset{d}{=} Y$
Independence rule	$M_{X+Y}(t) = M_X(t) M_Y(t)$
CGF	$K_X(t) = \log M_X(t)$; first cumulant = mean, second = variance
CGF of sum	$K_{X+Y}(t) = K_X(t) + K_Y(t)$
Fallback for heavy tails	Characteristic function $\varphi_X(t) = E[e^{itX}]$ (always exists)

The MGF is a single function that encodes the entire distribution. It makes computing moments mechanical, reduces the distribution of sums of independent variables to multiplication, and powers the clean proof of the Central Limit Theorem.

Read next:

Law of Large Numbers & CLT - Why Averages Behave