Prerequisite:

Parametric vs Non-Parametric Statistics

Parametric statistics assumes data come from a distribution belonging to a family indexed by a finite-dimensional parameter $\theta \in \Theta \subseteq \mathbb{R}^d$. For example, assuming normality means $\theta = (\mu, \sigma^2)$. Parametric methods are efficient when the model is correctly specified, but can be badly biased otherwise.

Non-parametric statistics makes minimal distributional assumptions. Estimands are functionals of an unknown distribution $F$ - for instance, the median, a quantile, or a density. Non-parametric methods sacrifice efficiency for robustness. The bootstrap (discussed below) is a cornerstone non-parametric technique.

Estimators: Bias, Variance, and MSE

Let $X_1, \ldots, X_n$ be i.i.d. with distribution depending on parameter $\theta$. A statistic is any measurable function of the data. An estimator $\hat{\theta} = T(X_1, \ldots, X_n)$ is a statistic used to estimate $\theta$.

Definition. The bias of an estimator is

$$\text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta.$$

An estimator with $\text{Bias}(\hat{\theta}) = 0$ for all $\theta$ is called unbiased.

Definition. The mean squared error decomposes as

$$\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = \text{Bias}(\hat{\theta})^2 + \text{Var}(\hat{\theta}).$$

Proof of decomposition. Let $b = \text{Bias}(\hat{\theta})$. Then

$$E[(\hat{\theta} - \theta)^2] = E[(\hat{\theta} - E[\hat{\theta}] + b)^2] = E[(\hat{\theta} - E[\hat{\theta}])^2] + 2b \cdot E[\hat{\theta} - E[\hat{\theta}]] + b^2 = \text{Var}(\hat{\theta}) + b^2. \quad \square$$

Definition. An estimator $\hat{\theta}_n$ is consistent if $\hat{\theta}_n \xrightarrow{P} \theta$ as $n \to \infty$ for all $\theta$. A sufficient condition is $\text{MSE}(\hat{\theta}_n) \to 0$.

Sample Mean and Variance

The sample mean $\bar{X}n = \frac{1}{n}\sum{i=1}^n X_i$ satisfies $E[\bar{X}_n] = \mu$ (unbiased) and $\text{Var}(\bar{X}_n) = \sigma^2/n$, so $\text{MSE}(\bar{X}_n) = \sigma^2/n \to 0$ (consistent).

The sample variance $S_n^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X}_n)^2$ is the unbiased estimator of $\sigma^2$. The MLE $\hat{\sigma}^2 = \frac{1}{n}\sum(X_i - \bar{X}_n)^2$ is biased.

Theorem. $E[S_n^2] = \sigma^2$.

Proof. Expand:

$$\sum_{i=1}^n (X_i - \bar{X})^2 = \sum_{i=1}^n X_i^2 - n\bar{X}^2.$$

Taking expectations: $E[\sum X_i^2] = n(\sigma^2 + \mu^2)$ and $E[n\bar{X}^2] = n(\sigma^2/n + \mu^2) = \sigma^2 + n\mu^2$. Therefore $E[\sum(X_i - \bar{X})^2] = (n-1)\sigma^2$, so $E[S_n^2] = \sigma^2$. $\square$

Method of Moments

Given a model $p(x \mid \theta)$ with $\theta \in \mathbb{R}^d$, the method of moments estimator sets the first $d$ theoretical moments equal to the corresponding sample moments. Denote $\mu_k(\theta) = E_\theta[X^k]$ and $\hat{\mu}_k = \frac{1}{n}\sum X_i^k$. Solve

$$\mu_k(\hat{\theta}) = \hat{\mu}_k, \quad k = 1, \ldots, d.$$

Method of moments estimators are consistent (by the LLN) and asymptotically normal (by the CLT and delta method), though often less efficient than the MLE.

Sufficient Statistics and the Neyman-Fisher Factorization

Definition. A statistic $T = T(X_1, \ldots, X_n)$ is sufficient for $\theta$ if the conditional distribution of $(X_1, \ldots, X_n)$ given $T$ does not depend on $\theta$.

Sufficient statistics capture all information about $\theta$ in the data. The following theorem gives a practical criterion.

Theorem (Neyman-Fisher Factorization). $T$ is sufficient for $\theta$ if and only if the joint density factors as

$$p(x_1, \ldots, x_n \mid \theta) = g(T(x), \theta) \cdot h(x_1, \ldots, x_n)$$

for non-negative functions $g$ and $h$, where $h$ does not depend on $\theta$.

Example. For $X_i \sim \text{Bernoulli}(p)$, the joint pmf is $p^{\sum x_i}(1-p)^{n - \sum x_i}$, which factors with $T = \sum x_i$ and $h \equiv 1$. So $T = \sum X_i$ (equivalently $\bar{X}$) is sufficient for $p$.

The Rao-Blackwell theorem states that for any estimator $\hat{\theta}$ and sufficient statistic $T$, the conditional expectation $\tilde{\theta} = E[\hat{\theta} | T]$ is at least as good as $\hat{\theta}$ in MSE. This motivates always conditioning on sufficient statistics.

Exponential Families

Many common distributions belong to an exponential family of the form

$$p(x \mid \theta) = h(x) \exp!\bigl(\eta(\theta)^\top T(x) - A(\theta)\bigr),$$

where $\eta(\theta)$ are the natural parameters, $T(x)$ is the sufficient statistic, $A(\theta)$ is the log-partition function, and $h(x)$ is the base measure. Key properties:

  • $T(x)$ is a sufficient statistic by the factorization theorem.
  • Moments of $T$ are derivatives of $A$: $E[T(X)] = \nabla A(\eta)$ and $\text{Var}(T(X)) = \nabla^2 A(\eta)$.
  • The MLE satisfies $E_{\hat{\theta}}[T(X)] = \bar{T} = \frac{1}{n}\sum T(x_i)$, a moment-matching condition.

Gaussian. $p(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$. Here $T(x) = (x, x^2)^\top$, $\eta = (\mu/\sigma^2, -1/(2\sigma^2))^\top$.

Binomial. $\binom{n}{x}p^x(1-p)^{n-x} = \binom{n}{x}\exp!\bigl(x \log\frac{p}{1-p} - n\log\frac{1}{1-p}\bigr)$. Here $T(x) = x$ and $\eta = \log\frac{p}{1-p}$ (the log-odds).

Bootstrap

The bootstrap is a non-parametric resampling technique for quantifying estimator uncertainty.

Non-parametric bootstrap: Draw $B$ bootstrap samples $x^{(1)}, \ldots, x^{(B)}$ by sampling $n$ observations with replacement from the data. Compute $\hat{\theta}^{(b)} = T(x^{(b)})$ for each. The empirical distribution of ${\hat{\theta}^{(b)}}$ approximates the sampling distribution of $\hat{\theta}$. Bootstrap confidence intervals: use the 2.5th and 97.5th percentiles of ${\hat{\theta}^{(b)}}$.

Parametric bootstrap: Fit the model to data, obtaining $\hat{\theta}$. Draw $B$ samples from $p(\cdot \mid \hat{\theta})$ and re-estimate on each. Useful when the model is well-specified and sampling is expensive.

Consistency of the bootstrap requires the plug-in estimator $\hat{F}_n$ (the empirical CDF) to be a good approximation of $F$. By the Glivenko-Cantelli theorem, $\sup_x |\hat{F}_n(x) - F(x)| \xrightarrow{a.s.} 0$, which underpins bootstrap validity for smooth functionals.

Examples

Comparing estimators. For $X_i \sim \text{Uniform}(0, \theta)$, consider two estimators: $\hat{\theta}1 = 2\bar{X}$ (method of moments) and $\hat{\theta}2 = \frac{n+1}{n}X{(n)}$ where $X{(n)} = \max_i X_i$. Both are unbiased. However, $\text{Var}(\hat{\theta}_1) = \theta^2/(3n)$ while $\text{Var}(\hat{\theta}_2) = \theta^2/(n(n+2))$. For large $n$, $\hat{\theta}2$ is far more efficient because $X{(n)}$ is a sufficient statistic.

Exponential family for the Poisson. For $X_i \sim \text{Poisson}(\lambda)$, $p(x|\lambda) = e^{-\lambda}\lambda^x/x! = \frac{1}{x!}\exp(x\log\lambda - \lambda)$. Here $T(x) = x$, $\eta = \log\lambda$, $A(\eta) = e^\eta$. The MLE satisfies $E_{\hat{\lambda}}[X] = \bar{X}$, giving $\hat{\lambda} = \bar{X}$ - the sample mean, as expected.


Read Next: