Density Estimation - Learning the Shape of Your Data
Helpful context:
- Probability Distributions
- Bayesian Inference
- Clustering - Finding Structure When Nobody Gave You Labels
Most supervised learning works like this: you have inputs $x$ and outputs $y$, and you learn a function $f(x) \approx y$. The question is always “what does this input map to?”
Density estimation asks a different question entirely. You have data points $x_1, x_2, \ldots, x_n$ and you want to learn the probability distribution those points were drawn from. Not “what category does this belong to” but “how likely is a point like this to exist at all?”
Call this distribution $p(x)$. Density estimation is the problem of recovering $p$ from samples.
Why This Matters
Knowing $p(x)$ unlocks several things.
Anomaly detection. If you know the distribution of your normal data, then any point with very low probability under that distribution is a candidate anomaly. It is not that the point is wrong - it is that it is surprising, given everything you have seen before. Fraud, equipment failure, and network intrusions all share this property: they are low-probability events under the distribution of normal behavior.
Generative modeling. If you can learn $p(x)$, you can sample new points from it. These new points look like your training data but are not copies of it. This is what generative models do: VAEs and normalizing flows explicitly learn $p(x)$ and sample from it; GANs learn to sample from it implicitly without ever writing down the density.
Classification via Bayes' rule. Suppose you want to classify whether an email is spam or not spam. Bayes' rule says:
$$p(\text{spam} \mid x) \propto p(x \mid \text{spam}) \cdot p(\text{spam})$$
Here $p(x \mid \text{spam})$ is the density of emails given that they are spam - a class-conditional density. Estimate it, multiply by the prior probability that any email is spam, and you have a probabilistic classifier. This is exactly what Naive Bayes and Gaussian discriminant analysis do.
Histograms - the Intuitive Starting Point
The histogram is the simplest density estimator. The procedure:
- Divide the feature space into bins of equal width.
- Count how many data points fall into each bin.
- Normalize so that the total area (count times bin width) sums to 1.
The result is a piecewise constant function that approximates $p(x)$.
Worked example. Five data points: ${1.1, 1.4, 2.0, 2.3, 4.8}$. Using bins $[1,2), [2,3), [3,4), [4,5)$ of width 1: the bin $[1,2)$ contains 2 points, $[2,3)$ contains 2 points, $[3,4)$ contains 0, $[4,5)$ contains 1. Dividing by (total points $\times$ bin width) $= 5 \times 1 = 5$, the estimated densities are $0.4, 0.4, 0.0, 0.2$. The integral over all bins is $0.4 + 0.4 + 0.0 + 0.2 = 1.0$.
The problems are fundamental. Bin width sensitivity: a narrow bin width makes the estimate spiky, following noise in the data. A wide bin width over-smooths, hiding real structure. The “right” bin width is unclear and must be set by the user. Discontinuity: probability jumps at bin boundaries, which is rarely a real feature of the underlying distribution. Dimensionality: in $d$ dimensions with $B$ bins per dimension, the total number of bins is $B^d$. With $B=10$ and $d=10$, that is $10^{10}$ bins - most of which will be empty. The histogram becomes useless in more than 2 or 3 dimensions.
Parametric Density Estimation
The parametric approach assumes the data comes from a known distribution family, then estimates the family’s parameters.
The most common choice is a Gaussian (normal distribution). The Gaussian is described by two parameters: mean $\mu$ and variance $\sigma^2$. To fit a Gaussian density estimator, compute the sample mean and sample variance:
$$\hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i \qquad \hat{\sigma}^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \hat{\mu})^2$$
The fitted Gaussian $\mathcal{N}(\hat{\mu}, \hat{\sigma}^2)$ is your density estimate. It is smooth, integrates to 1, and is fully described by two numbers.
The strength: if the true distribution really is Gaussian (or close to it), this estimate is excellent - it uses the data efficiently and generalizes well. The weakness: if the true distribution is not Gaussian - if it is skewed, has heavy tails, or has multiple peaks - the fitted Gaussian will be badly wrong in ways that are hard to detect.
Gaussian Mixture Models (GMMs) extend parametric estimation to handle multi-modal distributions. A GMM with $k$ components is:
$$p(x) = \sum_{j=1}^k \pi_j \cdot \mathcal{N}(x \mid \mu_j, \Sigma_j)$$
Here $\pi_j$ are the mixture weights (how much each component contributes; they sum to 1), $\mu_j$ is the mean of component $j$, and $\Sigma_j$ is its covariance matrix. A GMM with enough components can approximate any smooth distribution. The parameters ${\pi_j, \mu_j, \Sigma_j}$ are fit using the Expectation-Maximization (EM) algorithm, which alternates between:
- E-step: for each data point, compute the probability that it was generated by each component (“soft assignments”).
- M-step: update the component parameters to maximize the likelihood, using the soft assignments from the E-step.
The number of components $k$ is a hyperparameter chosen by cross-validation or the Bayesian information criterion (BIC).
Kernel Density Estimation - the Non-Parametric Approach
What if you do not want to assume a distribution family at all? Kernel Density Estimation (KDE) is the non-parametric answer.
The intuition: for each observed data point, place a small smooth “bump” centered at that point. Sum all the bumps and normalize so the total integrates to 1. The result is a smooth continuous density estimate that follows the actual data, with no assumption about its shape.
Formally, with $n$ data points $x_1, \ldots, x_n$ and a kernel function $K$:
$$\hat{p}(x) = \frac{1}{n h} \sum_{i=1}^n K!\left(\frac{x - x_i}{h}\right)$$
The kernel $K$ is typically a Gaussian: $K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}$. Each data point contributes a Gaussian bell curve of width $h$ centered at its location.
The parameter $h$ is called the bandwidth. It plays the same role as bin width in a histogram. A small $h$ produces a spiky estimate that closely tracks individual points - high variance, potentially capturing noise. A large $h$ over-smooths and misses real structure in the distribution.
Bandwidth selection. The bandwidth $h$ can be chosen by cross-validation: hold out some data points, fit KDE on the rest, and evaluate the log-likelihood of the held-out points under the estimated density. Repeat for multiple values of $h$ and pick the best. For 1D data with an approximately Gaussian distribution, Silverman’s rule of thumb gives a quick estimate:
$$h \approx 1.06 \cdot \hat{\sigma} \cdot n^{-1/5}$$
Here $\hat{\sigma}$ is the sample standard deviation and $n$ is the number of data points. This rule is derived by minimizing the mean integrated squared error under the assumption that the true distribution is Gaussian - so it works well for near-Gaussian data and less well for strongly multi-modal data.
Worked example. Three data points: $x = 1$, $x = 2$, $x = 5$. Using a Gaussian kernel with bandwidth $h = 0.5$, the estimated density at any point $t$ is:
$$\hat{p}(t) = \frac{1}{3 \cdot 0.5} \left[ \mathcal{N}(t; 1, 0.25) + \mathcal{N}(t; 2, 0.25) + \mathcal{N}(t; 5, 0.25) \right]$$
At $t = 1.5$ (between the first two points), contributions from $x=1$ and $x=2$ are both high. At $t = 3.5$ (between the cluster and the isolated point), all three contributions are small, producing a valley. At $t = 5$, only the third point contributes significantly, giving a smaller peak. The result: high density near 1 and 2, a valley around 3-4, and a smaller peak near 5.
from sklearn.neighbors import KernelDensity
import numpy as np
X = np.array([[1], [2], [5]])
kde = KernelDensity(kernel="gaussian", bandwidth=0.5)
kde.fit(X)
# Evaluate log-density on a grid
t = np.linspace(-1, 8, 500).reshape(-1, 1)
log_density = kde.score_samples(t)
density = np.exp(log_density)
The curse of dimensionality in KDE. In high dimensions, the bandwidth required to capture a smooth estimate grows exponentially with dimension. With $d$ dimensions, you need roughly $n \sim h^{-d}$ data points to populate the neighborhood around each query point. For $d = 10$ and $h = 0.1$, this is $10^{10}$ points - impractical. KDE works well in 1 to 3 dimensions; beyond 5 or so, you need either parametric assumptions or enormous amounts of data.
Applications
Anomaly detection. Train a KDE (or GMM) on normal data. At inference time, compute $\hat{p}(x_{\text{new}})$. If this value is below a threshold $\tau$, flag the point as anomalous. The threshold $\tau$ is chosen based on the acceptable false positive rate. This approach makes the anomaly score interpretable: it is a probability, not an opaque score from a tree ensemble.
Generative modeling. Once you have fit $\hat{p}(x)$, you can sample from it. For a KDE, sampling is easy: pick a data point $x_i$ uniformly at random, then draw a sample from $\mathcal{N}(x_i, h^2 I)$. The result is a new point that looks like your training data but is not a copy. VAEs and normalizing flows are deep learning extensions of this idea - they learn $p(x)$ over complex manifolds like images, where KDE would fail due to dimensionality.
Bayesian classification. By Bayes' rule, the posterior probability of class $c$ given input $x$ is:
$$p(c \mid x) = \frac{p(x \mid c) \cdot p(c)}{\sum_{c'} p(x \mid c') \cdot p(c')}$$
The class-conditional density $p(x \mid c)$ can be estimated separately for each class using KDE or a GMM. Multiply by the class prior $p(c)$ (the fraction of training examples in each class), normalize, and you have a full probabilistic classifier. This generalizes Naive Bayes (which assumes $p(x \mid c)$ factors over features) to arbitrary class-conditional densities.
The EM algorithm that fits GMMs has the same structure as the soft k-means clustering algorithm: it alternates between assigning points to clusters and updating cluster parameters. The main difference is that GMM EM uses soft probabilistic assignments and explicitly models the covariance of each cluster, whereas k-means uses hard assignments and implicitly assumes spherical equal-variance clusters.
| Concept | Key point |
|---|---|
| Density estimation | Learn $p(x)$ from samples; enables anomaly detection, generation, and Bayesian classification |
| Histogram | Simplest estimator; sensitive to bin width; fails in high dimensions |
| Parametric estimation | Assume a distribution family; efficient when correct; brittle when wrong |
| GMM | Mixture of Gaussians; fit by EM; $k$ is a hyperparameter |
| KDE | Place a kernel bump at each data point; bandwidth $h$ controls smoothness |
| Silverman’s rule | $h \approx 1.06 \hat{\sigma} n^{-1/5}$; quick bandwidth for near-Gaussian 1D data |
| Curse of dimensionality | KDE needs exponentially more data as dimension increases |
Read Next: