Anomaly Detection - Finding the Needles Without Knowing the Haystack
Helpful context:
- Clustering - Finding Structure When Nobody Gave You Labels
- Ensembles - Weak Learners That Combine Into Something Stronger
A credit card transaction processing system handles ten million transactions per day. Fraud accounts for perhaps 50 of them. You cannot train a standard classifier because you have almost no positive examples, and the ones you do have are already labeled - meaning they are old fraud patterns that fraudsters have since stopped using. You need to find the transactions that are weird relative to everything else, without someone labeling what “weird” looks like.
This is the anomaly detection problem. It appears across fraud detection, server monitoring, manufacturing quality control, medical diagnosis, and network intrusion - anywhere that normal behavior is abundant and you are hunting for rare deviations. What makes it hard is the same thing that makes it interesting: you often cannot define what you are looking for, only what you are not looking for.
The Core Idea
Anomaly detection methods work by modeling normal. Once you have a model of what normal looks like, anything that deviates significantly from that model is flagged as anomalous.
This sounds simple but has two deep complications. First, you need enough “normal” data to model it well. Second, you need to decide what “significantly deviate” means - that threshold is always a judgment call that trades off false positives (normal things flagged as anomalous) against false negatives (real anomalies missed).
Isolation Forest: Normal Is Hard to Isolate
Isolation Forest (Liu et al., 2008) is the most widely used anomaly detection algorithm in industry, and its insight is beautiful in its simplicity.
Anomalies are rare and different. Normal points cluster together in dense regions of feature space. Anomalies sit in sparse regions, far from everything else. The key observation: it takes fewer random splits to isolate an anomaly than to isolate a normal point.
Imagine randomly selecting a feature and a split value within its range, repeatedly, until each point is alone. A normal point surrounded by similar neighbors requires many splits before it is isolated - each split separates it from only a few neighbors. An anomaly in sparse space is isolated after just a few splits - a single cut might separate it from the entire rest of the dataset.
The algorithm builds $T$ random trees:
- Randomly subsample $\psi$ points from the dataset (typically $\psi = 256$).
- Recursively split by randomly selecting a feature and a random split value within that feature’s range, until every point is isolated or the tree reaches maximum depth.
- For a new point, the isolation depth is the number of splits required to isolate it in the tree.
The anomaly score is computed from the average path length across all $T$ trees, normalized by the expected path length for a random binary tree. Points with shorter average path length (easier to isolate) get higher anomaly scores.
What works well: Isolation Forest is fast to train ($O(T \psi \log \psi)$, largely independent of $n$), requires no distance computations, handles high-dimensional data reasonably, and works unsupervised. The $\psi = 256$ default is remarkably robust - using a small subsample per tree speeds up training without harming performance, because isolated points are identified with few points anyway.
Where it struggles: anomalies that form their own dense cluster (a group of fraudsters using the same pattern) may not be isolated quickly because they are dense relative to each other. Points near the boundary of the normal distribution may score as anomalous even when they are not.
from sklearn.ensemble import IsolationForest
clf = IsolationForest(n_estimators=100, contamination=0.01, random_state=42)
clf.fit(X_train) # trains on all data; no labels needed
scores = clf.score_samples(X_test) # more negative = more anomalous
labels = clf.predict(X_test) # -1 = anomaly, 1 = normal
The contamination parameter is your estimate of the fraction of anomalies in the data. It sets the decision threshold. If you do not know this fraction, set it conservatively low (0.001 to 0.05) and examine the score distribution manually.
Local Outlier Factor: Anomaly Relative to Neighbors
Isolation Forest uses global density. Local Outlier Factor (LOF, Breunig et al., 2000) uses local density - comparing each point’s density to its neighbors' density. This matters when the data has regions of different density: a point in a sparse region might be normal for that region, while the same distance from its neighbors in a denser region would be highly anomalous.
For each point $p$, LOF computes:
- Find the $k$ nearest neighbors of $p$.
- Compute the reachability distance from $p$ to each neighbor (max of actual distance and the neighbor’s $k$-th nearest neighbor distance - this smoothing prevents instability with points that are very close together).
- Compute the local reachability density of $p$: the inverse of the average reachability distance to $p$’s neighbors.
- The LOF score of $p$ is the ratio of the average local reachability density of $p$’s neighbors to $p$’s own local reachability density.
A LOF score near 1 means $p$ has the same density as its neighbors - it is normal. A LOF score much greater than 1 means $p$ is in a sparser region than its neighbors - it is an outlier relative to its local context.
What works well: LOF detects local anomalies that global methods miss. In a dataset where fraud occurs in multiple different “styles” each with its own density, LOF can detect deviations within each style separately.
Where it struggles: LOF requires computing pairwise distances, making it $O(n^2)$ in the worst case - unusable for datasets above a few hundred thousand points. The choice of $k$ significantly affects results, and the right $k$ varies across different regions of the data.
Statistical Baselines
Before reaching for Isolation Forest, statistical baselines deserve consideration. They are transparent, fast, and surprisingly effective for low-dimensional data with known distributions.
Z-score: flag points where $|x - \mu| / \sigma > 3$ (more than 3 standard deviations from the mean). Works well for univariate Gaussian data. Breaks down for skewed distributions or when the mean and standard deviation are themselves distorted by the anomalies you are trying to find.
IQR method: flag points below $Q1 - 1.5 \times IQR$ or above $Q3 + 1.5 \times IQR$. More robust than z-score because quartiles are not affected by extreme values.
Multivariate Gaussian: fit a Gaussian to the data (estimate $\mu$ and $\Sigma$), then compute the Mahalanobis distance of each new point. Points with high Mahalanobis distance are anomalous. The Mahalanobis distance accounts for correlations between features: a point can be normal on every individual feature but anomalous in their combination.
Statistical methods require assuming a distribution. When the assumption holds, they are powerful and interpretable. When it does not, Isolation Forest or LOF are safer bets.
One-Class SVM: A Tight Boundary Around Normal
One-Class SVM (Schölkopf et al., 2001) learns a boundary in feature space (or in a kernel-induced space) that contains the normal data, then flags anything outside that boundary as anomalous.
The standard SVM separates two classes with a margin. One-Class SVM has only one class, so it separates the data from the origin in a high-dimensional feature space, finding the boundary that contains most of the normal points with maximum margin from the origin.
What works well: the kernel trick allows One-Class SVM to learn complex boundaries (an RBF kernel can wrap tightly around clusters of arbitrary shape). Works in high dimensions.
Where it struggles: very sensitive to the regularization parameter $\nu$ (fraction of training points allowed to be outside the boundary) and the kernel bandwidth. Getting these right requires validation data with known labels - which often defeats the purpose of unsupervised anomaly detection. Training scales poorly ($O(n^2)$ to $O(n^3)$) for large datasets.
Autoencoder-Based Detection
An autoencoder is a neural network that learns to compress data into a low-dimensional representation (the bottleneck) and then reconstruct the original input. Trained on normal data, it learns to reconstruct normal patterns efficiently but fails to reconstruct anomalies well - because the bottleneck has not learned to represent patterns it never saw.
The anomaly score is the reconstruction error: the difference between the input and the autoencoder’s output. Normal points have low reconstruction error. Anomalous points have high reconstruction error.
What works well: autoencoders can model extremely complex normal patterns - images, time series, text, multi-sensor industrial data. They are the right tool when the normal data is high-dimensional and structured (factory sensor readings, network packet flows, medical scans).
Where it struggles: requires enough normal training data to learn the data manifold. Requires careful tuning of architecture and training. The reconstruction error threshold is a hyperparameter. Anomalies that look similar to normal data in the latent space (same broad structure, wrong fine-grained detail) may not be caught.
Choosing the Right Tool
| Algorithm | Speed | Labels needed | Strengths | Weakness |
|---|---|---|---|---|
| Isolation Forest | Fast | None | Scales to millions, robust, high-dim | Clustered anomalies, boundary cases |
| LOF | Slow | None | Local anomalies, varying density | $O(n^2)$, sensitive to $k$ |
| Z-score / IQR | Very fast | None | Transparent, interpretable | Assumes distribution, univariate |
| One-Class SVM | Slow | None | Complex boundaries, high-dim | Very sensitive to hyperparameters |
| Autoencoder | Moderate | None (but more data) | Complex structured data, images | Requires tuning, significant data |
In production anomaly detection, the workflow is almost always:
- Train Isolation Forest as the baseline - fast, robust, requires minimal tuning.
- Examine flagged anomalies manually to understand what kinds of anomalies the data actually contains.
- If local anomalies matter, add LOF or a local variant.
- If the data is high-dimensional and structured (images, sequences), consider an autoencoder.
- Set thresholds based on the acceptable false positive rate for your application.
The contamination fraction and score threshold are the decisions that matter most in deployment. A fraud system that flags 0.01% of transactions for review has very different operational requirements than one that flags 0.1%. The ML model produces scores; the threshold determines the operating point.
| Concept | Key point |
|---|---|
| Anomaly detection | Model normal; flag deviations; no labels needed |
| Isolation Forest | Short isolation path = anomaly; fast; scales to millions |
| Contamination | Fraction of anomalies; sets decision threshold |
| LOF | Compare density to neighbors; finds local anomalies; $O(n^2)$ |
| Mahalanobis distance | Multivariate outlier score; accounts for correlations |
| Autoencoder | High reconstruction error = anomaly; for complex structured data |
| Threshold tuning | The operating point is always a judgment call; use false positive rate requirements |
Read Next: