Normalizing Flows - Invertible Maps From Simple Noise to Complex Data // Megha Bose

Helpful context:

Every generative model faces the same fundamental tension: the data distribution is complex, but we need to either sample from it or evaluate its density. VAEs resolve this by approximating the density with a tractable lower bound. GANs sidestep density entirely by learning to sample implicitly. Normalizing flows take a different path - they learn an exact, invertible transformation between the data distribution and a simple reference distribution (usually a Gaussian), so that both sampling and exact density evaluation are possible.

The idea is conceptually clean. Suppose $f: \mathbb{R}^d \to \mathbb{R}^d$ is a bijection, and $z = f(x)$ maps data points $x$ to latent codes $z$. If $z$ follows a standard Gaussian $p_Z(z) = \mathcal{N}(0, I)$, what density does $x$ follow? The change-of-variables formula gives the answer exactly: the density of $x$ is the density of $f(x)$ scaled by the absolute Jacobian determinant. To generate a new sample, sample $z \sim \mathcal{N}(0,I)$ and apply $f^{-1}(z)$. To evaluate the likelihood of a data point $x$, apply $f(x)$, evaluate the Gaussian density, and scale by the Jacobian determinant. Nothing is approximate - the likelihood is exact.

The difficulty is entirely in the construction of $f$. For the formula to be useful, $f$ must be invertible (so we can both sample and evaluate likelihood), and its Jacobian determinant must be cheap to compute (otherwise the likelihood computation is intractable). A general neural network satisfies neither constraint. The ingenuity in normalizing flows is in designing architectural families - coupling layers, autoregressive flows, continuous flows - that are simultaneously expressive, invertible, and have tractable Jacobians. These design choices shape a rich model family with exact likelihoods but different tradeoffs in expressivity, sampling speed, and training speed.

Change of Variables

Let $z \sim p_Z(z)$ be a simple distribution (standard Gaussian) and let $f: \mathbb{R}^d \to \mathbb{R}^d$ be a differentiable bijection. Define $x = f^{-1}(z)$, so $z = f(x)$. The density of $x$ is given by:

$$p_X(x) = p_Z(f(x)) \left|\det J_f(x)\right|,$$

where $J_f(x) = \partial f(x) / \partial x \in \mathbb{R}^{d \times d}$ is the Jacobian of $f$ at $x$. In log form:

$$\log p_X(x) = \log p_Z(f(x)) + \log \left|\det J_f(x)\right|.$$

The Jacobian determinant accounts for how the transformation $f$ stretches or compresses volume. If $f$ expands a region (large $|\det J_f|$), the density there must decrease to keep total probability at 1. If $f$ compresses a region (small $|\det J_f|$), the density increases.

Why the Jacobian determinant matters. Without the Jacobian correction, applying any bijection would change the total probability mass and give a non-normalizing density. The correction term $\log |\det J_f(x)|$ is what makes the transformed distribution a valid probability distribution. It is also the key source of difficulty: computing $\det J$ for a general $d \times d$ matrix costs $O(d^3)$, which is prohibitive for image data where $d$ can be millions.

Composing flows. A single bijection may not be expressive enough. A normalizing flow stacks $K$ bijections: $z = f_K \circ f_{K-1} \circ \cdots \circ f_1(x)$. By the chain rule for determinants:

$$\log p_X(x) = \log p_Z(z) + \sum_{k=1}^K \log \left|\det J_{f_k}(x_k)\right|,$$

where $x_k = f_k \circ \cdots \circ f_1(x)$. Each $f_k$ must be invertible with tractable Jacobian; the composition inherits both properties. By stacking enough simple bijections, a normalizing flow can approximate arbitrarily complex distributions.

Training. Given a dataset $\{x^{(1)}, \ldots, x^{(n)}\}$, maximize the exact log-likelihood:

$$\mathcal{L}(\theta) = \frac{1}{n}\sum_{i=1}^n \left[\log p_Z(f_\theta(x^{(i)})) + \log \left|\det J_{f_\theta}(x^{(i)})\right|\right].$$

This is maximum likelihood estimation with an exact density - no variational approximation, no adversarial training.

Coupling Layers: RealNVP

The challenge is designing $f$ with a tractable Jacobian. The coupling layer, introduced in NICE (Dinh et al., 2015) and extended in RealNVP (Dinh et al., 2017), achieves this through a structured transformation.

Split the input $x \in \mathbb{R}^d$ into two halves: $x = (x_a, x_b)$ with $x_a \in \mathbb{R}^{d/2}$ and $x_b \in \mathbb{R}^{d/2}$. The coupling layer is:

$$z_a = x_a, \qquad z_b = x_b \odot \exp(s_\theta(x_a)) + t_\theta(x_a),$$

where $s_\theta$ and $t_\theta$ are arbitrary neural networks (scale and translation functions), and $\odot$ is elementwise multiplication. The first half $x_a$ passes through unchanged; the second half $x_b$ is scaled and shifted by a function of $x_a$.

Why this is invertible. Given $z = (z_a, z_b)$, recover $x$:

$$x_a = z_a, \qquad x_b = (z_b - t_\theta(z_a)) \odot \exp(-s_\theta(z_a)).$$

The inversion is exact and does not require solving any equations - just a forward pass of $s_\theta$ and $t_\theta$. Importantly, $s_\theta$ and $t_\theta$ never need to be inverted; they can be arbitrarily complex neural networks.

Why the Jacobian is triangular. The Jacobian of the coupling layer has the structure:

$$J_f = \begin{pmatrix} I & 0 \\ \partial z_b / \partial x_a & \text{diag}(\exp(s_\theta(x_a))) \end{pmatrix}.$$

The top-left block is the identity (since $z_a = x_a$). The bottom-right block is diagonal (since $z_{b,i}$ depends on $x_{b,i}$ only through the elementwise scale). The off-diagonal blocks are zero (top-right) or unconstrained (bottom-left, which does not contribute to the determinant). The determinant of a lower-triangular matrix is the product of its diagonal entries:

$$\log \left|\det J_f(x)\right| = \sum_{i=1}^{d/2} s_\theta(x_a)_i.$$

This is $O(d)$ to compute - a single forward pass through $s_\theta$. The entire log-likelihood computation is $O(d)$ per layer, regardless of the complexity of $s_\theta$.

In practice, the split alternates which half is left unchanged, so that after a few layers every dimension has been transformed as a function of every other. This is the main architectural trick of RealNVP: complex transformations built from simple, tractably invertible pieces.

Autoregressive Flows

Coupling layers transform half the input as a function of the other half. Autoregressive flows generalize this: each dimension $z_i$ is transformed as a function of all previous dimensions $x_1, \ldots, x_{i-1}$.

The autoregressive transformation is:

$$z_i = h(x_i;; \mu_i(x_{<i}),; \sigma_i(x_{<i})),$$

where $h$ is an invertible scalar function (e.g., an affine shift: $z_i = \sigma_i(x_{<i}) \cdot x_i + \mu_i(x_{<i})$) and $\mu_i, \sigma_i$ are functions of the preceding variables. The Jacobian of this transformation is lower-triangular: $\partial z_i / \partial x_j = 0$ for $j > i$ (since $z_i$ does not depend on $x_j$ for $j > i$). The determinant is therefore the product of diagonal entries $\prod_i \sigma_i(x_{<i})$.

Masked Autoregressive Flow (MAF) (Papamakarios et al., 2017) implements the autoregressive network $\mu_i(x_{<i}), \sigma_i(x_{<i})$ using a masked neural network (MADE) that enforces the autoregressive constraint by zeroing out connections that would violate it. A single network pass computes all $\mu_i$ and $\sigma_i$ in parallel. This makes density evaluation (forward pass, $O(d)$ in one shot) fast, but sampling requires sequential evaluation: to sample $z_1$, compute $\mu_1, \sigma_1$ (from an empty conditioning set); to sample $z_2$, need $x_1$, which requires inverting the transformation for $z_1$ first. Sampling thus takes $d$ sequential passes.

Inverse Autoregressive Flow (IAF) (Kingma et al., 2016) inverts the direction. It is an autoregressive transformation on the latent code $z$, designed to be fast at sampling but slow at density evaluation - the exact reverse of MAF. IAF is useful as a variational posterior in VAEs, where you need fast samples but do not need density evaluation.

The MAF/IAF duality is a fundamental tradeoff: autoregressive transformations in data space (MAF) give fast density evaluation but slow sampling; in latent space (IAF) give fast sampling but slow density evaluation.

Continuous Normalizing Flows

Discrete flows apply $K$ bijections sequentially. Continuous Normalizing Flows (CNFs), introduced by Chen et al. (2018) with the Neural ODE framework, take the limit $K \to \infty$ and define the transformation via an ordinary differential equation.

Define a vector field $v_\theta(x, t): \mathbb{R}^d \times [0, 1] \to \mathbb{R}^d$. The transformation is the solution to:

$$\frac{dx}{dt} = v_\theta(x, t), \qquad x(0) = z \sim p_Z.$$

Running the ODE from $t=0$ to $t=1$ maps $z$ to a sample $x = x(1)$. The density at any time $t$ evolves according to the instantaneous change of variables (Liouville’s theorem):

$$\frac{\partial \log p_t(x(t))}{\partial t} = -\text{tr}\left(\frac{\partial v_\theta}{\partial x}\right).$$

The log-density change is the trace of the Jacobian of the vector field - much cheaper than the determinant, and the trace can be estimated with Hutchinson’s estimator using $O(1)$ vector-Jacobian products. The log-likelihood is:

$$\log p_X(x) = \log p_Z(x(0)) - \int_0^1 \text{tr}\left(\frac{\partial v_\theta}{\partial x(t)}\right) dt.$$

This integral is computed by augmenting the ODE state with the log-density and solving the joint ODE numerically. The vector field $v_\theta$ can be any neural network - there is no coupling structure required. CNFs are in principle more expressive than discrete flows, but the ODE integration makes training slower than discrete flows (many function evaluations per gradient step).

Flow Matching. Training a CNF by maximum likelihood is expensive because the ODE must be integrated at each training step. Flow Matching (Lipman et al., 2022) avoids this by training $v_\theta$ to match a target conditional vector field that is known in closed form. For straight-line paths between $z \sim \mathcal{N}(0,I)$ and $x \sim p_{\text{data}}$, the target vector field is simply $v = x - z$ at each point. This reduces CNF training to a simple regression problem, with no ODE integration required during training.

Exact Likelihood vs. Other Generative Models

The central advantage of normalizing flows is exact likelihood. VAEs and diffusion models evaluate a lower bound (ELBO) rather than the true log-likelihood. GANs have no density at all. Flows evaluate the exact marginal log-likelihood via the change-of-variables formula.

What exact likelihood buys. With exact density, flows can be used as density estimators for anomaly detection (flag points with low $\log p_X(x)$), for lossless compression (optimal codes require exact densities via arithmetic coding), and for model comparison (exact held-out log-likelihood). The ELBO from a VAE is a lower bound, so comparing ELBO values across models is unreliable; comparing exact log-likelihoods is clean.

The bijection constraint. The cost of exactness is the constraint that $f$ must be a bijection - the input and output must have the same dimension, and the transformation must be invertible. This is a hard architectural constraint. You cannot use a flow to model a distribution that lives on a lower-dimensional manifold (such as images, which are believed to concentrate near a low-dimensional manifold embedded in pixel space) - the flow treats the full ambient dimension as the effective dimension. VAEs and GANs can use a low-dimensional latent space and map up to data space, naturally encoding a dimensionality reduction. Flows cannot.

In practice, images are pre-processed with a small amount of dequantization noise to ensure they have positive density in the continuous ambient space, partially mitigating this. But flows fundamentally cannot reduce dimensionality the way VAEs can.

Applications

Density estimation. Flows were originally proposed for density estimation - fitting a flexible distribution to a dataset and evaluating likelihoods. Glow (Kingma and Dhariwal, 2018) applied flows to face images and showed competitive bits-per-dimension with other generative models while enabling high-quality synthesis and semantic face editing by interpolating in latent space.

Variational inference. Flows make variational posteriors more flexible. A normalizing flow posterior $q_\phi(z \mid x) = f_\phi(\mathcal{N}(0,I) \mid x)$ can represent complex, multi-modal, or non-Gaussian posteriors, reducing the gap between the ELBO and the true log-likelihood in a VAE. This is one of the main applications of IAF.

Sampling and generation. Flows generate samples by mapping $z \sim \mathcal{N}(0,I)$ through $f^{-1}$. For image generation, discrete flows (Glow) and continuous flows (FFJORD) produce competitive samples but are generally inferior to diffusion models at the same computational budget.

Molecule generation and scientific applications. Flows are well-suited to structured continuous data: molecular conformations, 3D protein structures, and lattice field theory configurations. The exact likelihood enables principled comparison of models and direct optimization of physical objectives.

Summary

Model	Likelihood	Sample quality	Training stability	Inference speed	Architecture constraint
VAE	Approximate (ELBO)	Moderate (blurry)	High	Fast (one pass)	Flexible, encoder-decoder
GAN	None	High (sharp)	Low (mode collapse)	Fast (one pass)	Flexible, generator-discriminator
Normalizing Flow	Exact	Moderate	High	Fast (for MAF/RealNVP)	Bijection, same dim in/out
Diffusion	Approximate (ELBO via SDE)	Very high	High	Slow (many steps)	Flexible, U-Net typical

Concept	Definition
Change of variables	$\log p_X(x) = \log p_Z(f(x)) + \log \lvert \det J_f(x) \rvert$
Normalizing flow	Composition of invertible, tractable-Jacobian bijections
Coupling layer	$z_b = x_b \odot \exp(s(x_a)) + t(x_a)$; triangular Jacobian
Coupling layer Jacobian	Lower-triangular; $\log \lvert \det J \rvert = \sum_i s(x_a)_i$, $O(d)$
Autoregressive flow (MAF)	Each $z_i = h(x_i; \mu_i(x_{<i}), \sigma_i(x_{<i}))$; fast density, slow sampling
Inverse autoregressive flow (IAF)	Autoregressive on latent; fast sampling, slow density
Continuous normalizing flow	$dx/dt = v_\theta(x,t)$; log-density via $-\text{tr}(\partial v / \partial x)$
Instantaneous change of variables	$d \log p_t / dt = -\text{tr}(\partial v_\theta / \partial x)$
Flow matching	Regress $v_\theta$ to known conditional vector field; avoids ODE integration during training
Bijection constraint	$f$ must be invertible with same input/output dimension; cannot reduce dimensionality
Exact likelihood advantage	Enables anomaly detection, lossless compression, clean model comparison

Read next: