Score-Based Diffusion - Learning to Reverse Noise
Helpful context:
- Vector Calculus for ML - Derivatives in Every Direction at Once
- Variational Autoencoders - Encoding Structure Into the Latent Space
Suppose someone hands you a blurry, noise-corrupted photograph and asks you to restore it. You do not know the original image, but you have seen thousands of photographs before and you have a strong sense of what natural images look like. The most useful piece of information at any moment is not “what does the clean image look like” but “in which direction should I adjust the pixels to make this image more plausible?” That direction - the gradient of the log probability of the image - is the score function, and it is the central object in score-based diffusion.
The score function $s_\theta(x) = \nabla_x \log p(x)$ points from low-density regions toward high-density ones. If you follow it with small gradient steps, you climb the probability landscape and eventually reach a region of high density - a plausible sample. The challenge is that we do not know $p(x)$ for natural images; we only have samples from it. Score matching, introduced by Hyvärinen in 2005, is a method to learn the score directly from data without ever computing the density. The key insight that turns this into a modern generative model is that you do not need to learn a single score function - you need one for each noise level, from pure Gaussian noise down to the clean data distribution. If you have this family of scores, you can start from noise and iteratively denoise your way to a real image.
Diffusion models formalize this as a stochastic process. The forward process gradually destroys a data point by adding Gaussian noise, transforming any image into pure noise over time. The reverse process, if you know the score at each noise level, undoes this destruction step by step. Modern diffusion models - DDPM, DDIM, score SDEs - are different ways of parameterizing and solving this reverse process. They produce state-of-the-art image quality without the training instability of GANs, and unlike VAEs they do not optimize a surrogate lower bound. The price is sampling speed: generating a single image requires hundreds of sequential network evaluations.
The Score Function
For a distribution $p(x)$ over $\mathbb{R}^d$, the score function is:
$$s(x) = \nabla_x \log p(x).$$
This is a vector field on $\mathbb{R}^d$. At every point $x$, the score vector points in the direction of steepest ascent of the log density. Langevin dynamics turns the score into a sampler:
$$x_{t+1} = x_t + \frac{\epsilon}{2} \nabla_x \log p(x_t) + \sqrt{\epsilon} z_t, \qquad z_t \sim \mathcal{N}(0, I).$$
Start at any point, repeatedly add the score (gradient of log density) plus a small noise term, and in the limit $\epsilon \to 0$ and steps $\to \infty$, the iterates converge to samples from $p$. No density evaluation required - just the gradient.
The catch is that computing $\nabla_x \log p(x)$ exactly requires knowing $p(x)$, which we do not have. Score matching sidesteps this. The explicit score matching objective is:
$$J_{\text{ESM}}(\theta) = \mathbb{E}{p(x)}\left[\tfrac{1}{2}|s\theta(x) - \nabla_x \log p(x)|^2\right].$$
Minimizing this over $\theta$ fits a network $s_\theta$ to the true score. By integration by parts, the unknown $\nabla_x \log p(x)$ can be eliminated, giving an equivalent objective requiring only samples from $p$:
$$J_{\text{ISM}}(\theta) = \mathbb{E}{p(x)}\left[\text{tr}(\nabla_x s\theta(x)) + \frac{1}{2}|s_\theta(x)|^2\right].$$
The trace term involves second derivatives of the network - expensive to compute at scale. Sliced score matching approximates this trace by projecting onto random directions, making it practical for high-dimensional data.
Why score functions fail near low-density regions. Score matching works well in regions of high probability, where there are many training points. In the low-density regions between modes - the gaps in the data distribution - there are almost no training samples, so the learned score is inaccurate there. Langevin dynamics starting from noise must cross these gaps to reach any mode, and it will get lost if the score in the gaps is wrong. This is the core problem that the noising schedule solves.
The Forward Process
The forward process progressively corrupts data $x_0 \sim p_{\text{data}}$ by adding Gaussian noise. In DDPM (Ho et al., 2020), this is a discrete Markov chain:
$$q(x_t \mid x_{t-1}) = \mathcal{N}\left(x_t;; \sqrt{1 - \beta_t} x_{t-1},; \beta_t I\right), \qquad t = 1, \ldots, T.$$
The noise schedule $\{\beta_t\}$ increases from small to large. Using the notation $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}t = \prod{s=1}^t \alpha_s$, the marginal at time $t$ has a closed form:
$$q(x_t \mid x_0) = \mathcal{N}\left(x_t;; \sqrt{\bar{\alpha}_t} x_0,; (1 - \bar{\alpha}_t) I\right).$$
This means: to get $x_t$ from $x_0$, sample $\epsilon \sim \mathcal{N}(0, I)$ and set $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$. As $t \to T$, $\bar{\alpha}_T \approx 0$ and $x_T \approx \mathcal{N}(0, I)$ regardless of $x_0$. The data has been completely destroyed.
The continuous-time version (Song et al., 2021) describes the same process as a stochastic differential equation:
$$dx = f(x, t) dt + g(t) dW,$$
where $dW$ is a Wiener process (Brownian motion increment), $f(x, t)$ is the drift coefficient, and $g(t)$ is the diffusion coefficient. For the variance-preserving SDE used in DDPM-like models:
$$dx = -\frac{\beta(t)}{2} x dt + \sqrt{\beta(t)} dW.$$
The drift $-\frac{\beta(t)}{2} x$ shrinks the signal; the diffusion $\sqrt{\beta(t)} dW$ adds noise. At time 0 the process starts at data; at time $T$ it has converged to $\mathcal{N}(0, I)$.
The Reverse Process
The key mathematical fact is Anderson’s theorem (1982): the time-reversal of the forward SDE is also an SDE,
$$dx = \left[f(x, t) - g(t)^2 \nabla_x \log p_t(x)\right] dt + g(t) d\bar{W},$$
where $d\bar{W}$ is a Wiener process running backward in time and $p_t(x)$ is the marginal density of the forward process at time $t$. The reverse SDE has the same structure as the forward, but the drift is modified by the score function $\nabla_x \log p_t(x)$.
This is the core insight: if you know the score at every noise level $t$, you can reverse the forward process. Start from $x_T \sim \mathcal{N}(0, I)$ and integrate the reverse SDE from $T$ to $0$. The resulting $x_0$ is a sample from $p_{\text{data}}$.
For the variance-preserving SDE, the reverse SDE is:
$$dx = \left[-\frac{\beta(t)}{2} x - \beta(t) \nabla_x \log p_t(x)\right] dt + \sqrt{\beta(t)} d\bar{W}.$$
The score $\nabla_x \log p_t(x)$ is unknown, so we approximate it with a neural network $s_\theta(x, t)$ trained by score matching at each noise level $t$.
Denoising Score Matching
Explicit score matching requires the true score, which is intractable. Denoising score matching (Vincent, 2011) gives an equivalent training objective that only requires the noised data.
The key identity: for the forward process with transition kernel $q(x_t \mid x_0)$, the score of the marginal is related to the conditional score by:
$$\nabla_{x_t} \log p_t(x_t) = \mathbb{E}{q(x_0 \mid x_t)}\left[\nabla{x_t} \log q(x_t \mid x_0)\right].$$
For a Gaussian transition $q(x_t \mid x_0) = \mathcal{N}(x_t; \mu_t(x_0), \sigma_t^2 I)$, the conditional score is:
$$\nabla_{x_t} \log q(x_t \mid x_0) = -\frac{x_t - \mu_t(x_0)}{\sigma_t^2}.$$
Since $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ with $\epsilon \sim \mathcal{N}(0,I)$, we have $x_t - \mu_t(x_0) = \sqrt{1-\bar{\alpha}_t} \epsilon$, so:
$$\nabla_{x_t} \log q(x_t \mid x_0) = -\frac{\epsilon}{\sqrt{1 - \bar{\alpha}_t}}.$$
The denoising score matching objective is then:
$$\mathcal{L}{\text{DSM}}(\theta) = \mathbb{E}{t, x_0, \epsilon}\left[\left|s_\theta(x_t, t) + \frac{\epsilon}{\sqrt{1-\bar{\alpha}_t}}\right|^2\right].$$
In words: train the network to predict $-\epsilon / \sqrt{1-\bar{\alpha}_t}$ given the noised image $x_t$ and time $t$. This is the score of the noised distribution, not the clean data distribution - but that is exactly what the reverse SDE needs.
DDPM: Predicting Noise is Predicting the Score
Ho et al. (2020) reparameterize the network. Instead of directly learning $s_\theta(x_t, t) \approx \nabla_{x_t} \log p_t(x_t)$, they train a noise prediction network $\epsilon_\theta(x_t, t)$ to predict the noise $\epsilon$ that was added to produce $x_t$ from $x_0$.
Since $s_\theta(x_t, t) = -\epsilon_\theta(x_t, t) / \sqrt{1-\bar{\alpha}_t}$, these are equivalent. The DDPM training objective is:
$$\mathcal{L}{\text{DDPM}}(\theta) = \mathbb{E}{t \sim \mathcal{U}[1,T], x_0 \sim p_{\text{data}}, \epsilon \sim \mathcal{N}(0,I)}\left[|\epsilon - \epsilon_\theta(\underbrace{\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}t} \epsilon}{x_t}, t)|^2\right].$$
This is a simple weighted denoising loss. For each training step: (1) sample a clean image $x_0$, (2) sample a timestep $t$ uniformly, (3) sample noise $\epsilon$, (4) form the noisy image $x_t$, (5) ask the network to predict $\epsilon$, (6) take a gradient step. There is no adversarial training, no posterior approximation, no likelihood term. The network $\epsilon_\theta$ is typically a U-Net with time conditioning via sinusoidal embeddings.
The reverse process is: starting from $x_T \sim \mathcal{N}(0, I)$, iterate
$$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}t}} \epsilon\theta(x_t, t)\right) + \sigma_t z_t, \qquad z_t \sim \mathcal{N}(0, I),$$
for $t = T, T-1, \ldots, 1$. Each step uses the predicted noise to estimate the direction toward $x_0$, takes a step in that direction, and adds a small amount of noise to maintain stochasticity. After $T$ steps (typically $T = 1000$), the result is a sample from (approximately) $p_{\text{data}}$.
The ODE/SDE Duality: DDIM vs DDPM
DDPM sampling is stochastic: each step adds fresh Gaussian noise. Song et al. (2021) showed that every diffusion model has a deterministic counterpart - a probability flow ODE - that samples from the same distribution without any stochasticity:
$$\frac{dx}{dt} = f(x, t) - \frac{1}{2} g(t)^2 \nabla_x \log p_t(x).$$
This ODE has the same marginals $p_t(x)$ as the forward SDE - meaning that trajectories of the ODE and the SDE visit the same distributions at each noise level $t$. Integrating this ODE backward from $x_T \sim \mathcal{N}(0, I)$ to $t = 0$ gives a sample from $p_{\text{data}}$.
DDIM (Denoising Diffusion Implicit Models, Song et al., 2021) is a discrete-time version of the probability flow ODE applied to the DDPM model. The DDIM update is:
$$x_{t-1} = \sqrt{\bar{\alpha}{t-1}}\underbrace{\frac{x_t - \sqrt{1-\bar{\alpha}t} \epsilon\theta(x_t, t)}{\sqrt{\bar{\alpha}t}}}{\text{predicted } x_0} + \sqrt{1 - \bar{\alpha}{t-1}} \epsilon_\theta(x_t, t).$$
Crucially, DDIM is deterministic: given $x_T$, the trajectory $x_T \to x_{T-1} \to \cdots \to x_0$ is fixed. This has three consequences:
- Latent encoding: The ODE can be run forward to encode a real image $x_0$ into a noise vector $x_T$, and backward to decode it. DDPM cannot do this because the forward path is stochastic.
- Faster sampling: DDIM can skip timesteps - instead of running all $T$ steps, take $S \ll T$ large steps along the ODE trajectory. In practice, 50 DDIM steps produce quality comparable to 1000 DDPM steps.
- No stochasticity: DDIM produces cleaner, more consistent outputs but cannot match the diversity of DDPM when $T$ is large. DDPM adds noise at each step, which can correct small errors; DDIM propagates errors deterministically.
The interpolation $\eta \in [0, 1]$ in DDIM controls the noise added at each step - $\eta = 0$ is the deterministic ODE, $\eta = 1$ recovers DDPM.
| Property | DDPM | DDIM |
|---|---|---|
| Sampling | Stochastic (SDE) | Deterministic (ODE) |
| Steps needed | $\sim 1000$ | $\sim 50$ |
| Latent encoding | No (forward is stochastic) | Yes (forward ODE is invertible) |
| Diversity | High | Lower for small steps |
| Error correction | Yes (noise at each step) | No (deterministic) |
Why Diffusion Beats GANs
GANs and diffusion models are both learned samplers that transform a simple distribution into a complex one. The differences in their training objectives lead to stark differences in behavior.
Mode coverage. GAN training minimizes an adversarial objective that can collapse to minimizing JSD. The generator can increase its score by concentrating probability mass on a few high-density modes - the discriminator is fooled more easily when the generator’s output is consistently in a small set of convincing examples. Diffusion training minimizes a denoising loss that is evaluated over the entire data distribution at every noise level. There is no incentive for mode collapse: the network must correctly denoise images from all modes to minimize the loss.
Training stability. GAN training is a minimax game between two networks with competing objectives. The loss landscape is non-stationary: as the generator improves, the optimal discriminator changes, which changes the gradient signal for the generator. This can lead to oscillation, vanishing gradients, and training instability. Diffusion training is a standard supervised regression problem - predict the noise $\epsilon$ from the noisy image $x_t$. The objective is fixed, the loss decreases monotonically with network capacity, and scaling behavior is predictable.
Sample quality and diversity. Modern diffusion models (DALL-E 2, Stable Diffusion, Imagen) produce samples with higher FID scores than state-of-the-art GANs, while maintaining diversity over the full distribution. The stochastic sampling process in DDPM actively explores the space of plausible completions at each denoising step, naturally producing diverse outputs from the same model.
The cost of diffusion over GANs is inference speed. A GAN generates a sample in one forward pass; DDPM requires $T = 1000$ forward passes. DDIM reduces this to $\sim 50$ passes, and consistency models (Song et al., 2023) reduce it further, but diffusion is still substantially slower than GAN sampling.
Classifier-Free Guidance
Unconditional diffusion generates diverse samples from $p_{\text{data}}$. Conditional generation requires sampling from $p_{\text{data}}(x \mid c)$ for a condition $c$ (a class label, a text prompt, or any other signal). Guidance is a technique to sharpen conditional samples by trading diversity for fidelity to the condition.
Classifier guidance (Dhariwal and Nichol, 2021) uses a separately trained classifier $p_\phi(c \mid x_t)$ and modifies the score at each noise level:
$$\nabla_{x_t} \log p(x_t \mid c) = \nabla_{x_t} \log p_t(x_t) + \nabla_{x_t} \log p_\phi(c \mid x_t).$$
The classifier gradient pushes the sample toward regions that look like class $c$. This works but requires training a separate classifier on noisy images at all timesteps.
Classifier-free guidance (Ho and Salimans, 2022) avoids the separate classifier entirely. During training, the conditioning information $c$ is randomly dropped (replaced with a null token $\emptyset$) with probability $p_{\text{uncond}}$, so the network learns both $\epsilon_\theta(x_t, c)$ and $\epsilon_\theta(x_t, \emptyset)$ from a single model. At sampling time, the effective score is:
$$\tilde{\epsilon}\theta(x_t, c) = \epsilon\theta(x_t, \emptyset) + w \cdot \left(\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \emptyset)\right),$$
where $w \geq 1$ is the guidance scale. This moves the score in the direction that increases $p(c \mid x_t)$, the same direction as classifier guidance but estimated purely from the diffusion model itself.
When $w = 1$, the formula recovers $\epsilon_\theta(x_t, c)$ - standard conditional generation. When $w > 1$, the score is exaggerated in the direction of the condition: the sample is pushed toward the extreme of what satisfies condition $c$. The tradeoff: higher $w$ increases sharpness, coherence, and condition-alignment but decreases diversity and can produce oversaturated, unrealistic artifacts. In text-to-image models (Stable Diffusion, DALL-E 2), $w \in [5, 15]$ is typical.
Geometrically: $\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \emptyset)$ estimates the direction in noise space that makes the image more conditional-consistent. The guidance scale $w$ is the step length in that direction. Larger $w$ means larger steps - the sampler trades exploration (covering all plausible images satisfying $c$) for exploitation (maximizing one specific aspect of satisfying $c$).
Flow Matching and the Continuous SDE View
Song et al. (2021) unify DDPM, score matching, and related methods under a single SDE framework. The key objects are:
- A family of distributions $\{p_t\}{t=0}^T$ interpolating from $p{\text{data}}$ ($t=0$) to $\mathcal{N}(0,I)$ ($t=T$).
- A score network $s_\theta(x, t)$ approximating $\nabla_x \log p_t(x)$.
- Either a stochastic reverse SDE (DDPM-like) or a deterministic probability flow ODE (DDIM-like) for sampling.
Different choices of forward SDE give different models:
- Variance-Preserving (VP) SDE: the DDPM forward process. Maintains unit variance throughout.
- Variance-Exploding (VE) SDE: keeps the clean signal and adds increasing variance. Used in NCSN (Noise Conditional Score Network).
- Sub-VP SDE: a modification of VP with smaller variance at each step, producing cleaner samples at the cost of slightly different noise statistics.
Flow Matching (Lipman et al., 2022) simplifies the training objective further. Instead of learning a score function and solving an SDE, flow matching directly learns a vector field $v_\theta(x, t)$ that transports samples from $\mathcal{N}(0,I)$ to $p_{\text{data}}$ via an ODE $dx/dt = v_\theta(x,t)$. The training target is a simple regression against the conditional flow field, which has lower variance than score matching and allows straight-line trajectories (requiring fewer integration steps).
Summary
| Model | Process | Training objective | Sampling | Exact likelihood |
|---|---|---|---|---|
| DDPM | Discrete SDE | Predict noise $\epsilon$ at each $t$ | Stochastic, $\sim 1000$ steps | No (lower bound via ELBO) |
| DDIM | Probability flow ODE | Same as DDPM | Deterministic, $\sim 50$ steps | Yes (via ODE) |
| Score SDE (VP/VE) | Continuous SDE | Score matching at all $t$ | Stochastic or ODE | Yes (via probability flow ODE) |
| Flow Matching | ODE (straight paths) | Predict velocity field $v$ | Deterministic, $\sim 10$ steps | Yes |
| GAN | Implicit (one step) | Adversarial minimax | One forward pass | No |
| VAE | Latent variable | ELBO (reconstruction + KL) | One forward pass | Approximate (ELBO) |
| Concept | Definition |
|---|---|
| Score function | $s(x) = \nabla_x \log p(x)$; gradient of log density |
| Langevin dynamics | $x_{t+1} = x_t + \frac{\epsilon}{2} s(x_t) + \sqrt{\epsilon} z_t$; converges to samples from $p$ |
| Forward process | $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$, $\epsilon \sim \mathcal{N}(0,I)$ |
| Denoising score matching | Learn $\epsilon_\theta(x_t, t) \approx \epsilon$; equivalent to learning $\nabla_{x_t} \log p_t(x_t)$ |
| DDPM objective | $\mathbb{E}{t, x_0, \epsilon}|\epsilon - \epsilon\theta(x_t, t)|^2$ |
| Reverse SDE | $dx = [f - g^2 \nabla_x \log p_t] dt + g d\bar{W}$; reverses the forward noise |
| Probability flow ODE | $dx/dt = f - \frac{1}{2} g^2 \nabla_x \log p_t$; deterministic, same marginals as SDE |
| DDIM | Deterministic sampler derived from probability flow ODE; enables latent encoding |
| Classifier-free guidance | $\tilde{\epsilon} = \epsilon_\theta(\emptyset) + w(\epsilon_\theta(c) - \epsilon_\theta(\emptyset))$; sharpens condition alignment |
| Guidance scale $w$ | $w > 1$ sharpens samples toward condition at cost of diversity |
| Mode collapse comparison | DDPM: none (denoising loss over full distribution); GAN: common (adversarial incentive) |
| Training stability | DDPM: supervised regression, stable; GAN: minimax, unstable |
Read next: