Consider the problem of learning to draw. A student artist makes a sketch and submits it to a critic who says “fake” or “real.” The student does not know the rules of good drawing in advance - no one has written them down - but the critic’s feedback is enough. Over thousands of iterations the student’s sketches improve, the critic learns to spot subtler flaws, and eventually the student produces work indistinguishable from genuine art. Neither party is given a loss function describing what “good art” looks like. They only play against each other.

Generative Adversarial Networks, introduced by Goodfellow et al. in 2014, formalize this intuition precisely. Two neural networks compete in a minimax game. The generator $G$ takes random noise as input and produces synthetic data. The discriminator $D$ takes data as input and outputs a probability that the data is real rather than generated. The generator wants to fool the discriminator; the discriminator wants to catch the generator. The remarkable claim is that this adversarial pressure, sustained long enough, drives the generator to learn the true data distribution - not because anyone told it what the data distribution is, but because that is the only strategy that defeats a perfect discriminator.

The framework is philosophically different from maximum likelihood estimation. MLE asks the model to assign high probability to training data. GANs ask the model to produce samples that cannot be distinguished from training data. These are not the same objective, and the distinction matters. MLE requires a tractable density; GANs do not. This makes GANs applicable to high-dimensional continuous distributions - images, audio, video - where explicit density estimation is computationally intractable.


The Setup

The GAN framework involves two objects: a data distribution and two parameterized networks.

Let $p_{\text{data}}$ denote the true data distribution over some space $\mathcal{X}$ (e.g., the space of natural images $\mathbb{R}^{d}$). We have access to samples from $p_{\text{data}}$ but not to the density itself. The goal is to construct a distribution $p_G$ that matches $p_{\text{data}}$ so closely that samples from $p_G$ are indistinguishable from samples from $p_{\text{data}}$.

The generator $G_\theta: \mathcal{Z} \to \mathcal{X}$ is a neural network mapping from a latent space $\mathcal{Z}$ to the data space $\mathcal{X}$. The latent distribution $p_z$ is fixed and simple - typically the standard Gaussian $\mathcal{N}(0, I)$ or the uniform distribution on $[0,1]^k$. The generator implicitly defines a distribution $p_G$ as the pushforward of $p_z$ through $G$: if $z \sim p_z$, then $G(z) \sim p_G$. Crucially, $G$ never needs to compute $p_G$ explicitly - it only needs to produce samples.

The discriminator $D_\phi: \mathcal{X} \to [0,1]$ is a neural network that acts as a binary classifier. On input $x$, it outputs $D(x)$, interpreted as the probability that $x$ was drawn from $p_{\text{data}}$ rather than from $p_G$. When $D(x) \approx 1$ the discriminator believes $x$ is real; when $D(x) \approx 0$ it believes $x$ is fake.

The latent space $\mathcal{Z}$ is low-dimensional relative to $\mathcal{X}$. For image generation, $\mathcal{X}$ might be $\mathbb{R}^{64 \times 64 \times 3}$ while $\mathcal{Z}$ might be $\mathbb{R}^{100}$. The generator must learn a mapping from this compact manifold to the high-dimensional data space, in effect learning to navigate the data manifold - the low-dimensional curved surface within $\mathcal{X}$ on which real data concentrates. Think of all possible human face images as a thin curved surface within the space of all possible pixel grids. The generator must learn to walk along that surface.


The Zero-Sum Game

The GAN objective is a minimax game:

$$\min_G \max_D ; V(D, G) = \mathbb{E}{x \sim p{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

To understand why each term appears, consider each player’s incentives separately.

The discriminator $D$ wants to maximize $V$:

  • The first term $\mathbb{E}{x \sim p{\text{data}}}[\log D(x)]$ is maximized when $D(x) \approx 1$ on real data - the discriminator correctly identifies real samples.
  • The second term $\mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$ is maximized when $D(G(z)) \approx 0$ on generated samples - the discriminator correctly identifies fakes.

The generator $G$ wants to minimize $V$. It has no influence over the first term (it does not produce real data), so it focuses on the second: it wants $D(G(z)) \approx 1$, meaning the discriminator classifies generated samples as real. Minimizing $\mathbb{E}[\log(1 - D(G(z)))]$ is equivalent to maximizing $\mathbb{E}[D(G(z))]$.

This is a zero-sum game: the payoff to $G$ is exactly $-V(D,G)$ and the payoff to $D$ is $+V(D,G)$. Every gain for one player is a loss for the other. Game theory tells us that such games have Nash equilibria - pairs $(D^, G^)$ from which neither player benefits by unilaterally deviating. The GAN framework bets that gradient descent finds this equilibrium, and that the equilibrium corresponds to $p_G = p_{\text{data}}$.

Notice that the value function $V$ is also the cross-entropy loss for a binary classifier. The first term is the expected log-likelihood of labeling real data as real; the second is the expected log-likelihood of labeling generated data as fake. Training the discriminator to maximize $V$ is exactly training a binary classifier with cross-entropy loss on a balanced dataset of real and generated samples. This connection makes implementation straightforward: discriminator training is standard supervised learning.


Optimal Discriminator

For a fixed generator $G$ (and hence fixed $p_G$), what is the optimal discriminator? We want the function $D: \mathcal{X} \to [0,1]$ that maximizes $V(D, G)$.

Key observation: $V$ decomposes into a sum over points in $\mathcal{X}$. At each individual point $x$, the discriminator chooses its output $D(x) \in (0,1)$ independently. So we can maximize $V$ by independently maximizing the integrand at each $x$ - a pointwise optimization.

Rewrite $V$ as an integral:

$$V(D, G) = \int_x \left[ p_{\text{data}}(x) \log D(x) + p_G(x) \log(1 - D(x)) \right] dx$$

For a fixed $x$, maximize $f(d) = p_{\text{data}}(x) \log d + p_G(x) \log(1 - d)$ over $d \in (0, 1)$:

$$\frac{df}{dd} = \frac{p_{\text{data}}(x)}{d} - \frac{p_G(x)}{1 - d} = 0$$

$$p_{\text{data}}(x)(1 - d) = p_G(x) \cdot d$$

$$p_{\text{data}}(x) = d \left( p_{\text{data}}(x) + p_G(x) \right)$$

$$\boxed{D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_G(x)}}$$

What does this mean? This is the Bayes-optimal classifier for the binary task of distinguishing $p_{\text{data}}$ from $p_G$. Given a point $x$ drawn with equal probability from either distribution, the posterior probability that it came from $p_{\text{data}}$ is exactly $D^*(x)$.

The formula is geometrically clear:

  • Where $p_{\text{data}}(x) \gg p_G(x)$: the point is much more likely to be real, so $D^*(x) \approx 1$.
  • Where $p_G(x) \gg p_{\text{data}}(x)$: the point is likely generated, so $D^*(x) \approx 0$.
  • Where $p_{\text{data}}(x) = p_G(x)$: the discriminator is maximally uncertain, $D^*(x) = 1/2$.

When the generator perfectly matches the data distribution, $D^* = 1/2$ everywhere - it cannot do better than random guessing. This is the generator “winning”: a perfect discriminator is indistinguishable from a coin flip.


Optimal Generator and the Nash Equilibrium

Now substitute $D^*$ back into $V$ to find the objective the generator effectively faces. This tells us what the generator is actually optimizing when the discriminator plays optimally.

$$C(G) = V(D^*, G) = \mathbb{E}{x \sim p{\text{data}}} \left[ \log \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_G(x)} \right] + \mathbb{E}{x \sim p_G} \left[ \log \frac{p_G(x)}{p{\text{data}}(x) + p_G(x)} \right]$$

To identify what this expression equals, we rewrite it by inserting a factor of 2 in the denominators (and subtracting $\log 2$ twice to compensate):

$$C(G) = \mathbb{E}{x \sim p{\text{data}}} \left[ \log \frac{p_{\text{data}}(x)}{\frac{p_{\text{data}}(x) + p_G(x)}{2}} \right] + \mathbb{E}{x \sim p_G} \left[ \log \frac{p_G(x)}{\frac{p{\text{data}}(x) + p_G(x)}{2}} \right] - \log 4$$

The first expectation is $\text{KL}!\left(p_{\text{data}} ,\Big|, \frac{p_{\text{data}} + p_G}{2}\right)$ and the second is $\text{KL}!\left(p_G ,\Big|, \frac{p_{\text{data}} + p_G}{2}\right)$, where $\text{KL}(P | Q) = \mathbb{E}_{x \sim P}[\log P(x) - \log Q(x)]$ is the Kullback-Leibler divergence.

Recall the Jensen-Shannon divergence between two distributions $P$ and $Q$:

$$\text{JSD}(P | Q) = \frac{1}{2} \text{KL}!\left(P ,\Big|, \frac{P+Q}{2}\right) + \frac{1}{2} \text{KL}!\left(Q ,\Big|, \frac{P+Q}{2}\right)$$

Therefore:

$$\boxed{C(G) = 2 \cdot \text{JSD}(p_{\text{data}} | p_G) - \log 4}$$

This is the central theoretical result. When the discriminator plays optimally, minimizing the generator’s objective is equivalent to minimizing the Jensen-Shannon divergence between the true data distribution and the generated distribution.

Since $\text{JSD}(P | Q) \geq 0$ always, we have $C(G) \geq -\log 4$, with equality if and only if $p_G = p_{\text{data}}$. The Nash equilibrium is therefore:

  • The generator produces the true data distribution: $p_G = p_{\text{data}}$
  • The discriminator is uniformly $1/2$: $D^* = 1/2$ everywhere
  • The minimax value is $V^* = -\log 4$

Why JSD and not KL? KL divergence is asymmetric and infinite when $p_G$ assigns zero probability to regions where $p_{\text{data}}$ is positive (mode dropping). JSD is symmetric and always finite, because it measures KL to the mixture $\frac{p_{\text{data}} + p_G}{2}$, which is positive wherever either component is. More importantly: the discriminator’s job is to estimate $p_{\text{data}}(x) / (p_{\text{data}}(x) + p_G(x))$, which is exactly the ratio needed to compute the JSD - without ever computing the individual densities explicitly. The discriminator converts an intractable density estimation problem into a tractable classification problem.

Why can’t we just minimize KL directly? $\text{KL}(p_G | p_{\text{data}})$ (mode-seeking) penalizes heavily when $p_G$ puts mass where $p_{\text{data}}$ has none - leading to sharp but incomplete coverage. $\text{KL}(p_{\text{data}} | p_G)$ (mode-covering) penalizes missing modes of $p_{\text{data}}$ - but requires computing the density ratio, which is intractable in high dimensions. JSD is estimated via classification accuracy: a tractable stand-in for an intractable density problem.


Training Procedure

The theory requires simultaneous minimax optimization. In practice, this is approximated by alternating gradient updates:

For each training iteration:

  1. Sample a minibatch of $m$ noise vectors $z^{(1)}, \ldots, z^{(m)}$ from $p_z$.
  2. Sample a minibatch of $m$ real data points $x^{(1)}, \ldots, x^{(m)}$ from $p_{\text{data}}$.
  3. Update the discriminator by ascending its stochastic gradient:

$$\nabla_\phi \frac{1}{m} \sum_{i=1}^m \left[ \log D_\phi(x^{(i)}) + \log(1 - D_\phi(G_\theta(z^{(i)}))) \right]$$

  1. Sample a fresh minibatch of noise $z^{(1)}, \ldots, z^{(m)}$.
  2. Update the generator by descending its stochastic gradient:

$$\nabla_\theta \frac{1}{m} \sum_{i=1}^m \log(1 - D_\phi(G_\theta(z^{(i)})))$$

In the original paper, step 3 is performed $k$ times for each step 5. In practice $k = 1$ is standard.

The saturating gradient problem. There is a critical issue with the generator’s objective $\min_G \mathbb{E}[\log(1 - D(G(z)))]$. Consider what happens early in training. The generator is essentially random and produces garbage, so the discriminator easily identifies all generated samples as fake: $D(G(z)) \approx 0$. Plugging in, $\log(1 - D(G(z))) \approx \log(1) = 0$. The function $\log(1 - d)$ is nearly flat near $d \approx 0$, so its gradient $\partial/\partial d [\log(1-d)] = -1/(1-d)$ is small in magnitude when $d \approx 0$. The generator receives almost no gradient signal precisely when it needs guidance the most - an early-training death spiral.

Non-saturating loss. The fix is simple: instead of minimizing $\mathbb{E}[\log(1 - D(G(z)))]$, the generator maximizes $\mathbb{E}[\log D(G(z))]$. Near $d \approx 0$, the gradient of $\log d$ is $1/d$, which is large - providing strong signal when the discriminator is winning. The two objectives have the same Nash equilibrium (both are minimized at $D^* = 1/2$, $p_G = p_{\text{data}}$) but have completely different gradient landscapes. In the original formulation, the gradient vanishes when the discriminator is winning; in the non-saturating version, the gradient is strongest when the discriminator is winning. This is the version used in essentially all practical GAN implementations.


Training Instabilities

The elegant theory gives way to significant practical difficulties. Training a GAN requires navigating a minimax landscape that gradient descent was not designed for.

Mode Collapse

Mode collapse is the most common failure: the generator produces a narrow range of outputs that fool the discriminator, but ignores large portions of the true data distribution. A face GAN might generate hundreds of variants of the same person instead of the full diversity of human faces.

Why it happens. Suppose $G$ finds a region where $D$ currently outputs high values (classifies as real). $G$ has a strong incentive to concentrate all its probability mass in that region - placing all mass on one mode makes $\mathbb{E}_z[\log D(G(z))]$ large. Spreading mass across modes would trade probability on the discovered mode for probability on undiscovered modes, which $D$ may not yet reward. So $G$ collapses.

$D$ then adapts: it learns to detect this specific repeated output. $G$ then shifts to a different mode. The cycle continues. This is a coordination failure: $G$ and $D$ are updated alternately with finite batches, not simultaneously at equilibrium. The optimal discriminator $D^*$ for a given $G$ is only optimal for that specific $G$ - it tells $G$ nothing about how it will be evaluated after $D$ updates.

Vanishing Gradients

If the discriminator is trained to near-optimality before the generator has learned much, $D(G(z)) \approx 0$ for essentially all $z$. Even the non-saturating loss suffers: when $\log D(G(z))$ becomes very negative, floating-point representations lose precision. Training stalls.

Non-Convergence

Gradient descent on a minimax objective does not necessarily converge. The generator overshoots to beat the current discriminator; the discriminator updates to catch the new generator; neither settles. This oscillation is theoretically guaranteed to happen for some loss landscapes and learning rates. Standard minimization convergence guarantees do not apply to saddle-point problems.

Practical stabilization techniques:

  • Label smoothing: replace the real label “1” with 0.9 to prevent the discriminator from becoming overconfident.
  • Spectral normalization of discriminator weights: clips the discriminator’s Lipschitz constant, preventing it from becoming too powerful.
  • Dropout in the discriminator: adds noise that prevents overfitting to the current generator.
  • Separate learning rates: often $\eta_D < \eta_G$ to keep the discriminator from running too far ahead.
  • Minibatch discrimination: the discriminator looks at multiple samples at once, making it harder for the generator to collapse to a single output.

Wasserstein GAN

The instability of standard GAN training is intimately connected to the JSD as a distance measure. Early in training, $p_G$ and $p_{\text{data}}$ are almost certainly supported on non-overlapping regions of $\mathcal{X}$ - a random generator produces noise, not natural images. When supports are disjoint, the JSD equals the constant $\log 2$ regardless of how close the distributions are geometrically. The gradient of JSD with respect to generator parameters is exactly zero. The discriminator becomes a perfect classifier and provides no gradient to the generator at all. Training stalls.

To build intuition: the earth mover’s distance (also called Wasserstein-1) measures the minimum “work” needed to reshape one distribution into another, where work is mass times distance. Imagine $p_{\text{data}}$ and $p_G$ as two piles of sand. JSD either says “not the same pile” or “identical” with no gradation when supports are disjoint. The earth mover’s distance instead says: it would cost this much (in grams times meters) to move the sand from one pile to match the other. Even when the piles don’t overlap, the distance is finite and varies smoothly as the generator pile moves. This smooth variation gives useful gradients throughout training.

Formally, the Wasserstein-1 distance between distributions $P$ and $Q$ over metric space $(\mathcal{X}, d)$ is:

$$W_1(P, Q) = \inf_{\gamma \in \Pi(P,Q)} \mathbb{E}_{(x,y) \sim \gamma}[d(x, y)]$$

where $\Pi(P,Q)$ is the set of all joint distributions (couplings) $\gamma(x,y)$ with marginals $P$ and $Q$. The coupling specifies which “sand particle” from $P$ gets transported to which location in $Q$; the infimum is over all possible transportation plans.

This infimum over all couplings is computationally intractable. But by the Kantorovich-Rubinstein duality theorem, there is an equivalent formulation:

$$W_1(P, Q) = \sup_{|f|L \leq 1} \left[ \mathbb{E}{x \sim P}[f(x)] - \mathbb{E}_{x \sim Q}[f(x)] \right]$$

where the supremum is over all 1-Lipschitz functions $f: \mathcal{X} \to \mathbb{R}$ (functions that don’t change the output by more than 1 per unit change in input). The intuition: the best “witness function” $f$ that is 1-Lipschitz and maximally separates the two distributions is exactly measuring the earth mover’s cost.

This dual form is amenable to learning: instead of a discriminator outputting probabilities, WGAN uses a critic $f_w: \mathcal{X} \to \mathbb{R}$ (no final sigmoid, so output is an unbounded real number) trained to maximize $\mathbb{E}{x \sim p{\text{data}}}[f_w(x)] - \mathbb{E}{z \sim p_z}[f_w(G\theta(z))]$ subject to $f_w$ being 1-Lipschitz.

The WGAN objective is:

$$\min_G \max_{|f|L \leq 1} ; \mathbb{E}{x \sim p_{\text{data}}}[f(x)] - \mathbb{E}_{z \sim p_z}[f(G(z))]$$

Enforcing the Lipschitz constraint is the main technical challenge. The original WGAN paper proposes weight clipping: after each discriminator update, clip all weights to $[-c, c]$. Weight clipping works but constrains the critic toward very simple functions.

Gradient Penalty (WGAN-GP). A better enforcement is to penalize the gradient norm of the critic (Gulrajani et al., 2017). A 1-Lipschitz function has $|\nabla f|_2 \leq 1$ everywhere; the optimal transport function has $|\nabla f|_2 = 1$ along transport paths. So we penalize deviations from norm 1 along straight lines between real and generated samples:

$$\mathcal{L}{\text{GP}} = \lambda ; \mathbb{E}{\hat{x} \sim p_{\hat{x}}} \left[ \left( |\nabla_{\hat{x}} f_w(\hat{x})|_2 - 1 \right)^2 \right]$$

where $\hat{x} = \epsilon x + (1-\epsilon) G(z)$ with $\epsilon \sim \text{Uniform}[0,1]$.

WGAN practical advantages:

  • The critic loss $\mathbb{E}{p{\text{data}}}[f] - \mathbb{E}{p_z}[f(G(z))]$ estimates $W_1(p{\text{data}}, p_G)$ and decreases monotonically with sample quality - unlike the standard GAN discriminator loss, which oscillates and is uninterpretable.
  • Training is more stable, mode collapse is less frequent.
  • No architectural constraints on the critic (batch normalization, etc.) beyond enforcing Lipschitz.
  • The cost: slower training (critics must be near-optimal before each generator update, so the original paper uses 5 critic updates per generator update) and sensitivity to the gradient penalty weight $\lambda$.

Conditional GAN

The standard GAN generates samples unconditionally - it learns $p_{\text{data}}(x)$ without control over what it generates. Conditional GAN (cGAN) extends this to learn the conditional distribution $p_{\text{data}}(x \mid y)$, where $y$ is a conditioning variable: a class label, a text description, a paired reference image, or any other signal.

Both networks are conditioned on $y$:

  • Generator: $G(z, y) \to x$ - given noise and condition, produce a sample consistent with $y$
  • Discriminator: $D(x, y) \to [0,1]$ - given a sample and condition, output probability that $x$ is real given label $y$

The objective becomes:

$$\min_G \max_D ; V(D, G) = \mathbb{E}{(x,y) \sim p{\text{data}}}[\log D(x, y)] + \mathbb{E}_{z \sim p_z, y \sim p_y}[\log(1 - D(G(z, y), y))]$$

The discriminator now assesses whether a sample is not only realistic but also consistent with the given condition. This prevents the generator from producing realistic samples of the wrong class. Conditioning is typically implemented by concatenating the label embedding (or other signal) to the input of both networks, or via feature-wise linear modulation (FiLM) layers that use the condition to shift and scale intermediate feature maps.

Pix2Pix (Isola et al., 2017) uses a conditional GAN where $y$ is an entire image: paired training examples $(y, x)$ where $y$ is a source image (a semantic label map, a daytime photo, a sketch) and $x$ is the corresponding target image (a photograph, a nighttime photo, a realistic rendering). Because the discriminator sees both the output and the conditioning image, it detects not just unrealistic outputs but outputs that fail to correspond to the input. The pix2pix objective also adds an L1 reconstruction term to prevent blurriness.


Applications

Image synthesis. Progressive GAN, StyleGAN, and BigGAN produce photorealistic images at high resolution. StyleGAN2 generates faces indistinguishable from photographs in controlled studies. The key architectural innovation in StyleGAN is the style-based generator: instead of feeding the latent code $z$ directly into the network, it is mapped through a series of fully connected layers to a style code $w$ that controls affine transformations (scale and shift) of feature maps at every layer of the synthesis network. This disentangles high-level semantic attributes (identity, pose) from fine-grained details (texture, lighting) and enables meaningful interpolation in latent space.

Data augmentation. Generated samples can augment training sets, particularly where labeled data is scarce. Medical imaging is a canonical example: synthetic CT scans, histology slides, or retinal images improve classifier performance when real annotated examples are expensive.

Domain adaptation. CycleGAN addresses unpaired domain transfer: two generators $G_{A \to B}$ and $G_{B \to A}$ together with two discriminators $D_A$ and $D_B$, with a cycle-consistency loss $|G_{B \to A}(G_{A \to B}(x)) - x|_1$ that prevents degenerate mappings. This allows style transfer without paired training data.

Scientific applications. GANs have been applied to molecule generation for drug discovery (generators on graph-structured molecular representations), turbulence simulation (high-resolution fluid dynamics from coarse simulations), and anomaly detection (using the discriminator as an anomaly scorer after training on normal data only).


Summary

Concept Key formula or fact
GAN objective $\min_G \max_D ; \mathbb{E}{p{\text{data}}}[\log D(x)] + \mathbb{E}_{p_z}[\log(1 - D(G(z)))]$
Optimal discriminator $D^*(x) = p_{\text{data}}(x) / (p_{\text{data}}(x) + p_G(x))$
Generator objective under $D^*$ $C(G) = 2 \cdot \text{JSD}(p_{\text{data}} | p_G) - \log 4$
Nash equilibrium $p_G = p_{\text{data}}$, $V^* = -\log 4$, $D^* = 1/2$ everywhere
Non-saturating loss Maximize $\mathbb{E}[\log D(G(z))]$ instead of minimize $\mathbb{E}[\log(1-D(G(z)))]$; avoids vanishing gradient
Mode collapse $p_G$ concentrates on a subset of the support of $p_{\text{data}}$; coordination failure between $G$ and $D$
Wasserstein distance $W_1(P,Q) = \inf_\gamma \mathbb{E}_\gamma[d(x,y)]$; finite and smooth even when supports are disjoint
Kantorovich-Rubinstein $W_1(P,Q) = \sup_{|f|_L \leq 1} \mathbb{E}_P[f] - \mathbb{E}_Q[f]$; converts to critic training
Gradient penalty Penalize $(|\nabla_{\hat{x}} f(\hat{x})|_2 - 1)^2$ on interpolated samples; better than weight clipping
Conditional GAN Condition $G(z,y)$ and $D(x,y)$ on label or image $y$; enables controllable generation

GANs represent a fundamental shift in how generative models are trained - not by maximizing likelihood but by winning a game. The theory is clean and the practical results are striking, but the gap between them - mode collapse, training instability, oscillation - remains an active research area. Wasserstein GANs address the gradient signal problem by replacing JSD with a geometrically sensible distance. Diffusion models, which have largely superseded GANs for image synthesis, address the stability problem by replacing adversarial training with a fixed denoising objective. Understanding why GANs fail in specific ways is as important as understanding why they succeed.


Read next: