Helpful context:


Suppose you want to build a spam filter. You have a collection of emails, each labeled spam or not-spam. There are two fundamentally different strategies for building a classifier. Strategy one: study each class of emails deeply. Learn what a spam email looks like - its vocabulary, its sentence structure, the kinds of links it contains. Then, when a new email arrives, ask: “does this email look more like the spam distribution or the not-spam distribution?” Strategy two: forget about what emails look like in general. Just learn a decision boundary - the set of features that directly predicts the label. You don’t need to know what a typical spam email looks like; you just need to know which features predict “spam.”

These two strategies correspond to two fundamentally different probabilistic formulations. The first strategy models the joint distribution $P(X, Y)$ - it learns what the data looks like for each class. At prediction time, it uses Bayes' theorem to invert this into a class probability. The second strategy models the conditional distribution $P(Y | X)$ directly - it learns to map from features to labels without ever modeling the input distribution. The first approach is generative; the second is discriminative.

This distinction is not just a modeling choice - it determines what assumptions you make, what data you can use, what you can do with the model at test time, and how quickly you converge to good performance as data grows. The choice between them has practical consequences that are easy to get wrong.


The Generative Approach

A generative model learns the joint distribution $P(X, Y)$, which factorizes as:

$$P(X, Y) = P(X | Y) \cdot P(Y)$$

$P(Y)$ is the prior over classes (e.g., 30% of emails are spam). $P(X | Y)$ is the class-conditional distribution: what do features look like given the class? At prediction time, invert via Bayes' theorem:

$$P(Y | X) = \frac{P(X | Y) \cdot P(Y)}{P(X)}$$

Since $P(X) = \sum_y P(X | Y=y) P(Y=y)$ is the same for all classes, classification amounts to computing $P(X | Y=y) \cdot P(Y=y)$ for each class $y$ and picking the largest.

Naive Bayes is the simplest generative classifier. It models $P(X | Y)$ with the strong assumption that features are conditionally independent given the class:

$$P(X | Y=y) = \prod_{j=1}^d P(X_j | Y=y)$$

For text classification, $X_j$ might be the count of word $j$, and $P(X_j | Y = \text{spam})$ is the distribution of that word count in spam emails. This assumption is almost always wrong - words are correlated - but Naive Bayes often works surprisingly well in practice because the classification decision depends only on comparing these products across classes, and the errors in the joint probability estimates can cancel out.

Gaussian Discriminant Analysis (GDA) models $P(X | Y=y) = \mathcal{N}(\mu_y, \Sigma)$ - each class has a Gaussian feature distribution with its own mean $\mu_y$ and a shared covariance $\Sigma$. Under this model, the decision boundary between two classes is linear (this is linear discriminant analysis, LDA). If each class has its own covariance $\Sigma_y$, the boundary becomes quadratic (QDA).

Hidden Markov Models (HMMs) are generative models for sequences. The hidden state sequence follows a Markov chain; the observed sequence is generated from the hidden states. Given observations, inference (the E-step of Baum-Welch, or the Viterbi algorithm) inverts the generation process to find the most probable hidden state sequence. HMMs were the dominant approach in speech recognition for decades.


The Discriminative Approach

A discriminative model directly models $P(Y | X)$ without ever representing $P(X)$ or $P(X, Y)$.

Logistic regression models $P(Y=1 | X)$ as:

$$P(Y=1 | X=x) = \sigma(w^\top x + b) = \frac{1}{1 + e^{-(w^\top x + b)}}$$

It learns the weights $w$ directly from labels by maximizing the conditional log-likelihood $\sum_i \log P(Y_i | X_i)$. It never asks “what does a typical positive example look like?” It only learns the boundary.

Support Vector Machines are discriminative by design - they find the maximum-margin linear boundary between classes. There is no probabilistic model of the input at all.

Neural networks model $P(Y | X)$ through a composition of nonlinear transformations. Like logistic regression, the parameters are fit by maximizing conditional likelihood. The model makes no commitment to any distributional form for $X$.

Conditional Random Fields (CRFs) are the discriminative counterpart to HMMs for sequence labeling. Rather than modeling the joint $P(X, Y)$ as HMMs do, CRFs model $P(Y | X)$ directly over entire label sequences. This allows the inclusion of arbitrary overlapping features of the input sequence, which generative models cannot easily accommodate.


Why Generative Models Use More Capacity

A generative model must represent $P(X)$, the distribution over inputs. For the spam classifier, this means modeling what email text looks like in general - the vocabulary, the syntax, the typical lengths. This structure has nothing to do with the classification task. Generative models spend representational capacity on the input distribution even though, at prediction time, only the contrast between $P(X | Y=1)$ and $P(X | Y=0)$ matters.

Discriminative models skip this. They model only the boundary, which is often a much simpler object than the full data distribution. The decision boundary between spam and not-spam can be a linear function of a few word frequencies; modeling the full distribution of email text requires capturing an enormous amount of unrelated structure.

This is sometimes called “wasted capacity”: generative models learn about the input distribution, which is irrelevant to classification. The argument for discriminative models is that if your goal is classification, you should model exactly what you need - $P(Y | X)$ - and nothing more.


Why Generative Models Can Win

If generative models are less efficient, why use them? Several reasons.

Limited labeled data with abundant unlabeled data. Discriminative models require labels to train. Generative models can use unlabeled data to learn $P(X)$, even without knowing the class. If you have 100 labeled emails and 100,000 unlabeled emails, a generative model can use all 100,100 emails during training (semi-supervised learning), while a discriminative model uses only the 100 labeled ones. When labels are expensive (medical imaging, legal documents), this is a decisive advantage.

Covariate shift. Suppose the input distribution shifts between training and deployment (e.g., the vocabulary of spam emails evolves over time), but the relationship $P(Y | X)$ stays the same. A discriminative model that modeled $P(Y | X)$ well on the training distribution may degrade if the test distribution is different. A generative model that explicitly models $P(X | Y)$ can in principle detect and adapt to this shift.

Out-of-distribution detection. If an input $x$ is very unlikely under $P(X)$, a generative model can flag it as anomalous. A discriminative model will confidently assign it to some class - it has no concept of “this input doesn’t look like anything I was trained on.” This is important for safety-critical applications.

Generation. If you want to generate new examples - synthesize new images, augment your training data, explore the input manifold - you need a model of $P(X)$. Discriminative models cannot generate; generative models can sample.


The Ng-Jordan Analysis

A landmark 2002 paper by Andrew Ng and Michael Jordan compared Naive Bayes (generative) and logistic regression (discriminative) on the same classification problem - an important pair because, as they showed, these two models are actually two ways to fit the same functional form.

Under the Naive Bayes assumption, the posterior $P(Y | X)$ takes the form of logistic regression:

$$\log \frac{P(Y=1|X)}{P(Y=0|X)} = w^\top x + b$$

So the two models make the same predictions if they agree on $w$ and $b$. But they estimate $w$ and $b$ differently. Naive Bayes estimates the parameters of $P(X | Y)$ and $P(Y)$ from the data, then derives $w$ and $b$ via Bayes' theorem. Logistic regression estimates $w$ and $b$ directly by maximizing conditional likelihood.

The key finding: generative classifiers converge faster (in samples), but discriminative classifiers converge to a lower asymptotic error.

With very little data, the generative model’s structural assumption (Naive Bayes) acts as a strong prior that reduces variance. The discriminative model, estimating the boundary directly, has higher variance and does worse. As data grows, the discriminative model eventually learns the correct boundary without assuming conditional independence of features. The generative model’s incorrect assumption starts hurting it. There is a crossover: at some amount of data, discriminative takes the lead and eventually wins.

This crossover can happen early (a few dozen examples) or late (thousands), depending on how wrong the generative assumption is. The practical lesson: if you have very little data, consider generative models for their sample efficiency. With abundant data, discriminative models are generally better.


Modern Generative Models

Contemporary generative models - Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and diffusion models - are not primarily classifiers. They are trained to learn $P(X)$ (or $P(X | Y)$ for conditional generation) for synthesis and representation learning.

VAEs learn a latent variable model: $P(X) = \int P(X | Z) P(Z) dZ$. The encoder learns $Q(Z | X)$ and the decoder learns $P(X | Z)$. The training objective (the ELBO) balances reconstruction quality against the structure of the latent space.

GANs learn $P(X)$ implicitly, through a game between a generator that produces samples and a discriminator that distinguishes real from fake. The generator never models $P(X)$ explicitly; it just learns to produce samples indistinguishable from the training data.

Diffusion models learn to reverse a noise-addition process: starting from Gaussian noise, iteratively denoise to produce a sample from $P(X)$. They currently produce the highest-quality image and audio samples.

These models are used for data augmentation, representation learning, creative applications, and as components in larger systems (e.g., a generative model trained on unlabeled data can provide a feature representation for downstream classification). They blur the generative/discriminative line: a diffusion model used as a feature extractor for classification is being used discriminatively despite being trained generatively.


Aspect Generative Discriminative
What is modeled $P(X, Y) = P(X | Y)P(Y)$ $P(Y | X)$ directly
Examples Naive Bayes, GDA, HMM, VAE, GAN Logistic regression, SVM, NN, CRF
Uses unlabeled data Yes No (standard)
Can generate samples Yes No
Asymptotic error (abundant data) Higher (model misspecification) Lower
Small-data performance Often better (strong inductive bias) Often worse (high variance)
Out-of-distribution detection Yes (via $P(X)$) No

Read next: