Bayesian Classifiers - Classification by Prior Belief and Evidence // Megha Bose

Helpful context:

Every classifier is implicitly making a claim about probability. When you draw a decision boundary through feature space, you are saying: on this side, the label is more likely to be 0; on that side, it is more likely to be 1. Bayesian decision theory makes this claim explicit and then asks: given that we know the true class probabilities at every point in feature space, what is the best possible classifier? The answer is the Bayes classifier, which assigns to each input the class with the highest posterior probability. This is not just a heuristic – it is provably optimal under the assumption that we are trying to minimize the probability of misclassification. Everything that follows is a consequence of this one idea: classify according to the posterior, and you cannot do better.

The difficulty is that the posterior $P(C=k \mid x)$ is not directly observable. In practice, we observe training data and must estimate the distributions that enter Bayes' rule. The generative approach models the class-conditional density $p(x \mid C=k)$ – the distribution of inputs within each class – and the class prior $P(C=k)$, then combines them via Bayes' rule to recover the posterior. This leads to Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) when the class-conditional densities are Gaussian, and to Naive Bayes when we impose a conditional independence structure on the features. The Naive Bayes classifier, despite its strong and often violated independence assumption, performs surprisingly well in practice because the decision boundary it produces is more robust to density estimation errors than the density estimates themselves.

Bayesian decision theory also provides the right framework for thinking about asymmetric costs. In most real problems, different types of errors carry different consequences: missing a cancer diagnosis is far worse than an unnecessary follow-up scan; flagging a legitimate transaction as fraudulent is far worse than missing a rare fraud case. The full theory replaces the simple goal of minimizing error rate with the goal of minimizing expected loss under an arbitrary loss matrix, and the resulting optimal classifier still takes the same form – but the decision thresholds shift in a principled way that reflects the asymmetry of costs. This generalization is what makes Bayesian decision theory a complete framework for classification, not merely an elegant one.

Bayesian Decision Theory

The setup is as follows. We have $K$ classes, a prior distribution $P(C=k)$ over classes, and a class-conditional density $p(x \mid C=k)$ that describes the distribution of the input $x$ given that the true class is $k$. By Bayes' rule, the posterior probability of class $k$ given observation $x$ is:

$$P(C=k \mid x) = \frac{p(x \mid C=k) P(C=k)}{p(x)},$$

where the marginal density $p(x) = \sum_{k=1}^K p(x \mid C=k) P(C=k)$ normalizes the posterior to sum to one over all classes.

The Bayes classifier assigns the class with the highest posterior probability:

$$k^* = \arg\max_k P(C=k \mid x) = \arg\max_k p(x \mid C=k) P(C=k).$$

The second equality holds because $p(x)$ is the same for all classes and does not affect the argmax. This rule minimizes the probability of misclassification for each individual $x$: at every point in feature space, it picks the class that is most likely to be correct given the evidence. Because it minimizes the pointwise error probability, it also minimizes the overall error rate, which is just the expected pointwise error over all $x$.

The Bayes error rate is the error rate of this optimal classifier:

$$\varepsilon^* = 1 - \mathbb{E}_x\left[\max_k P(C=k \mid x)\right].$$

This is the irreducible error – no classifier, no matter how complex or data-rich, can do better. To understand why it is irreducible, consider a point $x$ where $P(C=1 \mid x) = 0.6$ and $P(C=2 \mid x) = 0.4$. The Bayes classifier assigns class 1, and it is correct 60% of the time at this point. The remaining 40% is error that cannot be eliminated – it is genuine ambiguity in the data-generating process. The features $x$ simply do not carry enough information to distinguish the classes with certainty. This irreducible error reflects the overlap between class-conditional distributions: wherever $p(x \mid C=1)$ and $p(x \mid C=2)$ both have support, there will be some ambiguity, and the Bayes classifier absorbs this irreducible fraction of errors. Any real classifier – one that must estimate the posterior from finite data – will have an error rate at least as large as $\varepsilon^*$, and typically larger due to estimation error.

Discriminant Functions

The Bayes classifier is often reformulated in terms of discriminant functions. Define $g_k(x)$ to be any function of $x$ such that assigning $x$ to class $k^* = \arg\max_k g_k(x)$ gives the same classification as the Bayes rule. The simplest choice is $g_k(x) = P(C=k \mid x)$. But any monotone transformation of the posterior works equally well, because a monotone transformation preserves the argmax.

A particularly useful choice is the log-posterior:

$$g_k(x) = \log P(C=k \mid x) = \log p(x \mid C=k) + \log P(C=k) - \log p(x).$$

Since $\log p(x)$ does not depend on $k$, we can drop it and work with:

$$g_k(x) = \log p(x \mid C=k) + \log P(C=k).$$

The decision boundary between classes $i$ and $j$ is the set of points where both discriminant functions are equal:

$$\{x \mid g_i(x) = g_j(x)\} = \{x \mid \log p(x \mid C=i) - \log p(x \mid C=j) = \log P(C=j) - \log P(C=i)\}.$$

When the class-conditional densities are Gaussian, the geometry of this boundary depends critically on whether the covariances are shared across classes.

Linear Discriminant Analysis (LDA) assumes $p(x \mid C=k) = \mathcal{N}(x; \mu_k, \Sigma)$ with a common covariance matrix $\Sigma$ for all classes. The log-likelihood of class $k$ is:

$$\log p(x \mid C=k) = -\frac{1}{2}(x - \mu_k)^\top \Sigma^{-1}(x - \mu_k) + \text{const}.$$

Expanding the quadratic, the term $x^\top \Sigma^{-1} x$ is the same for all classes and drops out of the discriminant. What remains is linear in $x$:

$$g_k(x) = \mu_k^\top \Sigma^{-1} x - \frac{1}{2}\mu_k^\top \Sigma^{-1} \mu_k + \log P(C=k).$$

The decision boundary between classes $i$ and $j$ is therefore a hyperplane in $x$. For two classes with equal priors, this hyperplane is the set of points equidistant (in the Mahalanobis sense) from both class means.

Quadratic Discriminant Analysis (QDA) allows each class its own covariance matrix $\Sigma_k$. The term $x^\top \Sigma_k^{-1} x$ now depends on $k$ and does not cancel, leaving a quadratic function of $x$ in the discriminant. The decision boundary becomes a quadric surface (ellipse, hyperbola, or parabola in 2D). QDA is more expressive but requires estimating $K$ separate covariance matrices, which demands more data.

Loss Functions, Conditional Risk, and Overall Risk

So far, we have assumed that all errors are equally bad – the 0-1 loss. In practice, errors have different costs. In a medical screening test for a serious disease, failing to diagnose a sick patient (false negative) can be catastrophic, while incorrectly flagging a healthy patient (false positive) leads to further testing but not immediate harm. The right framework is to specify a loss matrix $L$, where $L[i][j]$ is the cost incurred when we predict class $i$ and the true class is $j$.

Under 0-1 loss, $L[i][j] = 1 - \delta_{ij}$: correct predictions cost 0, and any error costs 1. But with an arbitrary loss matrix, the optimal decision changes.

The conditional risk of predicting class $i$ given observation $x$ is the expected loss under the posterior:

$$R(\alpha = i \mid x) = \sum_{j=1}^K L[i][j] P(C=j \mid x).$$

This is a weighted sum of losses, where the weights are the posterior probabilities of each true class. The minimum risk classifier chooses the action that minimizes conditional risk at each point:

$$\alpha^*(x) = \arg\min_i R(\alpha = i \mid x).$$

The overall Bayes risk is the expected conditional risk under the data distribution:

$$R^* = \mathbb{E}_x\left[\min_i R(\alpha = i \mid x)\right].$$

This generalizes the standard Bayes error rate: with 0-1 loss, $R^* = \varepsilon^*$. With asymmetric loss, the decision boundaries shift away from the equal-posterior surface and toward regions where the high-cost errors are less likely. Specifically, a high cost for false negatives (missing disease) pushes the decision boundary so that we classify as positive (diseased) even at lower posterior probabilities – accepting more false positives to avoid the catastrophic false negatives.

Decision Boundaries and Error Rates

For two classes, the error rate has a clean decomposition. Let $R_1$ and $R_2$ be the decision regions – the subsets of input space assigned to class 1 and class 2 respectively. The total probability of error is:

$$P(\text{error}) = \int_{R_2} p(x \mid C=1) P(C=1) dx + \int_{R_1} p(x \mid C=2) P(C=2) dx.$$

The first term is the probability of misclassifying a class-1 point as class 2, and the second term is the reverse. The Bayes classifier minimizes this by assigning each $x$ to the class with the larger joint probability $p(x \mid C=k) P(C=k)$, which is equivalent to maximizing the posterior.

For Gaussian classes with equal covariance $\Sigma$, the Bayes decision boundary has an explicit form. Setting $g_1(x) = g_2(x)$ gives the hyperplane:

$$(\mu_1 - \mu_2)^\top \Sigma^{-1} x = \frac{1}{2}(\mu_1 - \mu_2)^\top \Sigma^{-1}(\mu_1 + \mu_2) - \log\frac{P(C=1)}{P(C=2)}.$$

The left side is the Mahalanobis projection of $x$ onto the direction connecting the two class means. The right side is a threshold that adjusts for the class means, the geometry of $\Sigma$, and the prior ratio. When the priors are equal, the boundary passes through the midpoint between the two means (in Mahalanobis distance). When one class is more probable a priori, the boundary shifts toward the less probable class – we require more evidence to classify into the rare class.

The Mahalanobis distance from $x$ to class $k$ is $d_k(x) = (x - \mu_k)^\top \Sigma^{-1}(x - \mu_k)$, and LDA is equivalent to assigning $x$ to the class with the smallest Mahalanobis distance (corrected for the log-prior). This gives LDA a natural interpretation as a nearest-centroid classifier in the metric defined by $\Sigma^{-1}$.

Maximum Likelihood Estimation for Class-Conditional Densities

To use the Bayes classifier in practice, we need to estimate $p(x \mid C=k)$ and $P(C=k)$ from training data. The prior is estimated straightforwardly as the fraction of training points in each class: $\hat{P}(C=k) = n_k / n$, where $n_k$ is the number of training points in class $k$ and $n = \sum_k n_k$.

For the class-conditional density, we assume a parametric form and estimate the parameters by maximum likelihood (MLE) within each class. Under the Gaussian assumption $p(x \mid C=k) = \mathcal{N}(x; \mu_k, \Sigma_k)$, the MLE estimates are the class-specific sample mean and sample covariance:

$$\hat{\mu}k = \frac{1}{n_k} \sum{x \in C_k} x, \qquad \hat{\Sigma}k = \frac{1}{n_k} \sum{x \in C_k} (x - \hat{\mu}_k)(x - \hat{\mu}_k)^\top.$$

These are QDA estimates. For LDA, we additionally assume $\Sigma_k = \Sigma$ for all $k$, and pool the within-class covariances:

$$\hat{\Sigma} = \frac{1}{n} \sum_{k=1}^K \sum_{x \in C_k} (x - \hat{\mu}_k)(x - \hat{\mu}_k)^\top.$$

The pooled covariance is the weighted average of the per-class covariances, with weights proportional to class sizes. LDA is more data-efficient than QDA: it estimates $O(d^2)$ parameters for a single shared $\Sigma$ rather than $K \cdot O(d^2)$ parameters for $K$ separate covariance matrices. When $n$ is small relative to $K d^2$, QDA tends to overfit, and LDA’s regularization through parameter sharing improves generalization.

MLE with a Gaussian assumption gives well-defined, closed-form estimates. But if we believe the class-conditional density has a different parametric form – a mixture of Gaussians, a log-normal, or something else – MLE still applies: we maximize the class-conditional log-likelihood within each class separately. The Bayesian classification framework is agnostic about the form of $p(x \mid C=k)$; what matters is that we estimate it well enough to recover the decision boundary accurately.

Naive Bayes Classifier

When the feature dimension $d$ is large, estimating the full joint density $p(x \mid C=k)$ is statistically and computationally costly. The number of parameters in a full Gaussian grows as $O(d^2)$ per class – far more than can be reliably estimated from typical training sets. The Naive Bayes classifier addresses this by making a strong structural assumption: the features are conditionally independent given the class label. That is:

$$p(x \mid C=k) = \prod_{j=1}^d p(x_j \mid C=k).$$

This decomposes the joint density into a product of $d$ univariate densities, reducing the estimation problem from one $d$-dimensional joint to $d$ independent one-dimensional problems. The posterior then becomes:

$$P(C=k \mid x) \propto P(C=k) \prod_{j=1}^d p(x_j \mid C=k),$$

and the Naive Bayes classifier assigns class $k^* = \arg\max_k \left[\log P(C=k) + \sum_{j=1}^d \log p(x_j \mid C=k)\right].$

The parametric form chosen for each $p(x_j \mid C=k)$ depends on the feature type:

Gaussian Naive Bayes assumes $p(x_j \mid C=k) = \mathcal{N}(x_j; \mu_{jk}, \sigma_{jk}^2)$. The MLE estimates are the per-feature, per-class sample mean and variance: $\hat{\mu}{jk} = \frac{1}{n_k}\sum{x \in C_k} x_j$ and $\hat{\sigma}{jk}^2 = \frac{1}{n_k}\sum{x \in C_k}(x_j - \hat{\mu}_{jk})^2$. This requires estimating $2Kd$ scalars rather than a $Kd \times d$ covariance matrix – a dramatic reduction.

Bernoulli Naive Bayes handles binary features. Each $p(x_j = 1 \mid C=k) = \theta_{jk}$ is estimated as the fraction of class-$k$ training points with $x_j = 1$. This is the natural model when each feature is an indicator variable.

Multinomial Naive Bayes is the standard for text classification with a bag-of-words representation. Here $x_j$ is the count of word $j$ in a document, and $p(\text{word}j \mid C=k) = \theta{jk}$ is estimated as the count of word $j$ in all class-$k$ documents divided by the total word count in class-$k$ documents. Classification scores documents by the log-probability of the word sequence under each class model, plus the log-prior.

Why does Naive Bayes work despite the independence assumption being violated? In practice, features are almost never conditionally independent given the class. In text, words co-occur in strong patterns; in images, adjacent pixels are highly correlated. Yet Naive Bayes often achieves competitive accuracy. The reason is that we ultimately only need the decision boundary to be correct, not the density estimates themselves. The Naive Bayes posterior $\tilde{P}(C=k \mid x) \propto P(C=k)\prod_j p(x_j \mid C=k)$ is a distorted version of the true posterior, but the argmax – the class it assigns – can agree with the true Bayes classifier even when the probability values are badly miscalibrated. The decision boundary is determined by the sign of the log-posterior ratio, and this can be correct even when the individual log-posteriors are off by a constant or a multiplicative factor. The independence assumption biases the density estimates, but it biases them in a correlated way that largely preserves the ranking of classes.

A useful way to see this: even if $\prod_j p(x_j \mid C=1)$ underestimates the true $p(x \mid C=1)$ because it ignores within-class correlations, it may underestimate $p(x \mid C=2)$ by a similar factor, leaving the ratio – and thus the decision – unchanged. Naive Bayes is biased but robust.

Bayesian Learning vs MLE

The Naive Bayes classifier and LDA/QDA both use MLE to estimate their parameters from training data. MLE finds the single parameter vector $\hat{\theta}$ that maximizes the likelihood of the training data:

$$\hat{\theta}{\text{MLE}} = \arg\max\theta p(\mathcal{D} \mid \theta).$$

This treats the parameters as fixed but unknown. Bayesian learning takes a different view: parameters are random variables with a prior distribution $p(\theta)$, and training data updates this prior to a posterior:

$$p(\theta \mid \mathcal{D}) \propto p(\mathcal{D} \mid \theta) p(\theta).$$

The MAP estimate (maximum a posteriori) finds the mode of this posterior:

$$\hat{\theta}{\text{MAP}} = \arg\max\theta \left[\log p(\mathcal{D} \mid \theta) + \log p(\theta)\right].$$

MAP is equivalent to MLE with a regularization term: the log-prior $\log p(\theta)$ penalizes parameter values that are unlikely under the prior. A Gaussian prior on $\theta$ with mean zero corresponds to $L_2$ regularization. A Laplace prior corresponds to $L_1$ regularization. MLE is the limiting case of MAP with a uniform prior.

Full Bayesian prediction goes further, averaging over all parameter values weighted by the posterior rather than committing to a single estimate:

$$p(y \mid x, \mathcal{D}) = \int p(y \mid x, \theta) p(\theta \mid \mathcal{D}) d\theta.$$

This predictive distribution is more calibrated than a point estimate – it incorporates parameter uncertainty. However, this integral is usually intractable for complex models, which is why MAP and MLE remain the workhorses in practice.

For categorical distributions – as in Multinomial Naive Bayes – the conjugate prior is the Dirichlet distribution. If we place a Dirichlet prior $\text{Dir}(\alpha_1, \ldots, \alpha_V)$ on the word probabilities $(\theta_{1k}, \ldots, \theta_{Vk})$ for class $k$, the posterior given the observed word counts $n_{jk}$ is again Dirichlet with updated parameters $\alpha_j + n_{jk}$. The MAP estimate adds $\alpha_j - 1$ pseudo-counts to each word. Setting $\alpha_j = 2$ for all $j$ (a uniform Dirichlet) gives Laplace smoothing: $\hat{\theta}{jk} = (n{jk} + 1)/(n_k + V)$. This prevents zero probabilities for words not seen in the training set of class $k$, which would otherwise make the Naive Bayes posterior zero for any document containing an unseen word – a severe failure mode without smoothing.

Example: Face Recognition with Eigenfaces

Face recognition is a canonical example of Bayesian classification applied to high-dimensional data. Each face image is a vector $x \in \mathbb{R}^d$, where $d$ is the number of pixels (e.g., $d = 100 \times 100 = 10{,}000$ for a modest grayscale image). Each class $k$ corresponds to the identity of a person.

Directly fitting a Gaussian $\mathcal{N}(\mu_k, \Sigma_k)$ in the full pixel space is infeasible: the covariance matrix has $d^2 = 10^8$ entries, far more than the number of training images available. The standard approach, Eigenfaces, first reduces dimensionality using PCA.

Let $U \in \mathbb{R}^{d \times m}$ be the matrix of the top $m$ principal components of the training data (the eigenfaces). Each image is projected to a low-dimensional code $z = U^\top (x - \bar{x}) \in \mathbb{R}^m$, where $\bar{x}$ is the mean face. In this $m$-dimensional PCA space, we fit a Gaussian per identity class: $p(z \mid C=k) = \mathcal{N}(z; \hat{\mu}_k, \hat{\Sigma})$. With LDA’s shared covariance assumption, the Bayes classifier reduces to assigning a new face to the nearest class mean in Mahalanobis distance – often simplified to Euclidean distance in the eigenface space when $\hat{\Sigma} \approx I$.

The decision boundary in PCA space is a Voronoi diagram: the space is partitioned into regions, one per identity, where each region is the set of points closer to that identity’s class mean than to any other. This is the natural geometry of LDA with equal priors and spherical covariance.

The error rate of the system depends on the ratio of within-class variance (variation in the same person’s face under different lighting, pose, and expression) to between-class variance (variation between different people’s face means). The Bayes error rate in PCA space quantifies the irreducible overlap between the class-conditional Gaussians – the fundamental limit set by how much face variation within a person overlaps with differences between people. Choosing the right number of principal components $m$ involves a bias-variance tradeoff: too few components lose discriminative information, too many reintroduce noise dimensions that inflate within-class variance and harm classification.

Multi-Class Classification

Bayesian classifiers are inherently multi-class. Computing $K$ posteriors and taking the argmax handles any number of classes without modification. This is in contrast to binary classifiers like SVMs, which must be extended to multi-class settings through one-vs-rest (train $K$ binary classifiers, each separating one class from all others, then pick the class with the highest score) or one-vs-one (train $\binom{K}{2}$ classifiers, one for each pair of classes, and use a tournament or vote). The Directed Acyclic Graph SVM (DDAG) arranges one-vs-one classifiers into a binary decision tree for efficient inference.

These multi-class strategies are necessary for SVMs because the SVM decision function is not a posterior probability – it does not naturally extend to comparing more than two classes simultaneously. Bayesian classifiers do not have this limitation: LDA, QDA, Gaussian Naive Bayes, and Multinomial Naive Bayes all produce a score $g_k(x)$ for every class $k$ simultaneously, and classification is a single argmax over all $K$ classes. This is a genuine architectural advantage when the number of classes is large.

Summary

Concept	Definition
Bayes classifier	$k^* = \arg\max_k P(C=k \mid x) = \arg\max_k p(x \mid C=k) P(C=k)$
Bayes error rate	$\varepsilon^* = 1 - \mathbb{E}_x[\max_k P(C=k \mid x)]$; irreducible lower bound
Discriminant function	$g_k(x) = \log p(x \mid C=k) + \log P(C=k)$; classify as $\arg\max_k g_k(x)$
LDA	Gaussian classes, shared covariance $\Sigma$; linear decision boundary
QDA	Gaussian classes, per-class covariance $\Sigma_k$; quadratic decision boundary
Conditional risk	$R(\alpha=i \mid x) = \sum_j L[i][j] P(C=j \mid x)$; expected loss of action $i$
Minimum risk rule	$\alpha^*(x) = \arg\min_i R(\alpha=i \mid x)$; generalizes 0-1 loss case
MLE for LDA	$\hat{\mu}k = \frac{1}{n_k}\sum{x \in C_k} x$; $\hat{\Sigma} = \frac{1}{n}\sum_k \sum_{x \in C_k}(x-\hat{\mu}_k)(x-\hat{\mu}_k)^\top$
Naive Bayes assumption	$p(x \mid C=k) = \prod_j p(x_j \mid C=k)$; conditional independence given class
Gaussian Naive Bayes	Each $p(x_j \mid C=k) = \mathcal{N}(x_j; \hat{\mu}{jk}, \hat{\sigma}{jk}^2)$; $2Kd$ parameters
Multinomial Naive Bayes	$\hat{\theta}{jk} = n{jk}/n_k$; Laplace smoothing adds pseudo-counts
Naive Bayes robustness	Decision boundary can be correct even when posteriors are miscalibrated
MAP estimate	$\hat{\theta}{\text{MAP}} = \arg\max\theta [\log p(\mathcal{D} \mid \theta) + \log p(\theta)]$; MLE with regularization
Dirichlet-Multinomial	Conjugate prior for categorical; Laplace smoothing prevents zero probabilities
Eigenfaces	PCA reduces dimension; LDA in PCA space; decision boundary is Voronoi diagram
Multi-class	Bayesian classifiers natively handle $K$ classes; SVM needs one-vs-rest or one-vs-one

Read next: