Conditional Probability & Bayes' Theorem // Megha Bose

Prerequisite:

Probability as a Language

Conditional probability is the operation of updating uncertainty in light of new information. It is the mathematical core of Bayesian reasoning, and understanding it carefully dispels a great many intuitive errors - including some that have sent innocent people to prison.

Conditional Probability

Definition. Let $(\Omega, \mathcal{F}, P)$ be a probability space and let $B \in \mathcal{F}$ with $P(B) > 0$. The conditional probability of $A$ given $B$ is:

$$P(A \mid B) = \frac{P(A \cap B)}{P(B)}.$$

Interpretation. Conditioning on $B$ restricts the sample space to $B$ and renormalizes. Formally, the map $A \mapsto P(A \mid B)$ is itself a probability measure on $(\Omega, \mathcal{F})$ - it satisfies all three Kolmogorov axioms.

Multiplication rule. The definition rearranges to:

$$P(A \cap B) = P(A \mid B)\cdot P(B).$$

Iterated, for a chain of $n$ events:

$$P!\left(\bigcap_{i=1}^n A_i\right) = P(A_1)\cdot P(A_2 \mid A_1)\cdot P(A_3 \mid A_1 \cap A_2) \cdots P(A_n \mid A_1 \cap \cdots \cap A_{n-1}).$$

This is sometimes called the chain rule of probability.

Law of Total Probability

Definition. A collection of events ${B_1, B_2, \ldots, B_n}$ is a partition of $\Omega$ if the events are pairwise disjoint and their union is $\Omega$: $B_i \cap B_j = \emptyset$ for $i \neq j$ and $\bigcup_i B_i = \Omega$.

Theorem (Law of Total Probability). If ${B_1, \ldots, B_n}$ is a partition of $\Omega$ with $P(B_i) > 0$ for all $i$, then for any event $A$:

$$P(A) = \sum_{i=1}^{n} P(A \mid B_i), P(B_i).$$

Proof. Because the $B_i$ partition $\Omega$, the events $A \cap B_i$ are pairwise disjoint and $\bigcup_i (A \cap B_i) = A$. By countable additivity:

$$P(A) = \sum_{i=1}^n P(A \cap B_i) = \sum_{i=1}^n P(A \mid B_i), P(B_i). \quad \square$$

The law of total probability lets us compute $P(A)$ by decomposing the sample space into cases.

Bayes' Theorem

Theorem (Bayes). Let ${B_1, \ldots, B_n}$ be a partition of $\Omega$ with $P(B_i) > 0$, and let $A$ be any event with $P(A) > 0$. Then:

$$P(B_i \mid A) = \frac{P(A \mid B_i), P(B_i)}{\displaystyle\sum_{j=1}^n P(A \mid B_j), P(B_j)}.$$

Proof. Apply the definition of conditional probability to $P(B_i \mid A)$, use the multiplication rule for the numerator $P(A \cap B_i) = P(A \mid B_i)P(B_i)$, and the law of total probability for the denominator $P(A)$. $\square$

Bayesian Terminology

In Bayesian inference we update beliefs about a hypothesis $H$ after observing evidence $E$:

$$\underbrace{P(H \mid E)}{\text{posterior}} = \frac{\overbrace{P(E \mid H)}^{\text{likelihood}} \cdot \overbrace{P(H)}^{\text{prior}}}{\underbrace{P(E)}{\text{normalizer}}}.$$

Prior $P(H)$: our belief about $H$ before seeing evidence.
Likelihood $P(E \mid H)$: how probable the evidence is if $H$ is true.
Posterior $P(H \mid E)$: updated belief after observing $E$.
Normalizer $P(E)$: ensures the posterior is a proper probability; computed via total probability over all hypotheses.

Base Rate Neglect: A Medical Testing Example

Consider a disease that affects 1% of the population. A diagnostic test has:

Sensitivity (true positive rate): $P(\text{positive} \mid \text{disease}) = 0.99$
Specificity (true negative rate): $P(\text{negative} \mid \text{no disease}) = 0.95$, so $P(\text{positive} \mid \text{no disease}) = 0.05$

A patient tests positive. What is the probability they have the disease?

Let $D$ = “has disease”, $+$ = “tests positive”. The prior is $P(D) = 0.01$.

By the law of total probability:

$$P(+) = P(+ \mid D),P(D) + P(+ \mid D^c),P(D^c) = (0.99)(0.01) + (0.05)(0.99) = 0.0099 + 0.0495 = 0.0594.$$

By Bayes' theorem:

$$P(D \mid +) = \frac{(0.99)(0.01)}{0.0594} = \frac{0.0099}{0.0594} \approx 0.167.$$

Even with a very accurate test, a positive result means only a 17% chance of disease. This is base rate neglect: the prior $P(D) = 0.01$ is small, so most positives are false positives. Ignoring the base rate and assuming a positive test means near-certain disease is a serious inferential error.

Intuition via Frequencies

Imagine 10,000 people. Of these, 100 have the disease and 9,900 do not.

	Disease	No Disease	Total
Test +	99	495	594
Test −	1	9405	9406
Total	100	9900	10000

Of 594 positives, only 99 truly have the disease: $99/594 \approx 16.7%$.

Independence and Conditional Independence

Events $A$ and $B$ are independent iff $P(A \mid B) = P(A)$ (conditioning on $B$ is irrelevant). More generally:

Definition. $A$ and $C$ are conditionally independent given $B$ (written $A \perp C \mid B$) if:

$$P(A \cap C \mid B) = P(A \mid B)\cdot P(C \mid B),$$

equivalently $P(A \mid B \cap C) = P(A \mid B)$ (when $P(B \cap C) > 0$).

Conditional independence does not imply marginal independence, and vice versa. These distinctions are at the heart of graphical models.

Naive Bayes Classifier

Given a document represented as words $(w_1, \ldots, w_k)$, we want to classify it into category $C$ (e.g., spam vs. not-spam). Bayes' theorem gives:

$$P(C \mid w_1, \ldots, w_k) \propto P(w_1, \ldots, w_k \mid C), P(C).$$

The likelihood $P(w_1, \ldots, w_k \mid C)$ requires estimating a distribution over all possible word sequences - intractable with limited data. The naive Bayes assumption is that words are conditionally independent given the class:

$$P(w_1, \ldots, w_k \mid C) = \prod_{i=1}^k P(w_i \mid C).$$

This gives the classifier:

$$\hat{C} = \arg\max_C P(C)\prod_{i=1}^k P(w_i \mid C).$$

The independence assumption is almost never literally true, yet naive Bayes works surprisingly well in practice, particularly for text classification.

Examples

Spam filtering. Email contains the word “lottery”. From training data:

$P(\text{spam}) = 0.3$, $P(\text{not spam}) = 0.7$
$P(\text{“lottery”} \mid \text{spam}) = 0.8$, $P(\text{“lottery”} \mid \text{not spam}) = 0.01$

$$P(\text{spam} \mid \text{“lottery”}) = \frac{(0.8)(0.3)}{(0.8)(0.3) + (0.01)(0.7)} = \frac{0.24}{0.247} \approx 0.972.$$

The posterior probability of spam is 97.2%, despite a relatively modest prior.

Medical diagnosis with two tests. Suppose after the first positive test, the patient takes a second independent test (same sensitivity and specificity). We update: the posterior from the first test, $P(D \mid +_1) \approx 0.167$, becomes the new prior. Running Bayes' theorem again with $P(D) = 0.167$:

$$P(+_2) = (0.99)(0.167) + (0.05)(0.833) = 0.165 + 0.042 = 0.207.$$

$$P(D \mid +_1, +_2) = \frac{(0.99)(0.167)}{0.207} \approx 0.799.$$

Two positive tests raise the probability to about 80%. Sequential Bayesian updating is exactly this process: each observation sharpens the posterior.

Summary

Conditional probability is the fundamental operation for reasoning under uncertainty:

Definition: $P(A \mid B) = P(A \cap B)/P(B)$ - restrict and renormalize.
Multiplication rule: $P(A \cap B) = P(A \mid B)P(B)$.
Total probability: $P(A) = \sum_i P(A \mid B_i)P(B_i)$ over any partition.
Bayes' theorem: posterior $\propto$ likelihood $\times$ prior.
Base rate neglect: failing to weight the prior properly leads to severe over- or underestimation.
Conditional independence: the modeling assumption that makes naive Bayes and graphical models tractable.

Read Next:

Random Variables