Conditional Probability - What You Already Know Changes Everything // Megha Bose

Helpful context:

Probability as a Language - The Grammar of Uncertainty

Here is a scenario that fools almost everyone the first time.

A disease affects 1 in 1000 people in the general population. A test for this disease is 99% accurate in both directions: if you have the disease, it correctly returns a positive result 99% of the time; if you don’t have the disease, it correctly returns a negative result 99% of the time.

You take the test. It comes back positive. What is the probability that you actually have the disease?

Most people say somewhere around 99%. The test is 99% accurate, you tested positive, so surely you probably have the disease. Right?

The actual answer is roughly 9%.

Before we work through why, sit with that number for a moment. Even with an exceptionally accurate test, a positive result can leave you with a less than 1 in 10 chance of having the disease. This isn’t a trick or a paradox - it’s what happens when you ignore the base rate, the fact that the disease is rare to begin with. This single example contains the entire lesson of this post. Understanding Bayes' theorem is understanding why the answer is 9%, not 99%.

Restricting the Sample Space

The core operation is simple to state. Suppose you’re partway through an experiment and you learn that some event $B$ has occurred. You want to compute the probability that another event $A$ also occurred, given this new information.

What you know is that the outcome is somewhere in $B$. The rest of the sample space is now irrelevant. So you restrict your attention to $B$ and ask: within $B$, what fraction of outcomes are also in $A$?

That fraction is $P(A \cap B) / P(B)$ - the probability of both $A$ and $B$, divided by the probability of $B$ (which normalizes so that the new probabilities still sum to 1 within $B$).

Conditional probability: For any events $A$ and $B$ with $P(B) > 0$:

$$P(A \mid B) = \frac{P(A \cap B)}{P(B)}.$$

Read this as “the probability of $A$ given $B$.” It’s a new probability assignment - one that restricts the universe to $B$ and renormalizes.

Example. Roll a fair die. You learn the result was an even number. What’s the probability it was a 6?

Without any information: $P(\{6\}) = 1/6$. With the information “even”: the sample space shrinks to $\{2, 4, 6\}$. The event $\{6\}$ is in there, and within $\{2, 4, 6\}$, it accounts for one of three equally likely outcomes. So $P(\{6\} \mid \text{even}) = 1/3$.

Using the formula: $P(\{6\} \mid \text{even}) = P(\{6\} \cap \text{even}) / P(\text{even}) = (1/6) / (1/2) = 1/3$. Same answer.

The Multiplication Rule and Chain Rule

Rearranging the definition of conditional probability gives you the multiplication rule:

$$P(A \cap B) = P(A \mid B) \cdot P(B).$$

This is how you compute the probability of two things both happening: you find the probability that one happens, then multiply by the conditional probability that the other happens given the first.

Example. A bag has 5 red and 3 blue marbles. You draw two without replacement. What’s the probability both are red?

$$P(\text{1st red}) = 5/8.$$ $$P(\text{2nd red} \mid \text{1st red}) = 4/7. \quad \text{(Only 4 red left out of 7 remaining.)}$$ $$P(\text{both red}) = \frac{5}{8} \cdot \frac{4}{7} = \frac{20}{56} = \frac{5}{14}.$$

The multiplication rule extends to any number of events - this is the chain rule of probability. For events $A_1, A_2, \ldots, A_n$:

$$P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1) \cdot P(A_2 \mid A_1) \cdot P(A_3 \mid A_1 \cap A_2) \cdots P(A_n \mid A_1 \cap \cdots \cap A_{n-1}).$$

At each step you multiply by “given everything before, how likely is the next?”

The Law of Total Probability

Sometimes you want $P(A)$, but you don’t know how to compute it directly. What you do know is how to compute $P(A)$ in several different scenarios, and you know the probability of each scenario.

Say the scenarios form a partition of the sample space: events $B_1, B_2, \ldots, B_n$ that are mutually exclusive (no two can both happen) and exhaustive (one of them must happen). Together they tile the sample space perfectly.

Then $A$ must occur jointly with exactly one of the $B_i$’s. Visualize a tree: from the root, branches lead to $B_1, B_2, \ldots, B_n$. From each $B_i$, a branch leads to $A$ or not-$A$. The probability of reaching $A$ via $B_i$ is $P(B_i) \cdot P(A \mid B_i)$. Summing over all branches that end in $A$:

$$P(A) = \sum_{i=1}^n P(A \mid B_i) \cdot P(B_i).$$

This is the Law of Total Probability. It lets you compute $P(A)$ by averaging over cases.

Bayes' Theorem: Inverting the Conditioning

Here is the key insight. Conditional probability is not symmetric. $P(A \mid B)$ and $P(B \mid A)$ are usually completely different numbers - and confusing them causes real harm, as we’ll see.

Bayes' theorem is the formula for going from $P(B \mid A)$ to $P(A \mid B)$.

Derivation. Start from the definition:

$$P(A \mid B) = \frac{P(A \cap B)}{P(B)}.$$

Apply the multiplication rule to the numerator: $P(A \cap B) = P(B \mid A) \cdot P(A)$.

Apply the law of total probability (with partition $\{A, A^c\}$) to the denominator: $P(B) = P(B \mid A) \cdot P(A) + P(B \mid A^c) \cdot P(A^c)$.

Substituting:

$$P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B \mid A) \cdot P(A) + P(B \mid A^c) \cdot P(A^c)}.$$

This is Bayes' theorem. In the language of hypothesis $H$ and evidence $E$:

$$\underbrace{P(H \mid E)}{\text{posterior}} = \frac{\overbrace{P(E \mid H)}^{\text{likelihood}} \cdot \overbrace{P(H)}^{\text{prior}}}{\underbrace{P(E)}{\text{normalizer}}}.$$

The prior $P(H)$ is what you believed about $H$ before seeing the evidence.
The likelihood $P(E \mid H)$ is how probable the evidence would be if $H$ were true.
The posterior $P(H \mid E)$ is your updated belief after seeing the evidence.
The normalizer $P(E)$ ensures the posterior is a proper probability.

Bayes' theorem is the mathematical engine of belief updating. Every time you see new evidence and update your opinion, you’re doing something qualitatively like Bayes - and Bayes' theorem is the precise, quantitative version.

The Medical Test, Worked Through

Now let’s return to the opening problem with the exact numbers from the post’s premise.

Disease prevalence: $P(D) = 0.001$ (1 in 1000)
Test sensitivity (true positive rate): $P(+ \mid D) = 0.99$
Test specificity (true negative rate): $P(- \mid D^c) = 0.99$, so the false positive rate is $P(+ \mid D^c) = 0.01$

You test positive. What is $P(D \mid +)$?

Step 1: Find $P(+)$ via the law of total probability.

$$P(+) = P(+ \mid D) \cdot P(D) + P(+ \mid D^c) \cdot P(D^c)$$

$$= (0.99)(0.001) + (0.01)(0.999)$$

$$= 0.00099 + 0.00999 = 0.01098.$$

Step 2: Apply Bayes' theorem.

$$P(D \mid +) = \frac{P(+ \mid D) \cdot P(D)}{P(+)} = \frac{(0.99)(0.001)}{0.01098} = \frac{0.00099}{0.01098} \approx 0.0902.$$

About 9%. You have roughly a 1 in 11 chance of actually having the disease.

Why is it so low? Think about 100,000 people. About 100 have the disease, and about 99,900 don’t.

	Disease	No Disease	Total
Test +	99	999	1098
Test −	1	98901	98902
Total	100	99900	100000

Of the 1098 people who test positive, only 99 actually have the disease. That’s 99/1098 ≈ 9%. The false positive rate is just 1% - but applied to a pool of 99,900 people without the disease, it produces about 999 false positives. These swamp the true positives.

The culprit is the base rate. The disease is rare, so most of the population is disease-free, so most positive tests are false positives - even if the test is highly accurate. This is not a flaw in Bayes' theorem; it’s precisely what Bayes' theorem is telling you.

Independence: When Information Tells You Nothing

Conditional probability leads naturally to the concept of independence.

Events $A$ and $B$ are independent if knowing that $B$ occurred tells you nothing about whether $A$ occurred:

$$P(A \mid B) = P(A).$$

Substituting the definition of conditional probability, this is equivalent to:

$$P(A \cap B) = P(A) \cdot P(B).$$

This product formula is the standard definition - it has the advantage of being symmetric and of working even when $P(B) = 0$.

Independence means information is irrelevant. The outcome of coin flip 1 tells you nothing about coin flip 2. Whether a red marble was drawn first (with replacement) tells you nothing about the second draw. These are genuinely independent.

Two common mistakes about independence:

Correlation is not independence. If stock A tends to rise when stock B rises, they’re not independent - they’re positively correlated. Knowing one went up changes your expectation for the other.
The gambler’s fallacy. This is the belief that after a long streak of heads, tails is “due.” Independent events have no memory. If coin flips are independent, the 1001st flip has exactly 50% chance of heads regardless of the first 1000. The coins don’t know what happened before.

The Prosecutor’s Fallacy

This one has put innocent people in prison. It is the confusion of $P(E \mid H)$ with $P(H \mid E)$.

In a criminal trial, the prosecutor presents DNA evidence and says: “The probability that an innocent person would have DNA matching this sample is 1 in a million. The defendant’s DNA matches. Therefore, there’s only a 1 in a million chance the defendant is innocent.”

This reasoning is wrong. It confuses the likelihood $P(\text{match} \mid \text{innocent})$ with the posterior $P(\text{innocent} \mid \text{match})$.

To find the posterior, you need the prior. If the database that was searched contained one million profiles, then roughly one innocent person would be expected to match by chance - and the prior probability of guilt for a randomly matched person is not what the prosecutor implies.

Bayes' theorem is the correct tool. The prosecutor is applying an informal version of the argument: “the probability of this evidence if innocent is tiny, so the probability of innocence given the evidence must also be tiny.” But this ignores the prior probability of guilt, which is exactly the quantity on trial. The error can be devastating.

Discomfort check. Why is $P(A \mid B)$ undefined when $P(B) = 0$? The formula $P(A \mid B) = P(A \cap B) / P(B)$ involves dividing by $P(B)$, and division by zero is undefined. In discrete settings this is rarely a problem - you just don’t condition on impossible events. In continuous settings it’s trickier. If $X$ is uniformly distributed on $[0,1]$, then $P(X = 0.5) = 0$, but conditioning on $X = 0.5$ is a meaningful and useful operation. Making this precise requires measure theory and the concept of conditional expectation as a Radon-Nikodym derivative. For now: in the finite and discrete worlds that these early posts inhabit, $P(B) > 0$ is almost always satisfied for any event you’d actually want to condition on.

Summary

Concept	Formula	What It Means
Conditional probability	$P(A \mid B) = P(A \cap B)/P(B)$	Restrict to $B$, renormalize
Multiplication rule	$P(A \cap B) = P(A \mid B) \cdot P(B)$	Probability of both
Chain rule	$P(A_1 \cap \cdots \cap A_n) = P(A_1) \cdot P(A_2 \mid A_1) \cdots$	Sequence of conditional steps
Law of total probability	$P(A) = \sum_i P(A \mid B_i) P(B_i)$	Average over a partition
Bayes' theorem	$P(H \mid E) \propto P(E \mid H) \cdot P(H)$	Posterior = likelihood × prior
Independence	$P(A \mid B) = P(A) \Leftrightarrow P(A \cap B) = P(A)P(B)$	Information doesn’t help

The medical test example is the whole post in miniature. Your prior belief that you have the disease was 0.1%. The positive test was strong evidence - but not strong enough to overcome how rare the disease is. The posterior is 9%, not 99%. That gap - between the accuracy of a test and the probability you have the disease - is bridged by Bayes' theorem.

Read next: