Prerequisite:


Probability is not intuition dressed up in symbols. It is a formal mathematical language - built on measure theory - for reasoning about uncertainty. Getting the foundations right matters because every result in statistics, machine learning, and information theory ultimately rests on these axioms.

Sample Spaces and Events

Definition. A sample space $\Omega$ is the set of all possible outcomes of an experiment. An event is any subset $A \subseteq \Omega$.

Events form a $\sigma$-algebra $\mathcal{F}$ - a collection of subsets closed under complementation and countable union. The pair $(\Omega, \mathcal{F})$ is a measurable space.

Examples of sample spaces:

  • Flipping a coin: $\Omega = {H, T}$
  • Rolling a die: $\Omega = {1,2,3,4,5,6}$
  • Measuring a voltage: $\Omega = \mathbb{R}$
  • Infinite sequence of coin flips: $\Omega = {H,T}^{\mathbb{N}}$

Kolmogorov’s Axioms

Definition. A probability measure is a function $P: \mathcal{F} \to \mathbb{R}$ satisfying:

  1. (Non-negativity) $P(A) \geq 0$ for all $A \in \mathcal{F}$.
  2. (Normalization) $P(\Omega) = 1$.
  3. (Countable additivity) If $A_1, A_2, \ldots$ are pairwise disjoint events, then

$$P!\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i).$$

The triple $(\Omega, \mathcal{F}, P)$ is a probability space.

Derived Properties

All standard probability facts are consequences of the three axioms.

Theorem (Empty set). $P(\emptyset) = 0$.

Proof. Take $A_i = \emptyset$ for all $i$. The sets are disjoint and their union is $\emptyset$. By countable additivity, $P(\emptyset) = \sum_{i=1}^{\infty} P(\emptyset)$. The only real number equal to its own infinite sum is $0$. $\square$

Theorem (Complement rule). $P(A^c) = 1 - P(A)$.

Proof. $A$ and $A^c$ are disjoint with $A \cup A^c = \Omega$. By additivity, $P(A) + P(A^c) = P(\Omega) = 1$. $\square$

Theorem (Monotonicity). If $A \subseteq B$ then $P(A) \leq P(B)$.

Proof. Write $B = A \cup (B \setminus A)$ as a disjoint union. Then $P(B) = P(A) + P(B \setminus A) \geq P(A)$ since $P(B \setminus A) \geq 0$. $\square$

Theorem (Inclusion-exclusion). For any events $A, B$:

$$P(A \cup B) = P(A) + P(B) - P(A \cap B).$$

More generally, for $n$ events:

$$P!\left(\bigcup_{i=1}^n A_i\right) = \sum_{k=1}^{n}(-1)^{k+1}\sum_{|S|=k} P!\left(\bigcap_{i\in S} A_i\right).$$

Theorem (Boole’s inequality / Union bound). $P!\left(\bigcup_{i=1}^n A_i\right) \leq \sum_{i=1}^n P(A_i)$.

This follows immediately from inclusion-exclusion since all the subtracted terms are non-negative.

Discrete vs Continuous Probability Spaces

Discrete Spaces

When $\Omega$ is finite or countably infinite, probability is specified by a probability mass function $p: \Omega \to [0,1]$ with $\sum_{\omega \in \Omega} p(\omega) = 1$. The measure of any event is:

$$P(A) = \sum_{\omega \in A} p(\omega).$$

Continuous Spaces

When $\Omega \subseteq \mathbb{R}^n$, probability is specified by a probability density function $f: \Omega \to [0, \infty)$ with $\int_\Omega f(\omega), d\omega = 1$. The measure of a (measurable) set $A$ is:

$$P(A) = \int_A f(\omega), d\omega.$$

Note that $f(\omega)$ is not a probability - it is a density. For a single point $\omega_0$ in a continuous space, $P({\omega_0}) = 0$.

Uniform Probability on Finite Spaces

When $\Omega$ is finite and every outcome is equally likely, the uniform measure assigns $p(\omega) = 1/|\Omega|$ to each $\omega$, giving:

$$P(A) = \frac{|A|}{|\Omega|}.$$

This reduces probability to combinatorics: counting favorable outcomes divided by total outcomes. Most elementary probability problems use this model.

Independence

Definition. Events $A$ and $B$ are independent if

$$P(A \cap B) = P(A)\cdot P(B).$$

Intuitively: knowing $B$ occurred gives no information about whether $A$ occurred, and vice versa.

Pairwise vs Mutual Independence

A collection of events ${A_1, \ldots, A_n}$ is pairwise independent if $P(A_i \cap A_j) = P(A_i)P(A_j)$ for all $i \neq j$. It is mutually independent (or simply independent) if for every subset $S \subseteq {1,\ldots,n}$:

$$P!\left(\bigcap_{i \in S} A_i\right) = \prod_{i \in S} P(A_i).$$

Counterexample (pairwise $\not\Rightarrow$ mutual). Roll two fair dice. Let:

  • $A$ = first die is odd
  • $B$ = second die is odd
  • $C$ = sum is odd

Then $P(A) = P(B) = P(C) = 1/2$, $P(A \cap B) = P(A \cap C) = P(B \cap C) = 1/4$, so the events are pairwise independent. But $A \cap B \cap C = \emptyset$ (if both dice are odd, the sum is even), so $P(A \cap B \cap C) = 0 \neq 1/8 = P(A)P(B)P(C)$. The events are not mutually independent.

Conditional Probability

Definition. The conditional probability of $A$ given $B$ (with $P(B) > 0$) is:

$$P(A \mid B) = \frac{P(A \cap B)}{P(B)}.$$

This is a new probability measure on $\Omega$ that assigns zero probability to all outcomes outside $B$ and renormalizes the remaining probabilities to sum to 1.

Multiplication rule. Rearranging the definition:

$$P(A \cap B) = P(A \mid B)\cdot P(B) = P(B \mid A)\cdot P(A).$$

For a chain of events:

$$P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1)\cdot P(A_2 \mid A_1)\cdot P(A_3 \mid A_1 \cap A_2) \cdots$$

Connection to independence. If $A$ and $B$ are independent (with $P(B) > 0$), then $P(A \mid B) = P(A)$: conditioning on $B$ does not change the probability of $A$.

Examples

Coin flips. Toss a fair coin twice. $\Omega = {HH, HT, TH, TT}$ with uniform probability $1/4$ each. Let $A = \text{“first flip is H”}$ and $B = \text{“at least one H”}$.

$$P(A) = 2/4 = 1/2, \quad P(B) = 3/4, \quad P(A \cap B) = 2/4 = 1/2.$$ $$P(A \mid B) = \frac{P(A \cap B)}{P(B)} = \frac{1/2}{3/4} = \frac{2}{3}.$$

Dice. Roll two fair dice. What is the probability the sum is 7?

Favorable outcomes: $(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)$ - there are 6. Total outcomes: 36. So $P(\text{sum}=7) = 6/36 = 1/6$.

Cards. Draw two cards without replacement from a standard 52-card deck. What is the probability both are aces?

$$P(\text{both aces}) = P(\text{1st ace}) \cdot P(\text{2nd ace} \mid \text{1st ace}) = \frac{4}{52} \cdot \frac{3}{51} = \frac{12}{2652} = \frac{1}{221}.$$

Checking independence. For a single fair die roll, let $A = {1,2,3}$ (outcome $\leq 3$) and $B = {1,2,4}$ (outcome is 1, 2, or 4). Then $P(A) = 1/2$, $P(B) = 1/2$, and $A \cap B = {1,2}$ so $P(A \cap B) = 1/3 \neq 1/4 = P(A)P(B)$. The events are not independent.

Why This Formalism Matters

The axiomatic framework forces clarity. In particular:

  • $\sigma$-algebra: specifies which events are measurable. In continuous spaces, not every subset can be assigned a probability consistently (Banach-Tarski paradox is the extreme case).
  • Countable additivity (not merely finite additivity): needed to extend probability to limits and to prove the law of large numbers rigorously.
  • Measure-theoretic foundation: unifies discrete and continuous probability, enables conditional expectation to be defined rigorously (as a Radon-Nikodym derivative), and underpins stochastic processes.

The language of sample spaces, $\sigma$-algebras, and probability measures is the substrate on which all of probability theory, statistics, and modern machine learning theory is built.


Read Next: