Probability as a Language
Prerequisite:
Probability is not intuition dressed up in symbols. It is a formal mathematical language - built on measure theory - for reasoning about uncertainty. Getting the foundations right matters because every result in statistics, machine learning, and information theory ultimately rests on these axioms.
Sample Spaces and Events
Definition. A sample space $\Omega$ is the set of all possible outcomes of an experiment. An event is any subset $A \subseteq \Omega$.
Events form a $\sigma$-algebra $\mathcal{F}$ - a collection of subsets closed under complementation and countable union. The pair $(\Omega, \mathcal{F})$ is a measurable space.
Examples of sample spaces:
- Flipping a coin: $\Omega = {H, T}$
- Rolling a die: $\Omega = {1,2,3,4,5,6}$
- Measuring a voltage: $\Omega = \mathbb{R}$
- Infinite sequence of coin flips: $\Omega = {H,T}^{\mathbb{N}}$
Kolmogorov’s Axioms
Definition. A probability measure is a function $P: \mathcal{F} \to \mathbb{R}$ satisfying:
- (Non-negativity) $P(A) \geq 0$ for all $A \in \mathcal{F}$.
- (Normalization) $P(\Omega) = 1$.
- (Countable additivity) If $A_1, A_2, \ldots$ are pairwise disjoint events, then
$$P!\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i).$$
The triple $(\Omega, \mathcal{F}, P)$ is a probability space.
Derived Properties
All standard probability facts are consequences of the three axioms.
Theorem (Empty set). $P(\emptyset) = 0$.
Proof. Take $A_i = \emptyset$ for all $i$. The sets are disjoint and their union is $\emptyset$. By countable additivity, $P(\emptyset) = \sum_{i=1}^{\infty} P(\emptyset)$. The only real number equal to its own infinite sum is $0$. $\square$
Theorem (Complement rule). $P(A^c) = 1 - P(A)$.
Proof. $A$ and $A^c$ are disjoint with $A \cup A^c = \Omega$. By additivity, $P(A) + P(A^c) = P(\Omega) = 1$. $\square$
Theorem (Monotonicity). If $A \subseteq B$ then $P(A) \leq P(B)$.
Proof. Write $B = A \cup (B \setminus A)$ as a disjoint union. Then $P(B) = P(A) + P(B \setminus A) \geq P(A)$ since $P(B \setminus A) \geq 0$. $\square$
Theorem (Inclusion-exclusion). For any events $A, B$:
$$P(A \cup B) = P(A) + P(B) - P(A \cap B).$$
More generally, for $n$ events:
$$P!\left(\bigcup_{i=1}^n A_i\right) = \sum_{k=1}^{n}(-1)^{k+1}\sum_{|S|=k} P!\left(\bigcap_{i\in S} A_i\right).$$
Theorem (Boole’s inequality / Union bound). $P!\left(\bigcup_{i=1}^n A_i\right) \leq \sum_{i=1}^n P(A_i)$.
This follows immediately from inclusion-exclusion since all the subtracted terms are non-negative.
Discrete vs Continuous Probability Spaces
Discrete Spaces
When $\Omega$ is finite or countably infinite, probability is specified by a probability mass function $p: \Omega \to [0,1]$ with $\sum_{\omega \in \Omega} p(\omega) = 1$. The measure of any event is:
$$P(A) = \sum_{\omega \in A} p(\omega).$$
Continuous Spaces
When $\Omega \subseteq \mathbb{R}^n$, probability is specified by a probability density function $f: \Omega \to [0, \infty)$ with $\int_\Omega f(\omega), d\omega = 1$. The measure of a (measurable) set $A$ is:
$$P(A) = \int_A f(\omega), d\omega.$$
Note that $f(\omega)$ is not a probability - it is a density. For a single point $\omega_0$ in a continuous space, $P({\omega_0}) = 0$.
Uniform Probability on Finite Spaces
When $\Omega$ is finite and every outcome is equally likely, the uniform measure assigns $p(\omega) = 1/|\Omega|$ to each $\omega$, giving:
$$P(A) = \frac{|A|}{|\Omega|}.$$
This reduces probability to combinatorics: counting favorable outcomes divided by total outcomes. Most elementary probability problems use this model.
Independence
Definition. Events $A$ and $B$ are independent if
$$P(A \cap B) = P(A)\cdot P(B).$$
Intuitively: knowing $B$ occurred gives no information about whether $A$ occurred, and vice versa.
Pairwise vs Mutual Independence
A collection of events ${A_1, \ldots, A_n}$ is pairwise independent if $P(A_i \cap A_j) = P(A_i)P(A_j)$ for all $i \neq j$. It is mutually independent (or simply independent) if for every subset $S \subseteq {1,\ldots,n}$:
$$P!\left(\bigcap_{i \in S} A_i\right) = \prod_{i \in S} P(A_i).$$
Counterexample (pairwise $\not\Rightarrow$ mutual). Roll two fair dice. Let:
- $A$ = first die is odd
- $B$ = second die is odd
- $C$ = sum is odd
Then $P(A) = P(B) = P(C) = 1/2$, $P(A \cap B) = P(A \cap C) = P(B \cap C) = 1/4$, so the events are pairwise independent. But $A \cap B \cap C = \emptyset$ (if both dice are odd, the sum is even), so $P(A \cap B \cap C) = 0 \neq 1/8 = P(A)P(B)P(C)$. The events are not mutually independent.
Conditional Probability
Definition. The conditional probability of $A$ given $B$ (with $P(B) > 0$) is:
$$P(A \mid B) = \frac{P(A \cap B)}{P(B)}.$$
This is a new probability measure on $\Omega$ that assigns zero probability to all outcomes outside $B$ and renormalizes the remaining probabilities to sum to 1.
Multiplication rule. Rearranging the definition:
$$P(A \cap B) = P(A \mid B)\cdot P(B) = P(B \mid A)\cdot P(A).$$
For a chain of events:
$$P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1)\cdot P(A_2 \mid A_1)\cdot P(A_3 \mid A_1 \cap A_2) \cdots$$
Connection to independence. If $A$ and $B$ are independent (with $P(B) > 0$), then $P(A \mid B) = P(A)$: conditioning on $B$ does not change the probability of $A$.
Examples
Coin flips. Toss a fair coin twice. $\Omega = {HH, HT, TH, TT}$ with uniform probability $1/4$ each. Let $A = \text{“first flip is H”}$ and $B = \text{“at least one H”}$.
$$P(A) = 2/4 = 1/2, \quad P(B) = 3/4, \quad P(A \cap B) = 2/4 = 1/2.$$ $$P(A \mid B) = \frac{P(A \cap B)}{P(B)} = \frac{1/2}{3/4} = \frac{2}{3}.$$
Dice. Roll two fair dice. What is the probability the sum is 7?
Favorable outcomes: $(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)$ - there are 6. Total outcomes: 36. So $P(\text{sum}=7) = 6/36 = 1/6$.
Cards. Draw two cards without replacement from a standard 52-card deck. What is the probability both are aces?
$$P(\text{both aces}) = P(\text{1st ace}) \cdot P(\text{2nd ace} \mid \text{1st ace}) = \frac{4}{52} \cdot \frac{3}{51} = \frac{12}{2652} = \frac{1}{221}.$$
Checking independence. For a single fair die roll, let $A = {1,2,3}$ (outcome $\leq 3$) and $B = {1,2,4}$ (outcome is 1, 2, or 4). Then $P(A) = 1/2$, $P(B) = 1/2$, and $A \cap B = {1,2}$ so $P(A \cap B) = 1/3 \neq 1/4 = P(A)P(B)$. The events are not independent.
Why This Formalism Matters
The axiomatic framework forces clarity. In particular:
- $\sigma$-algebra: specifies which events are measurable. In continuous spaces, not every subset can be assigned a probability consistently (Banach-Tarski paradox is the extreme case).
- Countable additivity (not merely finite additivity): needed to extend probability to limits and to prove the law of large numbers rigorously.
- Measure-theoretic foundation: unifies discrete and continuous probability, enables conditional expectation to be defined rigorously (as a Radon-Nikodym derivative), and underpins stochastic processes.
The language of sample spaces, $\sigma$-algebras, and probability measures is the substrate on which all of probability theory, statistics, and modern machine learning theory is built.
Read Next: