Probability as a Language - The Grammar of Uncertainty
Helpful context:
- Counting & Choice - How to Count Without Counting Everything
- Permutations & Combinations - Order and Selection, Made Precise
- Inclusion-Exclusion - Overcounting Your Way to the Right Answer
You flip a fair coin 1000 times and get 513 heads. Should you be suspicious? Is the coin rigged?
Your gut says something. Maybe 513 doesn’t feel too far from 500. Or maybe it does - 13 extra heads, that’s a bit much. But your gut is not a reliable instrument. People’s intuitions about randomness are notoriously wrong. We think truly random sequences should alternate more than they do. We expect “balance” that probability doesn’t guarantee. We are surprised by streaks. We are surprised by the birthday paradox - in a room of 23 people, there’s a better than even chance that two share a birthday. We are surprised by the Monty Hall problem - switching doors doubles your chance of winning, even though it feels like the odds should be 50-50 after one door is opened.
The reason our intuitions fail is that we don’t have a precise language for uncertainty. We have feelings and heuristics. Probability theory is the language - the grammar, the vocabulary, the rules - that makes uncertain questions answerable with something better than vibes.
A natural objection: if probability doesn’t guarantee any particular outcome, what is it actually good for? The answer is that probability isn’t competing with certainty - it’s competing with gut feelings, and it wins. Consider what probability does that intuition can’t:
It makes aggregates predictable even when individual outcomes aren’t. A casino doesn’t know who will win any particular hand. But the house edge means they can predict, with near-mathematical certainty, that they’ll profit over thousands of bets. An insurance company doesn’t know which customers will get sick - but it knows what fraction will, reliably enough to price policies and stay solvent. When you aggregate over many uncertain outcomes, the uncertainty shrinks. Probability tells you exactly how fast, and by how much.
It also quantifies how uncertain you are - which turns out to be enormously useful. “It might rain” is useless for planning. “70% chance of rain” tells you to bring an umbrella. “99% chance of rain” tells you to reschedule. Probability converts vague feelings into numbers you can act on.
And it gives you a framework for making the best possible decision under uncertainty even when you can’t know the outcome. You may not control what happens, but you can control how you respond to the odds. Expected value, Bayesian reasoning, hypothesis testing - all of this follows from the same foundation.
This post builds that foundation from the ground up. By the end, you’ll have a formal framework that can handle coin flips, rainfall totals, and the reliability of medical tests.
How This Field Got Started
Probability theory was born from a gambling dispute.
In 1654, a French nobleman and compulsive gambler named Antoine Gombaud - known as the Chevalier de Méré - wrote to Blaise Pascal with a puzzle he couldn’t resolve. Two players are mid-game in a fair bet. The game gets interrupted before anyone wins. Given the current score, how should the pot be split? This became known as the “Problem of Points,” and it had stumped mathematicians for over a century. Pascal found it interesting enough to write to Pierre de Fermat about it. Their exchange of letters over the next few months is considered the founding moment of probability as a mathematical discipline.
What’s remarkable is that Pascal and Fermat weren’t just solving a gambling problem. They were doing something philosophically new: they were reasoning systematically about events that hadn’t happened yet. For most of history, uncertain outcomes were treated as the domain of God or fate - you couldn’t calculate chance, you could only accept it. The Pascal-Fermat correspondence said: no, there is structure here, and we can reason about it rigorously.
Gerolamo Cardano had actually gotten there a century earlier, in his book Liber de Ludo Aleae (Book on Games of Chance), where he worked out basic combinatorics of dice. But Cardano’s work wasn’t published until 1663, and he was largely ignored. What Pascal and Fermat did was put probability in conversation with the mathematical establishment of the day.
The field grew quickly after that. Christiaan Huygens published the first printed text on probability in 1657, introducing the concept of expected value - the average outcome you’d expect over many repetitions of a bet. Jacob Bernoulli, writing around 1700, proved the first rigorous version of the Law of Large Numbers: as you repeat an experiment more and more times, the observed frequency of an event converges to its true probability. This was the mathematical justification for the intuition that probability means something real - it’s not just a number, it’s a long-run prediction that actually comes true.
By the 18th century, mathematicians were applying probability far outside gambling. Laplace used it to analyze errors in astronomical measurements. Actuaries used it to price life insurance. Gauss and Laplace developed the normal distribution to describe measurement noise. The same tool that settled a card game was now powering scientific inference.
The modern axiomatic foundation - the one this post builds - was written by Andrei Kolmogorov in 1933. Before Kolmogorov, probability was a collection of clever techniques without rigorous underpinnings. Kolmogorov reformulated it as a branch of measure theory, writing down the three axioms that everything else follows from. This is what gave probability the same logical solidity as geometry or algebra - and opened the door to applying it to any domain where uncertainty lives, from quantum mechanics to machine learning.
The through-line from gambling to modern AI is genuinely surprising. The problem of how to split a poker pot became the problem of how to reason about anything you don’t know for certain.
What Are We Trying to Describe?
Start with an experiment whose outcome you don’t know in advance: a coin flip, a die roll, tomorrow’s temperature, how many customers walk into a store. The thing all these share is that before the experiment runs, multiple outcomes are possible. After it runs, exactly one outcome occurred.
The first thing we need is a way to catalog what could happen.
The sample space $\Omega$ is the set of all possible outcomes of an experiment.
Some examples:
- Flip a coin once: $\Omega = \{H, T\}$
- Roll a standard die: $\Omega = \{1, 2, 3, 4, 5, 6\}$
- Flip a coin twice: $\Omega = \{HH, HT, TH, TT\}$
- Measure tomorrow’s rainfall in millimeters: $\Omega = [0, \infty)$
- Record the exact time (in seconds) a radioactive atom decays: $\Omega = (0, \infty)$
The sample space can be finite, countably infinite, or uncountably infinite - like all the real numbers in an interval. The first two cases are called discrete; the last is continuous. They behave somewhat differently, as we’ll see.
Once you have a sample space, you can talk about events.
An event is any subset $A \subseteq \Omega$.
An event is a collection of outcomes - the ones where the event “occurs.” When you roll a die and ask “did I roll an even number?”, the event is $\{2, 4, 6\}$. When you flip a coin twice and ask “did I get at least one head?”, the event is $\{HH, HT, TH\}$. When you measure rainfall and ask “did it rain more than 10 mm?”, the event is the interval $(10, \infty)$.
The complement of an event $A$ is $A^c = \Omega \setminus A$: everything in the sample space that isn’t in $A$. The union $A \cup B$ is “at least one of $A$ or $B$ occurs.” The intersection $A \cap B$ is “both occur.” These are ordinary set operations, but now they carry probabilistic meaning.
Probability: Three Rules That Govern Everything
Now we want to assign probabilities - numbers between 0 and 1 - to events. But which numbers? You can’t just assign them arbitrarily. Any assignment that’s going to be coherent and useful has to satisfy some minimal consistency conditions.
In 1933, the mathematician Andrei Kolmogorov wrote down three axioms. These aren’t assumptions pulled from thin air - they’re the minimal rules that any reasonable assignment of probabilities must satisfy. Here they are, with a brief argument for why each one is necessary.
Axiom 1 (Non-negativity): $P(A) \geq 0$ for every event $A$.
Why it has to hold: Probability is supposed to measure how likely something is. Negative likelihoods don’t make sense - there’s no meaningful interpretation of “a minus-30% chance of rain.”
Axiom 2 (Normalization): $P(\Omega) = 1$.
Why it has to hold: Something always happens. After you flip the coin, you get either H or T - you don’t get nothing. The probability that some outcome occurs should be 1, because it’s certain. You can think of 1 as the total “budget” of probability to distribute.
Axiom 3 (Countable additivity): If $A_1, A_2, A_3, \ldots$ are pairwise disjoint events (no two overlap), then
$$P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i).$$
Why it has to hold: If two events can’t both happen at the same time (disjoint events), then the probability that at least one happens should be the sum of their individual probabilities. Rolling a 3 and rolling a 5 are mutually exclusive outcomes; the probability of rolling either should be $P(\{3\}) + P(\{5\})$. The countably infinite version (not just finitely many events) is needed to make probability work with limits - to prove things like the law of large numbers rigorously.
The triple $(\Omega, P)$ - a sample space equipped with a probability function satisfying these axioms - is a probability space.
What Follows from the Axioms
Remarkably, everything else in probability is a consequence of these three rules. You never need additional assumptions; you just keep applying the axioms carefully.
$P(\emptyset) = 0$. The empty set (the event containing no outcomes) has probability zero.
Why: Take infinitely many copies of the empty set: $A_i = \emptyset$ for all $i$. They’re disjoint, and their union is $\emptyset$. By Axiom 3: $P(\emptyset) = \sum_{i=1}^\infty P(\emptyset)$. The only number that equals its own infinite sum is zero.
$P(A^c) = 1 - P(A)$. The probability that something doesn’t happen is one minus the probability that it does.
Why: $A$ and $A^c$ are disjoint, and $A \cup A^c = \Omega$. By Axiom 3 and Axiom 2: $P(A) + P(A^c) = P(\Omega) = 1$.
This one gets used constantly. Rather than computing $P(A)$ directly, sometimes it’s easier to compute $P(A^c)$ and subtract from 1. “At least one head in ten flips” is much easier to find as $1 - P(\text{no heads})$ than by summing up all the cases.
Inclusion-exclusion: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$.
Why: When $A$ and $B$ overlap, adding $P(A) + P(B)$ double-counts the intersection. You subtract $P(A \cap B)$ to correct. This generalizes: for three events, you add three, subtract three pairwise intersections, add back the triple intersection. That’s exactly the inclusion-exclusion principle from your combinatorics prerequisites - now it lives inside probability.
Monotonicity: If $A \subseteq B$, then $P(A) \leq P(B)$.
Why: $B = A \cup (B \setminus A)$ is a disjoint union, so $P(B) = P(A) + P(B \setminus A) \geq P(A)$ since $P(B \setminus A) \geq 0$. Bigger events (more outcomes) can’t have smaller probability.
The Uniform Model: Where Combinatorics Plugs In
For many problems, we have a finite sample space and every outcome is equally likely. This is the uniform or classical probability model.
If $|\Omega| = N$ and every outcome has equal probability $1/N$, then for any event $A$:
$$P(A) = \frac{|A|}{|\Omega|} = \frac{\text{number of outcomes in } A}{\text{total number of outcomes}}.$$
This formula turns probability into a counting problem - which is why the combinatorics posts were prerequisites. Counting outcomes in $A$ is exactly what permutations, combinations, and the inclusion-exclusion principle are for.
Worked example: two dice, sum equals 7.
Roll two fair six-sided dice. What’s the probability the sum is 7?
The sample space is all pairs $(d_1, d_2)$ where $d_1, d_2 \in \{1,2,3,4,5,6\}$. That’s $|\Omega| = 36$ equally likely outcomes.
The event $A = \{\text{sum} = 7\}$ consists of the pairs where $d_1 + d_2 = 7$. Let’s list them: $(1,6), (2,5), (3,4), (4,3), (5,2), (6,1)$. That’s $|A| = 6$.
$$P(\text{sum} = 7) = \frac{6}{36} = \frac{1}{6}.$$
Notice that 7 is the most probable sum when rolling two dice - more ways to make it than any other total. That’s a pure counting fact, which now translates directly into probability.
Continuous Spaces: Probability Without Counting
The uniform model breaks completely for continuous sample spaces. If $\Omega = [0, \infty)$ (say, tomorrow’s rainfall in mm), you can’t count outcomes - there are uncountably many. The formula $P(A) = |A|/|\Omega|$ has no meaning when both numerator and denominator are infinite.
Continuous probability works differently. Instead of assigning probabilities to individual outcomes, you assign them to intervals - and you use a probability density function $f(x)$ to do it.
$$P(a \leq X \leq b) = \int_a^b f(x) dx$$
The probability is the area under the density curve over an interval, not a count. Think of $f(x)$ as describing how concentrated probability is near the point $x$ - like mass per unit length. To get an actual probability (an actual mass), you need a region, not a point.
Example. Suppose rainfall is uniformly distributed between 0 and 10 mm - all amounts equally likely. Then $f(x) = 1/10$ for $x \in [0, 10]$ and zero outside. The probability of getting between 3 and 7 mm is:
$$P(3 \leq X \leq 7) = \int_3^7 \frac{1}{10} dx = \frac{4}{10} = 0.4.$$
That’s 40% of the area of the density - exactly right geometrically.
The strange consequence: $P(X = x) = 0$ for every single point.
$$P(X = 3.7) = \int_{3.7}^{3.7} f(x) dx = 0.$$
This feels wrong. If you measure rainfall and get exactly 3.7 mm, didn’t that happen? Yes - but the probability of that exact value, specified to infinite decimal places, is zero. This is not the same as impossible. Think of throwing a dart at a dartboard: the probability of hitting any precise geometric point is zero (a point has no area), yet the dart hits somewhere. Zero probability and impossibility are not the same thing in continuous spaces. What the density $f(x)$ tells you is not “how likely is exactly $x$” but “how densely packed is probability near $x$” - meaning nearby intervals get higher probability.
A density $f(x)$ is not itself a probability and doesn’t need to be below 1. It just needs to integrate to 1 over all of $\Omega$ (Axiom 2 in continuous form):
$$\int_{-\infty}^{\infty} f(x) dx = 1.$$
How to interpret this. Under the frequentist view: if you repeat the experiment many times, the fraction of outcomes landing in $[a, b]$ converges to $\int_a^b f(x) dx$. The individual outcomes still vary unpredictably - but the distribution of where they land matches the density. Under the Bayesian view: the density represents your degree of belief spread across possible values, with more area meaning more credence.
Two Ways to Interpret Probability
We have the axioms. We have the formula for uniform spaces. But what does “$P(A) = 1/6$” actually mean? There are two major interpretations that mathematicians and statisticians debate, and they lead to genuinely different ways of doing statistics.
The frequentist interpretation. $P(A) = 1/6$ means: in a long run of repeated experiments, the fraction of trials where $A$ occurs converges to $1/6$. Probability is long-run relative frequency. If you roll a die one million times, the fraction of 3s will be very close to $1/6$. Under this view, probability only makes sense for repeatable experiments - you can’t assign a probability to a one-time event like “it will rain in London on my birthday next year” because there’s no sequence of trials to average over.
The Bayesian (or subjective) interpretation. $P(A) = 1/6$ means: you believe, with confidence $1/6$, that $A$ will occur. Probability is a degree of belief, a measure of personal uncertainty. Under this view, you can assign probabilities to one-off events - even “what’s the probability that Shakespeare wrote the plays attributed to him?” - as long as you’re willing to commit to a number representing your credence.
Here’s the beautiful part: Kolmogorov’s axioms are neutral between these interpretations. Both frequentists and Bayesians accept the axioms. The axioms don’t tell you what probability means - they tell you how it must behave once you’ve decided what it means. This is why the axiomatic approach is so powerful: it unifies two schools of thought under a common mathematical structure.
Why Not Just Use Intuition?
The Dutch book argument makes the case for the axioms in a hard-nosed way: if your probability assignments violate the axioms, a clever opponent can construct a series of bets against you that guarantee they win money no matter what happens.
Suppose you think $P(A) = 0.4$ and $P(A^c) = 0.7$. These don’t add up to 1 - Axiom 2 is violated. An opponent offers you two bets: “I’ll sell you a ticket for A that pays 1 if A occurs, for a price of 0.4” and “I’ll sell you a ticket for $A^c$ that pays 1 if $A^c$ occurs, for a price of 0.7.” You buy both. You spend $0.4 + 0.7 = 1.1$ upfront. But exactly one of $A$ or $A^c$ occurs, so exactly one ticket pays out 1. You always lose 0.1. Your opponent has a guaranteed profit.
The axioms are precisely the conditions that make you Dutch-book-proof - internally consistent enough that no guaranteed money pump can be constructed against you.
Discomfort check. You might have noticed a technical term skipped over: the $\sigma$-algebra. The full setup of a probability space is actually a triple $(\Omega, \mathcal{F}, P)$, where $\mathcal{F}$ is a collection of subsets of $\Omega$ (the “measurable events”) that is closed under complement and countable union. Why do you need this? For finite sample spaces, you don’t - just let $\mathcal{F}$ be all subsets of $\Omega$ and move on. The complication arises for continuous spaces like $\Omega = \mathbb{R}$. It turns out you cannot consistently assign probabilities to every subset of the real line - there exist “non-measurable sets” where the assignment breaks down (the Banach-Tarski paradox is the dramatic extreme case). The $\sigma$-algebra specifies exactly which subsets are allowed to be events - the ones that can be assigned a probability without creating contradictions. For the problems in this and the next few posts, you’ll never need to worry about it.
Summary
| Concept | What It Is |
|---|---|
| Sample space $\Omega$ | The set of all possible outcomes |
| Event $A$ | A subset $A \subseteq \Omega$ |
| Probability $P$ | A function from events to $[0,1]$ satisfying the axioms |
| Axiom 1 | $P(A) \geq 0$ - probabilities are non-negative |
| Axiom 2 | $P(\Omega) = 1$ - something always happens |
| Axiom 3 | $P(A \cup B) = P(A) + P(B)$ for disjoint $A, B$ |
| Complement rule | $P(A^c) = 1 - P(A)$ |
| Inclusion-exclusion | $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ |
| Uniform model | $P(A) = |A| / |\Omega|$ when all outcomes are equally likely |
The axioms don’t tell you which probabilities to assign to which events - that depends on your model of the world. What they do is constrain the assignment: once you fix some probabilities, the rest are forced on you by the rules. That constraint is what makes probability a language rather than a free-for-all.
To answer the opening question: is your coin fair if you got 513 heads in 1000 flips? Probability theory gives you the tools to answer this precisely - using a binomial model, you can compute exactly how surprising 513 is under the fair-coin hypothesis. The answer, it turns out, is: not very. But that calculation requires random variables, which come next.
Read next: