Helpful context:


In 1977, Robert Solovay and Volker Strassen discovered something unsettling: you can check whether a 100-digit number is prime in milliseconds, but the answer is only correct with probability $1 - 2^{-100}$.

You could be wrong.

But think about what $2^{-100}$ actually means. A 100-bit number requires a specific sequence of 100 coin flips to be wrong. The probability that a cosmic ray flipped a bit in your computer’s RAM during the computation is substantially higher. The probability that you made an arithmetic error if you’d done the check by hand is orders of magnitude higher.

When is “almost certainly correct” good enough? For primality testing, cryptography, scientific simulation, and a dozen other applications: always. The question isn’t whether to use randomized algorithms - it’s understanding what kind of randomness you’re introducing.


Two Flavors of Randomized Algorithms

Randomized algorithms come in two types, and the distinction matters.

Las Vegas algorithms are always correct. Their running time is random - it varies from run to run - but they never return a wrong answer. Randomized quicksort is the canonical example: it always produces a sorted array, but the time taken depends on pivot choices.

Monte Carlo algorithms always run in bounded time. But they may return an incorrect answer with some probability. Primality testing (Miller-Rabin) is the canonical example: it runs in $O(k \log^2 n)$ time after $k$ rounds, but it has a false-positive rate of at most $4^{-k}$ for composite numbers.

You can run a Las Vegas algorithm until it terminates, then you have a correct answer with probability 1. For Monte Carlo, you drive the error probability below an acceptable threshold by running more rounds.


Randomized Quicksort: Expected $O(n \log n)$

Standard quicksort picks a pivot deterministically - often the first or last element. On sorted input, this gives worst-case $O(n^2)$. An adversary who knows your pivot rule can construct inputs that break your algorithm.

The fix: choose the pivot uniformly at random from the remaining elements.

Claim. Randomized quicksort has expected $O(n \log n)$ comparisons, regardless of input.

Proof. Let $z_1 < z_2 < \cdots < z_n$ be the sorted elements. Define indicator variables:

$$X_{ij} = \mathbf{1}[z_i \text{ and } z_j \text{ are compared during the sort}], \quad i < j$$

Two elements $z_i$ and $z_j$ are compared if and only if one of them is chosen as a pivot before any other element from the set $\{z_i, z_{i+1}, \ldots, z_j\}$. Why? Because once any element strictly between $z_i$ and $z_j$ is chosen as pivot, $z_i$ and $z_j$ are separated into different subproblems and will never be compared.

By symmetry, each of the $j - i + 1$ elements in $\{z_i, \ldots, z_j\}$ is equally likely to be the first pivot chosen among them. So the probability that $z_i$ or $z_j$ is that first pivot - triggering a comparison between them - is exactly $2/(j - i + 1)$.

By linearity of expectation:

$$\mathbb{E}\left[\sum_{i < j} X_{ij}\right] = \sum_{i=1}^{n} \sum_{j=i+1}^{n} \frac{2}{j - i + 1} = 2 \sum_{i=1}^{n} \sum_{k=2}^{n-i+1} \frac{1}{k} \leq 2n \sum_{k=1}^{n} \frac{1}{k} = 2n H_n = O(n \log n)$$

where $H_n \approx \ln n$ is the harmonic number. This is a Las Vegas algorithm: always sorts correctly, expected $O(n \log n)$ time.


Karger’s Min-Cut Algorithm

Given an undirected graph $G = (V, E)$, a cut is a partition of vertices into two non-empty sets. The min-cut is the cut with the fewest edges crossing it.

Karger’s algorithm: Repeatedly pick a random edge $(u, v)$ and contract it - merge $u$ and $v$ into a single vertex, remove self-loops, but keep parallel edges. Repeat until two vertices remain. The edges between them form a candidate cut.

One run takes $O(n^2)$ time and succeeds (finds the min-cut) with probability at least $2/(n(n-1))$. That’s $\Omega(1/n^2)$, which sounds terrible - but if we repeat $O(n^2 \log n)$ times and take the smallest cut found, we succeed with probability at least $1 - 1/n$.

Why does it work? Let $k$ be the min-cut size. Every vertex must have degree $\geq k$ (otherwise that vertex would be a smaller cut), so $|E| \geq nk/2$. The probability of contracting a min-cut edge at any step when $m$ vertices remain is at most $k/(mk/2) = 2/m$.

The probability of never contracting a min-cut edge through all $n - 2$ contractions:

$$\Pr[\text{success}] \geq \prod_{m=3}^{n} \left(1 - \frac{2}{m}\right) = \frac{1}{n} \cdot \frac{2}{n-1} \cdot \frac{3}{n-2} \cdots = \frac{2}{n(n-1)}$$

This is a Monte Carlo algorithm: fast, but with a small probability of returning the wrong cut.


Bloom Filters

A Bloom filter answers set membership queries probabilistically: “Is this element in the set?” It can say “definitely not” (no false negatives) or “probably yes” (occasional false positives). It uses far less memory than storing the set exactly.

The structure: a bit array $B$ of $m$ bits, all initialized to 0, and $k$ independent hash functions $h_1, \ldots, h_k : U \to \{1, \ldots, m\}$.

Insert $x$: Set $B[h_i(x)] = 1$ for all $i = 1, \ldots, k$.

Query $x$: Return “yes” if $B[h_i(x)] = 1$ for all $i$; return “no” otherwise.

A “no” is certain: if any bit $B[h_i(x)]$ is 0, then $x$ was never inserted. A “yes” could be a false positive - all $k$ bits happen to be 1 from other insertions.

The false positive probability after inserting $n$ elements into $m$ bits with $k$ hash functions:

$$p_{\text{FP}} \approx \left(1 - e^{-kn/m}\right)^k$$

The optimal number of hash functions is $k^{\ast} = (m/n) \ln 2$, giving $p_{\text{FP}} \approx (0.618)^{m/n}$. With 10 bits per element and 7 hash functions, the false positive rate is about 0.8%.

Bloom filters are everywhere: web browsers use them to check URLs against lists of malicious sites, databases use them to avoid unnecessary disk lookups, distributed systems use them to summarize large datasets.


Reservoir Sampling

A data stream is arriving: user events, sensor readings, log entries. You don’t know how many items there will be in total. You want to maintain a uniform random sample of exactly $k$ items from everything seen so far, using $O(k)$ memory.

ReservoirSample(stream, k):
    reservoir = first k elements
    i = k + 1
    for each new item x:
        j = random integer in [1, i]
        if j <= k:
            reservoir[j] = x
        i = i + 1
    return reservoir

Invariant. After seeing $i$ items, each of the $i$ items is in the reservoir with probability exactly $k/i$.

Proof. Base case: after $k$ items, each is in the reservoir with probability $k/k = 1$. Inductive step: when item $i+1$ arrives, it enters the reservoir with probability $k/(i+1)$. An existing reservoir item survives this step with probability $1 - (k/(i+1)) \cdot (1/k) = i/(i+1)$. Combined probability that an item seen before step $i+1$ is still in the reservoir:

$$\frac{k}{i} \cdot \frac{i}{i+1} = \frac{k}{i+1}$$

Exactly what the invariant requires. Reservoir sampling uses $O(k)$ space regardless of stream length - even if the stream never ends.


Discomfort check. Can we derandomize these algorithms - make them deterministic with the same guarantees? Sometimes yes. For algorithms that use random hash functions, carefully constructed families of hash functions (universal hash families, $k$-wise independent families) can replace true randomness. The method of conditional expectations can sometimes derandomize probabilistic proofs. But sometimes randomness is genuinely useful - there are problems where the best known randomized algorithm is simpler or faster than the best known deterministic algorithm. Primality testing is the famous case: the AKS algorithm (2002) proved primality can be tested in deterministic polynomial time, but it’s much slower in practice than Miller-Rabin. The randomized version is used in every real cryptographic library.


Why Randomized Data Structures Work Better

A deterministic hash table can always be broken: if an adversary knows your hash function, they can feed you a sequence of keys that all collide, reducing performance to $O(n)$ per operation.

A hash table with a randomly chosen hash function from a universal family has expected $O(1)$ per operation regardless of the input. The adversary can’t predict your hash function.

A skip list is a randomized alternative to a balanced BST. Each element is promoted to higher “express lane” levels with probability $1/2$. With high probability, search, insert, and delete all take $O(\log n)$ - not because of careful rebalancing, but because random promotion creates a balanced structure in expectation.

The theme: randomization can replace the careful bookkeeping that deterministic algorithms require. Sometimes the randomized version is dramatically simpler to implement and analyze.


Yao’s Minimax Principle

When we analyze a randomized algorithm, we ask: what is the expected cost on the worst-case input? But there is a dual question: what is the expected cost of the best deterministic algorithm on a random input? Yao’s minimax principle says these two quantities are related in a precise way.

Setup. Fix a computational problem. Let $\mathcal{A}$ be the set of all deterministic algorithms for the problem, and let $\mathcal{I}$ be the set of all inputs. Let $\text{cost}(A, I)$ denote the cost of algorithm $A$ on input $I$.

A randomized algorithm is a distribution over deterministic algorithms. A distribution $\mu$ over inputs is called a “hard distribution” if every deterministic algorithm does poorly in expectation under $\mu$.

Yao’s Minimax Principle. For any distribution $\mu$ over inputs:

$$\min_{A \in \mathcal{A}} \mathbb{E}_{I \sim \mu}[\text{cost}(A, I)] \leq \min_{\text{rand } A} \max_{I \in \mathcal{I}} \mathbb{E}[\text{cost}(A, I)]$$

The left side is the best any deterministic algorithm can do in expectation against the distribution $\mu$. The right side is the expected cost of the best randomized algorithm on its worst-case input. The principle says: the left side is a lower bound on the right side.

Why does this hold? Any randomized algorithm can be written as a distribution $p$ over deterministic algorithms. Its expected cost on worst-case input is $\max_I \mathbb{E}_{A \sim p}[\text{cost}(A, I)]$. By minimax duality:

$$\max_I \mathbb{E}_{A \sim p}[\text{cost}(A, I)] \geq \min_A \mathbb{E}_{I \sim \mu}[\text{cost}(A, I)]$$

for any distribution $\mu$. Minimizing the left side over all randomized algorithms preserves the inequality.

Application: lower bound for comparison-based sorting. Consider sorting under a uniform random permutation. Every deterministic comparison-based sorting algorithm must make $\Omega(n \log n)$ comparisons in expectation on a uniformly random permutation - this follows from information theory (there are $n!$ permutations, each comparison gives 1 bit). By Yao’s principle, any randomized comparison-based sorting algorithm must also make $\Omega(n \log n)$ expected comparisons on its worst-case input.

Key insight. To prove a lower bound on randomized complexity, it suffices to find one hard distribution $\mu$ such that every deterministic algorithm is slow in expectation under $\mu$. The hard distribution becomes the proof witness.


The Coupon Collector Problem

A classic problem that appears everywhere in the analysis of randomized algorithms: there are $n$ types of coupons. Each draw gives you a coupon chosen uniformly at random (with replacement). How many draws $T$ do you need to collect all $n$ types?

Analysis via phases. Decompose the process into $n$ phases, where phase $i$ is the period spent collecting the $i$-th new (distinct) coupon. At the start of phase $i$, you already have $i - 1$ distinct coupons, so each draw gives a new coupon with probability $(n - (i-1))/n$. The number of draws in phase $i$ is geometric with success probability $(n - i + 1)/n$, so its expected value is $n/(n - i + 1)$.

$$\mathbb{E}[T] = \sum_{i=1}^{n} \frac{n}{n - i + 1} = n \sum_{j=1}^{n} \frac{1}{j} = n H_n$$

where $H_n = 1 + 1/2 + \cdots + 1/n \approx \ln n + 0.577$ is the $n$-th harmonic number. So $\mathbb{E}[T] = n \ln n + O(n)$.

Concentration via union bound. Let’s show $T$ is concentrated around $n \ln n$. Fix $t = 2n \ln n$. For each coupon type $i$, the probability that coupon $i$ is not seen in $t$ draws is:

$$\Pr[\text{coupon } i \text{ not in } t \text{ draws}] = \left(1 - \frac{1}{n}\right)^t \leq e^{-t/n} = e^{-2 \ln n} = \frac{1}{n^2}$$

By a union bound over all $n$ coupon types:

$$\Pr[T > 2n \ln n] \leq n \cdot \frac{1}{n^2} = \frac{1}{n}$$

So with probability at least $1 - 1/n$, you collect all coupons within $2n \ln n$ draws. This is a “with high probability” bound - the probability of failure shrinks polynomially in $n$.

Applications. Load balancing: if $n$ tasks are assigned to $n$ servers uniformly at random, the coupon collector bound describes when every server gets at least one task. Random graph coverage: in a random walk on a graph with $n$ nodes, how long before every node is visited? In hashing: how many distinct keys before every bucket has been touched?


Balls and Bins

Throw $m$ balls uniformly at random into $n$ bins. This simple model captures hashing, load balancing, and birthday-style collision problems.

Birthday paradox. When $m = \Theta(\sqrt{n})$ balls are thrown, with constant probability two balls land in the same bin. The exact calculation: the probability that all $m$ balls land in distinct bins is:

$$\Pr[\text{no collision}] = \prod_{i=0}^{m-1} \left(1 - \frac{i}{n}\right) \approx e^{-\sum_{i=0}^{m-1} i/n} = e^{-m(m-1)/(2n)}$$

Setting $e^{-m^2/(2n)} = 1/2$ gives $m \approx \sqrt{2n \ln 2} \approx 1.177\sqrt{n}$. With about $1.177\sqrt{n}$ people in a room, there is a 50% chance two share a birthday ($n = 365$ gives $m \approx 23$, the familiar result).

Maximum load. When $m = n$ (equal balls and bins), what is the maximum load - the number of balls in the most loaded bin? The answer is $\Theta(\log n / \log \log n)$ with high probability. The intuition: the most loaded bin acts like the maximum of $n$ roughly-Poisson random variables with mean 1. The Poisson$(1)$ tail decays fast enough that the maximum concentrates around $\log n / \log \log n$.

Concretely, the probability that any fixed bin receives $c \log n / \log \log n$ or more balls (for large enough $c$) is less than $1/n^2$ by a Chernoff bound, and a union bound over $n$ bins closes the argument.

Two-choice paradigm. A striking improvement: instead of throwing each ball into a uniformly random bin, choose 2 bins uniformly at random and place the ball in the less loaded one (break ties arbitrarily). The maximum load drops from $\Theta(\log n / \log \log n)$ to $\Theta(\log \log n)$ - an exponential improvement in the exponent.

Why such a dramatic difference? With one random choice, once a bin starts slightly ahead, it keeps accumulating balls proportional to its load. With two choices, you actively avoid the heavier bin, preventing any single bin from running away. The “power of two choices” is a general phenomenon - even a tiny amount of information about the current state dramatically reduces imbalance.


WalkSAT and Random Walks for 2-SAT

2-SAT can be solved deterministically in $O(n + m)$ time via strongly connected components. But the random walk approach reveals something deeper: local search with randomness can efficiently find satisfying assignments by making structured progress toward a solution.

Papadimitriou’s algorithm for 2-SAT. Given $n$ variables and $m$ clauses, all of size 2:

start with any assignment
repeat n^2 times:
    if current assignment satisfies all clauses: output it and halt
    pick any unsatisfied clause
    flip one of its two literals uniformly at random
output "unsatisfiable"

Analysis as a random walk. Fix any satisfying assignment $s^{\ast}$. Define $X_t$ = the Hamming distance from the current assignment to $s^{\ast}$ at step $t$, i.e., the number of variables where they differ.

When we pick an unsatisfied clause, $s^{\ast}$ must satisfy it (since $s^{\ast}$ satisfies all clauses). So at least one of the clause’s two literals agrees with $s^{\ast}$. When we flip a uniformly random literal from the clause:

  • If we flip the literal that agrees with $s^{\ast}$: $X_{t+1} = X_t + 1$ (we move away from $s^{\ast}$).
  • If we flip the literal that disagrees with $s^{\ast}$: $X_{t+1} = X_t - 1$ (we move toward $s^{\ast}$).

Since at least one of the two literals agrees with $s^{\ast}$, $\Pr[X_{t+1} = X_t - 1] \geq 1/2$. This is a random walk on $\{0, 1, \ldots, n\}$ with a bias of at least $1/2$ toward 0 and a reflecting barrier at 0.

The expected number of steps to reach 0 from position $i$ in such a walk is $O(n^2)$. Starting from any assignment (which has $X_0 \leq n$), we reach a satisfying assignment in expected $O(n^2)$ steps. Running for $n^2$ steps gives success with constant probability. Repeating $O(\log(1/\delta))$ times drives failure probability below $\delta$.

Schoning’s algorithm for 3-SAT. For 3-SAT, each unsatisfied clause has 3 literals, and $s^{\ast}$ satisfies at least one. Flipping a random literal gives $\Pr[X \text{ decreases}] \geq 1/3$ and $\Pr[X \text{ increases}] \leq 2/3$. Now the walk drifts away from $s^{\ast}$. The drift ratio is 2-to-1 against us, and a careful gambler’s ruin calculation shows the expected hitting time from position $i$ grows as $\Theta((4/3)^n)$.

Running the walk for $O((4/3)^n)$ steps and repeating gives a 3-SAT algorithm with runtime $O((4/3)^n \cdot \text{poly}(n))$. This is better than brute force ($O(2^n)$) by a constant factor in the exponent, and it is one of the best-known algorithms for 3-SAT in terms of the base of the exponential for random instances.


Derandomization via Method of Conditional Expectations

Randomized algorithms establish that good solutions exist by showing the expected value of some objective is large. Derandomization converts this into a deterministic algorithm that achieves at least as good a solution.

General framework. Suppose a randomized algorithm achieves $\mathbb{E}[f(X_1, \ldots, X_n)] \geq c$ where $X_1, \ldots, X_n$ are independent random variables (e.g., random bits). Define the conditional expectation:

$$\Phi(v_1, \ldots, v_i) = \mathbb{E}[f(X_1, \ldots, X_n) \mid X_1 = v_1, \ldots, X_i = v_i]$$

Since $\mathbb{E}[\Phi(v_1, \ldots, v_{i-1}, X_i)] = \Phi(v_1, \ldots, v_{i-1})$, at least one choice of $v_i$ keeps $\Phi \geq c$. Concretely: for each variable, compute $\Phi$ for both choices, and pick the one that does not decrease the conditional expectation. After fixing all $n$ variables, we have a deterministic assignment with $f \geq c$.

This works whenever $\Phi$ can be computed efficiently. The conditional expectation serves as a “pessimistic estimator” - it tracks our progress toward a good solution.

Example: MAX-CUT. Given a graph $G = (V, E)$ with $m$ edges, we want to partition $V$ into two sets $S$ and $\bar{S}$ to maximize the number of edges crossing the cut.

Randomized baseline: assign each vertex to $S$ or $\bar{S}$ independently with probability $1/2$. Each edge $(u, v)$ is cut if $u$ and $v$ land on opposite sides, which happens with probability $1/2$. By linearity of expectation, the expected cut size is $m/2$.

To derandomize: process vertices $v_1, v_2, \ldots, v_n$ one at a time. When placing vertex $v_i$, assign it to whichever side ($S$ or $\bar{S}$) maximizes the number of edges already cut among edges incident to previously placed vertices. This greedy rule maintains the invariant that the conditional expectation does not decrease, guaranteeing a cut of size $\geq m/2$ deterministically.

The randomized algorithm proved that cuts of size $\geq m/2$ always exist. The derandomization extracts one explicitly, without any randomness in the final algorithm.


Summary

Algorithm Type Complexity Guarantee
Randomized quicksort Las Vegas $O(n \log n)$ expected Always correct, random time
Karger’s min-cut Monte Carlo $O(n^2)$ per trial, $O(n^4 \log n)$ total Correct w.p. $\geq 1 - 1/n$ after $O(n^2 \log n)$ trials
Bloom filter query Monte Carlo $O(k)$ per operation No false negatives; false positive rate $\approx (0.618)^{m/n}$
Reservoir sampling Las Vegas $O(1)$ per item, $O(k)$ space Exactly uniform sample
Miller-Rabin primality Monte Carlo $O(k \log^2 n)$ Composite declared prime w.p. $\leq 4^{-k}$
Coupon collector - $\mathbb{E}[T] = n H_n = \Theta(n \log n)$ All $n$ types collected; $T = O(n \log n)$ w.h.p.
Balls into $n$ bins ($m = n$) - Max load $\Theta(\log n / \log \log n)$ w.h.p. Two-choice variant: max load $\Theta(\log \log n)$ w.h.p.
Papadimitriou 2-SAT Las Vegas $O(n^2)$ expected steps Finds satisfying assignment if one exists
Schoning 3-SAT Monte Carlo $O((4/3)^n \cdot \text{poly}(n))$ Correct w.h.p.; better than brute-force $O(2^n)$

The probabilistic guarantee is not a weakness. It’s a design choice. When you can make the error probability $2^{-100}$, “probably correct” and “certainly correct” are indistinguishable in practice - and the randomized version may be orders of magnitude faster.


Read next: