Minimum Spanning Trees & Union-Find // Megha Bose

Prerequisites: Graph Algorithms & Traversal · Greedy Algorithms

In the previous post on greedy algorithms, we used proof by contradiction to show that two greedy strategies for the robot tour problem are incorrect. Proof by contradiction can both prove and disprove correctness depending on how it is applied, and it is often used in combination with other techniques. Here we introduce another essential tool - proof by mathematical induction - and use it to prove that a greedy algorithm is, in fact, correct.

Mathematical Induction

The essence of mathematical induction is using past success to guarantee future success. If a property is satisfied at some base case and its truth at step $k$ guarantees its truth at step $k+1$, then the property holds for all steps.

Formally: if a statement $P(n)$ is true for $n = 1$, and the truth of $P(k)$ implies the truth of $P(k+1)$ for every natural number $k$, then $P(n)$ is true for all natural numbers $n$.

Anywhere we can assign natural numbers to a process - such as counting steps in an algorithm - induction may apply. It often feels almost magical: it doesn’t matter how far along we are, as long as we have maintained the property so far and can guarantee it in the immediate next step.

The Minimum Spanning Tree Problem

Suppose you are the ruler of a kingdom and need to connect all cities so that any city is reachable from any other, directly or indirectly, using the minimum total cost of roads. Building a road between different pairs of cities has different costs. This is the Minimum Spanning Tree (MST) problem.

To minimise cost, the simplest greedy idea is to repeatedly pick the cheapest available road that does not create a cycle (since a cycle wastes resources - removing any edge from a cycle preserves connectivity). We continue until all cities are connected. This is Kruskal’s algorithm.

Consider four cities with the following road costs:

graph LR A((A)) -- 1 --- B((B)) A((A)) -- 3 --- C((C)) B((B)) -- 2 --- C((C)) B((B)) -- 4 --- D((D)) C((C)) -- 5 --- D((D))

Kruskal’s picks edges in order of cost: A-B (1), B-C (2), then skips A-C (3) because A, B, C are already connected - adding it would form a cycle. Then picks B-D (4). Result:

graph LR A((A)) -- 1 --- B((B)) B((B)) -- 2 --- C((C)) B((B)) -- 4 --- D((D))

Total cost = 7, and all cities are connected with no cycles.

But can we assume it is correct? No - we must prove it.

Proving Kruskal’s Algorithm Correct

The property we want to hold at every step: the set of roads chosen so far is a subset of at least one optimal solution.

Base case (step 1, no roads chosen): An empty set is a subset of every set, including any optimal solution. ✓

Inductive step: Assume the property holds after choosing $k$ roads. We now pick the $(k+1)$-th road - call it $e$ - the cheapest remaining edge that does not form a cycle. We want to show that the $k$ old roads together with $e$ is still a subset of some optimal solution.

Suppose for contradiction that it is not. Then there exists an optimal solution $T$ containing the $k$ old roads but not $e$. Since $T$ connects all cities, adding $e$ to $T$ creates a cycle. If every road in this cycle other than $e$ already belonged to our $k$ chosen roads, we would never have selected $e$ (it would have completed a cycle). So at least one edge $f$ in this cycle is not among our $k$ chosen roads, meaning $f$ has not been selected yet - and therefore $f$ must be at least as expensive as $e$ (since $e$ is the cheapest available edge at this step).

Now remove $f$ from $T$ and add $e$ in its place. Since $e \leq f$ in cost, this new solution is no more expensive than $T$. And since $f$ was part of a cycle, removing it preserves connectivity. This contradicts $T$ being optimal - unless $e$ and $f$ have equal cost, in which case the new solution is also optimal and contains all $k$ old roads plus $e$.

Either way, the $k$ roads plus $e$ form a subset of some optimal solution. By induction, this holds all the way until all cities are connected - at which point Kruskal’s algorithm has found an MST.

What This Tells Us

The greedy strategy of always picking the cheapest edge that doesn’t create a cycle is provably correct for the MST problem. Contrast this with the robot tour from the previous post, where the greedy nearest-neighbour strategy failed. The difference is not greed itself - it is whether the problem structure guarantees that local optimal choices accumulate into a global optimum. Mathematical induction, combined with proof by contradiction, is one of the primary tools for establishing that guarantee.

Union-Find (disjoint-set) data structures make cycle detection in Kruskal’s algorithm efficient - allowing the full algorithm to run in $O(E \log E)$ time, where $E$ is the number of edges. We will explore those structures in detail alongside other MST approaches in later posts.

Read Next: Network Flow & Matching