Divide & Conquer - Small Savings That Compound
Helpful context:
- Sorting Algorithms - How Order Turns Hard Problems Easy
- Amortized Analysis - The True Cost of a Sequence
In 1969, a mathematician named Volker Strassen published a paper that surprised the entire field of numerical computing.
For decades, everyone assumed the only way to multiply two $n \times n$ matrices was to compute $n^3$ scalar products - three nested loops. Strassen showed you could do it with $n^{2.807}$ operations. The trick? When multiplying two matrices split into 2x2 blocks, you need 8 sub-multiplications naively. Strassen found a way to do it with only 7.
That difference - 8 versus 7, a reduction by one eighth - doesn’t sound like much. But through recursion, that single saved multiplication at every level of decomposition compounds into a 4× speedup at $n = 10{,}000$.
This is divide and conquer. Not just “split the problem in half and recurse” - that part is easy. The insight is that sometimes you can save work at the sub-problem level, and through recursion, small savings become large ones.
The Template
Every divide-and-conquer algorithm has the same three-step structure.
Divide: split the problem into $a$ smaller subproblems, each of size roughly $n/b$.
Conquer: solve each subproblem recursively. (Base case: when the problem is small enough, solve directly.)
Combine: merge the subproblem solutions into a solution for the original problem. This step costs $O(n^c)$ for some $c$.
The running time satisfies a recurrence relation - an equation that defines the cost $T(n)$ in terms of the cost of smaller subproblems:
$$T(n) = a \cdot T(n/b) + O(n^c)$$
The Big-O - Why the Right Algorithm Beats Faster Hardware solves this based on the critical exponent $\log_b a$:
- Case 1 ($c \lt \log_b a$): leaves dominate - $T(n) = \Theta(n^{\log_b a})$
- Case 2 ($c = \log_b a$): every level equal - $T(n) = \Theta(n^c \log n)$
- Case 3 ($c \gt \log_b a$): root dominates - $T(n) = \Theta(n^c)$
Merge Sort, Revisited
Merge sort has $a = 2$, $b = 2$, $c = 1$: split into 2 subproblems of half the size, combine in $O(n)$.
$$\log_b a = \log_2 2 = 1 = c$$
We’re in Case 2. Every level of the recursion tree does $O(n)$ work. There are $O(\log n)$ levels. Total:
$$T(n) = \Theta(n \log n)$$
Binary Search, Revisited
Binary search has $a = 1$, $b = 2$, $c = 0$: one subproblem of half the size, constant work to determine which half.
$$\log_b a = \log_2 1 = 0 = c$$
Case 2 again:
$$T(n) = \Theta(\log n)$$
The “combine” step is empty - no merging needed. All you do is look at the midpoint and recurse into one half.
Karatsuba Multiplication: Saving One Multiply
To multiply two $n$-digit numbers, the schoolbook algorithm uses $O(n^2)$ single-digit multiplications. Can we do better?
Write the two numbers as $x = x_1 \cdot B^{n/2} + x_0$ and $y = y_1 \cdot B^{n/2} + y_0$, where $B$ is the base. The product is:
$$xy = x_1 y_1 \cdot B^n + (x_1 y_0 + x_0 y_1) \cdot B^{n/2} + x_0 y_0$$
This seems to require four multiplications: $x_1 y_1$, $x_1 y_0$, $x_0 y_1$, and $x_0 y_0$.
Karatsuba’s observation: you can compute $x_1 y_0 + x_0 y_1$ using only one extra multiplication instead of two:
$$x_1 y_0 + x_0 y_1 = (x_1 + x_0)(y_1 + y_0) - x_1 y_1 - x_0 y_0$$
The quantities $x_1 y_1$ and $x_0 y_0$ were already computed. So you need only three multiplications total: $x_1 y_1$, $x_0 y_0$, and $(x_1 + x_0)(y_1 + y_0)$.
The recurrence: $a = 3$, $b = 2$, $c = 1$ (the additions for combining take $O(n)$).
$$\log_b a = \log_2 3 \approx 1.585 \gt 1 = c$$
Case 1 applies:
$$T(n) = \Theta(n^{\log_2 3}) \approx \Theta(n^{1.585})$$
For $n = 10^6$ digits: schoolbook does $10^{12}$ operations; Karatsuba does roughly $10^{9.5}$ - about 30 times fewer.
Closest Pair of Points: O(n log n) via Divide and Conquer
Given $n$ points in the plane, find the pair with minimum Euclidean distance. The naive approach checks all $\binom{n}{2}$ pairs: $O(n^2)$.
Here’s the divide-and-conquer approach that achieves $O(n \log n)$:
- Sort the points by $x$-coordinate (pay $O(n \log n)$ once at the start).
- Split at the median $x$-coordinate into left half $L$ and right half $R$.
- Recursively find $\delta_L$ = minimum distance within $L$, and $\delta_R$ = minimum distance within $R$. Let $\delta = \min(\delta_L, \delta_R)$.
- The closest pair might straddle the dividing line. Consider only points within distance $\delta$ of the dividing line (a “strip” of width $2\delta$). Sort these by $y$-coordinate and check each point against its next 7 neighbors in $y$-order.
Why only 7 neighbors? Because within the strip, no two points on the same side can be within distance $\delta$. A geometric argument shows at most 8 points can fit in any $\delta \times 2\delta$ rectangle while maintaining pairwise distance $\geq \delta$.
With the pre-sorting trick, the strip step is $O(n)$ per level (maintain two sorted arrays and merge). The recurrence is $T(n) = 2T(n/2) + O(n)$, giving $T(n) = O(n \log n)$.
Strassen Matrix Multiplication: The 7-Multiply Trick
The naive way to multiply two $n \times n$ matrices does $n^3$ scalar multiplications - three nested loops. Strassen showed you can do it in $O(n^{2.807})$ using a non-obvious algebraic identity.
Why you can split a matrix into blocks at all.
Start from the definition: every entry of $C$ is a dot product. $C[i][j] = \sum_{k=0}^{n-1} A[i][k] \cdot B[k][j]$ - row $i$ of $A$ dotted with column $j$ of $B$.
Now split both the row and the column at their midpoints. The sum breaks into two halves:
$$C[i][j] = \sum_{k=0}^{n/2-1} A[i][k] \cdot B[k][j] + \sum_{k=n/2}^{n-1} A[i][k] \cdot B[k][j]$$
The first sum runs over $k$ in the first half, the second over $k$ in the second half.
This is not a trick - it is always valid to split a sum. The question is just: which block of $A$ and which block of $B$ do those two halves correspond to?
Since this holds for every $(i, j)$ in the top-left quadrant, it holds for the whole block: $C_{11} = A_{11}B_{11} + A_{12}B_{21}$. Not magic - just a sum split. The same argument gives the other three blocks, just with different choices of which half of $i$ and $j$ you’re looking at:
| Output block | $i$ in… | $j$ in… | First half of $k$ | Second half of $k$ |
|---|---|---|---|---|
| $C_{11}$ | top | left | $A_{11} \times B_{11}$ | $A_{12} \times B_{21}$ |
| $C_{12}$ | top | right | $A_{11} \times B_{12}$ | $A_{12} \times B_{22}$ |
| $C_{21}$ | bottom | left | $A_{21} \times B_{11}$ | $A_{22} \times B_{21}$ |
| $C_{22}$ | bottom | right | $A_{21} \times B_{12}$ | $A_{22} \times B_{22}$ |
Notice the pattern: which block of $A$ you use depends only on which half of $i$ you are in (top rows use $A_{11}$/$A_{12}$, bottom rows use $A_{21}$/$A_{22}$). Which block of $B$ you use depends only on which half of $j$ you are in (left columns use $B_{11}$/$B_{21}$, right columns use $B_{12}$/$B_{22}$). The $k$ split always contributes one term from each side.
You might notice something that looks odd: in the term $A_{12}B_{21}$, the column-block index of $A$ (the 2 in $A_{12}$) matches the row-block index of $B$ (the 2 in $B_{21}$). This is not a coincidence - it is exactly the same contraction that happens at the scalar level. Scalar multiplication is $C[i][j] = \sum_k A[i][k] \cdot B[k][j]$: column index of $A$ contracts with row index of $B$. Block multiplication is $C_{IJ} = \sum_K A_{IK} \cdot B_{KJ}$: the same formula with block indices. $C_{11} = A_{11}B_{11} + A_{12}B_{21}$ is just $K=1$ and $K=2$ in that sum. Blocks behave algebraically like scalars under matrix multiplication - the structure is preserved exactly.
Since blocks behave like scalars, you could split into $k \times k$ blocks for any $k$, not just 2. The asymptotic complexity would be $O(n^{\log_k m})$ where $m$ is the number of block multiplications needed at that level. The question is whether larger blocks let you save proportionally more multiplications. The answer, so far, is no: for $3 \times 3$ blocks the naive count is 27 multiplications, and the best known algorithm (Laderman, 1976) reduces this to 23, giving $O(n^{\log_3 23}) \approx O(n^{2.854})$ - worse than Strassen’s $O(n^{2.807})$. You saved 4 out of 27 multiplications but the block size grew faster than the savings. This is why Strassen’s 2×2 decomposition remains a strong baseline despite decades of effort on larger block sizes.
Computing these directly requires 8 block multiplications. Since each block is $n/2 \times n/2$, the recurrence is $T(n) = 8T(n/2) + O(n^2)$, giving $T(n) = O(n^3)$. No improvement yet.
Strassen’s trick. Instead of computing each output block directly, compute 7 intermediate products using sums and differences of the blocks - then combine them:
You can verify $C_{11}$ by expanding: $M_1$ contributes $A_{11}B_{11} + A_{11}B_{22} + A_{22}B_{11} + A_{22}B_{22}$; $M_4$ adds $A_{22}B_{21} - A_{22}B_{11}$; $-M_5$ subtracts $A_{11}B_{22} + A_{12}B_{22}$; $M_7$ adds $A_{12}B_{21} + A_{12}B_{22} - A_{22}B_{21} - A_{22}B_{22}$. Most terms cancel, leaving $A_{11}B_{11} + A_{12}B_{21}$, which is exactly $C_{11}$. The same algebra works for the other three blocks.
This uses 7 block multiplications and 18 additions. Since additions cost $O(n^2)$ and multiplications recurse, the recurrence is $T(n) = 7T(n/2) + O(n^2)$.
$$\log_b a = \log_2 7 \approx 2.807 \gt 2 = c$$
Case 1 applies:
$$T(n) = \Theta(n^{2.807})$$
At $n = 10{,}000$: naive takes $10^{12}$ operations; Strassen takes roughly $10^{11.2}$ - about 6 times fewer.
def strassen(A, B):
n = len(A)
if n == 1:
return [[A[0][0] * B[0][0]]]
mid = n // 2
def split(M):
top = [row[:mid] for row in M[:mid]]
right = [row[mid:] for row in M[:mid]]
bot_left = [row[:mid] for row in M[mid:]]
bot_right = [row[mid:] for row in M[mid:]]
return top, right, bot_left, bot_right
def add(X, Y):
return [[X[i][j] + Y[i][j] for j in range(len(X[0]))] for i in range(len(X))]
def sub(X, Y):
return [[X[i][j] - Y[i][j] for j in range(len(X[0]))] for i in range(len(X))]
def combine(C11, C12, C21, C22):
top = [C11[i] + C12[i] for i in range(mid)]
bot = [C21[i] + C22[i] for i in range(mid)]
return top + bot
A11, A12, A21, A22 = split(A)
B11, B12, B21, B22 = split(B)
M1 = strassen(add(A11, A22), add(B11, B22))
M2 = strassen(add(A21, A22), B11)
M3 = strassen(A11, sub(B12, B22))
M4 = strassen(A22, sub(B21, B11))
M5 = strassen(add(A11, A12), B22)
M6 = strassen(sub(A21, A11), add(B11, B12))
M7 = strassen(sub(A12, A22), add(B21, B22))
C11 = add(sub(add(M1, M4), M5), M7)
C12 = add(M3, M5)
C21 = add(M2, M4)
C22 = add(sub(add(M1, M3), M2), M6)
return combine(C11, C12, C21, C22)
In practice, Strassen is only beneficial for $n \gtrsim 500$ on modern hardware - the constant factor from the 18 additions and the overhead of splitting and combining outweigh the savings for small matrices. Real libraries (like NumPy’s BLAS backend) use highly tuned naive multiplication for most sizes and fall back to Strassen or its successors only at large $n$.
Trivia. Strassen is not the theoretical state of the art. The race to minimize the matrix multiplication exponent $\omega$ has continued for decades: Coppersmith and Winograd (1987) reached $O(n^{2.376})$, and a series of refinements brought it to $O(n^{2.371})$ as of 2024. These algorithms use a technique called the laser method and have constant factors so astronomically large that nobody implements them - they are purely theoretical. The conjecture is that $\omega = 2$ (matrix multiplication might be doable in $O(n^2)$) but this is unproven.
In practice, Strassen itself is rarely used - and the reason is instructive. Modern CPUs have two weapons that the asymptotic analysis ignores: SIMD (a single instruction that multiplies 8 or 16 floats simultaneously) and cache blocking (tiling the matrix into chunks that fit in L1 cache so data is reused before being evicted). A naive triple loop, written carefully, maps perfectly to both - the inner loop is sequential memory access that a compiler can turn into a stream of AVX-512 vector instructions, and the blocking keeps every element in cache while all arithmetic involving it is done. Strassen breaks both of these. It needs to allocate and fill 7 extra $n/2 \times n/2$ matrices for M1-M7 - those additions scan the entire matrix, burning memory bandwidth before a single multiply happens. The recursive structure also creates irregular access patterns that fight the cache prefetcher. On top of that, the subtractions in Strassen introduce floating-point cancellation errors that make it numerically less stable than the naive algorithm (numerical stability means how much small rounding errors in the inputs get amplified by the computation - Strassen’s intermediate subtractions can cause these errors to grow, which matters in scientific computing where precision is critical). The crossover where Strassen wins is around $n \sim 500$-$1000$, and most practical matrix multiplications never reach that size or are handled by batching smaller ones. The gap between the best theoretical algorithm and what actually runs fast in production is wider here than almost anywhere else in algorithms.
Discomfort check. The 7 products look like magic - how did Strassen find them? He didn’t derive them by hand in the obvious way. He treated the problem algebraically: the entries of a $2 \times 2$ matrix product are bilinear forms in the inputs, and he searched for a decomposition of those forms using fewer rank-1 terms. The “7” is not obviously optimal - it’s the best known for $2 \times 2$ block matrices, but the exact minimum (whether 6 works or 7 is tight) was an open problem for decades. In 2022, DeepMind’s AlphaTensor found new algorithms for small matrix sizes using reinforcement learning, rediscovering known results and finding some new ones. The problem is harder than it looks.
Summary
| Algorithm | $a$ | $b$ | $c$ | Master Theorem case | Complexity |
|---|---|---|---|---|---|
| Binary search | 1 | 2 | 0 | Case 2 | $O(\log n)$ |
| Merge sort | 2 | 2 | 1 | Case 2 | $O(n \log n)$ |
| Karatsuba | 3 | 2 | 1 | Case 1 | $O(n^{1.585})$ |
| Strassen | 7 | 2 | 2 | Case 1 | $O(n^{2.807})$ |
| Closest pair | 2 | 2 | 1 | Case 2 | $O(n \log n)$ |
| Naive matrix multiply | 8 | 2 | 2 | Case 2 (equal) | $O(n^3)$ |
Read next: