Dynamic Programming - Never Solve the Same Problem Twice
Helpful context:
In 1950, Richard Bellman was a mathematician at the RAND Corporation, and he had a problem. He was doing research on multistage decision processes - essentially, how to make a sequence of optimal choices - but the US military was funding his work. The word “mathematical” in a project title, Bellman later wrote, would have “red-flagged” the research. He needed a name that sounded impressive but said nothing.
He chose “dynamic programming.”
The word “dynamic” was meant to suggest time-varying, multistage processes. The word “programming” had nothing to do with computers - it meant planning, as in “linear programming.” Bellman later admitted he picked the name because it was impossible to use pejoratively. No one could say dynamic programming was a bad name; it was simply a meaningless one.
And yet the technique the name describes is one of the most powerful ideas in all of computing.
The Core Observation
Many hard problems share two properties. Once you notice them, you can’t unsee them.
Optimal substructure. The optimal answer to the problem uses optimal answers to smaller subproblems. This sounds tautological, but it isn’t - many problems don’t have this property.
The clearest counterexample is the longest simple path (a path that visits no vertex twice). Take this graph:
The longest simple path from q to t is q→r→s→t (3 edges). Now suppose optimal substructure held and you tried to decompose through the midpoint r: the longest simple path from q to r is q→s→r (it detours through s to get 2 edges instead of 1), and the longest simple path from r to t is r→s→t (also detours through s). But you can’t concatenate them - s appears in both subpaths, so the combined route q→s→r→s→t repeats a vertex. The optimal subpaths interfere with each other.
Shortest paths don’t have this problem: if you split the shortest path from q to t at any intermediate vertex r, both halves are themselves shortest paths and they never share vertices (a repeated vertex would mean a cycle, which only makes the path longer). That no-sharing guarantee is what makes shortest paths decomposable - and longest simple paths lack it entirely.
Overlapping subproblems. The recursive approach to solving the problem visits the same subproblems repeatedly. This is what separates DP from divide-and-conquer: in merge sort, the subproblems are disjoint - you never sort the same subarray twice. In DP, you do.
When both properties hold, you can trade memory for time. Compute each subproblem once, store the result, and look it up whenever you need it again. This transforms exponential algorithms into polynomial ones.
How to Recognize a DP Problem
The two conditions for DP are necessary but recognizing them in a new problem is non-trivial. Here is the diagnostic process.
Step 1: Is there a natural notion of “subproblem”? DP requires that the problem decompose into smaller instances of the same problem. If you can describe the problem as “find the best X of length/size n,” ask: does the best X of size n always contain the best X of size n-1 as a prefix/suffix/substructure? If yes, that’s optimal substructure.
Step 2: Do subproblems overlap? If the recursion tree has repeated subproblems, memoization saves work. If every recursive call touches a fresh subproblem (like merge sort, which never revisits the same subarray), divide-and-conquer suffices and DP adds no benefit.
Step 3: Design the state. The hardest part of DP is often choosing what to memoize. The state is the key you use to look up a cached answer - the minimum information needed to uniquely identify a subproblem. The state space is the total number of distinct states: if your state is two indices each from 0 to n, you have $n^2$ states - fine. If your state needs to track which subset of n cities you’ve visited, you have $2^n$ states - exponential, which defeats the purpose. The state must capture everything needed to compute the subproblem’s answer, but nothing extra. For sequence problems, the state is usually a pair of indices. For graph problems, it might be (current node, visited set) - but “visited set” has $2^n$ possibilities, so only works for small n (the Traveling Salesman Problem approach).
Step 4: Write the recurrence and check base cases. The recurrence should have strictly smaller subproblems on the right-hand side - otherwise the recursion may not terminate.
When DP beats greedy. Greedy works when a locally optimal choice is globally optimal - when the problem has the greedy choice property (taking the best immediate option never closes off a better future). DP is needed when the best local choice depends on what else you’ve already chosen. Coin change is the canonical example: with coins {1, 3, 4} and target 6, greedy picks 4 then two 1s (cost 3), while DP finds 3+3 (cost 2). Greedy fails because the best first coin depends on future choices.
When neither works. If subproblem A depends on subproblem B, which depends back on A, there’s no valid order to fill the table - the dependency graph has a cycle, and you’d need the answer before you’ve computed it. Neither DP nor greedy handles this directly. Separately, many problems are NP-hard - meaning no polynomial-time algorithm is known for them, and it’s widely believed none exists. These problems often have optimal substructure in principle, but the state space required to express it is exponential: the Traveling Salesman Problem needs to track which subset of cities have been visited (there are $2^n$ such subsets), and subset sum with large arbitrary weights has a weight capacity too big to table efficiently. DP still works on these - just in exponential time, with better constants than naive brute force.
The Fibonacci Disaster
Start with the simplest example. The Fibonacci sequence: $F(0) = 0$, $F(1) = 1$, $F(n) = F(n-1) + F(n-2)$.
A naive recursive implementation is correct but catastrophic:
fib(n):
if n <= 1: return n
return fib(n-1) + fib(n-2)
How many calls does fib(50) make? Let $T(n)$ be the call count. Then $T(n) = T(n-1) + T(n-2) + 1$, which solves to $T(n) = O(\phi^n)$ where $\phi \approx 1.618$. For $n = 50$, that’s roughly $2^{50} \approx 10^{15}$ function calls. Your laptop would take thousands of years.
Where does φ = 1.618 come from? Try dividing consecutive Fibonacci numbers: 5/3 = 1.667, 8/5 = 1.600, 13/8 = 1.625, 55/34 = 1.618. The ratio keeps hovering closer and closer to some fixed value. Call that value φ.
To find it exactly, start from the recurrence F(n) = F(n-1) + F(n-2) and divide every term by F(n-1):
$$\frac{F(n)}{F(n-1)} = 1 + \frac{F(n-2)}{F(n-1)}$$
At the limit, the left side is φ. The right side is 1 + 1/φ (since F(n-2)/F(n-1) = 1 divided by F(n-1)/F(n-2), which is also approaching φ). So:
$$\phi = 1 + \frac{1}{\phi}$$
Multiply both sides by φ: $\phi^2 = \phi + 1$, which rearranges to $\phi^2 - \phi - 1 = 0$.
Now just use the quadratic formula. For $ax^2 + bx + c = 0$, the solution is $\frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$. Here a = 1, b = -1, c = -1, so the discriminant is $b^2 - 4ac = (-1)^2 - 4(1)(-1) = 1 + 4 = 5$. That’s exactly where the 5 comes from - it’s just arithmetic. This gives $\phi = \frac{1 \pm \sqrt{5}}{2}$. Since φ is a positive ratio, we take the positive root: $\phi = \frac{1 + \sqrt{5}}{2} \approx 1.618$.
There’s actually a deeper way to derive this using Generating Functions - Sequences Disguised as Power Series - a technique where you encode the entire Fibonacci sequence as coefficients of a power series and use algebra to extract a closed form. That approach also yields Binet’s formula $F(n) = (\phi^n - \psi^n)/\sqrt{5}$ directly, without any guesswork.
Now look at the subproblem structure. To compute $F(50)$, we need $F(49)$ and $F(48)$. To compute $F(49)$, we need $F(48)$ and $F(47)$. $F(48)$ is computed twice. $F(47)$ is computed three times. $F(46)$ is computed five times. The redundancy is exponential.
There are only 50 distinct subproblems: $F(0), F(1), \ldots, F(49)$. If we compute each one once and remember the result, we make exactly 50 function calls.
memo = {}
fib(n):
if n <= 1: return n
if n in memo: return memo[n]
memo[n] = fib(n-1) + fib(n-2)
return memo[n]
Or iteratively, filling a table from the bottom up:
dp[0] = 0
dp[1] = 1
for i in range(2, n+1):
dp[i] = dp[i-1] + dp[i-2]
return dp[n]
Both approaches take $O(n)$ time. The difference from naive recursion is not a constant factor - it’s the difference between milliseconds and the heat death of the universe.
Top-Down vs. Bottom-Up
There are two ways to implement DP, and they’re equivalent in asymptotic terms.
Top-down (memoization): Write the natural recursive solution, but cache results in a table. The call structure is the same as naive recursion - you just never compute anything twice. Natural to write; works on problems where you can’t easily determine the order in which to fill the table.
Bottom-up (tabulation): Figure out the dependency order among subproblems, and fill the table in that order. No recursion overhead. Better cache locality. Sometimes lets you save memory by discarding old rows.
For Fibonacci, the bottom-up approach is clearly better - you only ever need the last two values, so you can reduce space from $O(n)$ to $O(1)$. For problems with more complex dependency structures, top-down is often easier to code correctly.
Coin Change
Here’s a problem where greedy fails. You have coins of denominations $c_1, c_2, \ldots, c_k$ (unlimited supply of each), and you want to make exactly $n$ cents using the fewest coins possible.
Greedy says: always take the largest coin that fits. For US coins $\{25, 10, 5, 1\}$, this works perfectly. But for coins $\{10, 6, 1\}$ and target $n = 12$? Greedy takes $10 + 1 + 1 = 3$ coins. DP finds $6 + 6 = 2$ coins.
Let $\text{dp}[t]$ = minimum coins needed to make $t$ cents. The recurrence:
$$\text{dp}[t] = 1 + \min_{c_i \leq t} \text{dp}[t - c_i]$$
with $\text{dp}[0] = 0$ and $\text{dp}[t] = \infty$ if no coins can make $t$. Fill from $t = 1$ to $t = n$.
For $n = 12$, $k = 3$, coins $= \{10, 6, 1\}$:
| $t$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| $\text{dp}[t]$ | 0 | 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 1 | 2 | 2 |
$\text{dp}[12] = 2$: two coins of 6. Time: $O(kn)$. Space: $O(n)$.
Longest Common Subsequence
Given two strings $s$ of length $m$ and $t$ of length $n$, find the longest subsequence common to both. A subsequence doesn’t need to be contiguous - “ACE” is a subsequence of “ABCDE.” This matters for DNA alignment, diff tools, plagiarism detection.
Define $\text{dp}[i][j]$ = length of the LCS of $s[1..i]$ and $t[1..j]$. The recurrence:
$$\text{dp}[i][j] = \begin{cases} 0 & \text{if } i = 0 \text{ or } j = 0 \\ \text{dp}[i-1][j-1] + 1 & \text{if } s_i = t_j \\ \max(\text{dp}[i-1][j],\ \text{dp}[i][j-1]) & \text{otherwise} \end{cases}$$
The logic: if the last characters match, the LCS includes that character and extends the LCS of the two prefixes. If they don’t match, we try dropping either the last character of $s$ or the last character of $t$, and take the better option.
Fill the table row by row. The answer is $\text{dp}[m][n]$. Time: $O(mn)$. Space: $O(mn)$, reducible to $O(\min(m,n))$ by keeping only the current and previous row.
To recover the actual sequence (not just the length), trace back through the table: when $s_i = t_j$, that character is in the LCS; otherwise, move in the direction of the larger value.
Edit Distance (Levenshtein)
The edit distance between two strings is the minimum number of single-character insertions, deletions, and substitutions needed to transform one into the other. “kitten” → “sitting” takes 3 operations. This is how spell checkers rank suggestions, how Git diffs work, how DNA sequences are compared.
Define $\text{dp}[i][j]$ = edit distance between $s[1..i]$ and $t[1..j]$. The recurrence:
$$\text{dp}[i][j] = \min\begin{cases} \text{dp}[i-1][j] + 1 & \text{(delete } s_i\text{)} \\ \text{dp}[i][j-1] + 1 & \text{(insert } t_j\text{)} \\ \text{dp}[i-1][j-1] + \mathbf{1}[s_i \neq t_j] & \text{(substitute or match)} \end{cases}$$
Base cases: $\text{dp}[i][0] = i$ (delete all of $s$), $\text{dp}[0][j] = j$ (insert all of $t$).
Time: $O(mn)$. Space: $O(\min(m,n))$ using rolling arrays. The table for “kitten” vs “sitting”:
| s | i | t | t | i | n | g | ||
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
| k | 1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| i | 2 | 2 | 1 | 2 | 3 | 4 | 5 | 6 |
| t | 3 | 3 | 2 | 1 | 2 | 3 | 4 | 5 |
| t | 4 | 4 | 3 | 2 | 1 | 2 | 3 | 4 |
| e | 5 | 5 | 4 | 3 | 2 | 2 | 3 | 4 |
| n | 6 | 6 | 5 | 4 | 3 | 3 | 2 | 3 |
The answer, $\text{dp}[6][7] = 3$, is in the bottom-right corner.
0/1 Knapsack
You have $n$ items, each with weight $w_i$ and value $v_i$. A knapsack holds total weight at most $W$. Which items should you pack to maximize value?
Greedy by value-to-weight ratio fails on 0/1 knapsack. Try: items with $(v, w) \in \{(6, 2), (10, 5), (12, 6)\}$, $W = 8$. Greedy by ratio picks item 1 and item 2 (total: 16, weight: 7). But items 1 and 3 give value 18 at weight 8. Greedy misses it.
Define $\text{dp}[i][w]$ = maximum value using a subset of items $1..i$ with total weight at most $w$.
$$\text{dp}[i][w] = \begin{cases} 0 & \text{if } i = 0 \\ \text{dp}[i-1][w] & \text{if } w_i > w \\ \max(\text{dp}[i-1][w],\ v_i + \text{dp}[i-1][w - w_i]) & \text{otherwise} \end{cases}$$
Either we skip item $i$ (take the best solution without it), or we include it (add $v_i$ to the best solution for the remaining weight). Time: $O(nW)$. Space: $O(W)$ with rolling arrays.
Discomfort check. Knapsack runs in $O(nW)$. Isn’t that polynomial? No - this is the key subtlety. $W$ is a number in the input, but it’s encoded in $\log_2 W$ bits. “Polynomial time” means polynomial in the input size (number of bits), not in the numeric values. An algorithm running in $O(nW)$ is polynomial in $n$ and the value of $W$, but exponential in the bit-length of $W$. Double the number of bits used to store $W$ - say from 10 bits to 20 bits - and $W$ goes from about 1,000 to about 1,000,000. Runtime multiplies by 1,000, not 2. This is why knapsack is NP-hard despite having a “polynomial-looking” DP. The term for this is pseudopolynomial.
Failure Modes: When DP Goes Wrong
State space explosion. The 0/1 knapsack runs in $O(nW)$ - pseudopolynomial, not polynomial. When $W$ is exponentially large (e.g., weights are exponential in $n$), this is no better than brute force. Choosing the right state representation is crucial.
Wrong subproblem definition. A subtle trap: defining subproblems that don’t compose correctly. If OPT(n) doesn’t contain OPT(n-1) as a substructure - if the optimal solution to size n uses a completely different approach than size n-1 - the recurrence is wrong. This happens with problems where the “best so far” depends on global context that can’t be captured in a local state.
Confusing overlapping with repeated. Just because a recursive solution is slow doesn’t mean DP helps. If the slowness is from exponential branching with no repeated subproblems, memoization doesn’t help. Profile the recursion tree before adding DP.
DP on Trees and DAGs
Not all DP problems fit neatly in a 1D or 2D table. Two important generalizations.
Longest Path in a DAG
On a directed acyclic graph (DAG), there are no cycles, so the subproblem dependencies never loop back on themselves. You can therefore fill a DP table in topological order - an ordering where every edge goes from earlier to later in the order, so every subproblem is solved before anything that depends on it.
The classic problem: given node weights, find the path through the graph with the maximum total weight.
$$\text{dp}[v] = w_v + \max_{u \to v} \text{dp}[u]$$
Process nodes in topological order. When you reach $v$, every predecessor $u$ is already solved.
Concrete example: five nodes, each with a weight.
Topological order: A, B, C, D, E.
| Node | Predecessors | dp value |
|---|---|---|
| A | none | 1 |
| B | A (dp=1) | 1 + 3 = 4 |
| C | A (dp=1) | 1 + 2 = 3 |
| D | B (dp=4), C (dp=3) | max(4, 3) + 4 = 8 |
| E | D (dp=8) | 8 + 1 = 9 |
The longest path is A → B → D → E with total weight 9.
from collections import defaultdict
def longest_path(nodes, edges, weight):
# nodes must be given in topological order
children = defaultdict(list)
for u, v in edges:
children[u].append(v)
dp = {v: weight[v] for v in nodes}
for v in nodes:
for c in children[v]:
dp[c] = max(dp[c], dp[v] + weight[c])
return max(dp.values())
nodes = ['A', 'B', 'C', 'D', 'E']
edges = [('A','B'), ('A','C'), ('B','D'), ('C','D'), ('D','E')]
weight = {'A':1, 'B':3, 'C':2, 'D':4, 'E':1}
print(longest_path(nodes, edges, weight)) # 9
Maximum Weight Independent Set on a Tree
A tree is a special case of a DAG: no cycles, and a natural root-to-leaf direction. DP on trees works bottom-up - compute answers for leaves first, then propagate up to the root (post-order traversal).
Problem: choose a subset of nodes with no two adjacent (no parent-child pair both selected), maximising total weight.
For each node $v$, track two values:
- $\text{dp}[v][0]$: best total weight in $v$’s subtree when $v$ is excluded
- $\text{dp}[v][1]$: best total weight in $v$’s subtree when $v$ is included
$$\text{dp}[v][1] = w_v + \sum_{c \in \text{children}(v)} \text{dp}[c][0]$$ $$\text{dp}[v][0] = \sum_{c \in \text{children}(v)} \max\bigl(\text{dp}[c][0],\ \text{dp}[c][1]\bigr)$$
If $v$ is included, no child can be (add their excluded values). If $v$ is excluded, each child can go either way - take whichever is better.
Concrete example:
Fill bottom-up:
| Node | dp[v][0] | dp[v][1] | Working |
|---|---|---|---|
| 4 (leaf) | 0 | 2 | no children |
| 3 (leaf) | 0 | 5 | no children |
| 2 | max(0, 2) = 2 | 3 + dp[4][0] = 3 | child is node 4 |
| 1 | max(2,3) + max(0,5) = 8 | 4 + dp[2][0] + dp[3][0] = 6 | children are 2 and 3 |
Answer = max(dp[1][0], dp[1][1]) = max(8, 6) = 8, achieved by selecting nodes {2, 3}.
from collections import defaultdict
def max_independent_set(adj, weights, root):
def dfs(v, parent):
excl, incl = 0, weights[v]
for child in adj[v]:
if child == parent:
continue
c_excl, c_incl = dfs(child, v)
excl += max(c_excl, c_incl) # child can go either way
incl += c_excl # v included => child must be excluded
return excl, incl
excl, incl = dfs(root, -1)
return max(excl, incl)
adj = defaultdict(list)
for u, v in [(1,2), (1,3), (2,4)]:
adj[u].append(v)
adj[v].append(u)
weights = {1:4, 2:3, 3:5, 4:2}
print(max_independent_set(adj, weights, root=1)) # 8: nodes {2, 3}
Both run in $O(n)$ - every node and edge is visited exactly once.
Summary
| Problem | Recurrence structure | Complexity | Key insight |
|---|---|---|---|
| Fibonacci | 1D, depends on two previous | $O(n)$ time, $O(1)$ space | Overlapping subproblems |
| Coin change | 1D, depends on all smaller | $O(kn)$ | Greedy fails for general denominations |
| Longest Common Subsequence | 2D, match or skip | $O(mn)$ | Optimal substructure via last characters |
| Edit distance | 2D, three operations | $O(mn)$ | Insert, delete, substitute |
| 0/1 Knapsack | 2D, include or skip | $O(nW)$ pseudopolynomial | $W$ value is exponential in bit-length |
| Tree independent set | Tree DP, two states per node | $O(n)$ | Post-order: children before parents |
The pattern is always the same. Identify the subproblems. Write the recurrence. Decide on top-down or bottom-up. Optimize space if needed.
Bellman’s meaningless name turned out to describe something profound: the insight that optimal solutions are built from optimal sub-solutions, and that we only need to find each sub-solution once.
Read next: