Shortest Paths - When Greed Works and When It Doesn't // Megha Bose

Helpful context:

In 1956, Edsger Dijkstra was on a coffee date with his fiancée in Amsterdam. He had no pen or paper. Sitting at a café table, he worked out an algorithm in his head for finding the shortest path between two cities in a road network.

He published it three years later. The paper is half a page long - one of the shortest papers in computer science history. It describes one of the most influential algorithms ever written.

Today, a version of that algorithm runs on every phone that uses GPS navigation. It finds the shortest route between two points on a road network with tens of millions of nodes in milliseconds.

The Problem

Given a weighted directed graph $G = (V, E, w)$ and a source vertex $s$, find the shortest (minimum-weight) path from $s$ to every other reachable vertex. Formally:

$$\delta(s, v) = \min_{p : s \leadsto v} \sum_{e \in p} w(e)$$

If no path exists, $\delta(s, v) = \infty$. If a negative-weight cycle lies on some path to $v$, the “shortest path” is $-\infty$ (you can loop forever to reduce the cost).

The right algorithm depends on two properties of the graph:

Do edges have negative weights?
Do you need one source to all others, or all pairs?

Dijkstra’s Algorithm: Greedy, Non-Negative Weights

Dijkstra’s algorithm is a greedy approach that works when all edge weights are non-negative. It maintains a priority queue of tentative distances and finalizes vertices in order of increasing distance.

import heapq

def dijkstra(graph, s):
    # graph[u] = list of (v, weight) pairs
    dist = {v: float('inf') for v in graph}
    parent = {v: None for v in graph}
    dist[s] = 0
    pq = [(0, s)]  # (distance, vertex)

    while pq:
        d, u = heapq.heappop(pq)
        if d > dist[u]:
            continue  # stale entry; skip
        for v, w in graph[u]:
            if dist[u] + w < dist[v]:
                dist[v] = dist[u] + w
                parent[v] = u
                heapq.heappush(pq, (dist[v], v))

    return dist, parent

The key operation is relaxation: if we find a shorter path to $v$ through $u$, update dist[v].

Correctness invariant: When vertex $u$ is extracted from the priority queue, dist[u] equals the true shortest-path distance $\delta(s, u)$.

Proof. By induction on the order of extraction.

Base case: $s$ is extracted first with dist[s] = 0 = $\delta(s, s)$.

Inductive step: Assume every vertex extracted before $u$ has a correct dist value. Let dist[u] = d. Suppose for contradiction that $\delta(s, u) < d$. Then there exists a shortest path $P: s \leadsto u$ with total weight $\delta(s, u) < d$.

Since $s$ is extracted and $u$ is not yet correctly finalized, somewhere along $P$ there must be a transition from an extracted vertex to an unexplored one. Define:

$x$ = the last extracted vertex on path $P$
$y$ = the immediate successor of $x$ on $P$ (so $y$ has not yet been extracted)

Since $x$ was extracted before $u$, the inductive hypothesis gives $\text{dist}[x] = \delta(s, x)$. When $x$ was extracted, we relaxed all edges out of $x$, so:

$$\text{dist}[y] \leq \text{dist}[x] + w(x, y) = \delta(s, x) + w(x, y)$$

Since $P$ is a shortest path, optimal substructure holds: the sub-path $s \leadsto x \leadsto y$ along $P$ is itself a shortest path from $s$ to $y$, so:

$$\delta(s, x) + w(x, y) = \delta(s, y)$$

Since $y$ lies on $P$ and all remaining edges to $u$ have non-negative weight:

$$\delta(s, y) \leq \delta(s, u)$$

Chaining all three:

$$\text{dist}[y] \leq \delta(s, x) + w(x, y) = \delta(s, y) \leq \delta(s, u) < d = \text{dist}[u]$$

So $\text{dist}[y] < \text{dist}[u]$, which means $y$ would have been extracted from the priority queue before $u$ - contradicting our assumption that $y$ is unexplored when $u$ is extracted. Therefore $\delta(s, u) = d$. $\blacksquare$

Time complexity. With a binary heap: $V$ extract_min operations at $O(\log V)$ each, plus at most $E$ decrease_key operations at $O(\log V)$ each. Total: $O((V + E) \log V)$.

With a Fibonacci heap: decrease_key becomes $O(1)$ amortized. Total: $O(E + V \log V)$.

For dense graphs ($E = \Theta(V^2)$) with an adjacency matrix: scan all vertices to find the minimum, giving $O(V^2)$ - better than the heap version in this regime.

Why Dijkstra Fails on Negative Edges

Discomfort check. Why can’t Dijkstra handle negative edge weights? The invariant depends on a crucial property: once you extract a vertex $u$ and finalize its distance, no future path can be shorter. This holds when all weights are non-negative - any path through a vertex not yet extracted is at least as long as the direct path. But with negative edges, a future path might shortcut back through an already-finalized vertex via a negative edge, improving its distance. By the time you’d discover this, you’ve already removed $u$ from the queue and assumed its distance was final. Bellman-Ford handles this by simply not finalizing anything prematurely - it repeatedly relaxes all edges.

Bellman-Ford: Dynamic Programming, Negative Weights

Bellman-Ford handles negative edge weights (but not negative cycles). It’s a DP algorithm: define the subproblem by the number of edges allowed.

$$\text{dp}[v][k] = \text{shortest path from } s \text{ to } v \text{ using at most } k \text{ edges}$$

Recurrence:

$$\text{dp}[v][k] = \min\left(\text{dp}[v][k-1],\ \min_{(u,v) \in E} \text{dp}[u][k-1] + w(u,v)\right)$$

In practice, we run $V - 1$ rounds of edge relaxation over all edges:

def bellman_ford(vertices, edges, s):
    # edges: list of (u, v, weight)
    dist = {v: float('inf') for v in vertices}
    dist[s] = 0

    for _ in range(len(vertices) - 1):
        for u, v, w in edges:
            if dist[u] + w < dist[v]:
                dist[v] = dist[u] + w

    # Detect negative cycles
    for u, v, w in edges:
        if dist[u] + w < dist[v]:
            raise ValueError("Negative cycle detected")

    return dist

Why does this work? Here is the core intuition, then the proof.

Think of Bellman-Ford as sending a wave of information outward from $s$, one hop at a time. After round 1, every vertex knows the best way to reach $s$ in a single hop. After round 2, it knows the best two-hop route. After round $k$, it knows the best $k$-hop route. Since any simple path in a $V$-vertex graph uses at most $V - 1$ edges, running $V - 1$ rounds is enough to settle every reachable vertex.

flowchart LR s(["s"]) -->|"round 1"| A(["v₁"]) s -->|"round 1"| B(["v₂"]) A -->|"round 2"| C(["v₃"]) B -->|"round 2"| D(["v₄"]) C -->|"round 3"| E(["v₅"])

Formal claim. Let $\delta_k(s, v)$ denote the minimum weight of any path from $s$ to $v$ using at most $k$ edges ($\infty$ if none exists). After round $k$, for every vertex $v$:

$$\text{dist}[v] \leq \delta_k(s, v)$$

Proof by induction on $k$.

Base case ($k = 0$): After initialisation, $\text{dist}[s] = 0 = \delta_0(s, s)$, and $\text{dist}[v] = \infty = \delta_0(s, v)$ for all $v \neq s$, since no 0-edge path to any other vertex exists.

Inductive step: Assume after round $k - 1$, for all $v$:

$$\text{dist}[v] \leq \delta_{k-1}(s, v)$$

Fix any vertex $v$ and let $P^\star$ be an optimal path from $s$ to $v$ using at most $k$ edges. Two cases:

Case 1: $P^\star$ uses at most $k - 1$ edges. The shortest path using at most $k$ edges equals the shortest path using at most $k - 1$ edges, so the inductive hypothesis directly gives $\text{dist}[v] \leq \delta_k(s, v)$ after round $k$.
Case 2: $P^\star$ uses exactly $k$ edges. Let $u$ be the second-to-last vertex on $P^\star$, so $P^\star = s \leadsto u \to v$. The prefix $s \leadsto u$ uses $k - 1$ edges and, by optimal substructure, is itself a shortest $(k-1)$-edge path to $u$. By the inductive hypothesis, after round $k - 1$:

$$\text{dist}[u] \leq \delta_{k-1}(s, u)$$

In round $k$ we relax edge $(u, v, w)$:

$$\text{dist}[v] \leq \text{dist}[u] + w \leq \delta_{k-1}(s, u) + w = \delta_k(s, v)$$

Why $V - 1$ rounds suffice. In a graph with no negative cycles, every shortest path is simple - it never repeats a vertex, because repeating a vertex would mean traversing a cycle, and any non-negative cycle only adds weight. A simple path on at most $V$ vertices uses at most $V - 1$ edges, so:

$$\delta_{V-1}(s, v) = \delta(s, v)$$

After $V - 1$ rounds the claim gives $\text{dist}[v] \leq \delta(s, v)$. Since dist[v] is always the weight of some actual path, $\text{dist}[v] \geq \delta(s, v)$. Together: $\text{dist}[v] = \delta(s, v)$. $\blacksquare$

Negative cycle detection. After $V - 1$ rounds we have the best possible simple-path distances. If any edge $(u, v, w)$ can still be relaxed - i.e. $\text{dist}[u] + w < \text{dist}[v]$ - then the only way to improve beyond a simple path is to traverse a cycle. Since non-negative cycles never help, this cycle must be negative. Hence a negative-weight cycle is reachable from $s$.

Time complexity. $V - 1$ rounds, each scanning all $E$ edges: $O(VE)$.

Floyd-Warshall: All-Pairs Shortest Paths

Need shortest paths between all pairs of vertices? Floyd-Warshall does it in $O(V^3)$.

Intuition: grow the set of allowed intermediaries.

Label the vertices $1, 2, \ldots, V$. Run through them one by one and ask at each step:

For each pair $(i, j)$: what is the shortest path from $i$ to $j$ if only vertices $\{1, 2, \ldots, k\}$ may be used as intermediate stops?

When $k = 0$, no intermediaries are allowed - only direct edges. When $k = V$, all vertices are available: these are the true shortest paths.

At each step, adding vertex $k$ to the allowed set introduces exactly one new option for every pair $(i, j)$: route through $k$. The shortest path either does not use $k$ (answer unchanged from step $k - 1$) or uses $k$ exactly once, splitting into two sub-paths - $i \leadsto k$ and $k \leadsto j$ - both of which use only intermediaries $\{1, \ldots, k-1\}$.

A concrete example:

flowchart LR A(["A"]) -->|3| B(["B"]) B -->|2| C(["C"]) A -->|10| C

With no intermediaries ($k = 0$): the only known path from A to C is the direct edge, cost 10. When B enters the allowed set: check $A \to B \to C = 3 + 2 = 5 < 10$. Update. The direct edge is beaten by the two-hop path.

The recurrence. Define:

$$\text{dp}[i][j][k] = \text{shortest distance from } i \text{ to } j \text{ using only vertices } \{1, \ldots, k\} \text{ as intermediaries}$$

Base case: $\text{dp}[i][j][0] = w(i,j)$ if edge $(i, j)$ exists, else $\infty$; and $\text{dp}[i][i][0] = 0$.

For $k \geq 1$:

$$\text{dp}[i][j][k] = \min\Bigl(\underbrace{\text{dp}[i][j][k-1]}{\text{skip } k},\quad \underbrace{\text{dp}[i][k][k-1] + \text{dp}[k][j][k-1]}{\text{route through } k}\Bigr)$$

Why does this decomposition work? If the shortest path from $i$ to $j$ routes through $k$, it splits as $i \leadsto k \leadsto j$. Could either sub-path loop back through $k$? In a graph without negative cycles, shortest paths are simple - they visit each vertex at most once. So both sub-paths use only intermediaries from $\{1, \ldots, k-1\}$, and $\text{dp}[\cdot][\cdot][k-1]$ gives their lengths correctly by the inductive hypothesis.

Correctness. By induction on $k$. The base case holds by definition. Assuming $\text{dp}[\cdot][\cdot][k-1]$ is correct, the recurrence considers both cases (use $k$ or not) and takes the minimum - which is correct by hypothesis. After $V$ iterations, $\text{dp}[i][j][V] = \delta(i, j)$ for all pairs. $\blacksquare$

In practice, the $k$ dimension is eliminated by updating in-place. This is safe because $\text{dp}[k][k] = 0$ throughout: routing through $k$ to reach $k$ itself adds nothing, so the values $\text{dp}[i][k]$ and $\text{dp}[k][j]$ used on the right-hand side are already their correct $k - 1$ values.

def floyd_warshall(vertices, edges):
    # edges: list of (u, v, weight)
    dist = {u: {v: float('inf') for v in vertices} for u in vertices}
    for v in vertices:
        dist[v][v] = 0
    for u, v, w in edges:
        dist[u][v] = w

    for k in vertices:
        for i in vertices:
            for j in vertices:
                if dist[i][k] + dist[k][j] < dist[i][j]:
                    dist[i][j] = dist[i][k] + dist[k][j]

    return dist

Negative cycles: if dist[i][i] becomes negative, vertex $i$ lies on a negative-weight cycle.

Floyd-Warshall works with negative edges as long as there are no negative cycles. Time: $O(V^3)$. Space: $O(V^2)$.

A*: Dijkstra with Direction

Dijkstra finds the shortest path to all vertices. When you only care about one destination $t$, you can stop early and guide the search using a heuristic - a function $h(v)$ estimating the remaining cost from $v$ to $t$.

A* assigns each vertex a priority $f(v) = g(v) + h(v)$, where $g(v)$ is the best known distance from $s$ to $v$. Vertices with low $f$-value get explored first: cheap to reach from $s$ and estimated to be close to $t$.

Admissibility. A heuristic is admissible if it never overestimates: $h(v) \leq \delta(v, t)$ for all $v$. An admissible heuristic is optimistic - it may underestimate, but never inflates the true remaining cost. This is what guarantees A* finds the optimal path.

Intuition. Dijkstra expands outward from $s$ uniformly in all directions - it has no information about where $t$ is. A* biases the expansion toward $t$: a vertex that is cheap to reach but far from $t$ (high $h$) gets deprioritized; a vertex pointing toward $t$ (low $h$) gets promoted to the front of the queue.

Example

flowchart LR S(["S
h = 4"]) -->|2| A(["A
h = 5"]) S -->|3| B(["B
h = 2"]) A -->|6| T(["T
h = 0"]) B -->|2| T

The $h$ values are admissible: $h(S) = 4 \leq 5$, $h(A) = 5 \leq 6$, $h(B) = 2 \leq 2$, $h(T) = 0$. The optimal path is $S \to B \to T$ with cost 5.

Dijkstra (priority by $g$):

Extracted	$g$	Open set after
S	0	A $(g=2)$, B $(g=3)$
A	2	B $(g=3)$, T $(g=8)$
B	3	T $(g=5)$
T	5	done

Dijkstra extracts A even though it leads to a suboptimal path - it has no way to know A is a dead end.

A* (priority by $f = g + h$):

Extracted	$f = g + h$	Open set after
S	$0 + 4 = 4$	A $(f=7)$, B $(f=5)$
B	$3 + 2 = 5$	A $(f=7)$, T $(f=5)$
T	$5 + 0 = 5$	done

A* never extracts A. The heuristic recognized that A, despite being cheaply reachable ($g = 2$), is estimated to be far from T ($h = 5$), giving total estimated cost 7. B costs a bit more to reach ($g = 3$) but sits close to T ($h = 2$), giving estimated total 5 - matching the true optimal. A* finds the answer with one fewer node explored.

Consistency

A heuristic is consistent (or monotone) if for every edge $(u, v, w)$:

$$h(u) \leq w + h(v)$$

This is the triangle inequality on heuristic estimates: the estimated cost to $t$ should not improve by taking a detour through $v$. Consistency implies admissibility. With a consistent heuristic, $f$-values are non-decreasing along any explored path, so A* never needs to re-expand a vertex - it behaves exactly like Dijkstra with modified priorities.

Euclidean distance satisfies both properties: it never overestimates road distance, and it obeys the geometric triangle inequality.

Proof of Correctness

Claim. If $h$ is admissible and $h(t) = 0$, then when A* first extracts $t$, $g(t) = \delta(s, t)$.

Proof. Let $P^\star$ be a shortest $s$-$t$ path of cost $\delta(s, t)$.

At the moment A* is about to extract $t$, consider the vertices of $P^\star$ from $s$ toward $t$. Either all predecessors of $t$ on $P^\star$ have been extracted - in which case the progressive relaxations along $P^\star$ gave $g(t) \leq \delta(s, t)$ already - or there is a first vertex $v$ on $P^\star$ still in the open set.

In the second case, let $u$ be $v$’s predecessor on $P^\star$ (already extracted). By the same inductive argument as Dijkstra’s correctness proof, restricted to the extracted prefix of $P^\star$: $g(u) = \delta(s, u)$ when $u$ is extracted. The edge $u \to v$ is then relaxed, giving:

$$g(v) \leq g(u) + w(u, v) = \delta(s, u) + w(u, v) = \delta(s, v)$$

where the last equality uses the fact that $u \to v$ is a segment of the optimal path $P^\star$. Applying admissibility:

$$f(v) = g(v) + h(v) \leq \delta(s, v) + \delta(v, t) = \delta(s, t)$$

Since A* extracted $t$ before $v$, we have $f(t) \leq f(v) \leq \delta(s, t)$. Since $h(t) = 0$, this gives $g(t) \leq \delta(s, t)$.

In both cases, $g(t) \leq \delta(s, t)$. Since $g(t)$ is always the weight of some actual path, $g(t) \geq \delta(s, t)$. Therefore $g(t) = \delta(s, t)$. $\blacksquare$

Connection to Dijkstra

With $h(v) = 0$ for all $v$, A* reduces to Dijkstra exactly. With the perfect heuristic $h(v) = \delta(v, t)$, the $f$-value of every vertex on an optimal path equals $\delta(s, t)$ - they all share the same priority - and A* reaches $t$ without expanding a single off-path vertex. In practice, a good heuristic can reduce exploration by orders of magnitude on large graphs like road networks.

Applications

GPS navigation. Google Maps and similar apps use bidirectional Dijkstra or A* on road networks with hundreds of millions of nodes. Contraction hierarchies - a preprocessing technique that “shortcutts” the graph - allow queries to answer in milliseconds.

Internet routing. OSPF (Open Shortest Path First) uses Dijkstra on each router to compute its forwarding table. BGP (Border Gateway Protocol) uses a distributed Bellman-Ford variant.

Currency arbitrage. Model currencies as vertices; exchange rates as edge weights $-\log(\text{rate})$. A negative cycle in this graph corresponds to a sequence of conversions that yields a profit. Bellman-Ford detects it: if after $V - 1$ rounds a relaxation is still possible, an arbitrage opportunity exists. Financial systems scan for this constantly.

Summary

Algorithm	Handles negative weights?	Complexity	Use case
Dijkstra (binary heap)	No	$O((V + E) \log V)$	Single-source, non-negative weights
Dijkstra (dense)	No	$O(V^2)$	Dense graphs
Dijkstra (Fibonacci heap)	No	$O(E + V \log V)$	Single-source, theoretical best
Bellman-Ford	Yes (detects negative cycles)	$O(VE)$	Single-source, negative weights
Floyd-Warshall	Yes (no negative cycles)	$O(V^3)$	All-pairs
A*	No (same as Dijkstra)	$O((V + E) \log V)$ worst case	Single-target, with heuristic

The shortest path problem is one of the most thoroughly studied in algorithms. Dijkstra’s 1959 half-page paper launched a literature that spans decades. What makes it remarkable is not just the algorithm - it’s the insight that greedy finalization works when costs are non-negative, that DP on path length works when they’re not, and that the two approaches are complementary faces of the same underlying idea.

Read next:

Network Flow & Matching - The Max, the Min, and Why They’re Equal