Simulated Annealing - Escaping Local Optima by Accepting Bad Moves // Megha Bose

Helpful context:

When you cool molten steel too fast, it becomes brittle. The atoms don’t have time to find low-energy configurations - they freeze into whatever disordered arrangement they happen to be in. When you cool it slowly (the process called annealing), the atoms can keep rearranging as the temperature drops, settling eventually into a crystal structure with minimal energy. The slow cooling gives the system time to find its ground state.

In 1983, Kirkpatrick, Gelatt, and Vecchi realized this is exactly what happens when you try to escape local optima in combinatorial optimization. Cool the “temperature” slowly - accept worse solutions less and less often as time goes on - and you can escape local traps. The slower you cool, the better the solution you find.

This is simulated annealing.

The Boltzmann Acceptance Criterion

Start with any feasible solution $s$. At each step:

Propose a random neighbor $s'$ of the current solution.
Compute $\Delta E = f(s') - f(s)$ (the change in objective value - positive means worse, for minimization).
If $\Delta E < 0$: accept $s'$ unconditionally (it’s an improvement).
If $\Delta E \geq 0$: accept $s'$ with probability

$$P(\text{accept}) = e^{-\Delta E / T}$$

where $T > 0$ is the current temperature.

This is the Metropolis criterion, from statistical physics. At high temperature, $e^{-\Delta E/T} \approx 1$ for moderate $\Delta E$ - almost any move is accepted, enabling broad exploration. At low temperature, $e^{-\Delta E/T} \approx 0$ for any $\Delta E > 0$ - only improvements are accepted, just like hill climbing.

The temperature schedule interpolates between these: early in the algorithm, we explore broadly; late in the algorithm, we exploit the best region found so far.

The Acceptance Probability in Practice

Let’s make this concrete. Suppose we’re minimizing a cost function and a candidate move worsens the cost by $\Delta E$. At temperature $T = 5$:

A small worsening $\Delta E = 1$: accepted with probability $e^{-1/5} \approx 0.82$.
A moderate worsening $\Delta E = 5$: accepted with probability $e^{-5/5} \approx 0.37$.
A large worsening $\Delta E = 20$: accepted with probability $e^{-20/5} \approx 0.018$.

Large deteriorations are accepted much less often than small ones, even at the same temperature. The algorithm takes careful steps away from good regions rather than random leaps - it preferentially accepts small uphill moves and rarely accepts large ones.

As temperature drops, even small uphill moves become less likely:

At $T = 1$: $\Delta E = 1$ is accepted with probability $e^{-1} \approx 0.37$.
At $T = 0.1$: $\Delta E = 1$ is accepted with probability $e^{-10} \approx 0.00005$.

The algorithm gradually freezes into a near-optimal solution.

Temperature Schedules

The annealing schedule $T(t)$ controls how quickly we cool.

Geometric cooling (practical). The standard:

$$T_{k+1} = \alpha \cdot T_k, \quad \alpha \in [0.90, 0.99].$$

Start at $T_0$ chosen so that roughly 80% of moves are accepted (meaning the initial temperature is calibrated to the typical energy differences in the problem). Cool by a factor $\alpha$ every $L$ iterations (where $L$ is also a parameter). Stop when $T$ is near zero or no improvement occurs for many consecutive steps.

Geometric cooling is fast and effective in practice, but provides no guarantee. The rate matters: $\alpha = 0.99$ (slow cooling) gives much better solutions than $\alpha = 0.90$ (fast cooling), at the cost of more computation.

Logarithmic cooling (theoretical). If $T(t) = c / \log(t + 1)$ for a sufficiently large constant $c$ (related to the largest energy barrier between any local optimum and the global optimum), then SA converges to the global optimum in probability. The constant $c$ is the depth of the deepest local trap in the landscape.

The catch: “converges in probability” is misleading. The required number of steps is astronomical - essentially exponential. This schedule is theoretically optimal but practically useless.

Reheating. Some implementations periodically raise $T$ after the algorithm has cooled and converged to a local optimum. The idea: reheat enough to escape, then cool again. This is sometimes called iterated SA and can improve results when the problem has many local optima of similar depth.

The Connection to MCMC

At a fixed temperature $T$, the Metropolis acceptance criterion defines a Markov chain over the solution space. The chain’s stationary distribution is the Boltzmann distribution:

$$\pi_T(s) \propto e^{-f(s)/T}.$$

At high $T$, this is nearly uniform - all solutions are roughly equally likely. As $T \to 0$, the distribution concentrates on the global minimum: solutions with lower energy have exponentially higher probability.

Simulated annealing is running a sequence of Markov chains at decreasing temperatures, hoping that at each temperature the chain mixes well enough (reaches near-stationarity) before we cool further. If we cool too fast, the chain doesn’t mix - we don’t sample from $\pi_T$ accurately - and we can get stuck. The logarithmic cooling schedule is slow precisely because it gives enough time at each temperature for the chain to mix.

This connection to MCMC means everything we know about Markov chain mixing - spectral gaps, conductance, the relationship between mixing time and energy barriers - applies to understanding when SA works well and when it doesn’t.

Simulated Annealing for TSP

The same 2-opt neighborhood as pure local search, but now we accept worse tours with probability $e^{-\Delta L / T}$, where $\Delta L > 0$ is the increase in tour length.

What this buys you: when 2-opt gets stuck at a local optimum, SA can accept a slightly longer tour to escape. From that new (slightly worse) position, a different sequence of improving 2-opt moves may lead to a much better local optimum.

In practice, SA with 2-opt moves finds tours dramatically better than pure 2-opt hill climbing on hard TSP instances - often within 1 - 2% of optimal on benchmark instances. The key parameter is the cooling schedule: too fast and the algorithm gets stuck; too slow and you waste computation.

Discomfort check. Simulated annealing is guaranteed to find the global optimum if you cool slowly enough. So why not always cool slowly?

“Slowly enough” means the logarithmic cooling schedule $T(t) = c / \log(t+1)$. To see why this is impractical: the mixing time of the Markov chain at temperature $T$ near 0 is exponential in $1/T$ (because escaping a local trap requires a sequence of uphill moves, each exponentially unlikely). The logarithmic schedule decays so slowly that the total time needed to escape any local trap is still exponential. You end up needing exponentially many steps - just as many as exhaustive search, roughly speaking. The theoretical guarantee exists, but it’s not useful. In practice, you use geometric cooling with a carefully tuned $\alpha$, accept that you might get a near-optimal solution rather than the exact optimum, and run multiple restarts.

When Simulated Annealing Works Well

SA works well when:

The search space is large and well-connected (many paths between any two solutions)
The energy landscape is relatively smooth (small uphill moves can connect local optima)
Starting fresh from a random point is expensive or unlikely to find good solutions quickly
You have time to tune the cooling schedule to the problem

SA works poorly when:

The energy landscape is rugged with many deep isolated valleys
Good solutions are in tiny, disconnected regions of the space
The problem has hard constraints that are difficult to satisfy for random solutions

Summary

Concept	What it means
Metropolis criterion	Accept worse solution with probability $e^{-\Delta E/T}$
Temperature $T$	Controls exploration-exploitation tradeoff
Geometric cooling	$T_k = \alpha T_{k-1}$; practical, no convergence guarantee
Logarithmic cooling	$T(t) = c/\log t$; provably optimal, astronomically slow
Boltzmann distribution	Stationary distribution of SA chain at fixed $T$
Connection to MCMC	SA = sequence of MH chains at decreasing temperatures
Reheating	Raise $T$ to escape deep local optima, then cool again

Simulated annealing turns local search’s fatal flaw - getting stuck at local optima - into a feature: you accept worse solutions strategically, with a probability that decreases as the algorithm matures. The physics analogy is not just poetic; it is exact. SA is running a physical cooling process on the combinatorial optimization landscape, and the theory of statistical mechanics tells you exactly what happens as the temperature goes to zero.

Read next:

Tabu Search - Memory-Guided Optimization That Forbids Its Own History