Simulated Annealing // Megha Bose

Prerequisite: Local Search & Hill Climbing

The Problem with Hill Climbing

Hill climbing gets trapped in local optima. Once at a local peak, no improving neighbour exists and the search stalls - even if the global optimum is far better. The only remedies within pure hill climbing are random restarts (expensive) or richer neighbourhood structures (problem-specific). Simulated annealing offers a more principled escape mechanism: occasionally accept worse solutions, with a probability that decreases as the algorithm progresses.

Physical Analogy: Annealing in Metallurgy

When molten metal cools slowly, atoms settle into a low-energy crystalline lattice - the global energy minimum. If the metal cools too fast (quenching), atoms get locked in a disordered, higher-energy state. The key is a careful temperature schedule: cool slowly enough to allow atoms to keep rearranging toward better configurations.

Simulated annealing (Kirkpatrick, Gelatt, Vecchi, 1983) applies this idea to combinatorial optimisation.

The Algorithm

Let $f(s)$ be the cost function (to be minimised). At each step, propose a random neighbour $s'$ of the current solution $s$. Let $\Delta E = f(s') - f(s)$.

If $\Delta E < 0$ (improvement), accept $s'$ unconditionally.
If $\Delta E \geq 0$ (worsening), accept $s'$ with probability

$$P(\Delta E, T) = e^{-\Delta E / T}$$

where $T > 0$ is the current temperature. This is the Metropolis criterion, originally introduced in statistical physics for Monte Carlo simulation of particle systems.

At high temperature, $e^{-\Delta E/T} \approx 1$ for moderate $\Delta E$: almost any move is accepted, enabling broad exploration. As $T \to 0$, $e^{-\Delta E/T} \to 0$ for $\Delta E > 0$: only improvements are accepted, reducing to hill climbing. The algorithm thus transitions smoothly from global exploration to local exploitation.

Temperature Schedules

The function $T(t)$ controls how quickly the algorithm “cools.”

Theoretical schedule. If $T(t) = c / \log(t+1)$ for a sufficiently large constant $c$ (related to the energy barrier between any local and the global optimum), then SA converges to the global optimum in probability. The catch: the required number of steps is exponential, making this schedule impractical.

Geometric cooling (practical). The standard schedule sets

$$T_{k+1} = \alpha \cdot T_k, \quad \alpha \in [0.9, 0.99]$$

starting from an initial temperature $T_0$ chosen so that roughly 80% of moves are accepted. Cooling continues until $T$ is near zero or no improvement has been found for many consecutive iterations. This is fast but provides no convergence guarantee.

Reheating. Some implementations periodically raise $T$ to escape deep local optima after the temperature has dropped, then cool again. This is sometimes called iterated SA.

Acceptance Probability Analysis

Consider two moves: one that worsens the objective by $\Delta E = 1$ and one that worsens it by $\Delta E = 10$. At temperature $T = 5$:

$$P(\Delta E = 1) = e^{-1/5} \approx 0.82, \quad P(\Delta E = 10) = e^{-10/5} \approx 0.14$$

Large deteriorations are accepted much less often than small ones, even at the same temperature. This is a natural and desirable property: the algorithm takes careful steps away from good regions rather than random leaps.

Parallel Tempering

A powerful extension runs $k$ independent SA chains simultaneously at different temperatures $T_1 < T_2 < \cdots < T_k$. Periodically, adjacent chains attempt to swap configurations: chain at $T_i$ holds solution $s_i$ and chain at $T_{i+1}$ holds $s_{i+1}$. The swap is accepted with probability

$$\min!\left(1,\ e^{(\beta_{i+1} - \beta_i)(E(s_i) - E(s_{i+1}))}\right)$$

where $\beta_i = 1/T_i$. This is detailed balance and preserves the target distribution. The hot chains explore freely; good configurations discovered at high temperature propagate down to the cold chain, which focuses on refinement. Parallel tempering dramatically accelerates mixing in multimodal landscapes.

Relationship to Markov Chains

SA is a time-inhomogeneous Markov chain over the solution space. At a fixed temperature $T$, the Metropolis criterion defines a transition matrix $P_T$ whose stationary distribution is the Boltzmann distribution:

$$\pi_T(s) = \frac{1}{Z(T)} e^{-f(s)/T}$$

As $T \to 0$, this distribution concentrates on the global minimum. The theoretical convergence guarantee says: if the chain mixes fast enough at each temperature, decreasing $T$ logarithmically gives convergence to the global minimum. The exponential time requirement reflects the mixing time of $P_T$ near $T = 0$.

Practical Considerations

Initial temperature: Run a short random walk and set $T_0$ so that 80% of moves are accepted.
Cooling rate: Slower cooling ($\alpha = 0.99$) gives better solutions at the cost of more computation.
Stopping criterion: Fixed budget of evaluations, or no improvement after $k$ consecutive temperature reductions.
Neighbourhood design: Quality of SA depends critically on the neighbourhood. The neighbourhood used for TSP, for instance, is 2-opt or 3-opt moves.

Examples

TSP. SA with 2-opt or 3-opt moves regularly finds tours within 1–2% of optimal on large benchmark instances. The key parameter is the cooling schedule: too fast and the algorithm gets stuck; too slow and computation is wasted.

VLSI Placement. Assigning circuit components to chip locations to minimise wire length. The solution space is a permutation; neighbourhoods are pairwise swaps of component positions. SA was used at IBM in the 1980s to produce competitive chip layouts, which motivated the original 1983 paper.

Protein Folding (Rosetta). Predicting three-dimensional protein structure from amino acid sequence. Energy functions are rugged and multimodal. SA with fragment insertion moves (replacing local backbone segments with library fragments) is a core component of the Rosetta suite.

Hyperparameter Tuning. Treating hyperparameter configurations as solutions and validation loss as the objective, SA explores the hyperparameter space, accepting slightly worse configurations occasionally to avoid premature convergence to a poor local optimum.

Read Next: Tabu Search