Genetic Algorithms - Evolution as an Optimization Strategy // Megha Bose

Helpful context:

In the 1970s, John Holland asked a question: can we solve optimization problems the same way evolution solves problems in nature?

Evolution doesn’t improve a single organism. It maintains a population. Better organisms reproduce more, passing their structure to offspring. Mutation introduces random variation. Sexual reproduction combines features of two individuals. After enough generations, the population concentrates around increasingly fit solutions - not because any organism has a plan, but because the selection pressure steers the population toward better and better regions of the fitness landscape.

Holland’s genetic algorithms apply this mechanism to combinatorial optimization. Maintain a population of candidate solutions. Better solutions produce more offspring. Crossover combines features of two solutions. Mutation perturbs individual solutions. After enough generations, the population converges toward good solutions.

The Genetic Algorithm Template

Every GA has the same basic structure:

Initialize a population of $\mu$ random solutions (chromosomes).
Evaluate fitness $f(x)$ for each solution.
Select parents based on fitness.
Crossover: combine two parents to produce offspring.
Mutate: randomly perturb offspring.
Replace the old population with the new one (possibly with elitism).
Repeat from step 2 until a stopping criterion is met.

The design choices - representation, selection, crossover, mutation, replacement - determine everything about the algorithm’s behavior.

Representation Matters

How you encode a solution as a chromosome is the most important design decision. The encoding shapes the neighborhood structure implicitly defined by crossover and mutation.

Binary strings. Each solution is a bit string in $\{0, 1\}^n$. Classic for theoretical analysis and combinatorial problems where features are binary decisions (include/exclude, yes/no).

Real-valued vectors. Each solution is $x \in \mathbb{R}^n$. Natural for continuous optimization - engineering parameters, neural network weights.

Permutations. Each solution is a permutation of $\{1, \ldots, n\}$. Required for routing (TSP), scheduling, and assignment problems. Standard crossover breaks permutation validity; specialized crossover operators (Order Crossover, Partially Mapped Crossover) preserve it.

Trees. Used in genetic programming, where you’re evolving programs rather than parameter vectors. More on this below.

Selection

Roulette wheel selection (fitness-proportionate). Select individual $i$ with probability proportional to $f(x_i)$:

$$p_i = \frac{f(x_i)}{\sum_{j=1}^\mu f(x_j)}.$$

Simple, but sensitive to scaling. If one individual has fitness far higher than the others, it dominates reproduction and diversity collapses - the population converges prematurely.

Tournament selection. Sample $k$ individuals at random; the one with the highest fitness wins and becomes a parent. Repeat for the second parent. Tournament size $k$ controls selection pressure: $k = 2$ is mild, $k = 7$ is strong. Tournament selection is robust to fitness scaling and is the standard in practice.

Rank selection. Sort by fitness and assign selection probability based on rank. Avoids the scaling problem but reduces selective pressure compared to fitness-proportionate selection.

Crossover

Crossover combines two parent chromosomes to produce offspring.

Single-point crossover. Choose a random cut point $c$. Offspring 1 takes genes $1, \ldots, c$ from parent 1 and $c+1, \ldots, n$ from parent 2.

Uniform crossover. Each gene position independently takes the gene from parent 1 with probability 0.5 and from parent 2 otherwise. Allows mixing from the entire chromosome simultaneously, not just across one cut.

Order crossover (for permutations). Copy a random segment from parent 1 directly. Fill remaining positions with genes from parent 2 in the order they appear, skipping genes already placed. This preserves relative ordering while enabling recombination.

The crossover probability $p_c$ (fraction of offspring produced by crossover rather than direct copy) is typically 0.6 - 0.9.

Mutation

Mutation introduces random variation, preventing premature convergence.

Bit flip (binary): flip each bit with probability $p_m \approx 1/n$.
Gaussian perturbation (real-valued): add Gaussian noise $\mathcal{N}(0, \sigma^2)$ to each dimension.
Swap mutation (permutation): choose two positions at random and swap them.
Inversion mutation (permutation): choose a segment and reverse it.

Mutation rates require careful calibration. Too low, and the algorithm relies entirely on crossover - diversity collapses, the population stagnates. Too high, and offspring are random - effectively random search with no inheritance from parents.

Elitism: copy the best $e$ solutions from the current generation directly to the next without modification. Even $e = 1$ significantly stabilizes convergence by ensuring the best solution found is never lost.

The Schema Theorem

Holland’s Schema Theorem attempts to explain mathematically why GAs work. A schema is a template over $\{0, 1, \ast\}^n$ where * matches either bit. The schema $H = 1\ast 0\ast\ast$ matches any 5-bit string with $x_1 = 1$ and $x_3 = 0$.

Define:

Order $o(H)$: the number of fixed (non-wildcard) positions.
Defining length $\delta(H)$: the distance between the outermost fixed positions.
$f(H)$: average fitness of chromosomes matching $H$.
$\bar{f}$: population average fitness.

With fitness-proportionate selection and single-point crossover, the expected count of schema $H$ satisfies:

$$m(H, t+1) \geq m(H, t) \cdot \frac{f(H)}{\bar{f}} \cdot \left(1 - p_c \frac{\delta(H)}{n-1} - o(H) p_m\right).$$

Short schemas (small $\delta$) and low-order schemas (small $o$) with above-average fitness grow exponentially. Crossover rarely disrupts them; selection amplifies them.

The building block hypothesis follows: GAs work by identifying short, low-order, high-fitness building blocks (schemata) and combining them through crossover to produce increasingly fit solutions. The GA is effectively running a parallel search over an exponential number of schemata simultaneously.

Discomfort check. Is there real theory explaining why GAs work, or do they just work empirically?

Mostly empirically. The Schema Theorem is real, but it doesn’t fully explain GA performance. It says short fit schemas grow - it doesn’t say the combination of short schemas gives globally good solutions. The No Free Lunch (NFL) theorem complicates matters further: averaged over all possible optimization problems, every optimization algorithm performs equally well. No algorithm - including GAs - is universally better than any other. GAs work when problems have exploitable structure: near-decomposability (good solutions can be assembled from good sub-solutions), moderate epistasis (fitness of one gene doesn’t strongly depend on every other gene), and many local optima of varying quality that can be escaped by population-level crossover. When those structural properties are absent, GAs offer no particular advantage.

Genetic Programming

Genetic programming evolves programs rather than fixed-length strings. Solutions are represented as trees: internal nodes are functions (arithmetic operations, conditionals, Boolean operators); leaves are terminals (constants, input variables).

Crossover swaps subtrees between parents. Mutation replaces a subtree with a randomly generated one. The fitness function is performance on a test set - how well the program solves the target task.

GP has been used for symbolic regression (rediscovering physical equations from data), automatic circuit design (NASA’s evolved antenna for the ST5 spacecraft), and game-playing strategy evolution. In symbolic regression, GP can rediscover equations like $F = ma$ from experimental data, given the right terminal and function sets.

CMA-ES: State of the Art for Continuous Problems

For continuous optimization, the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is the state of the art among evolutionary approaches. It maintains:

A mean vector $m \in \mathbb{R}^n$ (the current best estimate of the optimum).
A covariance matrix $C \in \mathbb{R}^{n \times n}$ (the shape of the search distribution).
A step size $\sigma$.

At each generation, sample $\lambda$ offspring from $\mathcal{N}(m, \sigma^2 C)$, evaluate fitness, and update $m$, $C$, and $\sigma$ based on the top-$\mu$ individuals. The covariance matrix adapts to the local geometry of the fitness landscape - if the function has strong correlations between variables, $C$ learns to reflect that, allowing efficient search in elongated valleys.

CMA-ES achieves near-quadratic convergence on smooth functions and is competitive with L-BFGS on hard non-convex problems where gradient information is unavailable.

GAs vs. Simulated Annealing

Property	Simulated annealing	Genetic algorithm
Population	Single solution	Population of $\mu$ solutions
Diversity mechanism	Thermal acceptance	Crossover + mutation
Escape from local optima	Temperature (probabilistic)	Population diversity + crossover
Parallelism	Single thread per run	Naturally parallel (evaluate population in parallel)
Theoretical basis	MCMC, Boltzmann	Schema theorem (incomplete), NFL
When best	Smooth landscapes, clear temperature analogy	Problems with compositional structure, discrete search spaces

The fundamental advantage of GAs over SA is the population: good partial solutions can be combined via crossover in ways that individual perturbation cannot achieve. If the global optimum contains a particular substructure that appears in several population members, crossover can assemble it. SA can only discover it through a sequence of random perturbations.

Summary

Component	Options	Effect
Representation	Binary, real, permutation, tree	Shapes neighborhood implicitly
Selection	Roulette wheel, tournament, rank	Controls selection pressure
Crossover	Single-point, uniform, order	Combines parent building blocks
Mutation	Bit flip, Gaussian, swap, inversion	Maintains diversity
Elitism	Copy top $e$ individuals	Stabilizes convergence
Population size	Larger = more diversity, more expensive	Tradeoff: exploration vs. cost

Genetic algorithms are population-based search that exploits the compositional structure of good solutions. They work well when problems are near-decomposable and when good partial solutions can be usefully combined. They fail when fitness is highly epistatic (every variable interacts with every other), when the search space is smooth (gradient methods are better), or when the encoding doesn’t respect the problem’s structure. The art is in the representation and in calibrating selection pressure so the population explores broadly early and converges narrowly late.

Read next:

Beam Search - The Best Path You Can Afford to Search For