Helpful context:


In 1945, the US Navy had a problem. They needed to sort 10 million punched cards containing personnel records - and their best method was slow, error-prone, and required enormous physical floor space. John von Neumann, then at the Institute for Advanced Study, thought carefully about the problem and invented merge sort.

He didn’t invent it to solve a puzzle. He invented it because the Navy needed it and the obvious approach was too slow.

Today, sorting is woven into almost every piece of software you use. Database indexes are sorted. Search results are ranked (a kind of sorting). File systems store directory entries in sorted order. Compression algorithms sort symbols by frequency. The ubiquity makes sorting more than a textbook problem - it’s a design philosophy. This post traces through the main ideas.


Why Sorting Unlocks Everything

Before examining specific algorithms, it’s worth understanding why sorting is so useful.

A sorted array enables binary search: instead of scanning all $n$ elements for a value, you can find it in $O(\log n)$ comparisons. Merging two sorted arrays takes $O(n)$ instead of $O(n^2)$. Finding duplicates becomes a single pass. Database join operations between two sorted tables are linear. Median, percentiles, and range queries all become easy.

Sorting is often not the goal - it’s the preprocessing step that makes the actual goal tractable.


The Comparison Lower Bound

Here’s a question worth asking before writing a single line of code: how fast can sorting possibly be?

If your algorithm sorts by comparing pairs of elements, here is what that means concretely. Take $[3, 1, 2]$. You don’t know the sorted order yet. You ask: “is $3 > 1$?” - yes, so $1$ comes before $3$. You ask: “is $1 > 2$?” - no, so $1$ comes before $2$. You ask: “is $3 > 2$?” - yes, so $2$ comes before $3$. Now you know the full order: $1 < 2 < 3$. You move the elements into those positions - that is the rearranging step. Every comparison-based sorting algorithm works this way: it asks yes/no questions about pairs, accumulates enough answers to know the complete sorted order, then places elements accordingly.

Now here is the key observation. Each of the $n!$ possible input orderings requires a different sequence of rearrangements to sort correctly. This means each ordering must produce a different sequence of yes/no answers - because if two different orderings produced the exact same answers, the algorithm would make the same moves for both and get at least one of them wrong. So the algorithm needs at least $n!$ distinct answer sequences. After $k$ comparisons, each giving a binary answer, there are at most $2^k$ possible sequences. So:

$$2^k \geq n! \implies k \geq \log_2(n!)$$

Now, $\log_2(n!)$ grows roughly as $n \log_2 n$. To see why: Stirling’s approximation is a formula that estimates $n!$ without computing it directly - it says $n! \approx \sqrt{2\pi n}(n/e)^n$. Taking the log of both sides gives $\log_2(n!) \approx n \log_2 n - n \log_2 e$, which is $\Omega(n \log n)$. Intuitively, $\log(n!)$ is the sum $\log 1 + \log 2 + \cdots + \log n$, and the average term is around $\log(n/2) \approx \log n$, so the whole sum is around $n \log n$.

This is a lower bound: any algorithm that learns about input order only through comparisons must make at least $\Omega(n \log n)$ comparisons in the worst case. The algorithms below that achieve $O(n \log n)$ are hitting this ceiling - they’re optimal in this model.


Bubble Sort: The Teaching Algorithm

Bubble sort repeatedly scans the array, comparing adjacent elements and swapping them if they’re in the wrong order. After the first pass, the largest element has “bubbled” to the end. After the second pass, the second-largest is in place. After $n-1$ passes, the array is sorted.

The inner loop on pass $i$ compares positions $0$ through $n-2-i$, since the last $i$ elements are already in their final positions. Total comparisons: $(n-1) + (n-2) + \cdots + 1 = \frac{n(n-1)}{2} = O(n^2)$. Both worst-case and average-case are $\Theta(n^2)$.

One optimization: track whether any swap occurred during a pass. If no swap occurred, the array is already sorted and you can exit early. This makes bubble sort $O(n)$ on already-sorted input - the same as insertion sort’s best case.

Bubble sort is stable (equal elements preserve their relative order) and in-place ($O(1)$ extra space). In practice it is slower than insertion sort by a constant factor, because each swap requires three assignments while insertion sort shifts elements one position at a time. It is taught for its simplicity, not its performance. No serious implementation uses it.

When to use it. Essentially never. Insertion sort is strictly better on every practical criterion: same asymptotic complexity, fewer writes per element, and better cache behavior on nearly-sorted inputs.

def bubble_sort(arr):
    n = len(arr)
    for i in range(n):
        swapped = False
        for j in range(n - 1 - i):
            if arr[j] > arr[j + 1]:
                arr[j], arr[j + 1] = arr[j + 1], arr[j]
                swapped = True
        if not swapped:
            break

Insertion Sort: Simple and Surprising

Insertion sort works the way you’d sort a hand of playing cards. Pick up one card at a time and insert it in the right position among the cards you already hold.

For an array of $n$ elements, you scan from left to right. For each element, you scan backward to find where it belongs, shifting elements right to make room.

In the worst case (reverse-sorted input), every element has to travel all the way to the beginning: $n-1 + n-2 + \cdots + 1 = O(n^2)$ total moves. This is slow for large inputs.

But here’s the surprising part: insertion sort is optimal for nearly-sorted arrays. If each element is at most $k$ positions from its final location, insertion sort runs in $O(kn)$. When $k$ is small - say, the array was already sorted and you just added a few elements - insertion sort is actually faster than $O(n \log n)$ algorithms. This is why real sorting implementations use insertion sort for small subarrays.

def insertion_sort(arr):
    for i in range(1, len(arr)):
        key = arr[i]
        j = i - 1
        while j >= 0 and arr[j] > key:
            arr[j + 1] = arr[j]
            j -= 1
        arr[j + 1] = key

Merge Sort: Divide, Conquer, Combine

Merge sort is von Neumann’s insight. To sort $n$ elements:

  1. Split the array in half.
  2. Recursively sort each half.
  3. Merge the two sorted halves into one sorted array.

The merge step is the key idea. Given two sorted arrays, you can produce one sorted array in $O(n)$ time: look at the first element of each, take the smaller one, repeat.

The running time satisfies the recurrence:

$$T(n) = 2T(n/2) + O(n)$$

Two subproblems of half the size, plus $O(n)$ to merge. By the Master Theorem (Case 2: $a = 2$, $b = 2$, so $n^{\log_b a} = n$, and $f(n) = \Theta(n)$):

$$T(n) = \Theta(n \log n)$$

Merge sort is stable: equal elements maintain their original relative order. This matters when sorting by multiple keys (sort by date, then stably sort by name - you get alphabetical within each date).

The main cost is space: the merge step needs an auxiliary array of $O(n)$ elements. For large datasets, this matters.

def merge_sort(arr):
    if len(arr) <= 1:
        return arr
    mid = len(arr) // 2
    left = merge_sort(arr[:mid])
    right = merge_sort(arr[mid:])
    result, i, j = [], 0, 0
    while i < len(left) and j < len(right):
        if left[i] <= right[j]:
            result.append(left[i]); i += 1
        else:
            result.append(right[j]); j += 1
    return result + left[i:] + right[j:]

Quicksort: Why the Fast Algorithm Isn’t Guaranteed Fast

Quicksort is typically the fastest sorting algorithm in practice. It’s also the one with the most interesting gap between theory and reality.

The idea: pick a pivot element. Rearrange the array so everything smaller than the pivot is to its left, everything larger is to its right. Now the pivot is in its final position. Recursively sort each side.

The running time depends entirely on pivot quality. If the pivot always splits the array exactly in half, you get the same $O(n \log n)$ recurrence as merge sort. If the pivot is always the minimum or maximum element - say, you always pick the first element and the input happens to be already sorted - one side has $n-1$ elements and the other has $0$. The recurrence becomes $T(n) = T(n-1) + O(n) = O(n^2)$.

The fix is randomization. Pick the pivot uniformly at random. Now the worst case is exponentially unlikely - any particular bad sequence of pivot choices has probability $(1/n)^n$. The expected running time is $O(n \log n)$.

Quicksort is in-place (it doesn’t need an auxiliary array), and it has better cache behavior than merge sort because the partition step scans the array sequentially. For large in-memory sorts, these practical advantages often outweigh the theoretical concerns about worst-case behavior.

Discomfort check. “Quicksort is $O(n \log n)$ average but $O(n^2)$ worst - why does everyone use it?” Three reasons combine. First, with random pivot selection, the probability of hitting the worst case is negligible in practice. Second, real implementations add a fallback: if the recursion depth exceeds $O(\log n)$ (a sign you’re close to the worst case), they switch to heapsort, which guarantees $O(n \log n)$. This hybrid is called introsort. Third, quicksort’s cache locality makes it roughly 2-3 times faster than merge sort on random data, and real data is often close to random.

import random

def quicksort(arr, lo=0, hi=None):
    if hi is None:
        hi = len(arr) - 1
    if lo < hi:
        p = _partition(arr, lo, hi)
        quicksort(arr, lo, p - 1)
        quicksort(arr, p + 1, hi)

def _partition(arr, lo, hi):
    rand = random.randint(lo, hi)
    arr[rand], arr[hi] = arr[hi], arr[rand]
    pivot, i = arr[hi], lo - 1
    for j in range(lo, hi):
        if arr[j] <= pivot:
            i += 1
            arr[i], arr[j] = arr[j], arr[i]
    arr[i + 1], arr[hi] = arr[hi], arr[i + 1]
    return i + 1

Heap Sort: O(n log n) Guaranteed, In-Place

Heapsort uses a data structure - the heap - to achieve $O(n \log n)$ in the worst case with $O(1)$ extra space. No randomization required.

Step one: build a max-heap from the array (the largest element is at the top). This takes $O(n)$ - we’ll see why in Trees & Heaps - When Flat Structures Hit Their Limit . Step two: repeatedly extract the maximum element (swap it to the end of the array and restore the heap property). Each extraction costs $O(\log n)$ and there are $n$ of them, so the total is $O(n \log n)$.

Heapsort is not stable (elements can move past each other during heap operations), and it has worse cache behavior than quicksort - accessing a heap involves jumping between distant array positions, not scanning sequentially. In practice, heapsort is slower than quicksort despite having the same asymptotic complexity.

Its main advantage is the worst-case guarantee: $O(n \log n)$ on any input. This is why it appears as the fallback in introsort.

def _sift_down(arr, i, n):
    # push node i down until it is larger than both its children
    largest = i
    left, right = 2 * i + 1, 2 * i + 2
    if left < n and arr[left] > arr[largest]:
        largest = left
    if right < n and arr[right] > arr[largest]:
        largest = right
    if largest != i:
        arr[i], arr[largest] = arr[largest], arr[i]
        _sift_down(arr, largest, n)

def heapify(arr):
    # every leaf is already a valid heap; start from the last
    # non-leaf node and sift each node down into its correct position
    n = len(arr)
    for i in range(n // 2 - 1, -1, -1):
        _sift_down(arr, i, n)

def heap_sort(arr):
    n = len(arr)
    heapify(arr)                          # build max-heap in O(n)
    for i in range(n - 1, 0, -1):
        arr[0], arr[i] = arr[i], arr[0]  # move current max to end
        _sift_down(arr, 0, i)            # restore heap on remaining elements

Counting Sort and Radix Sort: Beating the Lower Bound

The $\Omega(n \log n)$ lower bound applies to comparison-based sorting. If you know something about the structure of the keys, you can do better.

Counting sort works when all keys are integers in a known range $[0, k)$. Count how many times each value appears, then reconstruct the sorted array from the counts. Total time: $O(n + k)$. When $k = O(n)$, this is linear.

Radix sort extends counting sort to multi-digit integers. Sort by the least significant digit first, then the next digit, and so on - using a stable sort at each step. With $d$ digit positions and a base-$b$ representation:

$$T(n) = O(d \cdot (n + b))$$

For 32-bit integers with $b = 256$ (byte-by-byte), $d = 4$ and each pass is $O(n + 256) = O(n)$. Total: $O(4n) = O(n)$.

The catch: radix sort requires the keys to have a fixed-width digit representation. It doesn’t generalize to arbitrary comparison keys (strings of variable length, complex structs). It also isn’t in-place - it needs $O(n)$ auxiliary space.

def counting_sort(arr, k):
    count = [0] * k
    for x in arr:
        count[x] += 1
    return [x for x, freq in enumerate(count) for _ in range(freq)]


def radix_sort(arr):
    max_val = max(arr)
    exp = 1
    while max_val // exp > 0:
        buckets = [[] for _ in range(10)]
        for x in arr:
            buckets[(x // exp) % 10].append(x)
        arr = [x for bucket in buckets for x in bucket]
        exp *= 10
    return arr

Timsort: What Python Actually Uses

Python’s built-in sort (and Java’s sort for objects) uses Timsort, a hybrid algorithm designed to be fast on the kinds of data that actually appear in practice.

Real data often has structure: regions that are already sorted, or nearly sorted, or sorted in reverse. Timsort exploits this.

The algorithm scans the input for natural runs - contiguous subsequences that are already sorted or reverse-sorted. Reverse-sorted runs get flipped. Short runs get extended using insertion sort (which is fast on nearly-sorted data). Then runs are merged together using a strategy that maintains a stack of “pending merges” with size constraints to guarantee $O(n \log n)$ total work.

On an already-sorted array, Timsort detects a single run and does no work: $O(n)$. On truly random data, it falls back to $O(n \log n)$ merge sort behavior. In practice, with partially structured data, it frequently beats pure quicksort.

arr.sort()               # in-place, uses Timsort
sorted_arr = sorted(arr) # returns a new sorted list

Summary

Algorithm Best Average Worst Space Stable When to use
Insertion sort $O(n)$ $O(n^2)$ $O(n^2)$ $O(1)$ Yes Small $n$ or nearly-sorted
Merge sort $O(n \log n)$ $O(n \log n)$ $O(n \log n)$ $O(n)$ Yes Stable sort, linked lists, external sort
Quicksort $O(n \log n)$ $O(n \log n)$ $O(n^2)$ $O(\log n)$ No General in-memory; fastest in practice
Heap sort $O(n \log n)$ $O(n \log n)$ $O(n \log n)$ $O(1)$ No Worst-case guarantee, in-place
Counting sort $O(n + k)$ $O(n + k)$ $O(n + k)$ $O(k)$ Yes Small integer range
Radix sort $O(d(n+b))$ $O(d(n+b))$ $O(d(n+b))$ $O(n+b)$ Yes Fixed-width integers
Timsort $O(n)$ $O(n \log n)$ $O(n \log n)$ $O(n)$ Yes Real-world data with structure

Read next: