Big-O - Why the Right Algorithm Beats Faster Hardware // Megha Bose

Helpful context:

How to Think Like a Programmer - Breaking Problems Into Pieces That Yield

In 2002, Google engineers faced a problem: they had crawled roughly 2 billion web pages and needed to sort links between them. The naive approach - compare every pair of pages - would require around $4 \times 10^{18}$ comparisons. On the hardware of the day, that would take thousands of years.

They didn’t buy faster computers. They changed the algorithm.

This is what Big-O is about. Not microseconds. Not clock cycles. The fundamental question: as your input grows, how does the running time grow with it?

What Does “Fast” Mean?

Suppose you’re looking up a name in a phone book with 1 million entries. The names are sorted alphabetically.

Approach 1: Start on page one, read every name until you find it. In the worst case, you read all 1 million entries. If the book doubled in size, you’d read 2 million entries. The work grows linearly with the size.

Approach 2: Open to the middle. If the name you want comes before the middle alphabetically, throw away the second half. Repeat on the remaining half. Every step cuts the remaining work in half. With 1 million entries you need at most 20 steps. With 2 million entries, 21 steps. The work barely grows at all.

Both approaches are “correct.” One is dramatically better. The difference isn’t a matter of implementation detail - it’s the shape of the relationship between input size and work. That shape is what Big-O captures.

We call the input size $n$. The first approach does roughly $n$ operations: we say it runs in $O(n)$ time. The second does roughly $\log_2 n$ operations: we say it runs in $O(\log n)$ time. For $n = 10^6$, that’s the difference between 1,000,000 steps and 20 steps.

Big-O Intuition: Drop the Clutter

Imagine your algorithm does exactly $3n^2 + 7n + 12$ operations on an input of size $n$. What matters?

At $n = 1000$:

The $3n^2$ term contributes $3{,}000{,}000$
The $7n$ term contributes $7{,}000$
The constant $12$ contributes $12$

The $n^2$ term crushes everything else. At $n = 10^6$, the gap is even more extreme. So we say the algorithm is $O(n^2)$ - we drop the constant multiplier 3, and we drop the lower-order terms $7n + 12$.

Why drop the constant? Because it’s hardware-dependent. The same algorithm running on a machine twice as fast would have constant $1.5n^2$. Whether it’s $3n^2$ or $1.5n^2$ doesn’t change the shape of the curve - it’s still quadratic growth. The constant tells you about the machine, not the algorithm.

Why drop lower-order terms? Because they become irrelevant as $n$ grows. You should care about what happens for large inputs, and for large $n$ the dominant term tells the whole story.

Formal Definitions

Now that you have the intuition, here are the precise definitions.

Big-O - $O(g(n))$ (loose upper bound): $f(n) = O(g(n))$ if $\exists$ positive constants $c$ and $n_0$ such that $f(n) \leq c \cdot g(n)$ for all $n \geq n_0$.

Informally: $f$ grows no faster than $g$, up to a constant factor, for all large enough $n$.

Little-o - $o(g(n))$ (strict upper bound): $f(n) = o(g(n))$ if for all positive constants $c$, $\exists\ n_0$ such that $f(n) < c \cdot g(n)$ for all $n \geq n_0$. The growth rate of $f$ is strictly less than that of $g$. Note that $o(n)$ implies $O(n)$, but not vice-versa - $\log n$ is both $o(n)$ and $O(n)$, but $n$ is only $O(n)$.

Big-Omega - $\Omega(g(n))$ (loose lower bound): $f(n) = \Omega(g(n))$ if $\exists$ positive constants $c$ and $n_0$ such that $c \cdot g(n) \leq f(n)$ for all $n \geq n_0$. Informally: $f$ grows at least as fast as $g$.

Little-omega - $\omega(g(n))$ (strict lower bound): $f(n) = \omega(g(n))$ if for all positive constants $c$, $\exists\ n_0$ such that $c \cdot g(n) < f(n)$ for all $n \geq n_0$. The growth rate of $f$ is strictly greater than that of $g$. Similarly, $\omega(n)$ implies $\Omega(n)$, but not vice-versa - $n^2$ is both $\omega(n)$ and $\Omega(n)$, but $n$ is only $\Omega(n)$.

Big-Theta - $\Theta(g(n))$ (tight bound): $f(n) = \Theta(g(n))$ if $f(n) = O(g(n))$ and $f(n) = \Omega(g(n))$. This pins down the exact growth rate. There is no “small theta” - no function can simultaneously be $o(g(n))$ and $\omega(g(n))$, i.e., strictly above and strictly below $g(n)$ at the same time.

When we want to be precise - not just “no faster than” but “exactly as fast as” - we use $\Theta$. In casual usage, people often say $O$ when they mean $\Theta$; this is imprecise but common.

Reading the notations intuitively. If your algorithm runs in:

$O(n)$ - upper-bounded by linear growth, but may be faster. A loose ceiling.
$o(n)$ - strictly faster than any linear algorithm; e.g. $O(\log n)$.
$\Omega(n)$ - takes at least linear time; a loose floor.
$\omega(n)$ - strictly slower than linear; e.g. $\Omega(n^2)$.
$\Theta(n)$ - grows exactly linearly; the tight characterization.

The Common Complexity Classes

Here are the classes you’ll encounter most, ordered from best to worst, with concrete examples.

$O(1)$ - constant time. The running time doesn’t depend on $n$ at all. Example: looking up element at index $i$ in an array - the address is computed directly as base + i × size. No matter how large the array, this takes the same time.

$O(\log n)$ - logarithmic. Time grows by one step every time $n$ doubles. Example: the phone book approach from earlier - halving the remaining entries each step. This technique is called binary search. At $n = 10^9$, you need only about 30 steps.

$O(n)$ - linear. Time grows proportionally with input. Example: finding the maximum element in an unsorted array - you must look at every element at least once.

$O(n \log n)$ - linearithmic. Time is a little worse than linear but much better than quadratic. Example: merge sort, which sorts $n$ elements in $O(n \log n)$ comparisons. This is the best possible time for comparison-based sorting.

$O(n^2)$ - quadratic. Time grows with the square of the input. Example: the naive algorithm for finding all pairs of duplicate elements in an array - compare each element with every other element.

$O(2^n)$ - exponential. Time doubles with every new element. Example: generating all subsets of an $n$-element set. At $n = 30$, this is already over 1 billion operations.

$O(n!)$ - factorial. Time grows faster than any exponential. Example: the traveling salesman problem - given $n$ cities, find the shortest route that visits each exactly once and returns home. The brute-force approach tries all possible orderings. At $n = 20$, this is $2.4 \times 10^{18}$ operations.

Discomfort check. “But my $O(n^2)$ algorithm runs faster than their $O(n \log n)$ algorithm in practice!” This is possible - and common - for small $n$. If the $O(n^2)$ algorithm has a small constant (say, 1 operation per comparison) and the $O(n \log n)$ algorithm has a large constant (say, 100 operations per merge), then for small $n$ the $O(n^2)$ algorithm wins. For large $n$, the $O(n \log n)$ algorithm will always win eventually, no matter how large the constant gap. This crossover point is why we write benchmarks - to find out whether our actual $n$ is in the “small” regime or the “large” regime.

Worst, Average, and Best Case

Big-O describes how time scales with input size - but for a fixed $n$, different inputs can cause very different behavior.

Best case: the input that lets the algorithm finish as quickly as possible. Example: linear search where the target happens to be the first element - $O(1)$.

Worst case: the input that forces the most work. This is the most commonly analyzed case, because it gives a guarantee: “no matter what you throw at this algorithm, it will finish within this time.” Example: linear search where the target is the last element or missing - $O(n)$.

Average case: the expected time over a random input. This is harder to compute because it requires a model of what “random input” means. Example: linear search on average finds the target after scanning half the array - $O(n/2) = O(n)$.

For most algorithm analysis, we focus on worst case. It gives guarantees. But knowing average case matters when you’re choosing between algorithms: quicksort has $O(n^2)$ worst case but $O(n \log n)$ average case, and in practice (with random pivot selection) the worst case almost never occurs.

Amortized Thinking

Sometimes an individual operation is expensive, but expensive operations are rare enough that the average cost per operation is low.

A dynamic array doubles its capacity when full. Most insertions cost $O(1)$. The occasional doubling costs $O(n)$. Yet if you do $n$ insertions total, the sum of all copying costs is $O(n)$, so the average per insertion is $O(1)$. We say the operation is $O(1)$ amortized.

Amortized analysis is the subject of a full post later. For now, recognize that “amortized $O(1)$” is different from “always $O(1)$” - it’s a statement about the cost spread across a sequence of operations.

Recurrence Relations

Divide-and-conquer algorithms split a problem into smaller subproblems. The running time satisfies a recurrence relation. Solving it tells you the complexity.

The classic example is merge sort. To sort $n$ elements, you sort two halves (each of size $n/2$) and merge them in $O(n)$ time. This gives:

$$T(n) = 2T(n/2) + O(n)$$

How do we solve this? Draw the recursion tree. At the top level, you do $O(n)$ work. At the next level, two subproblems of size $n/2$ - together they do $2 \cdot O(n/2) = O(n)$ work. And so on, for $\log_2 n$ levels total. Each level does $O(n)$ work, and there are $O(\log n)$ levels, so the total is $O(n \log n)$.

The Master Theorem

The recurrence $T(n) = aT(n/b) + O(n^c)$ captures a wide class of divide-and-conquer algorithms. The Master Theorem gives the solution based on the relationship between the leaf work ($a$ subproblems, each at the bottom contributes $1$ unit) and the root work ($O(n^c)$ per level call).

Define the critical exponent as $\log_b a$.

Case 1 - Leaves dominate: if $c < \log_b a$, then the work is dominated by the leaf level:

$$T(n) = \Theta(n^{\log_b a})$$

Example: Strassen’s matrix multiplication - a clever algorithm for multiplying two $n \times n$ matrices that splits each matrix into four $n/2 \times n/2$ blocks and combines them with only 7 recursive multiplications instead of the naive 8 - has $a = 7$, $b = 2$, $c = 2$. Since $\log_2 7 \approx 2.807 > 2$, we’re in Case 1: $T(n) = \Theta(n^{2.807})$.

Case 2 - All levels equal: if $c = \log_b a$, the work is the same at every level, and there are $\log n$ levels:

$$T(n) = \Theta(n^c \log n)$$

Example: merge sort has $a = 2$, $b = 2$, $c = 1$. Since $\log_2 2 = 1 = c$, we’re in Case 2: $T(n) = \Theta(n \log n)$.

Case 3 - Root dominates: if $c > \log_b a$ (and a technical regularity condition holds), the root level does the most work:

$$T(n) = \Theta(n^c)$$

Example: a hypothetical algorithm with $a = 2$, $b = 4$, $c = 1$. Since $\log_4 2 = 0.5 < 1$, we’re in Case 3: $T(n) = \Theta(n)$.

Why the Master Theorem is true. Think about the recursion tree for $T(n) = aT(n/b) + O(n^c)$.

At the root you do $n^c$ work. You then branch into $a$ subproblems each of size $n/b$. At depth $k$, there are $a^k$ subproblems each of size $n/b^k$, so the work at that level is:

$$a^k \cdot \left(\frac{n}{b^k}\right)^c = n^c \cdot \left(\frac{a}{b^c}\right)^k$$

Let $r = a / b^c$. As you go deeper, the work per level scales by $r$ at each step.

If $r < 1$ (equivalently, $c > \log_b a$): the per-level work shrinks geometrically as you go deeper. The root level dominates. The total is a converging geometric series, giving $\Theta(n^c)$.
If $r = 1$ (equivalently, $c = \log_b a$): every level does the same amount of work. There are $\log_b n$ levels, so the total is $\Theta(n^c \log n)$.
If $r > 1$ (equivalently, $c < \log_b a$): the per-level work grows as you go deeper. The leaf level dominates. The number of leaves is $a^{\log_b n} = n^{\log_b a}$, so the total is $\Theta(n^{\log_b a})$.

The critical exponent $\log_b a$ is just asking: how fast does the number of subproblems grow (captured by $a$) versus how fast does each subproblem shrink in cost (captured by $b^c$)? Whichever side wins determines where the work concentrates.

Summary

Complexity	Name	Example algorithm	Operations at $n = 10^6$
$O(1)$	Constant	Array index lookup	1
$O(\log n)$	Logarithmic	Binary search	20
$O(n)$	Linear	Linear scan	$10^6$
$O(n \log n)$	Linearithmic	Merge sort	$2 \times 10^7$
$O(n^2)$	Quadratic	Bubble sort	$10^{12}$
$O(2^n)$	Exponential	Subset enumeration	beyond astronomical
$O(n!)$	Factorial	Brute-force TSP	beyond astronomical

Big-O tells you the shape of the curve, not the exact number of operations. It tells you how an algorithm scales. And scaling is everything - it’s the difference between an algorithm that works on a laptop and one that would take longer than the age of the universe.

Read next:

Data Structures - Storage Is Strategy