Data Structures // Megha Bose

Prerequisite:

Algorithmic Thinking & Big-O

Every algorithm is only as fast as the data structure beneath it. Choosing the wrong container - even with a perfect algorithm - can blow up a solution from $O(n \log n)$ to $O(n^2)$. This post builds a mental model for each core structure: what it guarantees, what it costs, and where it shines.

Arrays and Dynamic Arrays

An array stores elements in a contiguous block of memory. The payoff is random access in $O(1)$: given index $i$, the element lives at address $\text{base} + i \cdot \text{size}$. No pointer chasing. This also means excellent cache locality - sequential access patterns load entire cache lines at once, giving real-world speedups that raw Big-O cannot capture.

The cost is rigidity. Inserting or deleting in the middle requires shifting elements: $O(n)$ in the worst case.

A dynamic array (Python list, Java ArrayList, C++ std::vector) wraps a fixed array and doubles its capacity when full. Doubling gives amortized $O(1)$ appends - the expensive resize happens rarely enough that the average cost per insert stays constant. See Amortized Analysis for the full proof.

Linked Lists

A singly linked list stores each element in a node containing a value and a pointer to the next node. There is no contiguous memory; each node can live anywhere on the heap.

Insert or delete at a known pointer: $O(1)$ - just rewire two pointers.
Access by index: $O(n)$ - must traverse from head.
Cache behavior: poor. Each pointer dereference likely causes a cache miss.

The pointer overhead is real: on a 64-bit system, each pointer costs 8 bytes regardless of the payload. For small integers this doubles or triples memory usage.

Doubly linked lists add a previous pointer, enabling $O(1)$ delete given a node reference (no need to find the predecessor). Sentinel (dummy) head and tail nodes eliminate edge-case checks for empty lists.

Stacks (LIFO) and queues (FIFO) are the two most common disciplined uses of sequences:

Structure	Push/Enqueue	Pop/Dequeue	Peek
Stack (array)	$O(1)$ amortized	$O(1)$	$O(1)$
Stack (linked)	$O(1)$	$O(1)$	$O(1)$
Queue (array, circular)	$O(1)$ amortized	$O(1)$	$O(1)$
Queue (linked)	$O(1)$	$O(1)$	$O(1)$

Array-based stacks win on cache; linked-list queues avoid the complexity of circular buffer management.

Hash Tables

A hash table maps keys to values in amortized $O(1)$ for insert, lookup, and delete - the most used guarantee in practical programming.

Hash Functions

A hash function $h : \text{Key} \to {0, 1, \ldots, m-1}$ maps a key to a bucket index. A good hash function distributes keys uniformly across buckets. For integer keys, a common choice is:

$$h(k) = k \bmod m$$

where $m$ is prime. For strings, polynomial rolling hash is standard:

$$h(s) = \left(\sum_{i=0}^{n-1} s[i] \cdot p^i\right) \bmod m$$

Collision Resolution

Two keys can hash to the same bucket - a collision. The two dominant resolution strategies are:

Chaining: each bucket holds a linked list of all keys that hashed there. Lookup scans the list. Expected list length is $\alpha = n/m$ (the load factor). Operations are $O(1 + \alpha)$ on average.

Open addressing: all elements live in the table itself. On collision, probe to the next slot according to some sequence. Linear probing checks $h(k), h(k)+1, h(k)+2, \ldots$ - simple but suffers from primary clustering. Double hashing uses a second hash function for the step size, spreading probes more evenly.

Tombstones and Resizing

When an element is deleted under open addressing, simply clearing the slot breaks probe chains for keys that passed through it. The fix: leave a tombstone marker. Subsequent lookups skip tombstones; subsequent inserts can reuse them.

Tombstones accumulate over time and degrade performance. Standard practice: resize the table (rehash everything) when the load factor crosses a threshold, typically $\alpha > 0.7$ for open addressing or $\alpha > 1.0$ for chaining. Resizing takes $O(n)$ but happens at most $O(\log n)$ times before the table doubles past any target size, keeping amortized cost $O(1)$.

Trees

Binary Search Tree (BST)

A BST stores elements such that for every node $x$: all keys in the left subtree are less than $x.\text{key}$, and all keys in the right subtree are greater. This BST property enables $O(h)$ search, insert, and delete, where $h$ is the tree’s height.

The catch: without balancing, $h$ can reach $O(n)$ for sorted input - the tree degenerates into a linked list.

AVL Trees

An AVL tree enforces the height-balance invariant: for every node, the heights of its left and right subtrees differ by at most 1. Formally, define $\text{bf}(x) = h(x.\text{left}) - h(x.\text{right})$; an AVL tree maintains $|\text{bf}(x)| \leq 1$ at every node.

When an insert or delete violates this, one of four rotations restores balance:

Left rotation (right-heavy): pivot the right child up.
Right rotation (left-heavy): pivot the left child up.
Left-Right rotation: left rotate the left child, then right rotate the root.
Right-Left rotation: right rotate the right child, then left rotate the root.

The height of an AVL tree with $n$ nodes satisfies $h \leq 1.44 \log_2 n$, guaranteeing $O(\log n)$ operations.

Red-Black Trees

Red-Black trees impose a different invariant using node colors:

Every node is red or black.
The root is black.
No two consecutive red nodes on any path (a red node’s children must be black).
Every path from a node to a null leaf has the same number of black nodes (black-height).

These rules ensure height $h \leq 2 \log_2(n+1)$. Red-Black trees need fewer rotations than AVL trees on insert/delete and are used in std::map (C++), TreeMap (Java), and the Linux kernel’s scheduling rbtree.

Graphs

A graph $G = (V, E)$ has vertices and edges. Two representations dominate:

Adjacency matrix: a $|V| \times |V|$ boolean (or weight) matrix. Edge query: $O(1)$. Space: $O(V^2)$. Best for dense graphs where $|E| \approx |V|^2$.

Adjacency list: each vertex holds a list of its neighbors. Edge query: $O(\text{degree})$. Space: $O(V + E)$. Best for sparse graphs where $|E| \ll |V|^2$. This is the standard representation for BFS, DFS, Dijkstra, and most graph algorithms.

Examples

Example 1 - Choosing a hash table vs. BST: You need a dictionary that supports insert, delete, and lookup, but not ordered iteration or range queries. Use a hash table: amortized $O(1)$ vs. $O(\log n)$ for BST. If you need min, max, or sorted order, use a BST.

Example 2 - Memory layout matters: Summing all elements of a $1000 \times 1000$ matrix row-by-row vs. column-by-column takes measurably different time due to cache line alignment. The array’s contiguous layout makes row-major traversal significantly faster on most architectures.

Example 3 - Open addressing in practice: CPython’s dict uses open addressing with a pseudo-random probe sequence. The load factor threshold is 2/3; above that the table is resized by roughly 4×. This keeps average probe length below 2.

When to Use Which Structure

Need	Best fit
Fast random access	Array
Fast insert/delete at known position	Linked list
LIFO discipline	Stack (array-backed)
FIFO discipline	Queue (ring buffer or linked)
$O(1)$ average lookup by key	Hash table
$O(\log n)$ ordered operations	Balanced BST (AVL / Red-Black)
BFS/DFS on sparse graph	Adjacency list
Dense graph edge queries	Adjacency matrix

The vocabulary of data structures is the vocabulary of algorithms. Every design decision - from cache coherence to deletion semantics - feeds back into the runtime of whatever code you build on top.

Read Next: