Hash Maps - Trading Space for Constant-Time Lookup // Megha Bose

Helpful context:

Data Structures - Storage Is Strategy

Imagine you are running the world’s largest library - 10 million books. A patron walks in and asks for a specific title. In the worst version of this library, the books are shelved in no particular order. You start at shelf one and check every book. On a good day, the book is near the front. On a bad day, it is the last one. On average, you check 5 million books. This is $O(n)$ lookup.

The phone book solved a version of this problem with alphabetical ordering: $O(\log n)$ binary search. Find the midpoint, decide left or right, repeat. Much better. But even binary search on a million entries takes about 20 comparisons. Could we do it in one?

Hans Peter Luhn, an IBM engineer, asked this question in 1953. His answer - credited as one of the earliest descriptions of hashing - was startling in its simplicity: instead of searching for the key, compute a number from the key, and use that number directly as the index where the value lives. If you can compute “where this key should be,” you never have to search at all.

This is the hash table. It is one of the most important data structures in computing, and understanding why it works - and crucially, when it breaks - is essential for anyone writing real software.

The Core Idea: From Key to Index

A hash function takes a key (a string, a number, anything) and returns an integer. That integer, reduced modulo the table size, gives you an index into an array. The value lives at that index.

# Python's built-in hash function
print(hash("alice"))   # some large integer, e.g. 7485320218934759040
print(hash(42))        # 42  (integers often hash to themselves)
print(hash("alice") % 16)   # e.g. 0 - index into a 16-bucket table

The lookup then works like this: to find “alice”, compute hash("alice") % table_size, jump directly to that index, done. One computation, one array access - $O(1)$.

Note that Python randomizes hash seeds at startup (since Python 3.3). This means hash("alice") returns a different integer each time you run Python, though it stays stable within a single run. This is an important security decision we will return to.

Why Collisions Are Inevitable: The Birthday Paradox

The immediate question is: what happens when two different keys hash to the same index? This is called a collision, and the surprising truth is that collisions are not rare edge cases - they are mathematically inevitable long before the table is full.

The birthday paradox gives the intuition. In a room of 23 people, there is a 50% chance two share a birthday - even though there are 365 possible birthdays. With 57 people, the probability rises to 99%. The reason: you are comparing every pair, not each person to a fixed date. The number of pairs grows as $O(n^2)$.

The same math applies to hash tables. Even with a large table, the probability of at least one collision grows quickly as entries accumulate. This is not a failure of the hash function - it is a mathematical fact. Good hash tables handle collisions gracefully; they do not pretend they will not happen.

Collision Resolution: Two Wars

The field has fought over two main approaches for seventy years.

Chaining (also called open hashing or closed addressing): each bucket holds a linked list (or similar structure) of all key-value pairs that hashed to that index. On lookup, you hash to the bucket, then search the list.

bucket 0: [("alice", 30) -> ("zara", 25)]   # two keys collided here
bucket 1: [("bob", 22)]
bucket 2: []
bucket 3: [("carol", 28)]

As long as chains stay short, lookup is effectively $O(1)$. In the worst case - all keys collide into one bucket - lookup degrades to $O(n)$.

Open addressing (also called closed hashing): collisions are resolved by probing for the next open slot in the array. Linear probing tries the next slot, then the next, and so on. Quadratic probing jumps by $i^2$ on the $i$-th probe. Double hashing uses a second hash function to determine the probe step.

# Conceptual linear probing (not Python's actual implementation)
def insert(table, key, value):
    i = hash(key) % len(table)
    while table[i] is not None:
        i = (i + 1) % len(table)   # probe next slot
    table[i] = (key, value)

Open addressing has better cache behavior than chaining - everything lives in a single array, so probing through a few slots is mostly cache hits. But it is more sensitive to load factor, and deletions are tricky (you cannot simply erase a slot, because it might break a probe chain).

Neither approach is universally better. Python’s dict uses a variant of open addressing. Java’s HashMap traditionally used chaining. The practical performance difference is often small compared to the load factor and hash quality.

Load Factor and Rehashing

The load factor is the ratio of stored entries to total buckets. If you have 7 entries in a 10-bucket table, the load factor is 0.7. As it rises, collisions become more frequent - more slots are occupied, so new keys are more likely to land somewhere that’s already taken.

Most implementations trigger a rehash when the load factor exceeds a threshold (Python uses about 2/3). Rehashing allocates a new, larger array (typically 2x) and reinserts all entries.

Before rehash: 7 entries in 10 buckets (load factor 0.7) → rehash!
After rehash:  7 entries in 20 buckets (load factor 0.35) → breathing room

Rehashing is $O(n)$ - every entry must be reinserted. But like array resizing, it happens exponentially less often as the table grows, so the amortized cost per insertion remains $O(1)$. You pay occasionally for a bulk operation, but averaged across many insertions, the cost is constant.

Python Dict Internals: More Interesting Than You Think

Python’s dict has evolved significantly. Before Python 3.6, dicts were unordered - the iteration order was arbitrary. In Python 3.6, CPython 3.6 switched to a compact dict implementation designed by Raymond Hettinger, with a critical side effect: insertion order is preserved. In Python 3.7, this became an official language guarantee.

The trick is clever. Instead of one large array holding key-value pairs directly, the compact dict uses two separate structures:

A small indices array mapping hash positions to positions in the entries array
A dense entries array storing key-value pairs in insertion order

# Indices array (sparse, holds positions into entries array)
# [-1, -1, 2, -1, 0, 1, -1, -1]   (-1 = empty)

# Entries array (dense, insertion order preserved)
# [(hash_alice, "alice", 30), (hash_bob, "bob", 22), (hash_carol, "carol", 28)]

This is more memory-efficient than the old approach (the indices array is small integers, not full entries), and the dense entries array is cache-friendly for iteration. It is also why Python dicts are faster than you might expect - iteration walks a dense array, not a sparse one.

user = {"name": "alice", "age": 30, "city": "berlin"}

# All O(1) average case
print(user["name"])           # "alice"
user["email"] = "a@b.com"    # insert
del user["city"]              # delete
print("age" in user)          # True - O(1) membership test

# Safe access with default
print(user.get("phone", "not provided"))   # "not provided" without KeyError

# Iteration preserves insertion order (guaranteed since Python 3.7)
for key, val in user.items():
    print(key, val)   # name, age, email - in that order

The Discomfort: When O(1) Becomes O(n)

The phrase “$O(1)$ average case” contains a dangerous qualifier. The worst case for a hash table is $O(n)$, and understanding when that worst case actually materializes matters enormously.

Hash flooding attacks. If an attacker can predict your hash function, they can craft input keys that all hash to the same bucket. Your $O(1)$ lookups degrade to $O(n)$ chain traversal. In 2011, a group of researchers demonstrated this against multiple web frameworks - by sending crafted POST data with carefully chosen field names, they could consume nearly all CPU time on a web server with a single request.

The fix Python uses is SipHash, a hash function designed specifically to be fast but also resistant to adversarial input when seeded with a secret. Because Python randomizes the hash seed at startup, an attacker cannot precompute which inputs will collide. The attack becomes infeasible.

# hash("alice") returns a DIFFERENT number each Python invocation
# This is intentional - it prevents precomputed collision attacks
import os; print(os.getenv("PYTHONHASHSEED", "random"))

Hash quality matters too. A naive hash function that only looks at the first few characters of a string will collide badly for inputs like URL parameters that start with the same prefix. Good hash functions distribute entropy evenly across all bits of the output.

Memory overhead. Hash tables are not memory-efficient. A table at 2/3 load factor is storing data in 2/3 of its buckets, leaving 1/3 empty. Add the overhead of storing keys alongside values, and hash tables often use 3-5x more memory than a plain array of values would. For small datasets or memory-constrained environments, this matters.

No ordering beyond insertion. You cannot efficiently ask “what is the 5th smallest key?” or “give me all keys between ‘alice’ and ‘carol’.” Hash tables trade sorted-order access for constant-time lookup. When you need range queries, sorted structures (BSTs, B-trees) are the right tool.

Hash Tables in Production Systems

Hash tables are everywhere in real systems, often under different names.

Database hash indexes. PostgreSQL and MySQL support hash indexes - an alternative to B-tree indexes for exact-equality lookups. A hash index is $O(1)$ for WHERE email = 'alice@example.com' but cannot support range queries like WHERE age BETWEEN 25 AND 35. Most databases default to B-trees precisely because range queries are common.

Consistent hashing in distributed systems. When you have a distributed cache or database sharded across many machines, you need to decide which machine holds which key. Naive modulo hashing (key % num_machines) breaks whenever you add or remove a machine - almost every key remaps. Consistent hashing uses a ring-based approach where only a fraction of keys need to remap when the cluster changes. Amazon DynamoDB, Cassandra, and Redis Cluster all use consistent hashing under the hood.

Bloom filters. A Bloom filter is a probabilistic data structure that answers “is this key definitely not in the set?” in $O(1)$ space that is much smaller than a full hash table. It can have false positives (says something is present when it isn’t) but never false negatives. Databases use Bloom filters to avoid unnecessary disk reads - before checking disk for a key, check if the Bloom filter says it might be there. Chrome uses Bloom filters to check URLs against a list of malicious sites locally before hitting a server.

Sets: Hash Maps Without Values

A set is a hash map where each key maps to nothing - it stores values with no associated data. Membership testing is $O(1)$, and set operations are natural and efficient:

seen = set()
seen.add("alice")
seen.add("bob")
print("alice" in seen)   # True - O(1)

a = {"alice", "bob", "carol"}
b = {"bob", "diana"}
print(a & b)    # {'bob'} - intersection
print(a | b)    # {'alice', 'bob', 'carol', 'diana'} - union
print(a - b)    # {'alice', 'carol'} - difference

Sets are underused by beginners and overused by experienced programmers who forget that a list is simpler when you don’t need membership tests. If you’re doing if x in my_list in a hot loop, convert to a set. If you’re just iterating once, a list is fine.

The Two-Sum Classic

The canonical example of hash table power: given a list of integers and a target, find two numbers that sum to the target. The naive approach is $O(n^2)$ - try every pair. The hash table approach is $O(n)$:

def two_sum(nums: list[int], target: int) -> tuple[int, int] | None:
    seen = {}   # value -> index

    for i, num in enumerate(nums):
        complement = target - num
        if complement in seen:
            return (seen[complement], i)
        seen[num] = i

    return None

print(two_sum([2, 7, 11, 15], 9))   # (0, 1)

The insight: for each number, you need to know if its complement has been seen before. A hash table turns that question from a linear scan into a single lookup. This pattern - store things you have seen in a hash map, then check against that map - appears constantly in algorithm problems and real code.

Future Directions

Hash tables are mature but not static. A few active areas:

Robin Hood hashing reduces worst-case lookup variance by stealing slots from “richer” entries (entries that are close to their ideal position) for “poorer” ones (entries that have been displaced far). This improves cache behavior and reduces the probability of pathologically long probe chains.

Cuckoo hashing uses two hash functions and two tables, guaranteeing $O(1)$ worst-case lookup (not just average). When a collision occurs, the existing entry is “kicked” to its alternate position, potentially cascading. It is more complex but useful when worst-case guarantees matter more than simplicity.

As CPU cache architectures evolve, the tradeoff between chaining and open addressing continues to shift. Cache-friendly hash tables that store keys and values in adjacent memory, with SIMD instructions to compare multiple keys simultaneously, are an active area of systems programming research.

Luhn’s Other Algorithm: Validation Without Any Lookup

The Hans Peter Luhn mentioned at the top of this post - the IBM engineer credited with early hashing in 1953 - also invented one of the most widely deployed algorithms in existence: the Luhn checksum, patented in 1954. Where hashing asks “can we find data without searching?”, the Luhn algorithm asks a different question: “can we detect data entry errors without querying a database at all?”

Every credit card on Earth is checked by this algorithm in the browser before any network request leaves your machine. The “invalid card number” error you see before submitting a form is Luhn, not a server response. It also validates IMEI numbers (your phone’s hardware identifier), Canadian Social Insurance Numbers, US National Provider Identifiers in healthcare, and ISIN securities codes.

The problem it solves. A user types a 16-digit card number. Human typos follow patterns: the most common single error is changing one digit; the most common two-digit error is transposing two adjacent digits (writing 48 as 84). A checksum algorithm that catches both of these can reject obvious typos instantly, without a round trip to a payment processor.

The algorithm.

Starting from the rightmost digit, label every position from right to left as 1, 2, 3, …. Leave digits at odd positions (1, 3, 5, …) as they are. Double digits at even positions (2, 4, 6, …). If doubling produces a number greater than 9, subtract 9. Sum everything. If the total is divisible by 10, the number passes.

The rightmost digit is the check digit - chosen at card issuance so the sum lands on a multiple of 10.

Worked example. Card number 4539 1488 0343 6467:

Digit	Position from right	Operation	Result
7	1 (odd)	keep	7
6	2 (even)	6 × 2 = 12, 12 - 9	3
4	3 (odd)	keep	4
6	4 (even)	6 × 2 = 12, 12 - 9	3
3	5 (odd)	keep	3
4	6 (even)	4 × 2	8
3	7 (odd)	keep	3
0	8 (even)	0 × 2	0
8	9 (odd)	keep	8
8	10 (even)	8 × 2 = 16, 16 - 9	7
4	11 (odd)	keep	4
1	12 (even)	1 × 2	2
9	13 (odd)	keep	9
3	14 (even)	3 × 2	6
5	15 (odd)	keep	5
4	16 (even)	4 × 2	8

Sum: 7 + 3 + 4 + 3 + 3 + 8 + 3 + 0 + 8 + 7 + 4 + 2 + 9 + 6 + 5 + 8 = 80. $80 \bmod 10 = 0$. Valid.

Implementation.

def luhn_valid(number: str) -> bool:
    digits = [int(d) for d in number if d.isdigit()]
    # Double every second digit from the right (0-indexed: positions -2, -4, ...)
    for i in range(len(digits) - 2, -1, -2):
        digits[i] *= 2
        if digits[i] > 9:
            digits[i] -= 9
    return sum(digits) % 10 == 0

def luhn_check_digit(partial: str) -> int:
    """Given a number without its check digit, compute what the check digit should be."""
    digits = [int(d) for d in partial if d.isdigit()]
    for i in range(len(digits) - 1, -1, -2):
        digits[i] *= 2
        if digits[i] > 9:
            digits[i] -= 9
    return (10 - sum(digits) % 10) % 10

print(luhn_valid("4539148803436467"))   # True
print(luhn_valid("4539148803436468"))   # False - one digit changed
print(luhn_valid("4539148803436476"))   # False - two adjacent digits transposed
print(luhn_check_digit("453914880343646"))   # 7

What it catches and what it doesn’t. Luhn detects every single-digit error (changing any one digit always changes the sum modulo 10). It detects almost all adjacent transpositions - the one exception is swapping a 0 and a 9, because both contribute 0 and 9 to the sum respectively regardless of which position they occupy. For practical card validation, this exception is acceptable; it occurs infrequently enough that the tradeoff is worth the simplicity.

Luhn is not a cryptographic checksum. It provides no secrecy, no authentication, and no resistance to deliberate forgery - anyone who knows the algorithm can construct a valid Luhn number. A payment processor does far more than Luhn validation before authorizing a charge. The algorithm’s purpose is pure typo detection: filter out the obvious before contacting the server.

The connection to hashing. Both hashing and the Luhn algorithm share Luhn’s core intuition: compute a number from the data, and let that number tell you something useful without searching through a database. A hash function tells you where to look. A checksum tells you whether to look at all. Different answers to the same underlying question.

Summary

Property	Hash Table
Lookup	$O(1)$ average, $O(n)$ worst case
Insert	$O(1)$ amortized
Delete	$O(1)$ average
Memory overhead	High (table typically 1/3 empty + key storage)
Cache behavior	Moderate under load, poor when chaining
Ordering	Insertion order only (Python 3.7+); no range queries
Worst case trigger	Hash flooding, poor hash function, high load factor
Fix for adversarial input	SipHash with random seed (Python’s approach)

Hash tables are the right default for lookup-by-key, deduplication, counting, and caching. Know their failure modes, understand that $O(1)$ average is not $O(1)$ guaranteed, and treat memory overhead as a real cost when designing systems at scale.

Read Next:

Trees & Heaps - When Flat Structures Hit Their Limit