Hash Maps and O(1) Lookup
Prerequisite: Data Structures: Arrays to Queues
Searching a list for a value takes $O(n)$ time - every element must be checked until you find a match (or exhaust the list). For a million-item list, that could mean a million comparisons. Hash maps solve this problem by giving you $O(1)$ average-case lookup regardless of size. Understanding how they work explains both their power and their limits.
The Problem with Linear Search
names = ["alice", "bob", "charlie", "diana"]
# O(n) - Python checks every element until it finds "charlie"
if "charlie" in names:
print("found")
This is fine for small lists. At scale it becomes a bottleneck. The fix is a data structure that maps a key directly to where its value lives in memory.
How Hash Functions Work
A hash function takes a key and returns an integer - the hash. That integer is used as an index into an underlying array of buckets.
# Python's built-in hash function
print(hash("alice")) # some large integer
print(hash(42)) # 42 (integers hash to themselves)
A good hash function distributes keys uniformly across buckets, minimising the chance that two keys land in the same slot. When they do, that is called a collision.
Handling Collisions
Two strategies are common:
Chaining: each bucket holds a linked list of all key-value pairs that hashed to that index. Lookup walks the list. Average case is $O(1)$ as long as chains stay short.
Open addressing: if a bucket is taken, probe neighbouring slots until an empty one is found. More cache-friendly but sensitive to load.
Load Factor and Resizing
The load factor is the ratio of stored entries to total buckets. As it rises, collisions become more frequent. Most hash map implementations resize - typically doubling the bucket array - when the load factor exceeds around 0.7. Python dicts do this automatically.
Resizing is $O(n)$, but it happens infrequently enough that the amortised cost per insertion remains $O(1)$.
Python Dicts
Python’s dict is a production-grade hash map:
user = {"name": "alice", "age": 30, "city": "berlin"}
# O(1) average: get, set, delete
print(user["name"]) # "alice"
user["email"] = "a@b.com" # insert
del user["city"] # delete
print("age" in user) # True - O(1) membership test
Since Python 3.7, dicts preserve insertion order. get() is safer than bracket access when a key might be missing:
print(user.get("phone", "not provided")) # "not provided" without KeyError
When Hash Maps Fail
Hash maps are not always the right tool:
- Hash collisions as a DOS vector: if an attacker can predict your hash function, they can craft inputs that all collide in one bucket, degrading $O(1)$ to $O(n)$. Python randomises hash seeds at startup to mitigate this.
- Non-hashable keys: lists and dicts cannot be dict keys because they are mutable. Use tuples instead.
- No ordering guarantees beyond insertion order: you cannot efficiently ask “what is the 5th smallest key?” A sorted structure is needed for that.
OrderedDict and Counter
from collections import OrderedDict, Counter
# OrderedDict: useful for LRU caches and ordered operations
od = OrderedDict()
od["first"] = 1
od["second"] = 2
od.move_to_end("first") # "first" is now last
# Counter: a dict subclass optimised for counting
words = ["apple", "banana", "apple", "cherry", "banana", "apple"]
counts = Counter(words)
print(counts.most_common(2)) # [('apple', 3), ('banana', 2)]
# Counter arithmetic
a = Counter(["a", "b", "a"])
b = Counter(["a", "c"])
print(a + b) # Counter({'a': 3, 'b': 1, 'c': 1})
Sets
A set is a hash map where each key maps to nothing - it stores values with no associated data. Membership testing is $O(1)$.
seen = set()
seen.add("alice")
seen.add("bob")
print("alice" in seen) # True - O(1)
print("carol" in seen) # False - O(1)
# Set operations
a = {"alice", "bob", "carol"}
b = {"bob", "diana"}
print(a & b) # {'bob'} - intersection
print(a | b) # union
print(a - b) # {'alice', 'carol'} - difference
Common Patterns
Counting frequencies - reach for Counter or a defaultdict(int):
from collections import defaultdict
def char_frequency(s: str) -> dict:
freq = defaultdict(int)
for ch in s:
freq[ch] += 1
return dict(freq)
Caching / memoisation - store computed results so you never compute the same input twice:
cache = {}
def fib(n: int) -> int:
if n in cache:
return cache[n]
if n <= 1:
return n
result = fib(n - 1) + fib(n - 2)
cache[n] = result
return result
(Python’s functools.lru_cache decorator does this automatically.)
Examples
Two-Sum in $O(n)$
Given a list of integers and a target, find two numbers that sum to the target. The naive approach tries every pair in $O(n^2)$. With a hash map it is $O(n)$:
def two_sum(nums: list[int], target: int) -> tuple[int, int] | None:
seen = {} # value -> index
for i, num in enumerate(nums):
complement = target - num
if complement in seen:
return (seen[complement], i)
seen[num] = i
return None
print(two_sum([2, 7, 11, 15], 9)) # (0, 1) - nums[0] + nums[1] = 9
Finding Duplicates
def has_duplicate(items: list) -> bool:
return len(items) != len(set(items))
def first_duplicate(items: list):
seen = set()
for item in items:
if item in seen:
return item
seen.add(item)
return None
print(first_duplicate([3, 1, 4, 1, 5, 9])) # 1
Group Anagrams
Anagrams share the same sorted characters. Use that as the hash map key:
from collections import defaultdict
def group_anagrams(words: list[str]) -> list[list[str]]:
groups = defaultdict(list)
for word in words:
key = tuple(sorted(word))
groups[key].append(word)
return list(groups.values())
words = ["eat", "tea", "tan", "ate", "nat", "bat"]
print(group_anagrams(words))
# [['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]
Summary
Hash maps give you $O(1)$ average-case lookup, insert, and delete by using a hash function to map keys to array indices. Python’s dict and set are fast, battle-tested implementations. Reach for them whenever you need to count, group, deduplicate, or cache results - and reach for Counter and defaultdict to avoid boilerplate around common patterns.
Read Next: Searching Algorithms: Binary Search