Hash Maps and O(1) Lookup // Megha Bose

Prerequisite: Data Structures: Arrays to Queues

Searching a list for a value takes $O(n)$ time - every element must be checked until you find a match (or exhaust the list). For a million-item list, that could mean a million comparisons. Hash maps solve this problem by giving you $O(1)$ average-case lookup regardless of size. Understanding how they work explains both their power and their limits.

The Problem with Linear Search

names = ["alice", "bob", "charlie", "diana"]

# O(n) - Python checks every element until it finds "charlie"
if "charlie" in names:
    print("found")

This is fine for small lists. At scale it becomes a bottleneck. The fix is a data structure that maps a key directly to where its value lives in memory.

How Hash Functions Work

A hash function takes a key and returns an integer - the hash. That integer is used as an index into an underlying array of buckets.

# Python's built-in hash function
print(hash("alice"))    # some large integer
print(hash(42))         # 42 (integers hash to themselves)

A good hash function distributes keys uniformly across buckets, minimising the chance that two keys land in the same slot. When they do, that is called a collision.

Handling Collisions

Two strategies are common:

Chaining: each bucket holds a linked list of all key-value pairs that hashed to that index. Lookup walks the list. Average case is $O(1)$ as long as chains stay short.

Open addressing: if a bucket is taken, probe neighbouring slots until an empty one is found. More cache-friendly but sensitive to load.

Load Factor and Resizing

The load factor is the ratio of stored entries to total buckets. As it rises, collisions become more frequent. Most hash map implementations resize - typically doubling the bucket array - when the load factor exceeds around 0.7. Python dicts do this automatically.

Resizing is $O(n)$, but it happens infrequently enough that the amortised cost per insertion remains $O(1)$.

Python Dicts

Python’s dict is a production-grade hash map:

user = {"name": "alice", "age": 30, "city": "berlin"}

# O(1) average: get, set, delete
print(user["name"])          # "alice"
user["email"] = "a@b.com"    # insert
del user["city"]              # delete
print("age" in user)         # True - O(1) membership test

Since Python 3.7, dicts preserve insertion order. get() is safer than bracket access when a key might be missing:

print(user.get("phone", "not provided"))   # "not provided" without KeyError

When Hash Maps Fail

Hash maps are not always the right tool:

Hash collisions as a DOS vector: if an attacker can predict your hash function, they can craft inputs that all collide in one bucket, degrading $O(1)$ to $O(n)$. Python randomises hash seeds at startup to mitigate this.
Non-hashable keys: lists and dicts cannot be dict keys because they are mutable. Use tuples instead.
No ordering guarantees beyond insertion order: you cannot efficiently ask “what is the 5th smallest key?” A sorted structure is needed for that.

OrderedDict and Counter

from collections import OrderedDict, Counter

# OrderedDict: useful for LRU caches and ordered operations
od = OrderedDict()
od["first"] = 1
od["second"] = 2
od.move_to_end("first")   # "first" is now last

# Counter: a dict subclass optimised for counting
words = ["apple", "banana", "apple", "cherry", "banana", "apple"]
counts = Counter(words)
print(counts.most_common(2))   # [('apple', 3), ('banana', 2)]

# Counter arithmetic
a = Counter(["a", "b", "a"])
b = Counter(["a", "c"])
print(a + b)    # Counter({'a': 3, 'b': 1, 'c': 1})

Sets

A set is a hash map where each key maps to nothing - it stores values with no associated data. Membership testing is $O(1)$.

seen = set()
seen.add("alice")
seen.add("bob")
print("alice" in seen)    # True - O(1)
print("carol" in seen)    # False - O(1)

# Set operations
a = {"alice", "bob", "carol"}
b = {"bob", "diana"}
print(a & b)    # {'bob'} - intersection
print(a | b)    # union
print(a - b)    # {'alice', 'carol'} - difference

Common Patterns

Counting frequencies - reach for Counter or a defaultdict(int):

from collections import defaultdict

def char_frequency(s: str) -> dict:
    freq = defaultdict(int)
    for ch in s:
        freq[ch] += 1
    return dict(freq)

Caching / memoisation - store computed results so you never compute the same input twice:

cache = {}

def fib(n: int) -> int:
    if n in cache:
        return cache[n]
    if n <= 1:
        return n
    result = fib(n - 1) + fib(n - 2)
    cache[n] = result
    return result

(Python’s functools.lru_cache decorator does this automatically.)

Examples

Two-Sum in $O(n)$

Given a list of integers and a target, find two numbers that sum to the target. The naive approach tries every pair in $O(n^2)$. With a hash map it is $O(n)$:

def two_sum(nums: list[int], target: int) -> tuple[int, int] | None:
    seen = {}   # value -> index

    for i, num in enumerate(nums):
        complement = target - num
        if complement in seen:
            return (seen[complement], i)
        seen[num] = i

    return None

print(two_sum([2, 7, 11, 15], 9))   # (0, 1) - nums[0] + nums[1] = 9

Finding Duplicates

def has_duplicate(items: list) -> bool:
    return len(items) != len(set(items))

def first_duplicate(items: list):
    seen = set()
    for item in items:
        if item in seen:
            return item
        seen.add(item)
    return None

print(first_duplicate([3, 1, 4, 1, 5, 9]))   # 1

Group Anagrams

Anagrams share the same sorted characters. Use that as the hash map key:

from collections import defaultdict

def group_anagrams(words: list[str]) -> list[list[str]]:
    groups = defaultdict(list)
    for word in words:
        key = tuple(sorted(word))
        groups[key].append(word)
    return list(groups.values())

words = ["eat", "tea", "tan", "ate", "nat", "bat"]
print(group_anagrams(words))
# [['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]

Summary

Hash maps give you $O(1)$ average-case lookup, insert, and delete by using a hash function to map keys to array indices. Python’s dict and set are fast, battle-tested implementations. Reach for them whenever you need to count, group, deduplicate, or cache results - and reach for Counter and defaultdict to avoid boilerplate around common patterns.