Bloom Filters - Answering "Definitely Not" in Constant Space
Helpful context:
- Databases & Indexes - The Structures That Make Queries Fast
- LSM Trees - Why Databases Gave Up on Updating in Place
There is a question that systems ask millions of times per day: “is this thing in this set?” Is this URL on the malware blocklist? Is this username already taken? Is this key in this SSTable? The naive answer is to store the set and check it. But what if the set contains billions of items and checking it requires a disk read? A bloom filter answers this question differently: it tells you “definitely not” or “possibly yes” - and it does so in constant time using a fixed amount of memory, regardless of how many items are in the set. The tradeoff is a controlled probability of being wrong in one direction.
Burton Bloom introduced this data structure in 1970 in a paper titled “Space/Time Trade-offs in Hash Coding with Allowable Errors.” The name is his own. The paper is one of the most-cited in computer science, and the idea has turned out to be useful in almost every domain of systems engineering.
The Problem It Solves
Consider an LSM tree database. When a read arrives for key $k$, the engine checks the memtable (fast, in memory), then checks SSTable files from newest to oldest (slower, on disk). If $k$ does not exist in the database at all, this degenerates into reading every SSTable file to confirm its absence. On a database with dozens or hundreds of SSTables, each potentially requiring a disk seek, this is catastrophically expensive.
What you want is a way to quickly ask, before touching disk: “is key $k$ definitely not in this SSTable?” If the answer is yes (definitely not), skip the SSTable entirely. If the answer is maybe (possibly in there), proceed to check.
A bloom filter gives exactly this. It cannot tell you for certain that a key is present. But it can tell you for certain that a key is absent.
How It Works
A bloom filter is a bit array of $m$ bits, all initialized to 0, combined with $k$ independent hash functions. Each hash function maps an item to one of the $m$ bit positions.
Adding an item: hash it with all $k$ hash functions. Set the bit at each resulting position to 1.
Checking membership: hash the query item with all $k$ hash functions. If any of the resulting bit positions is 0, the item is definitely not in the set - you would have set all those bits when you added it. If all positions are 1, the item is probably in the set.
The “probably” is the core of the tradeoff. It is possible that all $k$ bits for a query item happen to be set by combinations of other items that were added - even though this particular item was never added. This is a false positive: the filter says “possibly in set” but the item is actually absent. False positives are acceptable in many applications. What is not possible is a false negative: if an item was added, all its bits were set, and they remain set (bits are never cleared). A “definitely not” answer is always correct.
The Math
The false positive probability $p$ is approximately:
$$p \approx \left(1 - e^{-kn/m}\right)^k$$
where $n$ is the number of items inserted, $m$ is the number of bits, and $k$ is the number of hash functions.
For a given $m$ and $n$, the optimal number of hash functions that minimizes $p$ is:
$$k = \frac{m}{n} \ln 2 \approx 0.693 \cdot \frac{m}{n}$$
Substituting back, the minimum false positive rate for a given number of bits per item $b = m/n$ is approximately:
$$p \approx (0.6185)^b$$
In concrete terms: 10 bits per element gives a false positive rate of roughly 1%. 20 bits per element gives roughly 0.0001%. A hash set storing the full items takes 64 bits per element (an 8-byte pointer) or more. A bloom filter achieves 1% error with 10 bits - over 6x smaller, and it scales to any set size with zero additional overhead.
The space is fixed at $m$ bits regardless of how many items you add. This is the killer feature: a bloom filter for a billion-item set uses the same memory as one tuned for that capacity, and inserting the billionth item is exactly as fast as inserting the first.
Where Bloom Filters Are Used
LSM trees and databases: Cassandra, RocksDB, HBase, and LevelDB all store a bloom filter per SSTable. Before reading an SSTable from disk, the engine checks the bloom filter. A false-positive rate of 1% means that for keys that don’t exist, 99% of SSTable reads are avoided. The entire point of LSM trees is high write throughput - bloom filters are what keep read performance from collapsing.
Google Chrome and browser security: Chrome’s Safe Browsing feature downloads a bloom filter of millions of known malicious URLs. When you visit a URL, Chrome hashes it and checks the filter locally - no network request needed. Only on a positive (either true or false) does Chrome query Google’s servers for confirmation. This reduces network traffic by orders of magnitude and avoids sending your browsing history to Google for the vast majority of safe URLs.
CDNs and caching: a CDN does not want to cache every object - caching a resource that is only ever fetched once wastes memory. One strategy: only cache an object on the second request. A bloom filter tracks which URLs have been seen once. First request: not in filter, add it to filter, don’t cache. Second request: in filter (true positive), cache it. False positives (caching something on its second “apparent” visit despite it being a true first) are acceptable - the cost is slightly over-caching, not a correctness issue.
Databases: “username taken” checks: before querying the database to check if a username exists, check a bloom filter. If the filter says “definitely not,” the username is available without a database round trip. If it says “possibly,” query the database to confirm. At signup rates of millions per day, this matters.
Bitcoin SPV nodes: Simplified Payment Verification nodes do not store the full blockchain. They use bloom filters to tell full nodes which transactions they care about. A full node filters transactions through the SPV node’s bloom filter before sending them - drastically reducing bandwidth. False positives result in a few extra transactions being sent; false negatives are impossible.
Spell checkers: the original application Bloom described. Store the dictionary as a bloom filter. A word that produces a false positive looks “correct” to the spell checker when it is not. Acceptable at low false positive rates.
Akamai: Akamai uses bloom filters to avoid caching “one-hit wonders” - items requested exactly once that would displace more popular items from cache. A bloom filter tracks all requested URLs; something is only admitted to cache if it has been seen before (i.e., it is in the filter).
Limitations
No deletion: setting bits to 1 is irreversible. Removing an item would require clearing bits, but those bits may be shared with other items. Clearing them creates false negatives - which violate the core guarantee.
Counting bloom filter: extends each bit to a counter. Increment on add, decrement on remove. Deletion is now possible. Cost: 4x the memory (4 bits per cell instead of 1). Used where deletion is required.
Cuckoo filter: a more recent structure that supports deletion, has comparable false positive rates, and is faster for lookup. It stores fingerprints in a cuckoo hash table. Generally preferred over counting bloom filters when deletion is needed.
Scalable bloom filter: when you do not know in advance how many items you will add, a scalable bloom filter dynamically adds new filter layers as the current one fills, maintaining a target false positive rate.
The false positive rate is not free: a bloom filter only helps when the false positive cost is acceptable. If a false positive triggers a disk read that returns “not found,” you have paid for a useless disk read. At 1% false positive rate, 1 in 100 lookups for absent keys does an unnecessary disk read. For 99% of absent key lookups, the disk read is completely avoided. The net win is enormous.
Sizing in Practice
To build a bloom filter:
- Decide on the acceptable false positive rate $p$ (typical: 0.01 to 0.001).
- Know the expected number of items $n$.
- Compute the required bits: $m = -n \ln p / (\ln 2)^2$.
- Compute the optimal number of hash functions: $k = (m/n) \ln 2$.
For $n = 10^6$ items and $p = 0.01$: $m \approx 9.6 \times 10^6$ bits $\approx$ 1.2 MB. Ten bits per element, roughly six hash functions. Storing the same million items in a hash set (with 8-byte keys) would take 8 MB - nearly 7x more.
The bloom filter is one of the most elegant data structures in existence: a fixed-size structure that grows in capability (more items) without growing in size, at the cost of a mathematically controlled and tunable error rate. The error is always in the safe direction. “Definitely not” is always true. “Possibly yes” sometimes lies. Every production system that cares about avoiding unnecessary I/O has one somewhere.
Read next: