LSM Trees - Why Databases Gave Up on Updating in Place
Helpful context:
If you have ever wondered why Cassandra handles millions of writes per second while a traditional relational database struggles with a fraction of that, the answer is the data structure underneath: the Log-Structured Merge-tree, or LSM tree. It is the write-optimized storage engine that powers Cassandra, RocksDB, LevelDB, HBase, InfluxDB, and TiKV - and understanding it requires first understanding what it was designed to escape.
The Problem With Updating in Place
The dominant storage structure for decades has been the B-tree. A B-tree stores data in a balanced tree of pages, each page typically 4KB or 16KB. To update a record, you find the page it lives on, load it into memory, modify it, and write it back to disk.
This works. But the write pattern is brutal: a single user update might touch one page buried deep in a tree with millions of nodes. That page lives at a random location on disk. The disk head must seek to that location, make a small write (perhaps a few bytes inside a 4KB page), and move on. The next write is to a different random location. Repeat millions of times per second.
Random I/O is the killer. A spinning hard disk can do roughly 100-200 random seeks per second. Even an SSD, while dramatically faster, has meaningful random write amplification: writes smaller than the flash erase block size get promoted to full-block rewrites internally. A workload of tiny random writes is the worst-case pattern for both storage technologies.
What if you never did random writes? What if every write was sequential, always appending to the end of a file? Sequential writes are orders of magnitude faster than random writes. This is the insight behind LSM trees.
The Core Idea: Buffer, Sort, Merge
An LSM tree converts random writes into sequential writes through a simple mechanism: buffer writes in memory, sort them, flush to disk as immutable sequential files, and periodically merge those files together.
The structure has three components:
The memtable is an in-memory sorted data structure - typically a red-black tree or a skip list. Every write goes here first. Because it is in memory, writes are fast. Because it is sorted, reads can find keys quickly. The memtable grows until it reaches a size threshold (typically 64MB to 256MB).
SSTables (Sorted String Tables) are immutable files on disk. When the memtable fills up, it is flushed to disk as an SSTable - a file where keys are written in sorted order, sequentially. This flush is a single sequential write: no seeking, no updating. The file is written from start to finish and never modified again.
Compaction is the background process that merges SSTables. As flushes accumulate, you end up with many SSTable files. Compaction reads multiple SSTables, merges them (like merge sort, leveraging the fact that each is already sorted), discards superseded versions of keys, and writes a new, larger SSTable. Old SSTables are deleted after the merge. Again: all sequential I/O.
The Write Path
When a write arrives:
- It is appended to the Write-Ahead Log (WAL) on disk. This is a sequential append that ensures durability: if the process crashes before the memtable is flushed, the WAL can replay the writes.
- It is inserted into the memtable in memory.
- The write is acknowledged to the client. Done.
That is it. No disk seeking, no index updating, no page splitting. The actual work of organizing data on disk happens later, in the background, as flushing and compaction run asynchronously. This is why LSM trees deliver consistently high write throughput even under heavy load.
The Read Path
Reads are more involved, because data is spread across multiple layers.
When a read arrives for key $k$:
- Check the current memtable. If $k$ is there, return it.
- Check any immutable memtables that have been flushed but not yet written to disk as SSTables.
- Check SSTables from newest to oldest. The newest SSTable has the most recent version of any key. Keep checking older SSTables until found.
In the worst case, you read every SSTable on disk before concluding a key does not exist. This is the fundamental read tradeoff of LSM trees: reads are more expensive than writes. Two mechanisms mitigate this.
Bloom filters: each SSTable has an associated bloom filter - a compact probabilistic data structure that answers “is key $k$ definitely not in this SSTable?” in $O(1)$ with no disk access. If the bloom filter says the key is absent, you skip the SSTable entirely. Bloom filters eliminate most unnecessary SSTable reads, making not-found lookups cheap.
Index blocks: within each SSTable, a sparse index maps keys to byte offsets. Instead of scanning the entire file, you binary search the index to jump to the right region.
Compaction Strategies
Compaction determines how SSTables are organized and merged. The strategy matters enormously for the balance between read performance, write performance, and space usage.
Leveled compaction (used by RocksDB, LevelDB, and optionally Cassandra) organizes SSTables into levels. Level 0 is where fresh flushes land. Each level $L$ has a size limit roughly 10x that of level $L-1$. When level $L$ exceeds its limit, a compaction merges some of its SSTables with overlapping SSTables in level $L+1$. The key property: except at level 0, SSTables within a level have non-overlapping key ranges. A read therefore touches at most one SSTable per level, bounding read amplification logarithmically. The cost is higher write amplification - a key may be rewritten many times as it migrates through levels.
Size-tiered compaction (Cassandra’s default) groups SSTables by size and merges them when you have enough of similar size. Fewer, simpler compactions mean lower write amplification, but SSTables at the same tier can have overlapping key ranges, so reads may scan multiple files. Better for write-heavy workloads, worse for read-heavy ones.
FIFO compaction simply drops the oldest SSTables when total size exceeds a limit. Useful only for time series data where old data is irrelevant.
The Three Amplification Factors
The classic way to compare storage engines is through three amplification factors:
Write amplification is the ratio of bytes written to disk to bytes written by the application. Leveled compaction has high write amplification because keys migrate through levels. A key might be rewritten 10-30x across its lifetime. Size-tiered has lower write amplification. B-trees have moderate write amplification (a write touches one page plus the WAL).
Read amplification is how many I/Os a single point read may require. Leveled LSM: reads may touch one SSTable per level - $O(\log(\text{total size}))$ files. Size-tiered: potentially many SSTables per read. B-trees: one I/O per level of the tree, usually 3-4 for large datasets. B-trees win on point reads.
Space amplification is how much disk space is used relative to actual data size. During compaction, old and new versions of SSTables coexist. Leveled compaction has low space amplification (roughly 1.1x). Size-tiered can temporarily double the space during a merge.
There is no free lunch: optimizing one factor worsens another. The choice of storage engine and compaction strategy is the choice of which amplification factor matters least for your workload.
B-tree versus LSM Tree
| Property | B-tree | LSM tree |
|---|---|---|
| Write pattern | Random (update in place) | Sequential (append only) |
| Write throughput | Moderate | Very high |
| Read throughput | High (single location per key) | Moderate (multiple levels) |
| Space efficiency | Good | Moderate (during compaction) |
| Compaction overhead | None | Background CPU and I/O |
| Worst-case read | Predictable | Variable (without bloom filters) |
B-trees are the right choice when reads outnumber writes significantly, latency predictability matters, or the workload involves many updates to existing keys (the key is already in one place). LSM trees are the right choice when writes dominate, throughput matters more than individual read latency, or data is largely append-only (logs, time series, events).
Why This Matters in Practice
RocksDB, which is built on LSM trees, is embedded inside MySQL (as MyRocks at Facebook), TiKV (the storage layer of TiDB), CockroachDB’s storage, and dozens of other systems. The reason: at sufficient write scale, the alternative (random I/O into a B-tree) saturates disk bandwidth long before the application saturates CPU or network.
Cassandra’s LSM-based design is why it can ingest millions of events per second from IoT sensors or user activity streams. InfluxDB’s TSM (Time Series Merge-tree, an LSM variant) is why it handles billions of data points per day. The pattern - accept writes in memory, flush sequentially, merge in background - is one of the most durable ideas in storage system design.
Read next: