Write-Ahead Log - The Invisible Guarantee Behind Every Database Write // Megha Bose

Helpful context:

Every time you insert a row, update a record, or commit a transaction, a database makes you a promise: if it tells you the write succeeded, that write will survive a crash. Not “probably survive.” Not “survive unless the power goes out.” Survive. This guarantee is called durability - the D in ACID - and delivering it without making the database impossibly slow is one of the oldest engineering problems in systems design. The solution, used by virtually every serious database in existence, is the write-ahead log.

The Problem: Memory Is Fast, Disk Is Durable, You Need Both

The naive design is obvious: when a write arrives, update the data on disk and return success. This is durable by definition - the data is on disk. But disk writes are slow: a random write to an HDD takes ~10ms (the disk head must physically seek). A database that wrote every row change directly to its B-tree page on disk would top out at a few hundred writes per second. Unusable.

The opposite extreme is equally obvious: accept writes into memory and return success immediately. Memory writes are nanoseconds. But if the machine crashes before flushing to disk, everything in memory is gone. You told the user their write succeeded. It didn’t. Unacceptable.

The write-ahead log is the resolution. It gives you disk durability with near-memory write speed by exploiting one key property of disk I/O: sequential writes are orders of magnitude faster than random writes.

Appending to the end of a file is sequential. Updating a B-tree page somewhere in the middle of a large file is random. Sequential writes on an HDD can sustain hundreds of megabytes per second. Random writes on the same disk: a few megabytes per second.

The Mechanism

The write-ahead log is an append-only file on disk. Before the database applies any change to its primary data structures (the B-tree, the memtable, anything), it first appends a record describing that change to the WAL. Only after the WAL record is durably on disk does the database proceed.

The term “write-ahead” means exactly this: write to the log ahead of making the actual change.

The record in the WAL contains everything needed to reproduce the change: what table, what record, what the old value was, what the new value is, and which transaction it belongs to. In PostgreSQL this is called a WAL record; in MySQL InnoDB it is part of the redo log; in SQLite it is the journal.

A transaction commit works like this:

Accumulate changes in memory (the buffer pool / memtable).
Append WAL records for all changes to the WAL file on disk. fsync - wait for the OS to confirm the data has reached stable storage, not just the OS buffer cache.
Declare the transaction committed. Return success to the client.
At some later point (asynchronously), write the changed pages to their actual locations in the main data file.

Step 4 is eventually done - but it is not required before telling the client the write succeeded. Why? Because step 2 already made the change durable. If the database crashes after step 2 but before step 4, the WAL record exists on disk. On restart, the database reads the WAL and re-applies any committed changes that never made it to the main data file. This is crash recovery.

Crash Recovery: Replaying the Log

When a database restarts after a crash, it cannot know which in-memory changes made it to the main data file and which did not. The answer: it does not matter. The WAL is the authoritative record.

The recovery process reads the WAL from the last known good point and replays every committed transaction that appears in it. This is called redo recovery: redo the work that was logged but may not have been applied.

For uncommitted transactions (transactions that had WAL records but never had a commit record), the database performs undo recovery: roll back the partial changes. The WAL contains enough information to reverse them.

This combination - redo committed transactions, undo uncommitted ones - restores the database to a consistent state regardless of when the crash happened. The algorithm that formalizes this is ARIES (Algorithm for Recovery and Isolation Exploiting Semantics), developed at IBM in the 1990s and still the foundation of most relational database recovery systems.

Checkpointing: Keeping the Log from Growing Forever

If the WAL is never truncated, crash recovery would require replaying the entire history of the database from the beginning. That would take hours. The solution is checkpointing.

A checkpoint is a moment when the database guarantees that all changes up to a certain WAL position have been written to the main data file. Once a checkpoint is established, the WAL before that checkpoint is no longer needed for recovery and can be truncated (or the space reused).

Checkpoints involve flushing all dirty pages from memory to disk - which is a batch of sequential writes if done intelligently. After a checkpoint, recovery only needs to replay the WAL from the checkpoint forward. A database that checkpoints frequently recovers quickly but pays more I/O cost during normal operation. A database that checkpoints rarely has lower I/O overhead but longer recovery time after a crash.

WAL and Group Commit

The dominant cost of WAL writes is not the bytes written but the fsync: the call that forces the OS to flush its write buffers to stable storage. An fsync on modern hardware takes roughly 1-10ms. If you fsync per transaction, you cap throughput at 100-1000 transactions per second. For a high-traffic web service, this is nowhere near enough.

Group commit is the solution. Instead of fsyncing after every transaction, the database accumulates multiple transactions' WAL records and issues a single fsync for the batch. All transactions in the batch receive their commit confirmation after that one fsync. This amortizes the fsync latency across many transactions, allowing thousands of transactions per second with the same disk hardware.

PostgreSQL has had group commit since version 8.3. MySQL InnoDB’s innodb_flush_log_at_trx_commit setting controls this tradeoff: at 1, fsync per commit (full durability, lower throughput); at 2, write to OS buffer per commit but fsync only every second (risk up to one second of data loss); at 0, fsync only every second (faster, but data loss on OS crash too).

The WAL Is Not Just for Relational Databases

The concept is not specific to SQL databases. Any system that needs durability with high write throughput uses the same pattern.

RocksDB/LSM trees: the memtable is in memory. Before inserting into the memtable, RocksDB appends to a WAL. If the process crashes, the WAL is replayed on startup to reconstruct the memtable.

etcd and distributed consensus: etcd stores Raft log entries on disk using a WAL before applying them to the state machine. The WAL is what makes Raft’s guarantees work: a leader can only claim a log entry is committed after it is durably written on a majority of nodes.

Kafka: Kafka is essentially a distributed WAL exposed as a service. Producers append to a partition log (sequential writes); consumers read from arbitrary positions. The log is the product, not just an implementation detail.

Filesystems: ext4 and other journaling filesystems use a journal (a WAL) to ensure filesystem metadata operations are atomic. Without it, a crash mid-operation could leave directory entries pointing to non-existent blocks.

WAL in Streaming Replication

In PostgreSQL, the WAL is also the mechanism for replication. Standby servers receive WAL records from the primary and replay them in real time, maintaining an identical copy of the data. The standby is always some small number of WAL records behind the primary. When the primary fails, the standby can be promoted - it is in a consistent state and the WAL records it has replayed are durable.

This means the WAL serves triple duty in a replicated database: crash recovery, streaming replication, and point-in-time recovery (restoring the database to any historical state by replaying the WAL from a base backup up to a specific timestamp).

What Happens Without It

Omitting the WAL - or relaxing durability as a configuration option - is a choice some systems make for performance. Redis, by default, does not fsync the AOF (Append Only File, its WAL equivalent) on every write. It fsyncs every second. The consequence: up to one second of data loss on crash. For a cache, this is acceptable. For a bank, it is not.

SQLite’s WAL mode is actually an inversion of the default journal mode: instead of writing the old page to a rollback journal before modifying the main file, WAL mode appends new versions of pages to the WAL and reads redirect to the WAL when needed. Readers can proceed without locking even during concurrent writes - a significant concurrency improvement.

The write-ahead log is invisible during normal operation and makes itself known only when something goes wrong. That invisibility is the point. A database that correctly implements WAL-based recovery is one where crashes are a nuisance (a few seconds of recovery) rather than a disaster.

Read next: