Prerequisite: Databases & Indexes


The CAP theorem, proven by Eric Brewer and later formalized by Gilbert and Lynch, states that a distributed system can provide at most two of three properties: Consistency, Availability, and Partition tolerance. It is one of the most cited - and most misapplied - results in distributed systems. Understanding it precisely matters because the choice it describes shapes the behavior of every distributed database you will operate.

The Three Properties

Consistency in CAP means linearizability: every read returns the result of the most recent write, as if there is a single copy of the data. This is not the same “C” as in ACID, which refers to application-level invariants.

Availability means every request to a non-failed node receives a response - not an error, not a timeout. The response does not need to reflect the latest write.

Partition tolerance means the system continues operating when the network drops or delays messages between nodes.

The uncomfortable truth is that network partitions happen in any real distributed system - cables fail, switches drop packets, cloud provider networks have incidents. This makes partition tolerance non-optional. The real trade-off is between CP (favor consistency, sacrifice availability during a partition) and AP (favor availability, sacrifice consistency during a partition).

CP systems (ZooKeeper, etcd, HBase) refuse to serve reads or writes on nodes that cannot confirm they hold the latest data. During a partition, the minority partition becomes unavailable. You get correct answers or no answer.

AP systems (Cassandra, DynamoDB, CouchDB) continue serving requests from all nodes during a partition. Nodes in the minority partition may return stale data. You get some answer, possibly outdated.

PACELC

CAP only addresses behavior during partitions. PACELC extends the framework: even when the network is healthy (Else, no partition), systems must choose between Latency and Consistency. A strongly consistent write must wait for acknowledgment from a quorum of replicas - that latency is unavoidable. An eventually consistent write can return as soon as the local node acknowledges.

Cassandra with CONSISTENCY ONE is PA/EL: available under partition, low latency under normal operation. DynamoDB configured for strong reads is PC/EC. This framing is more useful for day-to-day engineering decisions than CAP alone.

Consistency Models

Consistency is not binary. There is a spectrum from weak to strong, each with different performance and usability trade-offs.

Linearizability (strong consistency) is the gold standard. Operations appear to take effect atomically at some point between their start and end. There is a total global order. Any operation that completes before another begins will be reflected in the later operation’s result. This is expensive: it requires coordination across replicas for every read and write.

Sequential consistency is weaker: all operations appear to execute in some sequential order, and each process’s operations appear in program order - but that order need not reflect real time. Two processes may disagree on which of two concurrent writes happened “first.”

Causal consistency preserves causality: if operation A causally precedes B (A’s result influenced B), every replica reflects A before B. Concurrent operations with no causal relationship may be seen in different orders by different replicas. MongoDB uses causal sessions.

Eventual consistency guarantees only that if no new writes occur, all replicas will converge to the same value. In the interim, reads may return stale data. This is the weakest useful guarantee and the most scalable.

Read-your-writes is a session-level guarantee: a client that writes a value will always see that value in subsequent reads, even if those reads go to a replica that hasn’t yet propagated the write. Stronger than pure eventual consistency for interactive applications.

Monotonic reads guarantee that if a client reads a value at time T, it will never read an older value at time T+1. Without this, a client refreshing a page might see updates appear and then disappear.

BASE vs ACID

Traditional relational databases offer ACID: Atomicity, Consistency, Isolation, Durability - strong transactional guarantees at the cost of horizontal scalability.

Highly available distributed systems often operate on BASE: Basically Available (the system remains operational even if some data is stale), Soft state (state may change without input as replicas converge), Eventually consistent (replicas converge given time). BASE accepts weaker guarantees in exchange for availability and partition tolerance.

Vector Clocks and Causal Ordering

To detect whether two events have a causal relationship, distributed systems use vector clocks: each node maintains a counter for every node in the system. When a node sends a message, it increments its own counter and attaches the entire vector. The receiver updates its vector to the component-wise maximum of its own and the received vector, then increments its own counter. Two events are causally related if one vector dominates the other; otherwise they are concurrent.

DynamoDB and Riak have used vector clocks (or a variant called dotted version vectors) to detect conflicting writes during network partitions.

Conflict Resolution

When AP systems allow concurrent writes to the same key on different replicas, conflicts arise. How to resolve them?

Last-write-wins (LWW) uses wall-clock timestamps to pick the “latest” write. Simple, but clock skew between nodes means the “winner” might not actually be the last write. Data can be silently overwritten.

CRDTs (Conflict-free Replicated Data Types) are data structures mathematically designed to merge concurrent updates deterministically without coordination. A grow-only counter (G-Counter), a set that only allows additions (G-Set), or more complex structures like OR-Sets (observed-remove sets) that handle add/remove without conflict. CRDTs guarantee convergence without a conflict resolution step. Riak, Redis cluster, and several other systems use them.

Examples

Shopping cart with Dynamo-style AP: Amazon’s original Dynamo (the internal system, not DynamoDB) used an AP design for shopping carts. The insight: if a customer adds an item to their cart during a network partition, it’s better to accept the write and reconcile later than to return an error. A cart that shows one extra item is better UX than a cart that refuses to add items at all. Conflict resolution used a merge of all concurrent versions.

Bank account with CP requirement: Financial systems cannot allow two ATMs to simultaneously read the same balance, each subtract $200, and both succeed - that would allow overdrafts. This operation requires linearizability. Systems like Google Spanner achieve this globally using TrueTime (GPS + atomic clocks) and the two-phase commit protocol across regions. The cost: cross-region write latency on every transaction.


Read Next: Fault Tolerance