Consistency & CAP - The Tradeoff Distributed Systems Cannot Escape
Helpful context:
- Load Balancing & Proxies - Distributing Work So No Single Machine Drowns
- Databases & Indexes - The Structures That Make Queries Fast
You update your DynamoDB user profile - change your display name from “alice” to “Alice.” You immediately read it back. The response shows “alice.” The old value. You did not misread the API. The write succeeded. This is not a bug.
Is it acceptable? The answer depends entirely on your workload. And the framework for thinking about why this happens, and when it matters, is what the CAP theorem - and its more useful successor, PACELC - are actually about.
The Origin: A Keynote That Became a Theorem
In 1999, Eric Brewer was chief scientist at Inktomi, one of the largest web infrastructure companies of the era - their technology powered Yahoo’s search engine and served a significant fraction of internet traffic. Running systems at that scale, Brewer kept noticing the same pattern: the harder his team tried to make a system consistent, the harder it became to keep it available. These seemed to trade off against each other. But nobody had formalized why.
In 2000, Brewer presented this observation as a conjecture at PODC - the Principles of Distributed Computing symposium, one of the top academic venues for distributed systems research. He called it the CAP conjecture: you cannot simultaneously guarantee all three of Consistency, Availability, and Partition tolerance.
At the time it was not a theorem. It was a heuristic from engineering practice. Brewer had no formal proof.
Two years later, Seth Gilbert and Nancy Lynch at MIT produced the formal proof in a paper titled “Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services” (2002). The conjecture became a theorem. The distributed systems community had its first widely accepted impossibility result for real networked systems.
Around the same time, the industry was having a broader argument about ACID vs BASE. ACID - Atomicity, Consistency, Isolation, Durability - was the gold standard for relational databases: every transaction is a safe, fully isolated, durable unit. BASE - Basically Available, Soft state, Eventual consistency - was the name Dan Pritchett at eBay gave to what large-scale web systems were actually doing in practice (he coined it in a 2008 ACM Queue article). They were not achieving ACID guarantees. They were trading strict consistency for availability and scale. BASE was not an aspiration - it was an honest description. The CAP theorem gave this tradeoff a theoretical foundation.
The Setup: Why One Machine Is Easy and Many Is Hard
On a single machine, consistency is trivial. There is one copy of the data. A write updates it. A subsequent read sees the write. Done. Every operation is implicitly ordered because everything runs through one processor and one disk.
Distribute data across multiple machines and everything changes. You replicate data for two reasons: fault tolerance (if one machine dies, others can still serve) and proximity (a replica closer to the user responds faster). But now multiple machines each hold a copy of the same data, and those copies must be kept in sync.
Synchronization happens over the network. And networks fail. Not occasionally - regularly. Cables get cut (undersea fiber cables connecting continents have been cut by ship anchors multiple times in recorded history). Cloud providers have internal network events that affect connectivity between machines in the same data center. AWS documented a major internal networking event in 2019 that caused widespread service degradation. Network partitions are not edge cases. They are a normal operating condition.
So here is the core tension. Imagine two database replicas - one in New York, one in London. Your account balance is stored in both. The undersea fiber cable connecting them gets cut.
The London replica now has to make a choice:
- Refuse to serve requests until it can reconnect with New York and confirm it has the latest data. London is safe - it will never return stale data - but it is also dark. Nobody in London can read their balance.
- Serve requests using the data it has. London stays open. But if you withdrew money from a New York ATM five minutes ago, London does not know yet. A second withdrawal from London could overdraw the account.
There is no third option that keeps London open and guarantees fresh data while the cable is cut. That impossibility is the CAP theorem.
CAP: The Trilemma
Consistency in CAP means linearizability - the property that every read reflects the most recent write, as if the entire distributed system were a single machine with no replication. This is not the same as the “C” in ACID, which is about application-level invariants. CAP’s consistency is specifically about the ordering and visibility of operations across nodes.
Availability means every request to a non-failed node receives a response - not an error, not a timeout, not “try again later.” The response does not need to be fresh. It just needs to arrive.
Partition tolerance means the system continues operating when the network drops or delays messages between nodes. Nodes on both sides of a partition keep running.
The theorem says you can have at most two of the three. But the critical insight, almost always glossed over in introductions to CAP, is that partition tolerance is not optional in any real distributed system. You cannot choose to not have partitions. Partitions happen whether you want them to or not. Choosing not to handle them just means your system breaks unpredictably when they occur.
This collapses the trilemma to a binary: during a partition, do you favor Consistency or Availability?
CP Systems: Correct or Nothing
CP systems refuse to serve requests when they cannot confirm they have the latest data. During a partition, the nodes that cannot reach the authoritative source return errors rather than potentially stale data. You get the right answer or no answer.
ZooKeeper and etcd
ZooKeeper is the canonical CP system - built by Yahoo in 2007 for coordination tasks across distributed clusters. Etcd (used by Kubernetes) follows the same design. Both use a consensus protocol (ZooKeeper uses ZAB; etcd uses Raft - we will get to Raft) where writes must be acknowledged by a majority of nodes before being committed, and reads go through the leader who has the authoritative state.
Why is this useful? Consider a cron job that needs to run exactly once across a fleet of 50 servers. Each server wants to check: “Am I the one who should run this?” You need a distributed lock. If the lock service were AP and during a partition it told three servers simultaneously “you have the lock,” all three would run the job. Tripled billing emails, tripled order processing, whatever it is - bad. ZooKeeper/etcd being CP means during a partition, some servers will get an error trying to acquire the lock, and only one will succeed. Correct behavior, some unavailability.
Kubernetes itself uses etcd for leader election: at any moment, exactly one control-plane node is the leader orchestrating the cluster. If etcd were AP and two nodes both believed they were the leader (called “split-brain”), both would be issuing conflicting commands to the cluster - scheduling the same pod twice, making contradictory scaling decisions. CP is not a nice-to-have here; it is a correctness requirement.
Airline seat reservation
When the last seat on a flight is available and two people click “book” simultaneously, exactly one of them should succeed. The other should see “sold out.” This requires CP behavior - the system must coordinate across replicas to ensure only one booking is accepted. An AP system during a partition might accept both bookings. That is an overbooking incident, a stranded passenger, and a compensation payout.
Ticketing (Radiohead reunion tour, one ticket remaining)
Same pattern. The last ticket must go to exactly one buyer. If the ticketing system is AP and the network partitions between two data centers, both data centers might sell that last ticket to different people. One person’s purchase later gets silently cancelled when the partition heals and the system reconciles. That customer is not happy.
HBase for financial ledgers
HBase is a distributed database that deliberately sacrifices availability for consistency. It is commonly used for financial ledgers - systems where the number matters precisely and stale reads can cause real monetary harm. During a partition, an HBase region server that cannot reach ZooKeeper (its coordination layer) will stop serving reads and writes until it can. This is the right call when the data involved is an account balance.
AP Systems: Available, Possibly Stale
AP systems keep serving requests during a partition, even if some nodes might return data that is slightly behind. You get some answer, possibly outdated. The bet is that the cost of an inconsistent answer is lower than the cost of no answer.
Shopping carts (Amazon’s Dynamo paper)
The 2007 Amazon Dynamo paper - one of the most influential distributed systems papers ever written - explicitly motivated its design around the shopping cart. During a network partition, should Amazon let customers add items to their cart? Or should it refuse the operation because it cannot confirm all replicas agree on the current cart state?
Amazon chose AP. The reasoning: the cost of unavailability (customer cannot add items, leaves the site, buys elsewhere) is far higher than the cost of inconsistency (the cart might briefly show the same item twice if two writes conflicted, which gets resolved at checkout). Dynamo made the AP choice and built a system around it. DynamoDB is its successor.
Social media like counts and view counts
When you see “4.2 million views” on a YouTube video, that is not the exact real-time count. Different servers have slightly different tallies depending on which writes have propagated to which replica. The number you see might be 30 seconds old, or 2 minutes old. This is AP behavior. Nobody cares whether the true count is 4,201,847 or 4,201,903. The alternative - requiring all replicas to agree on the exact count before returning it - would mean read latency measured in seconds and a system that breaks during any network event. The tradeoff is obvious.
DNS (Domain Name System)
DNS is the internet’s phone book - it maps domain names to IP addresses. DNS is the most AP system most people interact with daily, though few think of it that way.
When you change your website’s IP address, the old IP keeps being served to users around the world for hours or days, until their local DNS resolver’s cached answer expires (controlled by a TTL - Time To Live - value). Different resolvers around the world have different cached values. The system is globally eventually consistent. If DNS were CP and refused to answer queries while propagating updates, the internet would be unusable. The cost of stale DNS (old IP served briefly) is far lower than the cost of DNS being unavailable.
Your recently played history on Spotify
Play a song on your phone. Open Spotify on your laptop. For a few seconds, or sometimes longer, the song is not in your laptop’s “recently played” list. The write (play event on phone) has not propagated to the replica your laptop is reading from. This is AP behavior - Spotify’s system stays available on both devices and accepts that the history briefly disagrees. Blocking your phone’s playback until all Spotify servers have synchronized your play history would be absurd.
Cassandra for IoT sensor data
An IoT temperature sensor writes a reading every second. If the database cluster has a network issue and behaves as CP - refusing writes until consensus is reached - you lose those data points permanently. A gap in your time-series data. For sensor data where you’d rather have a slightly stale last reading than a gap, AP is the right call. Cassandra, which is AP by default, is commonly used here precisely because it keeps accepting writes under degraded network conditions.
Search engine indexes
When you publish a new article, it does not appear in Google search results immediately. Google’s crawlers discover it, index it, and distribute the index update across Google’s data centers. Different data centers might have different versions of the index for hours. This is AP behavior at planetary scale. If Google’s search index were CP, it would need to lock all replicas before serving a search result to ensure every data center has the latest index - making search completely unusable.
The Honest Critique of CAP
The CAP theorem is one of the most cited and most misapplied results in distributed systems. A few important caveats:
CAP applies to individual operations, not systems. A system can be CP for some operations and AP for others. DynamoDB is AP by default (eventually consistent reads) but CP when you pay for strongly consistent reads. Calling DynamoDB an “AP system” is a simplification that conceals its actual tunable behavior.
“Consistency” and “Availability” are not binary. CAP treats them as on/off, but real systems live on a spectrum. Cassandra with CONSISTENCY ONE (only one replica needs to acknowledge a write) is highly available and weakly consistent. With CONSISTENCY ALL (every replica must acknowledge), it is strongly consistent but effectively unavailable if any replica is down. Most real deployments use CONSISTENCY QUORUM - a majority of replicas must acknowledge, a middle ground between the two extremes.
Martin Kleppmann’s 2015 critique argued that CAP conflates different properties in misleading ways and that the theorem’s practical guidance is too coarse to drive real engineering decisions. He is right. CAP is a useful conversation starter, not an engineering specification.
PACELC: The Better Model
PACELC, proposed by Daniel Abadi in 2012, extends CAP to capture the tradeoff that matters most during normal operation - when there is no partition at all.
Abadi’s observation: CAP only talks about what happens when the network fails. But most of the time, the network is fine. And even during normal operation, there is a tradeoff: if you want strong consistency, every write must wait for acknowledgment from multiple replicas before returning to the client. Each acknowledgment is a network round-trip. That takes time. If you relax to eventual consistency, the write returns as soon as the local node acknowledges it - much faster, but subsequent reads from other replicas might lag.
The acronym: Partition → choose C or A; Else (no partition) → choose Latency or Consistency.
A few real systems through the PACELC lens:
- Cassandra with CONSISTENCY ONE - PA/EL: available under partition, low latency under normal operation
- DynamoDB strongly consistent reads - PC/EC: consistent under partition, consistent-but-slower under normal operation
- Google Spanner - PC/EC: linearizable globally using TrueTime (GPS receivers and atomic clocks in Google data centers assign globally ordered timestamps, solving the “what time is it really?” problem across continents), but with cross-region write latency of 100 - 300ms
- MySQL primary-replica - PC/EL: consistent (single primary means no split-brain), but writes return quickly (the primary does not wait for replicas to confirm before acknowledging the client)
PACELC is more useful than CAP for day-to-day decisions because it forces you to ask: “What is my tolerance for stale reads, and what latency am I willing to pay to avoid them?”
Consistency Models: A Spectrum
Consistency is not binary. Between “perfectly synchronized” and “anything goes” there is a spectrum, and real systems land at different points on it.
Linearizability (strong consistency) is the gold standard. Operations appear to take effect atomically at some point between their start and end, and there is a total global order. If write W completes before read R begins, R must see W’s value. This is expensive: it requires synchronous coordination across replicas for every operation. Google Spanner and etcd implement this.
Sequential consistency is weaker: operations appear in some sequential order consistent with each process’s program order, but the order need not reflect wall-clock time. Two clients might disagree about which of two concurrent writes “happened first” - but each client’s own operations appear in the right order.
Causal consistency preserves causality: if A causally precedes B (A’s result influenced B’s existence), every replica applies A before B. Concurrent operations with no causal relationship may appear in different orders on different replicas. MongoDB’s causal sessions implement this - if you read the result of your own write, your subsequent reads will see at least that version.
Eventual consistency guarantees only that, absent new writes, all replicas will converge to the same value. In the interim, stale reads are possible. The uncomfortable question “eventual consistency” leaves open is: how eventual is eventual? Milliseconds? Seconds? Minutes? The answer depends on replication topology, network conditions, and write rate. Under low write load in a healthy cluster, DynamoDB’s eventual consistency window is typically under a second. Under high write load during a partial network partition, it could be much longer. The SLA does not tell you this; you need to measure it for your workload.
Read-your-writes is a session-level guarantee: a client that writes a value will always see that value in subsequent reads, even if those reads hit a replica that has not fully propagated the write. Stronger than pure eventual consistency for interactive applications - essential for anything like a settings page where a user saves a change and immediately sees the updated page.
Monotonic reads guarantee that if a client reads value V at time T, it will never read an older value at time T+1. Without this, a user refreshing a page might see an update appear and then disappear - confusing enough to generate a support ticket.
DynamoDB’s Tunable Consistency
DynamoDB is a useful case study because it exposes the tradeoff explicitly and charges you for the difference.
By default, reads from DynamoDB are eventually consistent - they may return a slightly stale value if the write has not propagated to all replicas yet. This is the scenario from the opening: write a profile update, read it back, see the old value.
Strongly consistent reads (ConsistentRead: true) return the most recent write, always. The cost: higher read capacity consumption (twice as many read units as eventually consistent reads), higher latency (must coordinate with the primary replica rather than any available replica), and reduced availability during some partition scenarios.
For most application reads - displaying a social feed, loading a product catalog - eventual consistency is fine. The user will not notice a 500ms-old value. For reads where staleness causes correctness problems - inventory count before a purchase, account balance before a transfer - strongly consistent reads are necessary.
This is the practical application of PACELC: choose per-operation based on what the operation actually needs.
Why Financial Systems Need CP
Consider two ATMs simultaneously reading the same bank account balance: 400. Both subtract 200. Both write back 200. The account now shows 200, but 400 was withdrawn. This is the classic double-spend problem. It requires strong consistency to prevent.
The ATM must use a compare-and-swap operation - an atomic read-then-write that checks whether the value changed between the read and the write, and refuses to proceed if it did. “Write 200 only if the current balance is still 400.” If another ATM already changed the balance to 200, the compare-and-swap fails, the operation is retried, and the retry correctly refuses the withdrawal.
This kind of coordination requires the linearizability that CP systems provide. An AP system during a partition might allow both ATMs to proceed, because it cannot coordinate across the partition to detect the conflict.
Google Spanner achieves global linearizability using TrueTime: GPS receivers and atomic clocks deployed in Google data centers give each data center a precise physical time with bounded uncertainty. Spanner uses this to assign globally ordered timestamps to transactions and implement external consistency - the strongest consistency guarantee - across geographically distributed replicas. The cost is non-trivial: cross-region commits take 100 - 300ms. For financial transaction processing, this is acceptable. For a social media feed, it is not.
The Future: CRDTs and Hybrid Logical Clocks
Two directions make eventual consistency more tractable for real applications.
CRDTs (Conflict-free Replicated Data Types) are data structures mathematically designed to merge concurrent updates without coordination. Take a distributed counter - if two replicas each increment the counter independently during a partition, when the partition heals, how do you merge? With a naive integer, you cannot - you do not know if the two replicas both started at 5 and each incremented to 6 (final value should be 7) or if one just read what the other wrote (final value should be 6). A CRDT counter tracks each replica’s contribution separately and merges by summing them. No conflict, no coordination needed.
An observed-remove set handles add and remove operations across replicas without coordination. The key insight CRDTs exploit: not all operations need to be serialized. Some operations commute - the order does not matter. Increment by 1 and increment by 1 commute; the result is the same regardless of which happens “first.” CRDTs identify exactly those operations and build data structures around them. Riak, Redis Cluster, and Figma’s multiplayer canvas use CRDTs. They do not make eventually consistent systems strongly consistent - they make eventual convergence guaranteed and deterministic, eliminating the need for manual conflict resolution.
Hybrid Logical Clocks (HLCs) combine physical clocks with logical clocks (Lamport timestamps - a counter that each node increments with every event, ensuring causally related events are ordered) to provide causally consistent timestamps that are close to real time but preserve happens-before relationships even when physical clocks drift. CockroachDB and YugabyteDB use HLCs to implement serializable transactions in geo-distributed databases without the GPS hardware requirements of Google Spanner.
Summary
| Concept | Key Insight |
|---|---|
| CAP origin | Brewer’s 1999 engineering observation, formalized by Gilbert & Lynch in 2002 |
| P is not optional | Network partitions happen in every real distributed system; the real choice is C vs A |
| CP examples | ZooKeeper locks, Kubernetes leader election, airline seat booking, financial ledgers |
| AP examples | Shopping carts, social media counts, DNS, recently played history, sensor data |
| CAP’s limits | Applies per-operation; consistency and availability are spectrums, not binary |
| PACELC | Adds the latency vs consistency tradeoff that exists even without a partition |
| Eventual consistency | “How eventual?” depends on workload and cluster health - always measure it |
| Linearizability | The strongest guarantee: reads always see the most recent write |
| DynamoDB tunable | Eventually consistent = cheaper + faster; strongly consistent = correct + costlier |
| CRDTs | Make eventual convergence deterministic by using only commutative operations |
Read Next: