Sharding & Replication - Splitting Data So No Single Machine Drowns // Megha Bose

Helpful context:

Databases & Indexes - The Structures That Make Queries Fast

Your database hits 10TB and queries start taking seconds. You’ve already tuned every index, added read replicas, and upgraded the hardware twice. The largest available cloud instance is maxed out. The business is growing and there’s no bigger box to buy. Now what?

This is the moment every successful data-intensive system eventually faces: the point where a single machine is no longer enough, and the architecture must distribute data across multiple machines. Replication and sharding are the two fundamental techniques for doing this - and they solve different problems, come with different tradeoffs, and are far more commonly combined than used alone.

The Case for Doing Neither (Yet)

Before discussing how to distribute data, it’s worth stating the most important advice in this space: most teams shard prematurely.

Sharding is operationally complex. It breaks SQL joins across shards. It eliminates cross-shard transactions (or makes them very expensive). It forces every query to be shard-key-aware. Engineers who’ve never operated a sharded system routinely underestimate what they’re taking on.

Read replicas - adding one or more database servers that receive copies of all writes and serve reads - solve 80% of database scaling problems without any of sharding’s complexity. If your bottleneck is read throughput (the common case: reads vastly outnumber writes), read replicas scale read capacity horizontally while leaving the write path simple.

Vertical scaling (“buy a bigger server”) is also consistently underrated. A modern cloud instance can have 192 vCPUs, 12TB of RAM, and NVMe storage delivering millions of IOPS. For many workloads that “need sharding,” the real need is just a bigger primary and a few read replicas.

Shard when the write throughput or storage genuinely exceeds what one machine can handle. This is a higher bar than most people think.

Replication: Copies for Resilience and Scale

Replication copies data from one server (the primary, also called master) to one or more servers (replicas, also called secondaries or standbys). Every write goes to the primary. Reads can go to any replica.

Synchronous vs Asynchronous

Synchronous replication: the primary waits for at least one replica to write the change to its durable log before acknowledging the write to the client. If the primary dies, that replica has every committed write - zero data loss. The cost: every write waits for a network round trip to the replica. On a cross-region link (say, us-east-1 to eu-west-1, ~85ms RTT), this adds 85ms to every write operation.

Asynchronous replication: the primary acknowledges writes immediately and ships the changes to replicas in the background. Write latency is fast (no waiting for replicas). The risk: if the primary dies before a write is replicated, that write is lost. This is called replication lag: the gap between what the primary has committed and what replicas have applied.

Most production systems use asynchronous replication or semi-synchronous (wait for exactly one replica, not all). PostgreSQL streaming replication is asynchronous by default but can be configured to require synchronous acknowledgment from specific standbys. Aurora uses a distributed storage layer that replicates writes to 6 copies in 3 AZs before acknowledging - you get durability without paying the cross-region latency cost of traditional synchronous replication.

The Read-Your-Own-Write Problem

Replication lag creates a subtle but infuriating user experience problem. A user updates their profile photo. The write goes to the primary. The response page fetches the user’s profile from a read replica - but the replica hasn’t received the update yet. The user sees their old photo, becomes confused, tries again. The same problem affects “post a comment, immediately see it in the list.”

Solutions: route reads to the primary for a short window after a write (e.g., for 1 second). Or use a monotonic read guarantee: once a user sees a given version of the data, subsequent reads go to a replica that is at least as up-to-date. Or use session-level sticky routing: route all reads from a given user session to the same replica. Each approach adds complexity. The simplest is to route reads to the primary for “sensitive” flows (profile updates, payment confirmations) and to replicas for everything else.

Failover and the Split-Brain Danger

When the primary fails, a replica must be promoted. Automated failover tools (Patroni for PostgreSQL, AWS RDS Multi-AZ) detect primary failure and elect a new primary within seconds to minutes. During this window, writes fail.

The more dangerous failure is split-brain: two nodes both believe they are the primary and accept writes simultaneously. This happens when the primary becomes unreachable due to a network partition but is still running. Both the primary and the promoted replica accept writes. When the partition heals, the two divergent write histories must be reconciled - some data is lost.

The standard countermeasure is fencing. Before a replica promotes, it ensures the old primary cannot accept writes. In cloud environments, this means revoking the old primary’s network access or terminating its instance. In bare-metal environments, STONITH (Shoot The Other Node In The Head) literally powers off the old primary via a remote management interface. Fencing sounds violent; it prevents data loss.

Sharding: Split the Data Itself

Sharding horizontally partitions the dataset across multiple servers. Each server owns a subset of the data - a shard. A write for user 5001 goes to the shard that owns users 5000 - 6000; a write for user 1001 goes to a different shard. Total storage and write throughput scale with the number of shards.

Range Sharding

Partition by value ranges. Shard 1 holds user IDs 1 - 1,000,000; Shard 2 holds 1,000,001 - 2,000,000; and so on.

Advantage: range queries on the shard key (WHERE user_id BETWEEN 50000 AND 75000) hit exactly one shard.

Danger: hot spots. If user IDs are assigned sequentially, all new writes go to the last shard. Earlier shards sit nearly idle while the newest shard is saturated. This is the single biggest operational problem with range sharding - the distribution is only even in theory, not in the workload.

Range sharding works well when the access pattern naturally distributes across the range (time-series data where older data is accessed less, geographic data where different regions have different load patterns). It fails when the range has a natural ordering that concentrates writes at one end.

Hash Sharding

Apply a hash function to the shard key and use modulo:

shard_id = hash(user_id) % num_shards

Advantage: distributes data uniformly regardless of key ordering. No hot spots.

Disadvantage: range queries require hitting all shards (scatter-gather). WHERE user_id BETWEEN 50000 AND 75000 could have users scattered across every shard - the database must query all of them and merge results. Aggregate queries (COUNT, SUM, AVG) require collecting partial results from every shard and combining.

Hash sharding is the default choice when write uniformity is more important than range query efficiency. For most key-value access patterns (look up a user by ID, look up an order by ID), hash sharding is appropriate.

Directory Sharding

A lookup table maps each key (or key range) to a specific shard. The routing table is consulted on every request.

Advantage: maximum flexibility. Any mapping logic is possible. Individual tenants can be moved between shards without changing the sharding function. You can move a “hot” tenant to a dedicated shard without affecting others.

Disadvantage: the lookup table is a bottleneck and a single point of failure unless it’s replicated and cached. Adding a caching layer (memcached, Redis) in front of the directory is standard, but this adds operational complexity and cache invalidation concerns.

Directory sharding is common in multi-tenant SaaS applications where each customer (tenant) is a sharding unit and the business occasionally needs to move large customers to dedicated infrastructure.

Consistent Hashing: The Resharding Solution

The fundamental problem with hash(key) % N: when you add a shard (change N from 4 to 5), almost every key maps to a different shard. Virtually all data must migrate simultaneously. This is catastrophic for a live system.

Consistent hashing solves this. Arrange all possible hash output values on a circle (the “ring”). Place each shard at one or more points on the ring. A key maps to the first shard clockwise from the key’s hash position.

When you add a new shard, it takes over only the keys between it and its predecessor on the ring - roughly 1/n of the data, distributed from adjacent shards. When a shard is removed, its keys go to the next shard clockwise. Only adjacent shards are affected.

Virtual nodes improve load balance. Instead of placing each shard at one point, place it at many points (virtual nodes, often 100 - 300 per physical shard). Virtual nodes make the distribution more uniform: each physical shard has many small arcs of the ring rather than one large arc. They also improve fault tolerance: when a shard fails, its virtual nodes' keys are distributed across many other shards rather than all piling onto one successor.

Ring with virtual nodes (simplified):
[shard-A-v0, shard-C-v2, shard-B-v1, shard-A-v1, shard-B-v2, shard-C-v0, shard-A-v2, ...]

Key hashes to position 0.45 → next clockwise: shard-B-v1 → routes to Shard B

This is how DynamoDB and Cassandra route requests. DynamoDB’s partition key is hashed, and the result determines which storage partition owns the data. Cassandra uses a similar ring with configurable virtual node count.

DynamoDB Partition Keys and Hot Partitions

DynamoDB is essentially a managed hash-sharded database. The partition key determines which physical partition receives the write. DynamoDB allocates read and write capacity uniformly across partitions.

The critical design rule: choose a partition key with high cardinality and uniform access frequency. If you use user_id as the partition key and user 12345 is a celebrity with 10 million followers who all load their feed simultaneously, all those reads hit the same partition - a hot partition. DynamoDB throttles the partition when it exceeds its allocated capacity.

Common patterns to avoid hot partitions:

Add randomness to the key: for a heavily accessed item, shard it across multiple partition keys (item-1, item-2, …, item-N) and read from all of them, aggregating client-side.
Use composite partition keys: combine attributes to increase cardinality.
Write sharding: for items with high write rates, distribute writes across sub-keys.

AWS provides DynamoDB Adaptive Capacity (and later On-Demand mode) which automatically adjusts capacity allocation in response to uneven traffic patterns - but this only helps within the throughput ceiling of the table, not if you’ve genuinely exhausted capacity.

Cassandra: Partition Key vs Clustering Key

Cassandra’s data model makes the sharding semantics explicit in the schema. The partition key determines which node stores the data (hashed to the ring). The clustering key determines the sort order within a partition.

CREATE TABLE user_posts (
    user_id    UUID,          -- partition key: determines which node
    created_at TIMESTAMP,     -- clustering key: sort order within partition
    post_id    UUID,
    content    TEXT,
    PRIMARY KEY (user_id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

All posts for a user live in one partition, sorted by time descending. “Get the 20 most recent posts for user X” is a single-node, single-partition query - efficient. “Get all posts from the last hour across all users” is a scatter-gather across every node - expensive.

This explicit exposure of the data model’s distribution logic forces engineers to think about query patterns before designing the schema - the opposite of relational databases where schema design and query optimization are partially decoupled.

Cross-Shard Queries and Their Cost

Sharding trades query flexibility for write scalability. A query that once ran against one database may now span multiple shards, requiring scatter-gather:

Fan out the query to all N shards.
Each shard executes the query locally.
The coordinator (your application or a proxy) collects all N results.
Merge and return.

Scatter-gather latency is the maximum latency across all shards, not the average - the slow shard determines the total response time. This tail latency problem compounds as the number of shards grows.

Cross-shard joins are generally not supported natively. If user data is on one shard and order data is on another, joining them requires fetching one result set and doing the join in application memory. At scale, this means denormalizing data so related entities are co-located on the same shard, or giving up on online joins entirely and running cross-entity analytics in a data warehouse (BigQuery, Redshift, Snowflake) where scatter-gather over many nodes is the normal operational model.

MySQL Sharding at YouTube: Vitess

YouTube sharded MySQL long before the managed alternatives existed. By 2010, they were running hundreds of MySQL shards manually. The operational overhead was brutal: schema changes required coordinating across hundreds of databases, shard management was manual, and query routing was application-side.

Vitess was born from this pain. It’s a middleware layer that sits in front of MySQL and provides:

Query routing: automatically routes queries to the correct shard based on a sharding key
Connection pooling: thousands of application connections multiplexed to a smaller number of MySQL connections
Schema management: apply schema changes across all shards in a controlled, rolling fashion
Resharding: split or merge shards with minimal downtime

Vitess is now used by Slack, GitHub, and many others, and is a CNCF graduated project. It’s the pragmatic answer to “we need to shard MySQL without rewriting our application.”

PlanetScale is a managed Vitess offering - MySQL sharding as a service, with a GitHub-like branching model for database schema changes.

The Critique: You Probably Don’t Need to Shard Yet

Sharding adds complexity that’s easy to overlook before you’ve operated it. Common second-order costs:

Schema changes are hard. In a single database, ALTER TABLE runs once. Across 256 shards, it must be applied 256 times, ideally online and without locking. Tools like Vitess and gh-ost (online schema change for MySQL) help, but it’s still more complex than a single database.

Cross-shard transactions are either absent or expensive. You give up relational guarantees across shards. If you need atomicity across shard boundaries, you’re back to 2PC or Sagas.

Read replicas often solve the problem. If reads outnumber writes by 10x (common), one primary with three read replicas gives you 4x the read capacity without any data distribution complexity. Most “database performance problems” are read problems.

Connection pooling matters more than you think. A database server can handle ~200 concurrent connections efficiently. A hundred application servers each opening 20 connections exhausts this. PgBouncer (PostgreSQL) and ProxySQL (MySQL) pool connections at the middleware layer, drastically reducing the connection load on the database - often eliminating the performance problem that was mistaken for a need to shard.

Future Outlook

Serverless databases are attempting to make horizontal scaling automatic and invisible. Aurora Serverless v2 scales storage and compute independently, handling traffic spikes without pre-provisioned capacity. Neon (serverless PostgreSQL) separates storage from compute, allowing compute to scale to zero and storage to grow arbitrarily. PlanetScale’s database branching model treats sharding as a deployment concern rather than a schema design concern.

NewSQL databases (CockroachDB, Google Spanner, TiDB) implement distributed transactions with serializable isolation across shards, attempting to provide the scalability of sharding with the ACID guarantees of a single database. Spanner uses TrueTime for global consistency; CockroachDB uses Raft consensus per partition. These are no longer research projects - they run in production at scale.

The direction of the field is toward making sharding a database concern rather than an application concern. Whether managed services (Aurora, PlanetScale) or NewSQL databases fulfill this promise at every scale point will determine whether the next generation of engineers ever needs to hand-implement consistent hashing.

Concept	Key Point
Vertical scaling	Buy bigger hardware; underrated; works longer than expected
Read replicas	Horizontal read scaling; solves 80% of scaling problems; simple to operate
Replication lag	Async replica may lag behind primary; read-your-own-write bugs
Split-brain	Two primaries accepting writes; data loss; fencing is the defense
Range sharding	Sort by key range; efficient range queries; hot spot risk
Hash sharding	Uniform distribution; range queries require scatter-gather
Directory sharding	Lookup table routing; flexible; lookup table is a bottleneck
Consistent hashing	Adding/removing shards migrates only ~1/n of data
Virtual nodes	Many ring positions per shard; improves balance and fault tolerance
DynamoDB partition key	Hash-sharded; hot partitions are the main design mistake
Cassandra partition key	Explicit distribution model in schema; query pattern drives design
Scatter-gather	Fan-out to all shards; tail latency is max of all shards
Vitess	MySQL sharding middleware; used at YouTube, Slack, GitHub
Premature sharding	Most teams shard too early; read replicas often sufficient
Aurora Serverless	Automatic scaling; separates storage from compute
CockroachDB / Spanner	NewSQL: ACID across shards; removes sharding from application concern

Read Next:

Consistency & CAP - The Tradeoff Distributed Systems Cannot Escape