Sharding & Data Replication // Megha Bose

Prerequisite: CAP Theorem

A single database server has physical limits: the disk fills up, the CPU saturates, write throughput hits a ceiling. Replication and sharding are the two fundamental techniques for pushing past those limits. Replication copies data across servers for fault tolerance and read scaling. Sharding splits data across servers for write scaling and storage. In practice, most large-scale systems use both.

Vertical Scaling and Its Limits

The simplest response to a struggling database is to give it more resources - a machine with more RAM, faster SSDs, more CPU cores. This is vertical scaling, and it works surprisingly well. A modern cloud instance can have terabytes of RAM and many CPU cores, and for many applications that is enough for years.

But vertical scaling has a hard ceiling. At some point the largest available instance is not enough, or the cost of that instance exceeds what you can justify. That is when you turn to horizontal scaling.

Replication

Replication copies data from one server (the primary) to one or more others (replicas). Reads can be served by any replica, which distributes read load. The primary handles all writes.

Synchronous replication: The primary waits for at least one replica to acknowledge each write before confirming it to the client. If the primary dies, the replica has all committed data. The trade-off: every write waits for a network round trip to the replica, adding latency.

Asynchronous replication: The primary confirms writes immediately and replicates in the background. Lower write latency, but the replica may lag behind - and if the primary dies before a write is replicated, that write is lost.

Most systems use async replication or semi-sync (wait for one replica, not all). PostgreSQL streaming replication is async by default. You can configure it to require at least one synchronous standby for critical data.

Replication lag is the delay between a write on the primary and its appearance on a replica. Reads from replicas during high write load may see stale data. Applications that read their own writes (a user saves a profile and immediately reloads it) need to route that read to the primary, or wait for replication to catch up.

Failover and Split-Brain

When the primary fails, a replica must be promoted. Automated failover tools (Patroni for PostgreSQL, MHA for MySQL) handle this, but promotion takes seconds to minutes. During that window, writes fail.

The more dangerous failure is split-brain: two nodes both believe they are the primary. This happens when a primary becomes unreachable from the rest of the cluster (network partition) but is still running. Both sides accept writes. When connectivity is restored, the two write histories must be reconciled - some data is lost or overridden. Fencing (STONITH - shoot the other node in the head) is the standard countermeasure: before a replica promotes, it first ensures the old primary cannot accept writes, either by revoking its network access or sending a hardware fencing command.

Sharding

Sharding horizontally partitions a dataset across multiple nodes. Each node - a shard - owns a subset of the data. A user lookup for user ID 5001 goes to the shard that owns IDs 5000–6000; user ID 1001 goes to a different shard. Total storage and write throughput scale with the number of shards.

Range Sharding

Data is divided by value ranges: shard 1 holds user IDs 1–100000, shard 2 holds 100001–200000, and so on. Range queries (WHERE user_id BETWEEN 50000 AND 60000) hit a single shard efficiently.

The risk is hot spots. If user IDs are assigned sequentially and all new users are high-numbered, all writes land on the last shard while earlier shards sit idle. Choosing a shard key that distributes writes evenly is critical - sequential IDs are often a poor choice for range sharding.

Hash Sharding

Apply a hash function to the shard key and use the result modulo the number of shards:

shard = hash(user_id) % num_shards

This distributes data uniformly regardless of access patterns. The cost: range queries across the shard key now require hitting all shards (scatter-gather), since hashing destroys ordering.

Directory Sharding

A lookup table maps each key (or key range) to a shard. This is maximally flexible - any routing logic is possible, and individual tenants can be moved between shards without changing the shard key function. The downside: the lookup table is a bottleneck and a SPOF unless it is itself replicated and cached.

Consistent Hashing

Adding a shard to a hash-sharded system with hash(key) % N is painful. Changing N from 4 to 5 remaps most keys to different shards - almost all data must be migrated. Consistent hashing solves this.

The idea: arrange all possible hash values on a ring. Place each shard at one or more points on the ring. A key maps to the shard whose ring position is the first one clockwise from the key’s hash. When a new shard is added, it takes over only the keys between itself and its predecessor - roughly $1/n$ of the total data, rather than remapping everything.

Virtual nodes improve balance. Instead of placing each shard at one point on the ring, place it at many points (virtual nodes). This distributes load more evenly and means losing a shard distributes its keys across many neighbors rather than one.

Ring: [shard-A-v1, shard-C-v2, shard-B-v1, shard-A-v2, shard-B-v2, shard-C-v1]
Key hashes to position 0.45 → nearest clockwise = shard-B-v1 → routes to Shard B

Adding a new shard inserts its virtual nodes at random ring positions, absorbing a proportional share from each existing shard.

Cross-Shard Queries

Sharding improves write throughput and storage at the cost of query complexity. A query that once ran in one database now potentially spans multiple shards:

Single-shard query: Filter by the shard key. Efficient.
Scatter-gather: Fan out the query to all shards, collect results, merge. Latency is the max of all shard latencies. Aggregate functions (COUNT, SUM) require server-side computation.
Cross-shard joins: Generally not supported natively. The application must fetch rows from multiple shards and join in memory.

The practical rule: design your shard key so that the most frequent queries hit one shard. For a multi-tenant SaaS, sharding by tenant ID means most queries filter by tenant and stay on one shard. Cross-tenant analytics is done offline in a data warehouse rather than online in the sharded database.

Examples

Consistent hashing with virtual nodes:

Suppose you have three shards and decide on 150 virtual nodes per shard (450 total points on the ring). Hash each virtual node label (shard-A-0, shard-A-1, …) to get its ring position. Store these in a sorted array. For a given key, hash it, then binary-search the sorted array for the next position clockwise - $O(\log N)$ lookup.

When a fourth shard is added with 150 virtual nodes, it inserts 150 random points on the ring. On average it absorbs 25% of keys from each of the three existing shards, spreading migration load evenly.

Choosing a shard key for a social network:

A social network’s most common queries: “get posts by user X” and “get feed for user Y” (posts from users Y follows). Sharding by user_id means all of user X’s posts are on one shard - single-shard writes and reads per user. The feed query requires fetching posts from multiple followed users, but those can be pre-computed (fanout-on-write) into each follower’s timeline shard, also keyed by user.

Sharding by post ID instead would spread one user’s posts across all shards - simple writes but scatter-gather on every profile page load.

Read Next: Distributed Storage