Prerequisite: Databases & Indexes


Sharding distributes data across multiple nodes so that no single node holds everything and no single node handles all traffic. In theory, sharding scales linearly - double the shards, double the capacity. In practice, one shard often receives far more traffic than others. When this happens, that shard becomes a bottleneck: its CPU saturates, its disk IOPS are exhausted, and its tail latency spikes - while neighboring shards sit mostly idle. Adding more shards does not fix the problem; more capacity arrives distributed evenly but the hot shard’s share of traffic does not decrease.

Root Causes

Non-uniform key distribution is the most common cause. Hash-based sharding distributes keys evenly across the keyspace, but only if traffic is distributed evenly across keys. If 1% of keys account for 90% of reads (power-law distributions are the norm in user-generated content), those keys' shards are hot.

Celebrity or viral content is an acute version of the above. A post from an account with 50 million followers receives a burst of reads orders of magnitude larger than a typical post. The shard holding that post’s data - or that user’s follower list - absorbs the spike alone.

Time-based keys cause monotonically increasing partition keys. If you use a timestamp or auto-increment as the shard key, all writes go to the “latest” partition. In Bigtable and HBase, this is one of the most common operational mistakes: a table keyed on timestamp will always write to the last tablet, making it hot while earlier tablets go cold.

Detecting Hot Partitions

You cannot fix what you cannot see. Monitoring must expose per-partition metrics: request rate, CPU, read/write throughput, and latency - not just aggregate cluster metrics. A healthy cluster aggregate can hide a saturated single partition.

Signs of a hot partition: one partition’s throughput is 10x the median; P99 latency for requests to one shard is 100ms while others are 5ms; one Kafka partition’s consumer lag grows while others stay current.

Most managed services expose partition-level metrics natively: DynamoDB’s CloudWatch metrics include per-partition consumed capacity, Kafka exposes per-partition offset lag, and Redis Cluster exposes per-slot statistics.

Solution: Key Salting

The most direct fix is to artificially spread writes across multiple virtual shards by appending a random suffix to the hot key.

For a hot key user:celebrity_id, instead of writing to one location, write to ten:

user:celebrity_id:0
user:celebrity_id:1
...
user:celebrity_id:9

Pick the shard suffix randomly on write (or hash-deterministically from some secondary attribute). On read, issue a scatter-gather query to all ten shards and merge the results. The hot key’s traffic is now spread 10x across different physical shards.

The cost: reads require N requests instead of one. For write-heavy hot keys this is a good trade; for read-heavy hot keys the scatter-gather overhead matters. The shard count is a tunable parameter - use more shards for hotter keys.

Write-Time vs Read-Time Fan-Out

Related to key salting is the choice of when to do the work.

Write-time fan-out: When a celebrity posts, immediately write the post to each follower’s feed shard (or cache slot). Reads are fast - each user’s feed is pre-computed. But writes are amplified by the follower count: a post from a 10M-follower account requires 10M writes. This is only feasible below a follower threshold.

Read-time fan-out (pull model): Compute the feed at read time by querying each followee’s recent posts and merging. Writes are cheap. Reads are expensive - proportional to the number of accounts followed. For users who follow 2,000 accounts, this is 2,000 shard reads per timeline load.

Hybrid approach: Use write-time fan-out for users below a follower threshold (say, 100,000 followers) and read-time fan-out for celebrities. At read time, merge the pre-computed feed (from push) with a real-time pull from celebrity accounts the user follows. This bounds both write amplification and read cost.

Adaptive Hot Key Detection

For dynamic workloads where hot keys are not known in advance, systems must detect them at runtime. Top-K hot key detection uses probabilistic data structures (Count-Min Sketch, Heavy Hitters) to identify the most frequently accessed keys with bounded memory.

Once a hot key is identified, the system can apply a specific strategy automatically:

  • Promote the key to a local in-process cache (bypassing the shard entirely for subsequent reads within the same node).
  • Replicate the key to additional shards and use read replicas.
  • Apply key salting specifically to that key.

DynamoDB calls this adaptive capacity: it automatically shifts capacity within a table from cold partitions to hot partitions. This buys time but does not solve a sustained hot partition - the partition’s physical throughput ceiling is still bounded.

Local Caching of Hot Items

If a key is read thousands of times per second, the per-read cost of a network round trip to the shard may dominate. Caching hot items in process memory on the application servers eliminates the network hop entirely.

The challenge is cache invalidation: how does each application server know when the hot item’s value changes? Options: TTL-based expiry (accept briefly stale reads), a pub/sub invalidation channel, or accepting that hot reads are slightly stale by design. For social media follower counts, slight staleness is acceptable. For inventory levels, it may not be.

Request Coalescing

When many concurrent requests arrive for the same key within a short window, a coalescing layer deduplicates them: the first request fetches from the shard; subsequent requests for the same key within the same in-flight window wait for the first request’s result and share it. This is also called request collapsing or thundering herd protection.

Varnish and nginx implement this at the HTTP caching layer. Application-level implementations use a future/promise shared across concurrent goroutines or threads.

Write-Ahead Splitting in Bigtable/HBase

Bigtable and HBase split tablets automatically when they grow too large. But the split is reactive - it happens after the tablet is already hot. The better approach for predictable hot keys is pre-splitting: create tablets at known key boundaries before writing. For a timestamp-keyed table, pre-split by creating one tablet per day before data arrives, so writes are distributed from the start.

Examples

Twitter fanout: Twitter uses a hybrid push/pull model. For most users, a post is fanned out to all followers' timelines at write time (push). For accounts above ~10,000 followers (celebrities), the post is not pushed; instead, at read time, the celebrity’s recent tweets are pulled and merged into the requesting user’s pre-computed timeline. This keeps write amplification bounded.

Redis cluster hot slot: Redis Cluster uses hash slots (16,384 total). All keys that hash to the same slot live on the same node. If multiple high-traffic keys hash to the same slot, that slot and its node are hot. Key salting with {key}:suffix changes the hash but requires application-level scatter-gather. Redis Cluster does not automatically rebalance within a slot - the application must spread keys manually.

Kafka partition imbalance: If a Kafka topic uses user ID as the partition key, and a bot or celebrity generates 10,000 messages per second while average users generate 1 per minute, that user’s partition will have a consumer lag growing continuously while others are caught up. The fix: use a more uniform partition key, or increase the partition count and re-hash, or route high-volume producers to dedicated overflow topics.


Read Next: Feed Design: Push vs Pull