Helpful context:

Picture the scene: a whiteboard, a marker, an interviewer who says “design Twitter.” Most candidates freeze. Not because they lack knowledge - they can recite what a CDN is, or explain what Redis does - but because they have no mental framework for assembling those pieces under pressure. The ones who don’t freeze have internalized something more valuable than a list of technologies: they have a repeatable thinking process, and they know the numbers that separate plausible from impossible.

That is what this post is about. System design is not a test of memorization. It is a test of articulated tradeoffs.

Why System Design Became a Discipline

In the early days of the web, “design” meant picking a database and writing some server code. A single machine could handle a modest website. Then traffic scaled, and things got interesting. The 2000s eBay outages, the Twitter fail whale, the early Facebook scaling crises - all of them exposed a gap between software engineering (writing correct programs) and systems engineering (building things that stay up under real-world stress). Google’s SRE model, Amazon’s internal service decomposition, Netflix’s migration to AWS - these weren’t just engineering projects. They were the gradual formalization of a discipline: how do you reason about a system you cannot hold entirely in your head?

System design interviews are, imperfectly, trying to probe that discipline.

The Framework: Six Steps Before a Single Box Gets Drawn

Step 1 - Clarify requirements. Separate what the system does (functional) from how it performs (non-functional). “Build a URL shortener” leaves open: custom aliases? Link expiry? Analytics? How many redirects per second? The questions you ask reveal how you think. Don’t assume; ask.

Step 2 - Estimate scale. Back-of-envelope calculations identify what will be stressed. They don’t need precision - an order of magnitude is enough. This is the step most candidates skip, and it’s the one that matters most, because the right architecture at 1,000 RPS is wrong at 1,000,000 RPS.

Step 3 - Design the data model. What entities exist? What are the read/write patterns? A write-heavy workload points toward different storage choices than a read-heavy one. The data model drives everything downstream.

Step 4 - Design the API. What do clients call, and with what parameters? Doing this before internal design forces clarity: if you can’t articulate the API, you don’t understand the system.

Step 5 - Design components top-down. Work from client to storage. Where does each request go, and why?

Step 6 - Identify bottlenecks. Where does this break as scale grows? What fails first - CPU, disk I/O, network, the database? Address each bottleneck with a known pattern.

There is no “correct” answer in system design. This is the central discomfort that trips people up. Two experienced engineers can propose very different architectures for the same requirements and both be right - because they’ve made different but defensible tradeoff choices. The interview is grading your reasoning, not your conclusion. The more explicitly you name your tradeoffs, the better you perform.

Internalize These Numbers

This is the Google SRE latency hierarchy, and you should commit it to memory. Not as trivia, but because these numbers instantly reveal whether a design is physically plausible.

Operation Latency
L1 cache hit 0.5 ns
L2 cache hit 7 ns
RAM access ~100 ns
SSD random read ~100 µs
Network round-trip (same datacenter) ~500 µs
HDD disk seek ~10 ms
Network round-trip (cross-region) ~100 - 150 ms
Packet from California to Netherlands ~150 ms

What do these numbers tell you? They tell you that accessing a remote database is roughly a million times slower than an L1 cache hit. That cross-region replication adds 100 - 150ms of irreducible latency - physics, not engineering. That if your service makes four serial network calls within a datacenter, you’ve already spent ~2ms before doing any actual computation. Design for the numbers, not the words.

For quick estimation: 1 million requests/day ≈ 12 requests/second (divide by 86,400). A typical web request payload is 1 - 10 KB. A database row is 100 bytes to 1 KB. A single well-tuned PostgreSQL instance handles thousands of reads per second. A single Nginx instance handles tens of thousands of RPS.

Horizontal vs Vertical Scaling

Vertical scaling means adding more resources to one machine - bigger CPU, more RAM, faster disk. It is operationally simple and requires no application changes. It also has a hard ceiling, and a single machine is always a single point of failure.

Horizontal scaling means adding more machines. It is theoretically unbounded, but it requires one critical architectural property: the application must be stateless. If a request can be routed to any server and produce the same result, you can scale horizontally. If request routing depends on which server holds the user’s session in memory, adding servers makes things worse.

Designing services to be stateless from the beginning is the highest-leverage architecture decision. It costs almost nothing early. It enables every scaling strategy later. State - sessions, caches, in-progress work - belongs in an external store: Redis, a database, or a distributed cache, not server memory.

The AWS Well-Architected Framework calls this out explicitly in its Performance Efficiency pillar: use stateless compute, externalize state, and scale horizontally. GCP’s best practices for Compute Engine echo the same principle.

The Three Bottlenecks: CPU, I/O, Network

Every performance problem is one of three things:

CPU-bound: The bottleneck is computation. Encoding video, running ML inference, compressing data. Solutions: more cores, better algorithms, offload to specialized hardware (GPUs, TPUs).

I/O-bound: The bottleneck is waiting for disk or database reads. This is the most common bottleneck in web services. Solutions: caching (avoid the I/O entirely), read replicas (distribute I/O across machines), indexing (reduce I/O per query), SSDs over HDDs.

Network-bound: The bottleneck is data transfer. Uploading large files, streaming video, high-fan-out notifications. Solutions: compression, CDNs (serve from edge, not origin), batching (fewer round trips).

A system design that doesn’t identify which bottleneck it’s solving isn’t really a design - it’s a list of services.

The Layered Architecture

Most large-scale web systems share a recognizable layered structure. From client to data:

Client - Browser, mobile app, or API consumer. The client’s job is to make requests and render responses. Keep logic minimal here; it’s the layer you control least.

CDN - A geographically distributed cache of static assets (images, JS, CSS) and sometimes dynamic responses. A user in Singapore hitting your CDN edge node at 5ms instead of your origin server at 180ms is a free latency win. AWS CloudFront, GCP Cloud CDN, and Cloudflare all operate on this principle.

Load Balancer - Distributes requests across your application servers. Provides health checking and horizontal scale. AWS ALB (Application Load Balancer) operates at Layer 7 - it can route based on HTTP path, headers, or host. AWS NLB (Network Load Balancer) operates at Layer 4 - faster, but cannot inspect application-layer content.

Application Servers - Stateless compute. These handle business logic and call downstream services. Because they’re stateless, you can add or remove them freely.

Cache Layer - In-memory key-value stores (Redis, Memcached) that sit between your application and database. The cache-aside pattern: read from cache; on miss, read from database and populate cache; on write, update database and invalidate cache. Redis is the default choice today - it adds persistence, Lua scripting, pub/sub, and richer data structures beyond Memcached’s pure key-value model.

Database - The system of record. Where data lives durably. The write bottleneck in most systems. Protected by the cache layer for reads; protected by connection pooling (PgBouncer for Postgres) for connection exhaustion.

The 20% of Patterns That Cover 80% of Interview Questions

Read/Write Split (CQRS-flavored): Separate your read path from your write path. Writes go to a primary database; reads go to replicas. This handles the common case where reads vastly outnumber writes. AWS Aurora supports up to 15 read replicas. GCP Cloud Spanner handles this with automatic multi-region replication.

Cache-Aside (Lazy Loading): Don’t preload the cache; populate it on miss. Simple, avoids wasted cache capacity on data that’s never read. The risk: a cache stampede when a popular key expires and thousands of requests simultaneously hit the database. Mitigation: probabilistic early expiration (refresh before the key fully expires) or a mutex lock on first population.

Fan-out on Write vs Fan-out on Read: When a Twitter user with 10 million followers posts a tweet, do you write that tweet to 10 million followers' feeds immediately (fan-out on write), or do you assemble the feed at read time (fan-out on read)? Fan-out on write means fast reads but expensive writes for high-follower users. Fan-out on read means cheap writes but slow reads for users with many followees. Twitter’s actual architecture uses a hybrid: fan-out on write for most users, fan-out on read for celebrity accounts. Knowing this pattern exists is more important than knowing Twitter’s exact implementation.

Event-Driven Decoupling: Instead of service A calling service B synchronously, A publishes an event to a queue, and B consumes it asynchronously. This decouples availability - B can be down without affecting A - and enables multiple consumers (fan-out). AWS SQS and SNS, Kafka, and GCP Pub/Sub all implement this.

Multi-Region Design Considerations

A single-region system, no matter how well designed, is a latency problem for users far from your region and a risk for regional failures. Multi-region design introduces new tradeoffs.

Active-active multi-region: All regions serve traffic and accept writes. Requires cross-region replication and conflict resolution (or strong consistency at the cost of cross-region write latency). AWS Global Accelerator routes users to the nearest healthy region. DynamoDB Global Tables provide active-active replication with eventual consistency.

Active-passive (primary-standby): One region handles writes; others are warm standbys that serve reads (or stay idle for failover). Simpler consistency model; higher failover latency. AWS RDS Multi-Region read replicas with Route 53 failover is a common pattern here.

Data sovereignty: GDPR and similar regulations require that certain user data physically reside within specific geographic boundaries. AWS and GCP both offer region-level data residency guarantees. This isn’t optional for EU deployments - it’s a legal requirement that constrains your architecture.

What Actually Breaks Systems

The preceding sections describe what to build toward. Equally useful - and often more practically instructive - is asking the opposite question: what concretely causes each property to break down? The failure modes of scalability, reliability, and maintainability are specific, diagnosable, and usually present long before the system actually fails. Knowing them gives you a diagnostic lens: when you see the symptom, you know what’s coming.

What Makes a System Non-Scalable

A system fails to scale when its bottlenecks grow linearly with load instead of being absorbed by the architecture.

Stateful servers. The moment session data lives in a server’s local memory, adding servers makes routing harder, not easier. A user must always hit the same server. You cannot redistribute load freely. You cannot replace a server without losing in-flight sessions. Every horizontal scaling decision is constrained by this one early choice. The fix is cheap - externalize state from day one - but the damage of not doing it is compounding.

Synchronous call chains. If a request to your service triggers a synchronous chain A → B → C → D, your worst-case latency is the sum of all four hops plus business logic. More critically, the throughput of the entire chain is bounded by its slowest service. You cannot scale around a synchronous dependency; you can only wait for it.

Write-path centralization. Reads distribute easily: caches, read replicas, CDNs all absorb read load. Writes are harder because they require coordination and durability. A single primary database handling all writes is the most common scaling ceiling - it looks generous early (a tuned Postgres primary handles thousands of writes/second) and hits like a wall when approached. Systems that never separate their write path from their read path discover this at the worst possible time.

Hot partitions. Sharding only helps if the load is actually distributed across shards. Sharding by a non-uniform key - timestamp, geographic region, a user attribute that clusters - concentrates traffic on a few shards. One shard runs hot and becomes the bottleneck. The system looks sharded but behaves as if it weren’t. Uniform hash-based sharding by a randomly distributed key (e.g., hashed user ID) is the standard countermeasure.

Unbounded resource accumulation. Memory that grows without bound, database connections that are never returned to the pool, file descriptors that leak, queues that grow faster than they drain - these are all forms of the same failure: a resource that accumulates under load until the process crashes or slows to a halt. They don’t show up in synthetic load tests. They appear in production after sustained traffic.

What Makes a System Non-Reliable

A system fails to be reliable when individual failures propagate outward rather than being contained.

Single points of failure. Any component with no redundancy means a single failure takes down the system. This includes the obvious (one server, one availability zone) and the less obvious: a shared configuration database, a single deployment pipeline, a manually maintained server that no one knows how to recreate, a third-party API with no fallback. The question to ask of every component: what happens when this specific thing fails? If the answer is “everything stops,” that’s your single point of failure.

Cascading failures. When service B becomes slow, A’s threads pile up waiting for it. A’s thread pool exhausts. A becomes unresponsive. A’s callers time out and pile up. The failure propagates upstream through the call graph, taking down services that had nothing to do with the original problem. This is the most common failure mode in microservices architectures. The countermeasures - timeouts, circuit breakers, bulkheads - must be present before the failure, not added after. A circuit breaker that doesn’t exist cannot trip.

No graceful degradation. A system that returns an error when any single dependency fails has made those dependencies part of its critical path. The search page that errors out because the recommendations service is down, instead of showing results without recommendations, has coupled two things that didn’t need to be coupled. Every external dependency is a potential failure injection point. The design question is not “will this dependency fail?” but “what should the system do when it does?”

Invisible failures. A system without metrics, without structured logging, without health endpoints, without alerting, doesn’t fail loudly - it degrades silently. By the time a user reports it, the failure has been real for minutes or hours. The system was already broken; it just wasn’t observable. Observability is not an operational nicety added after the system works. It is a reliability property: a system you cannot observe is a system you cannot trust.

Large blast radius deployments. Releasing a change to all production traffic simultaneously means every bug is a production-wide incident. Canary deployments, feature flags, and staged rollouts are not process overhead; they are the mechanism by which the cost of a bug is bounded. A bad deploy that reaches 1% of traffic before rollback is a near-miss. The same deploy to 100% is an incident.

What Makes a System Non-Maintainable

A system becomes unmaintainable when the cost of understanding and changing it grows faster than the team that owns it.

Tight coupling. When changing service A requires coordinating changes in B, C, and D, the services share a boundary in name only. Every feature requires a multi-team deployment window. Every bug fix risks breaking something in another service. The system has the operational complexity of microservices without the independence that makes microservices worth the complexity. This is a distributed monolith, and it is worse than an actual monolith because at least a monolith deploys as a unit.

Implicit contracts. A service that returns undocumented fields that callers depend on, that has timing behavior other services assume, or that makes ordering guarantees it never explicitly promised, accumulates hidden coupling. The contract is real; it’s just invisible. When it changes - because the owning team didn’t know about it - something downstream breaks in a way that is difficult to trace. Explicit, versioned APIs are the mechanism for making implicit contracts explicit before they become incidents.

Snowflake infrastructure. Servers that were manually configured at some point in the past, whose configuration is not in source control, that cannot be reproduced from scratch in a reasonable time, and which no one fully understands are not components - they are organizational risks. When they fail, they become extended incidents rather than routine replacements. Infrastructure as code is the difference between a replaceable part and a sacred artifact that no one wants to touch.

No operational tooling. A system that cannot be deployed safely, rolled back quickly, diagnosed under load, or reproduced locally resists change. Each change requires either courage or extensive manual testing. Engineers avoid changes they should make because the operational risk of making them exceeds the pain of leaving them alone. Technical debt accumulates not because engineers don’t see it but because the cost of paying it down is too high. The solution is investment in deployment pipelines, rollback mechanisms, local development environments, and runbooks - before the debt is too heavy to lift.


Summary

Concept Key Insight
Stateless services Required for horizontal scaling; externalize all state
Back-of-envelope math Order of magnitude is enough; use it to find bottlenecks
The latency hierarchy L1 cache to cross-region: 7 orders of magnitude difference
Cache-aside pattern Populate on miss; invalidate on write
Read/write split Replicas absorb read load; primaries handle writes
Fan-out tradeoffs Write-time vs read-time materialization
Multi-region Active-active vs active-passive; data sovereignty constraints
System design interviews Grade reasoning, not conclusions; name your tradeoffs

Read Next: