System Design From First Principles - The Complete Playbook
Helpful context:
- The Vocabulary of Systems - Demystifying the Buzzwords That Fill Engineering Meetings
- System Design - Building Things That Don’t Fall Over
Most people approach system design by memorizing architectures. They study how Uber designed their location service, how Netflix handles streaming, how WhatsApp scaled to a billion users. Then they walk into a whiteboard session and freeze, because the problem in front of them is not Uber or Netflix - it is some variant they have never seen before, and their library of memorized diagrams does not pattern-match.
The engineers who do not freeze are not the ones who memorized more architectures. They are the ones who internalized a reasoning process. They know how to derive an architecture from requirements, how to justify technology choices from first principles, and how to name their tradeoffs explicitly. They do not recall the answer. They construct it.
This post is that reasoning process, written out in full.
Phase 1 - Understand the Problem Before Touching Architecture
The most expensive mistake in system design is building a solution to the wrong problem. Requirements look simple on the surface and hide enormous complexity underneath. Spend the first 10-15 minutes here. Do not draw a single box until this phase is complete.
Functional Requirements: What Must the System Do?
Write down exactly what operations the system supports. Be specific about the verbs.
For a social media feed:
- Users can post content (text, images, video)
- Users can follow other users
- Users see a feed of content from people they follow, in reverse-chronological order (or ranked?)
- Users can like and comment on posts
Each of these is a feature. Each feature is a read path or a write path or both, and each has different scale characteristics. “Ranked feed” instead of “reverse-chronological” changes the entire compute story - one is a database read, the other is a machine learning inference job.
Questions to always ask:
- Does the system need to support all users globally or a specific region?
- Are there features that seem natural but are not explicitly required? (Analytics, search, notifications - these often get bolted on later and should be clarified upfront)
- What is the read/write ratio? A read-heavy system and a write-heavy system of equal scale require fundamentally different architectures.
Non-Functional Requirements: How Well Must It Do It?
These are the constraints that eliminate whole categories of architectural choices.
Scale: How many users, active simultaneously? How many requests per second? How much data per day, per year? What is the growth rate? A system at 10,000 users and one at 10 million users are not the same design problem.
Latency: What is acceptable end-to-end latency for each operation? p50? p99? A “fast” feed load might mean under 300ms at p50. A real-time bidding system might require under 10ms at p99. These have entirely different implications for caching strategy and data model.
Availability: What is the acceptable downtime? 99.9% is 8.7 hours/year. 99.99% is 52 minutes/year. 99.999% is 5 minutes/year. Each additional nine requires significantly more investment. Most consumer products are fine at 99.9%. Financial transaction systems often require 99.99%+.
Consistency: If a user posts something, when must it be visible to their followers? Immediately? Within a second? Within a minute? The answer constrains whether you can use eventual consistency (simpler, faster, cheaper) or need strong consistency (harder, slower, expensive).
Durability: What is the cost of losing data? Losing a log message is acceptable. Losing a financial transaction is catastrophic. Durability requirements drive storage choices.
Throughput: Distinct from latency. A system can have low latency per request but need to sustain high throughput (many requests simultaneously). Or it can batch-process with high throughput but high latency per item (data pipelines).
Write down all of these explicitly. The answers become constraints that eliminate options in every subsequent phase. “We need 99.99% availability” closes the door on any design with a single point of failure. “We need strong consistency for financial data” closes the door on eventually-consistent datastores for that data.
Phase 2 - Estimate Scale: The Numbers That Determine Your Architecture
The right architecture at 1,000 RPS is wrong at 1,000,000 RPS. Back-of-envelope estimation identifies what will be stressed before you commit to a design that cannot handle it.
The Numbers You Must Internalize
| Operation | Latency |
|---|---|
| L1 cache hit | ~0.5 ns |
| L2 cache hit | ~7 ns |
| RAM read | ~100 ns |
| SSD random read | ~100 µs |
| Network round-trip (same datacenter) | ~500 µs |
| HDD disk seek | ~10 ms |
| Network round-trip (cross-region) | ~100-150 ms |
| Component | Typical capacity |
|---|---|
| Single CPU core | ~100M simple operations/sec |
| Single SSD | ~100K IOPS random read, ~500 MB/s sequential |
| Single network interface | ~10 Gbps = ~1.25 GB/s |
| Single Postgres primary (well-tuned) | ~5,000-10,000 writes/sec, ~50,000 reads/sec |
| Single Redis instance | ~100,000-1,000,000 ops/sec |
| Single Nginx instance | ~50,000+ RPS for static, ~10,000 RPS for proxied |
| Single Kafka broker | millions of messages/sec, hundreds MB/s |
These numbers are orders of magnitude, not precise specs. Their value is in quickly identifying when a naive design is impossible. “Each of our 10M daily active users fetches their feed 5 times a day, and we want to pre-compute feeds” - that is 50M write operations per day just for feed maintenance, before actual user posts. Is that acceptable? Can the write path handle it?
Worked Estimation: URL Shortener
Requirements: 100M URLs shortened per day. 10:1 read/write ratio. URLs kept for 5 years.
Writes: 100M / 86,400 seconds ≈ 1,200 writes/second (call it 2,000 with variance)
Reads: 10:1 ratio → 12,000 reads/second peak (call it 20,000)
Storage: Average URL = 200 bytes. 100M URLs/day × 365 × 5 years = 182 billion URLs. At 200 bytes each = 36 TB of raw URL data. With indexes and overhead, call it 100 TB.
Bandwidth: 20,000 reads/second, each read returning ~200 bytes → 4 MB/s outbound. Trivial.
Conclusions from these numbers: writes are not the bottleneck (2,000/sec is fine for a single Postgres primary). Reads are significant but manageable (20,000/sec is addressable with caching). Storage is the real constraint - 100TB cannot fit on one machine and grows continuously.
This tells you: the design needs a distributed or tiered storage strategy. It does not need a complex write path. It needs aggressive caching on reads (a cache hit rate of 99% reduces database reads to 200/sec). Before touching architecture, you know where the pressure is.
Phase 3 - Design the Data Model: Start With Access Patterns
The data model is the most important design decision and the hardest to change later. Get it wrong here and every other decision is fighting the data layer. The correct way to design a data model is to start with access patterns, not with entities.
The Core Question: How Is the Data Accessed?
List every operation from Phase 1. For each operation, ask:
- What data does it need?
- How is it looked up? (By user ID? By time range? By full-text search?)
- How often is it read vs written?
- How many records are typically involved? (One record, or millions aggregated?)
A “get user by ID” access pattern maps naturally to a key-value or relational lookup. A “get all posts by users I follow, sorted by time” access pattern is a join across multiple tables or a pre-computed feed. A “search for posts containing a keyword” access pattern requires a text index (Elasticsearch, not a relational database). The access pattern determines the data structure.
Choosing Storage: The Decision Framework
Use a relational database (PostgreSQL, MySQL) when:
- Data has clear relationships between entities
- You need ACID transactions (money, inventory, anything where partial writes are catastrophic)
- The access patterns are known and query-able with indexes
- You need joins, aggregations, or complex queries
- Strong consistency is required
Use a document store (MongoDB, DynamoDB document mode) when:
- Data is hierarchical and accessed as a unit (user profiles, product catalogs)
- Schema evolves frequently
- You want to avoid joins by denormalizing into a single document
- The primary access pattern is “get this whole thing by ID”
Use a key-value store (Redis, DynamoDB, Memcached) when:
- Access pattern is always “get by key” or “set by key”
- Extremely high throughput required (Redis handles 1M+ ops/sec)
- Data fits in memory (Redis) or is small per-item (DynamoDB)
- Often used as a cache layer on top of another database
Use a wide-column store (Cassandra, HBase) when:
- Write-heavy workload at massive scale
- Data is time-series or naturally partitioned (events by user ID + timestamp)
- Linear horizontal scale is more important than consistency
- Queries are always by partition key (never ad-hoc)
Use a search engine (Elasticsearch, Solr) when:
- The primary access pattern is text search or fuzzy matching
- You need faceted filtering (filter by price range AND category AND brand)
- Full-text indexing is required
- Never as primary storage - Elasticsearch is not durable in the relational sense
Use a graph database (Neo4j, Neptune) when:
- The primary queries involve traversing relationships (friends of friends, shortest path)
- The data is genuinely a graph and relationship traversal is the bottleneck
- Social networks' recommendation queries, fraud detection link analysis
The anti-pattern: choosing a database because it’s fashionable (“we’ll use Cassandra for scale”) without mapping it to access patterns. Cassandra is extraordinarily good at write-heavy time-series data accessed by partition key. It is painful for anything else. The right database is the one whose strengths match your access patterns, not the one on the trending list.
Normalize vs Denormalize
Normalization eliminates data duplication. Denormalization deliberately introduces duplication to eliminate joins.
A normalized relational model for a social feed: users table, posts table, follows table. Getting a user’s feed requires joining all three. At 100M users, this join becomes expensive.
A denormalized model: each user has a pre-computed feed stored as a list of post IDs. Reads are instant - no join. Writes are expensive - every post must be pushed to all followers' feed lists. This is fan-out on write.
Neither is universally right. The choice depends on read/write ratio and whether the fan-out is manageable. At 10:1 read-heavy, denormalization is usually worth it. For a user with 10M followers, fan-out on write is catastrophically expensive - you need a hybrid.
Phase 4 - Design the API: Define the Contract First
Before drawing internal components, define what clients will call. This forces clarity: if you cannot articulate the API, you have not understood the system.
API Design Principles
Name operations by intent, not implementation. POST /shorten is better than POST /create-record-and-store-in-database. The API should be decoupled from internals.
Choose the right protocol. REST for public APIs accessed by browsers or external clients. gRPC for high-performance internal service-to-service communication. WebSockets or Server-Sent Events for real-time push (notifications, live feeds). GraphQL when clients need flexible querying of nested data and you want to avoid under/over-fetching.
Design for idempotency. Any write operation that might be retried should be idempotent: calling it multiple times produces the same result as calling it once. Stripe does this with idempotency keys on payment requests. URL shorteners often hash the long URL so the same URL always produces the same short code.
Version from the start. /v1/shorten instead of /shorten. When the API changes, you can introduce /v2/ without breaking existing clients.
Worked Example: URL Shortener API
POST /v1/urls
Body: { "long_url": "https://...", "alias": "optional-custom", "expires_at": "optional" }
Response: { "short_url": "https://short.ly/abc123", "expires_at": "..." }
GET /{short_code}
Response: 301 Redirect to long_url (or 302 for tracking clicks)
GET /v1/urls/{short_code}/stats
Response: { "clicks": 1234, "created_at": "...", "last_accessed": "..." }
DELETE /v1/urls/{short_code}
Response: 204 No Content
Note the choice of 301 vs 302 redirect: 301 is permanent and browsers cache it, meaning after the first visit, no subsequent request hits your servers - good for throughput, bad for analytics (you lose click tracking). 302 is temporary and browsers re-request every time - good for analytics, costs one server round-trip per click. This is a tradeoff that should be made explicit in the API design, not discovered later.
Phase 5 - Design the Architecture: Component by Component
Now draw the boxes. Work top-down from the client to the data. For every component you add, be able to answer: why does this exist, what problem does it solve, and what would break if I removed it?
The Standard Layers (and When to Skip Them)
DNS and CDN: DNS routes the domain to your infrastructure. A CDN sits in front of everything and serves static assets (images, CSS, JS) from edge nodes close to users. Every web system needs DNS. CDN is justified when serving significant static content or when geographic latency matters. For a purely API-driven service with no assets, CDN is less critical initially.
Load Balancer: Required as soon as you have more than one application server. At Layer 7 (HTTP-aware), it can route based on path, headers, or host. At Layer 4 (TCP), it is faster but cannot inspect content. Most web systems use L7 (AWS ALB, GCP HTTP Load Balancer, Nginx, HAProxy). Add an L4 load balancer in front for very high-throughput non-HTTP protocols.
Application Servers: The stateless business logic layer. Stateless is non-negotiable: any per-request state lives in an external store. Design application servers to be horizontally scalable from day one - this single decision enables every future scaling maneuver.
Cache: Sits between application servers and the database. Reduces database read load and latency. Add a cache when: read traffic is high, data is frequently re-read (high temporal locality), and data changes slowly relative to its read rate. If every read is for unique data that is never re-requested, a cache adds overhead without benefit.
Message Queue: Sits between components that need to communicate asynchronously. Add a queue when: you want to decouple producer throughput from consumer processing speed; you want multiple consumers for the same event; you want to buffer spikes; you need retries on failure. Do not add a queue where synchronous response is required (the user is waiting for the result).
Database: The system of record. Multiple databases for different access patterns is normal and often necessary. A relational database for transactional data, a search engine for full-text search, a time-series database for metrics. Each component should use the storage primitive best suited to its access pattern.
Drawing the Write Path and Read Path Separately
The most clarifying exercise in system design is separating the write path from the read path and drawing each independently.
Write path for URL shortener: Client → Load Balancer → Application Server → Generate short code → Write to database → (async) update analytics
Read path for URL shortener: Client → Load Balancer → Application Server → Check cache for short code → Cache hit: return redirect / Cache miss: query database → populate cache → return redirect
Drawing these separately makes it obvious where the bottlenecks are and which components serve which purpose. The write path needs the database to be durable. The read path needs the cache to be fast. These are different requirements and their solutions are independent.
Phase 6 - Technology Selection: How to Choose and Justify
This is where system design becomes differentiated. Anyone can name technologies. Being able to justify a choice from first principles - and explain what you give up by making it - is the skill.
The Justification Framework
For every technology choice, articulate four things:
- What problem does this solve? State the specific requirement it addresses.
- Why this over the alternatives? Name the alternatives and explain why each is a worse fit.
- What are the tradeoffs? Every technology has downsides. Name them explicitly.
- What would trigger a different choice? Under what conditions would you switch?
Database: Worked Justification
“For the primary URL storage, I’d use PostgreSQL.
The access patterns are: write a URL record (once per shortened URL), read a URL record by short code (on every redirect), and read analytics by URL. These are straightforward key-value lookups and simple aggregations - relational database strengths.
We need ACID guarantees on writes: if a short code is generated but the URL record fails to persist, we have a dangling pointer. A relational database gives us transactional safety here.
Why not Cassandra? Cassandra is exceptional for write-heavy time-series data at massive scale, but it requires careful schema design around partition keys and does not support transactions across partition keys. For our read-heavy, relational-ish workload, Postgres is simpler and better-suited. Why not DynamoDB? DynamoDB would work for key-value access, but it adds operational complexity and cost without giving us anything Postgres cannot. At our scale estimate, a well-tuned Postgres primary handles our write rate comfortably.
The tradeoffs: Postgres is a single-primary system. When we outgrow vertical scaling on writes, we will need to shard. For reads, we can add read replicas. At 10x current scale, this would be the revisit point.
What would change this choice: if our write rate exceeded ~10,000/second sustained, or if we needed built-in multi-region active-active writes, I’d look at CockroachDB (distributed SQL with automatic sharding and multi-region support) or Cassandra with careful schema design.”
This is the pattern. Every technology choice should follow this structure.
The Technology Decision Map
Relational DB (Postgres, MySQL): Strong consistency, ACID, complex queries, foreign key integrity, schema enforcement. Use for: financial records, user data with relationships, any data requiring transactions.
Document DB (MongoDB, DynamoDB): Flexible schema, hierarchical data, high throughput on simple key lookups. Use for: product catalogs, user profiles, content management, anything accessed whole-document.
Key-Value Store (Redis, Memcached): Extreme throughput, in-memory, simple access. Use for: caching, session storage, rate limiting counters, leaderboards, pub/sub. Redis specifically: sessions, distributed locks, sorted sets for rankings, geospatial queries.
Wide-Column (Cassandra, HBase): Linear write scale, partition-key access, time-series. Use for: event streams, IoT data, write-heavy analytics, time-series metrics.
Search (Elasticsearch): Full-text search, faceted filtering, relevance ranking. Use for: product search, log analytics, anything requiring text matching. Not for primary storage.
Message Queue (Kafka, SQS, RabbitMQ): Kafka for high-throughput durable event streaming with replay. SQS for simple task queues in AWS ecosystems. RabbitMQ for complex routing logic between producers and consumers.
Cache (Redis, Memcached): Redis for richness (sorted sets, persistence, pub/sub). Memcached for pure key-value caching with simpler operability. The default choice in new systems is Redis.
Object Storage (S3, GCS): Unlimited scale, cheap, durable. Latency of ~100-200ms per get. Use for: large files, images, videos, backups, data lakes. Not for low-latency access.
Phase 7 - Identify and Address Bottlenecks
Every architecture has a bottleneck. The goal is not to eliminate bottlenecks - it is to choose which bottleneck you can live with and ensure the actual bottleneck matches your weakest constraint, not your strongest one.
The Bottleneck Analysis Method
For each component in your architecture, estimate its capacity against the scale numbers from Phase 2. Find where the headroom runs out first.
“Can the database handle the write rate?" Take your estimated writes per second. Compare to database capacity. If writes are 2,000/sec and Postgres handles 10,000/sec, there is headroom. If writes are 15,000/sec, this is your bottleneck and the design needs to address it.
“What is the cache hit rate and what happens on a miss?" If your cache hit rate is 99% and you have 20,000 reads/sec, only 200 reads/sec reach the database. If the cache goes down or hit rate drops to 80%, 4,000 reads/sec hit the database. Does the database survive?
“What happens when a downstream service is slow?" If your application server calls three downstream services synchronously, the p99 latency of your service is roughly the sum of the p99 latencies of all three. A slow payment service makes your checkout slow even if everything else is fast.
Addressing Common Bottlenecks
Database read bottleneck: Add read replicas. Add a cache in front of the database. Ensure the hottest queries are indexed. Denormalize to avoid expensive joins.
Database write bottleneck: Optimize writes (batch inserts, write-behind caching). Vertical scale the primary (faster CPU, more memory, NVMe SSDs). Shard by a well-distributed key. Move non-critical writes to a queue for async processing.
Application server bottleneck: Application servers are stateless, so this is the easiest bottleneck to address: add more servers and let the load balancer distribute. Ensure the bottleneck is not downstream (if the database is saturated, adding application servers makes it worse).
Cache bottleneck: Redis is so fast that the cache itself is rarely the bottleneck. If it is, scale Redis with clustering (Redis Cluster) or add more cache instances behind a consistent hash.
Network bottleneck: Move computation and data closer together (same availability zone, same datacenter). Use a CDN for static content. Compress payloads. Reduce the number of network calls by batching or by restructuring data access.
Phase 8 - Design for Failure: Reliability From First Principles
Reliable systems are not built by hoping components do not fail. They are built by assuming every component will fail and designing accordingly.
Classify Every Component by Failure Mode
For each component, ask: what happens when this fails? There are three classes of failure:
Hard dependency: The component fails and the request cannot complete. The database for a URL shortener is a hard dependency for writes - if the database is down, no new URLs can be shortened.
Soft dependency: The component fails and the system degrades gracefully. The analytics service is a soft dependency for redirects - if it is down, the redirect still works, we just lose click tracking for that period.
Transparent failure: The component fails and a fallback handles it transparently. The cache is a transparent failure for reads - if Redis is down, reads fall through to the database. Slower, but correct.
Categorize every component. Ensure hard dependencies have redundancy (replication, failover). Design soft dependencies to degrade gracefully. Ensure transparent failures do not cascade (a Redis outage should not cause 100x more database load than the database can handle).
Replication and Failover
Every stateful component needs replication. The purpose of replication:
- Availability: if the primary fails, a replica takes over
- Read scale: reads can be distributed across replicas (with eventual consistency implications)
- Geographic redundancy: replicas in different zones or regions survive zone/region failures
For databases: use synchronous replication within an availability zone (zero data loss on failover), asynchronous replication across regions (low latency writes, possible data loss on region failure).
Automatic failover is essential. Manual failover during an incident adds minutes to the recovery time. Every stateful component should have automated health checking and automatic promotion of a replica when the primary is unhealthy.
Circuit Breakers and Timeouts
Without circuit breakers, a slow dependency cascades into a full outage. The pattern:
A circuit breaker wraps each external call. When the error rate for a dependency exceeds a threshold, the circuit breaker “trips” - subsequent calls immediately return an error (or cached fallback) without contacting the dependency. After a cooldown period, the circuit breaker allows a test request. If it succeeds, the circuit closes and normal traffic resumes.
Without this, your application server’s thread pool fills up waiting for a slow dependency, and your service becomes unavailable - even though your service’s own code is fine. A tripped circuit breaker fails fast, freeing threads, and allows your service to continue operating in a degraded state while the dependency recovers.
Every external call must also have an explicit timeout. “Wait forever” is not a strategy.
Idempotency and Retries
At scale, failures happen. Network packets are dropped. Database connections time out. Requests fail mid-flight. The correct response is to retry. But retries are only safe if the operation is idempotent - calling it multiple times produces the same result.
Design write operations to be idempotent:
- Use a client-generated request ID. On retry, the server detects the duplicate ID and returns the previous result without re-executing.
- Use conditional writes (“write only if this value is X”). Prevents double-writes.
- Generate deterministic IDs from content (hash the input). The same input always produces the same ID, so a duplicate write has the same effect as the original.
Combine retries with exponential backoff and jitter. Immediate retries after failures create thundering herds. Exponential backoff reduces the retry rate over time. Jitter (random delay) desynchronizes retries from multiple clients hitting the same failed service simultaneously.
Phase 9 - A Complete Worked Example: Designing a Pastebin
Rather than design Twitter (overused), here is a less-discussed system that exercises all the concepts above.
Requirements: Users can paste text. Each paste gets a unique URL. Pastes can be public or private (with a secret URL). Pastes expire after 30 days by default, or a custom duration. Read rate is 10x write rate. Analytics: view count per paste.
Estimation
Writes: 10M pastes/day → ~120/sec average, ~500/sec peak. Reads: 10x → 1,200/sec average, ~5,000/sec peak. Storage: average paste = 10KB. 10M/day × 30 days at any given time = 300M active pastes × 10KB = 3TB. Manageable on object storage. Data persisted for 30 days then deleted. Cleanup is a significant write workload.
Data Model
Three access patterns drive the model:
- Create paste → generate unique ID, store content
- Read paste by ID → return content if not expired
- Get view count for paste
Paste record: paste_id (8-char random string), content stored in object storage (S3/GCS), metadata in relational database (created_at, expires_at, view_count, is_public, secret_key if private, owner_id if authenticated).
Why split content from metadata? Content is large (up to several MB for code), accessed as a blob, and cheap to store in S3. Metadata is small, queried by ID, and needs fast point lookups. Mixing them in one database means paying database I/O for large blob reads on every access.
Private pastes: generate a second longer random key that becomes part of the URL. The “secret URL” approach - knowing the URL is proof of authorization. No access control table needed, which simplifies the read path significantly.
API
POST /pastes
Body: { "content": "...", "expires_in": 86400, "is_public": false }
Response: { "paste_id": "abc12345", "url": "https://paste.io/abc12345", "expires_at": "..." }
GET /pastes/{paste_id}
Response: Paste content (with view_count increment happening asynchronously)
GET /pastes/{paste_id}/stats
Response: { "view_count": 1234, "created_at": "...", "expires_at": "..." }
View count increment is async intentionally: if the counter increment is on the read path, every view hits the database for a write. Instead, publish a “paste-viewed” event to a queue; a consumer updates the counter in batches.
Architecture
Write path: Client → Load Balancer → App Server → Generate paste ID → Write content to S3 → Write metadata to Postgres → Return paste URL.
Read path: Client → Load Balancer → App Server → Cache lookup by paste_id → Hit: return content URL from cache → Miss: query Postgres for metadata → Check expiry → Return S3 signed URL → Client fetches content from S3 (CDN-accelerated).
Why a signed S3 URL instead of proxying? Proxying the content through the app server means every large paste read saturates app server network bandwidth. A signed URL redirects the client to fetch directly from S3/CDN, removing the app server from the data path entirely.
Expiry cleanup: a daily batch job queries Postgres for expired pastes and deletes them from both S3 and Postgres. This is a background operation with no latency requirement.
Analytics: “paste-viewed” events are published to a Kafka topic. A consumer reads the topic and increments a Redis counter for each paste. The /stats endpoint reads from Redis. Eventual consistency is acceptable here - being 1 second behind on view counts is not a correctness problem.
Technology choices justified:
- Postgres: metadata has a clear schema, transactions needed for atomic paste creation, relational database strengths match access patterns
- S3: content is large blobs, cheapest per-GB storage, durable, CDN-integrable
- Redis: view count is a counter, Redis INCR is atomic and extremely fast, no need for durable storage here
- Kafka: decouples read path from analytics write, survives consumer downtime (events are replayed), enables multiple future analytics consumers without changing the read path
Failure Analysis
- Postgres down → write path fails (hard dependency). Read path falls back to cache but new data unavailable. Mitigation: Postgres primary + synchronous replica with automatic failover (RDS Multi-AZ).
- S3 unavailable → content unreachable. S3 offers 99.99% availability. Cross-region replication as additional insurance.
- Redis down → view counts stop incrementing (acceptable degradation). Cache misses fall to Postgres (watch database load spike). Kafka events queue up until Redis recovers; counts are eventually consistent when consumer catches up.
- App server down → load balancer health check detects and stops routing. Replaced by auto-scaling group. Stateless - no impact.
The Trade-Off Reference
Every design decision involves trading one property for another. The fluency to name these explicitly, and explain why you made the choice you did, is the difference between a junior and senior design.
| You want | You sacrifice | The mechanism |
|---|---|---|
| Higher availability | Consistency | Eventual consistency, replica reads |
| Lower latency | Freshness | Caching, read replicas with replication lag |
| Higher write throughput | Durability | Write-behind caching, async durability |
| Stronger consistency | Availability | Synchronous replication, quorum writes (no writes during partition) |
| Simpler operations | Optimal performance | Managed services over self-hosted |
| Lower cost | Performance ceiling | Smaller instances, fewer replicas, object storage over block storage |
| Faster reads | Slower writes | Denormalization, pre-computation, fan-out on write |
| Faster writes | Slower reads | Normalization, fan-out on read |
| Resilience to dependency failure | Complexity | Circuit breakers, caches, graceful degradation code |
| Horizontal write scale | Query flexibility | Sharding (restricts which queries are efficient) |
| Schema flexibility | Data integrity | NoSQL (no foreign keys, no transactions across documents) |
| Exactly-once processing | Throughput | Distributed transactions (slower, more coordination) |
What the Best System Designs Have in Common
The best designs share properties that have nothing to do with choosing the right technologies:
They start simple and scale deliberately. A monolith with one database is often the right starting point. It is easier to split a well-factored monolith into services when scale demands it than to build a distributed system before you understand the load patterns.
Every component can be explained. “We added this because…” followed by a specific requirement. If a component cannot be explained by a requirement, it should not be in the design.
Failure modes are named. Every dependency has an explicit answer to “what happens when this fails?”
The data model matches the access patterns. Storage is chosen to serve the reads and writes that actually happen, not the reads and writes that seem natural.
Tradeoffs are explicit and justified. Choosing eventual consistency is fine. Choosing it without acknowledging that users may see stale data is not.
The goal is not a beautiful architecture diagram. The goal is a system that does what it is supposed to do, at the scale required, with the failure characteristics acceptable to the business - and a team that understands why every piece is there.
Read next: