The Vocabulary of Systems - Demystifying the Buzzwords That Fill Engineering Meetings // Megha Bose

You join a meeting. Someone mentions the reverse proxy is being reconfigured, which will affect the rate limits on the API gateway, and by the way the new service’s SLO is 99.9% and we need to wire up distributed tracing before it goes to prod. You nod. Half the words landed; the other half slid past without purchase.

This post is a field guide to that vocabulary. Not just what each term means, but why it was coined - what problem it was solving when someone first needed a word for it - and where two terms that seem interchangeable actually diverge. These words are not arbitrary jargon. Each one crystallized around a real engineering problem, and understanding the problem is what makes the word stick.

Client and Server

The words “client” and “server” come from the mundane world of commerce. A client is someone who requests a service. A server is someone who provides it. The engineering metaphor emerged in the 1970s and 1980s as computing moved away from the mainframe model - where one huge machine did everything and dumb terminals just displayed the results - toward a model where the machine making the request and the machine responding to it could be different, specialized computers.

A client is anything that initiates a request. A server is anything that listens for requests and responds to them. Your browser is a client. The machine at Google’s data center that responded when you typed a URL is a server. So is the database that the Google machine queried internally. And the cache it checked before that.

The important thing the metaphor conveys is direction: clients initiate, servers respond. But in any real system, a machine plays both roles at once. The server handling your HTTP request is itself a client to the database it queries. “Client” and “server” describe a relationship within a single interaction, not a fixed identity of a machine.

Request, Response, Protocol, and the Language They Speak

A request is a message from a client asking for something. A response is what the server sends back. These are not informally defined - they follow a protocol, which is a shared contract about message format, sequence, and meaning. Both sides must speak the same protocol or the conversation fails.

The dominant protocol for web communication is HTTP (Hypertext Transfer Protocol), designed by Tim Berners-Lee in 1989 at CERN. HTTP is a request-response protocol: the client sends a request with a method (GET, POST, PUT, DELETE), a URL, headers, and an optional body. The server responds with a status code, headers, and an optional body. The status codes are a vocabulary of their own - 200 means success, 404 means the resource was not found, 500 means the server failed, 429 means you’ve sent too many requests and are being told to slow down.

Headers and Payload

Every HTTP message has two distinct parts: headers and a body, where the body is also called the payload.

Headers are metadata - key-value pairs that describe the message without being the message. They tell the recipient what type of content is in the body (Content-Type: application/json), how long it is (Content-Length: 348), who is making the request (Authorization: Bearer eyJh...), what formats the client can accept (Accept-Encoding: gzip), and dozens of other things. Headers are read by the network infrastructure - proxies, load balancers, caches - as well as the application.

The payload is the actual data being carried. When you submit a login form, the username and password are in the payload. When a server returns a list of users, the JSON array is the payload. When you upload a file, the file bytes are the payload.

The word “payload” comes from the freight industry: the payload of a truck is the cargo you are being paid to deliver - distinct from the truck itself, the fuel, and the shipping labels. In networking, the same distinction holds at every layer: each layer wraps the layer above it in its own headers, and the content from the layer above becomes its payload. An HTTP response body is the payload of HTTP, which is itself the payload of TCP, which is itself the payload of IP, which is itself the payload of the Ethernet frame. This nesting is called encapsulation.

HTTPS is HTTP with encryption layered on top via TLS (Transport Layer Security). The “S” stands for Secure. Without TLS, every network intermediary between client and server can read both the headers and the payload. With TLS, only the two endpoints can - everyone in the middle sees ciphertext.

URL, Endpoint, Route

These three terms describe the same thing from different angles and are often used interchangeably, sometimes incorrectly.

A URL (Uniform Resource Locator) is the full address: https://api.example.com/users/42?format=json. It has several distinct components: the scheme (https) tells you which protocol to use; the host (api.example.com) identifies the machine; the path (/users/42) identifies the resource on that machine; the query string (?format=json) passes optional parameters.

An endpoint is a specific URL on a server that handles a particular type of request - usually from the server’s perspective. “The /users/{id} endpoint returns a user by ID.” The word comes from the idea that it is the destination end of a communication channel.

A route is the pattern that a server uses to match incoming request paths to handler functions. In code: app.GET("/users/:id", getUserHandler). Route is a code-side concept; endpoint is the network-facing concept. In casual speech, they’re used interchangeably.

The Network Underneath

IP Address and DNS

Every machine reachable on the internet has an IP address - a number that uniquely identifies it on the network. IPv4 addresses are written as four numbers separated by dots: 142.250.80.46. IPv6 addresses are 128-bit hexadecimal numbers, designed to give every grain of sand on Earth an IP address and still have room to spare.

You don’t type 142.250.80.46 to reach Google. You type google.com. DNS (Domain Name System) is the global directory that translates human-readable names into IP addresses. The process of performing that lookup is called DNS resolution.

A DNS resolver is the service that does the resolution on your behalf. Your operating system is configured to use one - usually your home router, which forwards to your ISP’s resolver, or a public one like Google’s 8.8.8.8 or Cloudflare’s 1.1.1.1. When you type google.com, your computer asks its configured resolver: “what is the IP for google.com?” The resolver either answers from its cache (if it has a recent record for that name) or performs a walk up the DNS hierarchy: ask a root name server who handles .com, then ask the .com name server who handles google.com, then ask Google’s own authoritative name server for the final IP. The authoritative name server is the one that actually knows the answer - it is authoritative for that domain because the domain owner configured it. This full walk takes several round trips, but resolvers cache the result for a duration the domain owner specifies (called the TTL - Time to Live). A cached answer returns in milliseconds; an uncached full walk typically takes 20-100ms.

DNS was invented in 1983, before which there was literally a single file (HOSTS.TXT) maintained by a lab at Stanford that listed every hostname on the internet. When the internet grew beyond a few hundred machines, that stopped working. DNS replaced it with a distributed hierarchy of name servers - the root servers, then TLD servers (.com, .org), then authoritative servers for each domain.

The key insight: DNS decouples names (stable, human-meaningful) from addresses (unstable, machine-specific). A company can change which server handles api.example.com without telling anyone - just update the DNS record. Clients keep using the same name.

Private vs Public IPs and NAT

Not everything with an IP address is on the internet. Your laptop probably has an IP like 192.168.1.5. Your phone has one too. So do millions of other devices worldwide - and none of them conflict, because these are private IPs, meaningful only inside a local network. The ranges 10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16 are reserved for private use by convention. No packet carrying a private destination address is supposed to be forwarded onto the public internet.

Your home router has two IP addresses: one private-facing address for your local network, and one public IP assigned by your ISP. That public IP is the one that exists on the actual internet - globally unique, routable anywhere.

When your laptop sends a request to Google, your router rewrites the packet before it leaves: it replaces your private source IP with its own public IP, and keeps a record of the substitution so it can map the response back to you when it arrives. This is NAT (Network Address Translation). Google sees the router’s public IP, not your laptop’s. From the internet’s perspective, your entire home network appears as a single machine at one IP address.

NAT was invented partly as a workaround for IPv4 address exhaustion - there are only about 4 billion possible IPv4 addresses, far fewer than the number of devices in the world. By hiding thousands of private devices behind one public IP, NAT stretched the IPv4 address space enormously. IPv6, with its 128-bit addresses, was designed to make NAT unnecessary - enough addresses for every device, ever. The transition is ongoing.

Router, Switch, and How the Internet Is Actually Built

The internet is not a cloud. It is a physical graph of interconnected devices, and the devices doing the forwarding are routers.

A router is a device that receives packets and decides where to send them next. It operates at the IP layer - it reads the destination IP address on each packet and consults its routing table: a list of IP address prefixes and corresponding next hops. “For packets destined to 10.0.0.0/8, send to the device at 192.168.1.1.” The router does not need to know the full path to the destination - just the next step. The next router does the same, and so on, hop by hop, until the packet arrives.

This local decision-making is what makes the internet resilient. If a link goes down, routing protocols detect the change and update routing tables across the network in seconds. Packets automatically take alternate paths.

Your home router is both a router (it connects your local network to your ISP’s network, making forwarding decisions between the two) and a switch (it connects the devices in your home to each other).

A switch operates at the layer below IP - the MAC address layer. It forwards frames (not packets) between devices on the same local network based on MAC addresses, which are hardware identifiers burned into network interface cards. A switch doesn’t see IP addresses; it only knows about the local network segment it serves. A packet going to Google leaves your machine, the switch forwards it to the router based on MAC address, and the router takes over from there using IP.

The short version: switches connect devices within a network. Routers connect networks to each other.

The internet is a network of networks. No single organization owns or operates it. There are tens of thousands of independently owned networks - your ISP, Google, AWS, universities, governments, telcos - each running their own infrastructure. Each of these networks is called an Autonomous System (AS) and has an assigned AS number. The protocol that connects them all is BGP (Border Gateway Protocol): each AS announces to its neighbors “I can reach these IP address ranges, route through me.” Every major router on the internet runs BGP and uses those announcements to build its routing table. When you send a packet from your ISP to Google, BGP is what determined the path.

Physically, the internet runs on fiber optic cables - glass threads that carry pulses of light at roughly two-thirds the speed of light. Terrestrial fiber runs under roads and rail lines. Undersea fiber connects continents: there are hundreds of submarine cables crossing every ocean, each carrying terabits per second, owned by consortia of telcos and increasingly by technology companies like Google, Meta, and Microsoft who laid their own. The physical paths of these cables explain why cross-continental latency has a hard floor - light itself takes ~70ms to travel from New York to London, so no optimization can make that request faster than about 70ms round-trip.

Internet Exchange Points (IXPs) are the physical meeting places where different networks interconnect. An IXP is essentially a neutral building with a large switch inside, where ISPs, content providers, and cloud networks plug in next to each other and exchange traffic directly - without routing through intermediaries. If two networks exchange a lot of traffic, peering directly at an IXP is cheaper and faster than paying a third network to carry it. Major IXPs include DE-CIX in Frankfurt, AMS-IX in Amsterdam, and Equinix facilities in dozens of cities.

Data centers are where servers actually live - rooms full of rack-mounted computers with no screens or keyboards, connected to redundant power and fiber. When you request google.com, you reach a machine in one of Google’s data centers: in Oregon or Iowa or Dublin or Singapore. Cloud providers like AWS, GCP, and Azure each operate dozens of data center regions globally. The geographic spread is not redundancy theater - it determines how far your packet has to travel, and therefore how fast the response arrives.

Nobody owns the internet as a whole. The undersea cables are owned by telco consortia and tech companies. The backbone fiber is owned by tier-1 ISPs. The last mile to your home is owned by your ISP. The data centers are owned by cloud providers and the companies that use them. The IXPs are often run by non-profit industry cooperatives. All of it works because every participant speaks the same protocols - IP and BGP chief among them - and because the economic incentive to route traffic globally outweighs the cost of cooperation.

Port

An IP address identifies a machine. A port identifies a specific service running on that machine. The same computer can run a web server, a database, and an SSH daemon simultaneously - they share one IP address but listen on different ports. HTTP uses port 80 by convention; HTTPS uses 443; SSH uses 22; PostgreSQL uses 5432.

Ports exist because the operating system needs to know which process to hand an incoming packet to. The port is an integer from 0 to 65535. Ports below 1024 are privileged and require elevated permissions. When you visit https://example.com, your browser is implicitly connecting to port 443 - it just doesn’t show it because HTTPS always means 443.

TCP and UDP

IP sends packets across the network, but with no guarantee. Packets can be lost, duplicated, or arrive out of order. TCP (Transmission Control Protocol) builds a reliable, ordered byte stream on top of IP’s chaos. Before any data flows, TCP performs a handshake to establish the connection. Lost packets are retransmitted. Packets are reordered if they arrive scrambled. This is why HTTP, SSH, database connections, and most things you care about use TCP: reliability matters more than raw speed.

UDP (User Datagram Protocol) throws packets at the destination and does not look back. No handshake, no retransmission, no ordering. If a packet is lost, it’s gone. This sounds bad, but for some applications it’s the right tradeoff. A live video call cannot wait for a retransmitted frame from 300ms ago - it would arrive too late to display. DNS queries are tiny and idempotent - if the reply is lost, you just ask again. Online games send 60 position updates per second - skipping one is fine, waiting for it is not. UDP gives those applications the minimal overhead that TCP cannot.

Socket

A socket is a programming abstraction for a network connection. It is identified by a combination of IP address and port - both on the local side and the remote side. When your browser opens a connection to api.example.com:443, it creates a socket. The server on the other end creates its own socket. Data flows through this socket in both directions.

The word comes from electrical connectors: you plug something into a socket and current flows. The network socket is the endpoint you plug into to send and receive data. In code, sockets are file-like objects: you write to them, you read from them, you close them when done.

The Middlemen

Proxy and Reverse Proxy

A proxy is an intermediary that stands between two parties and acts on behalf of one. The word means “authority to act on behalf of another” - from Latin via Old French.

The distinction that confuses people is forward proxy versus reverse proxy - and the names are confusing because neither is truly “forward” or “backward.”

A forward proxy sits on the client side. Clients are configured to send their requests to the proxy, which forwards them to the destination server. The client knows about the proxy. The server sees the proxy’s IP, not the client’s. Corporate networks use forward proxies for content filtering; Tor is a forward proxy for anonymity; some clients use them for caching.

A reverse proxy sits on the server side. Clients send requests to it thinking they are talking to the server. The reverse proxy receives those requests and forwards them to one of the actual backend servers. The client does not know the reverse proxy exists. The name “reverse” means it is the mirror image: instead of representing clients to servers (forward), it represents servers to clients (reverse).

Nginx, HAProxy, and Caddy are software commonly used as reverse proxies.

Load Balancer

A load balancer is a reverse proxy that distributes incoming requests across a pool of servers. The name is self-explanatory: it balances the load. As traffic grows, one server eventually saturates. The solution is more servers - but then you need something to decide which request goes to which server. That is what a load balancer does.

Load balancers implement distribution strategies: round-robin (each server in turn), least-connections (send to whichever server has the fewest active requests), IP hashing (same client always goes to same server - useful when sessions are stored server-side). They also perform health checking: if a server stops responding, the load balancer stops sending traffic to it and routes around it.

The difference between a load balancer and a reverse proxy is mostly conceptual. A load balancer is a reverse proxy whose primary purpose is distribution across a pool. In practice, both do TLS termination, health checking, and request routing. Software like Nginx can act as either or both simultaneously.

API Gateway

An API gateway is a reverse proxy that adds business logic. Where a load balancer’s job is distribution, an API gateway’s job is also authentication, authorization, rate limiting, request transformation, logging, and routing to different backend services based on request content.

The term emerged with microservices architectures. When an application is split into dozens of services, every service would have to independently implement auth and rate limiting. An API gateway centralizes those concerns. Clients talk to one endpoint; the gateway decides which service handles each request and enforces cross-cutting policies.

AWS API Gateway, Kong, and Nginx (with plugins) are common examples. The distinction from a plain reverse proxy is the presence of that business logic layer.

CDN

A CDN (Content Delivery Network) is a globally distributed set of servers (edge nodes) that cache content close to users. The insight: physics limits network speed. A request from Tokyo to a server in Virginia takes ~150ms just for the signal to travel there and back at the speed of light. If a Tokyo user requests a static image or JavaScript file that hasn’t changed in weeks, there is no reason to make that round trip. A CDN keeps a copy in Tokyo.

CDNs were invented in the late 1990s, primarily by Akamai, as the web began serving large media files and the geography of the internet became a bottleneck. Today they serve static assets, video, software downloads, and increasingly dynamic content through edge computing.

The terms “edge” and “origin” come from CDN vocabulary: the origin is the authoritative server where content is created; the edge is the CDN node close to the user. A cache hit at the edge is fast. A cache miss requires fetching from origin and adds the full round-trip latency.

What Actually Happens When You Visit a URL

When you type https://api.example.com/users/42 into a browser, a chain of about eight distinct steps happens before you see a response. Every term used in these steps is defined above - this is where they all connect.

1. Parse the URL. The browser splits the URL into scheme (https), host (api.example.com), path (/users/42), and query string (none here). The scheme tells it to use HTTP over TLS on port 443.

2. DNS resolution. The browser checks its local DNS cache. If it has a recent record for api.example.com, it uses that. Otherwise it asks the operating system’s configured DNS resolver (usually your router, which in turn asks your ISP’s DNS server, which eventually queries the authoritative name server for example.com). The result is an IP address: say 93.184.216.34. This entire dance typically completes in under 20ms.

3. TCP handshake. The browser opens a TCP connection to 93.184.216.34:443. TCP performs a three-way handshake (SYN, SYN-ACK, ACK) to establish the connection. This costs one full round-trip time - if the server is 80ms away, you’ve spent 80ms before sending a single byte of application data.

4. TLS handshake. Over the newly established TCP connection, TLS negotiates an encrypted channel (described in detail in the next section). This costs another one to two round-trips. By the end, both sides share a symmetric session key and all subsequent data is encrypted.

5. HTTP request. The browser sends an HTTP GET request over the encrypted connection. The request includes the path (/users/42), the host header (Host: api.example.com), and other headers (accepted content types, cookies, authorization if any).

6. Server routing. The request arrives at whatever is listening on port 443 - probably a load balancer or reverse proxy, not the application server directly. The load balancer decrypts the TLS (TLS termination), reads the path, and forwards the plaintext HTTP request to one of the application server instances in the pool.

7. Application logic. The application server matches the path /users/42 to its route handler for GET /users/:id. The handler queries a database or cache for user 42, assembles the response body (a JSON object), and sends back an HTTP 200 response with that body.

8. Response travels back. The response is encrypted, sent back over TCP, arrives at the browser. The browser decrypts it, parses the JSON, and renders or processes it.

The total time is roughly: DNS latency + TCP RTT + TLS RTT(s) + application processing time + return trip. For a server in the same region, this is typically 50-200ms. For a cross-continent request, easily 500ms+, which is why CDNs and regional deployments exist.

TLS: How the Encryption Layer Actually Works

TLS (Transport Layer Security) is the protocol that makes HTTPS secure. Understanding it requires understanding why plaintext HTTP was dangerous.

HTTP sends everything as readable text. Your ISP, your office network admin, the operator of the coffee shop Wi-Fi, and every router your packet passes through on the internet can read every byte. In the early web this was considered acceptable - the web was academic, data wasn’t sensitive. Once banking, email, and healthcare moved online, it was not acceptable at all.

TLS solves three problems simultaneously:

Confidentiality: no one can read the data in transit
Integrity: no one can modify the data without detection
Authentication: you know you are talking to the real server, not an impersonator

The mechanism is a handshake that happens before any application data flows:

Step 1 - ClientHello. The browser connects and says: “I want to talk TLS. Here are the cipher suites I support” - a list of encryption algorithms ordered by preference.

Step 2 - ServerHello + Certificate. The server picks a cipher suite and sends its certificate. A certificate is a file containing the server’s public key and a signature from a Certificate Authority (CA) - a trusted third party like DigiCert, Let’s Encrypt, or Comodo. The signature mathematically binds the public key to the domain name.

Step 3 - Verification. The browser checks three things: Is this certificate signed by a CA I trust? (Browsers ship with a pre-installed list of ~150 trusted root CAs.) Is the certificate issued for this domain? Is it still within its validity period? If any check fails, the browser shows the “connection not secure” warning. If all pass, the browser trusts that the public key it received really belongs to the server it intended to reach.

Step 4 - Key exchange. The browser and server use the server’s public key (or a Diffie-Hellman exchange) to agree on a shared session key without that key ever being transmitted over the network. An eavesdropper who captured all the handshake traffic cannot derive the session key - this is the “forward secrecy” property of modern TLS.

Step 5 - Encrypted communication. From this point on, all data is encrypted with the session key using a fast symmetric cipher (AES-256-GCM is common). Symmetric encryption is used rather than the public-key cryptography from step 4 because it is orders of magnitude faster.

What TLS does not protect: that a connection was made. Your ISP can see that you connected to 93.184.216.34 - they just cannot see what you sent or received. DNS queries are also traditionally unencrypted (though DNS-over-HTTPS is changing this). Metadata - who you talk to, when, and how much data you exchange - leaks even over HTTPS.

One practical term: TLS termination is the act of decrypting an HTTPS connection. Load balancers and reverse proxies typically terminate TLS - they receive encrypted traffic from the outside, decrypt it, and pass plaintext HTTP to the backend application servers on the internal network. This centralizes certificate management and lets application servers be simpler.

Data at Rest: Storage Vocabulary

Database

A database is a system designed to store, organize, and retrieve data reliably. Why not just use files? Files work until you have multiple processes reading and writing simultaneously, or until you need to find all entries where city = 'Berlin', or until a crash happens mid-write and leaves the file in a corrupted half-written state. Databases solve these problems systematically: concurrent access, efficient queries, and durability in the face of failures.

Relational databases (PostgreSQL, MySQL, SQLite) organize data into tables with rows and columns. The word “relational” comes from relational algebra, a mathematical framework for describing relationships between sets. Each table is a relation. SQL (Structured Query Language) is the language for querying them. SQL is not the same as “relational” - SQL is the language, relational is the model. You can have a relational database without SQL (though in practice you usually don’t).

Non-relational databases (MongoDB, Redis, Cassandra, DynamoDB) organize data differently - as documents, key-value pairs, wide columns, or graphs - and are often tuned for specific access patterns that relational databases serve poorly. “NoSQL” is a marketing term that means roughly “not primarily relational.” It tells you what these systems are not more than what they are.

Schema

A schema is the definition of structure: what fields exist, what types they have, what constraints they obey. In a relational database, the schema defines your tables and columns before you insert any data. If you try to insert a string into an integer column, the database refuses. This strictness is a feature: it catches bugs at write time instead of read time.

Schema-less (or schema-flexible) databases like MongoDB let you insert any document shape without pre-declaring fields. This feels liberating until the day you realize half your documents have username and half have user_name because two developers made different choices in different sprints and nothing stopped them.

Index

An index is an auxiliary data structure that speeds up reads at the cost of writes and storage. Without an index, finding all users with last_name = 'Chen' requires scanning every row in the table. With an index on last_name, the database can jump directly to the matching rows - like a book’s index letting you jump to “Chen” instead of reading every page.

The tradeoff: every write must also update the index. A table with ten indexes runs ten write operations for every one data write. Heavy indexing speeds reads, slows writes, and uses disk space. The art is indexing the columns you actually query.

Transaction and ACID

A transaction is a sequence of operations that should be treated as a single atomic unit: either all succeed, or none of them take effect. The canonical example is a bank transfer: debit account A, credit account B. If the system crashes after the debit but before the credit, you’ve lost money from a black hole. A transaction wraps both operations so they either both succeed or both are rolled back.

ACID is the set of properties a database transaction should guarantee:

Atomicity: all or nothing - no partial transactions
Consistency: the database moves from one valid state to another, never violating defined rules
Isolation: concurrent transactions do not interfere with each other - each sees the database as if it were running alone
Durability: once a transaction commits, it survives crashes - it is persisted to disk, not just memory

These properties are not free. Enforcing full ACID, especially isolation, requires locking or complex concurrency control. Some databases relax some properties to gain performance. Understanding which properties your system actually needs is part of system design.

Cache

A cache is storage that is faster and smaller than the authoritative source. It exists because there is always a speed hierarchy: a CPU register is faster than L1 cache, which is faster than L2, which is faster than RAM, which is faster than SSD, which is faster than a hard disk, which is faster than a network call to a database, which is faster than a call to an external API. Every layer in this hierarchy is orders of magnitude slower than the one above it.

Caching is the practice of keeping a copy of frequently-accessed data at a faster layer so you do not pay the cost of reaching the slower authoritative source repeatedly. Redis, Memcached, and CDN edge caches are all caches at different layers. The word comes from the French “cacher” meaning to hide: the cache hides the slow storage behind a fast interface.

The fundamental tension in caching: the cache and the source can disagree. How long do you serve a cached response before checking for updates? Too short and the cache adds no value. Too long and users see stale data. Cache invalidation - knowing when to discard a cached value - is famously described as one of the two hard problems in computer science.

Scale and Reliability

Latency, Throughput, Bandwidth

These three are distinct and constantly conflated.

Latency is the time for a single operation to complete - from initiating a request to receiving a response. It is measured per-request and expressed in milliseconds or microseconds. A p99 latency of 200ms means 99% of requests complete in under 200ms; the remaining 1% take longer.

Throughput is how many operations a system completes per unit of time - requests per second, transactions per second, messages per second. High throughput does not imply low latency and vice versa. A batch processing system might have high throughput (millions of records per hour) but high latency per record.

Bandwidth is the theoretical maximum rate at which data can be transferred across a channel, typically measured in megabits or gigabits per second. Bandwidth is a physical property of the network link. You cannot exceed it, but you often do not reach it either due to overhead.

The analogy: a highway has a speed limit (latency bound), can carry many cars simultaneously (throughput), and has a fixed number of lanes (bandwidth). Widening the highway (adding bandwidth) lets more cars travel simultaneously. Increasing the speed limit (reducing latency) gets each car there faster. These are independent dimensions.

Horizontal and Vertical Scaling

Vertical scaling means making one machine bigger - more CPU cores, more RAM, faster disk. It is simple: no application changes required, no coordination between machines. It also has a hard ceiling: there is a maximum size machine you can buy, it is expensive, and one machine is always a single point of failure.

Horizontal scaling means adding more machines and distributing the load across them. This can theoretically continue indefinitely and avoids single points of failure. The cost is complexity: the application must work correctly across multiple instances, state must be handled carefully, and you need a load balancer. Stateless applications - those that do not store per-user state in memory - scale horizontally almost trivially. Stateful ones require more care.

Replication and Sharding

Replication is copying data to multiple machines. The copies are called replicas. Replication serves two purposes: availability (if one replica fails, others serve reads) and performance (reads can be spread across replicas).

Sharding (also called partitioning) is splitting data horizontally: different rows go to different machines based on a shard key. A user database might shard by user_id: users 0-999999 go to shard 1, users 1000000-1999999 go to shard 2. Each shard holds a subset of the data. Sharding is necessary when a dataset is too large for one machine.

Replication and sharding are often combined: each shard has multiple replicas.

The word “shard” in database context was popularized by online gaming in the early 2000s. Ultima Online used the term “shards” for parallel game world instances (a lore-justified name for separate game servers). Database engineers borrowed the word - a shard of a database is a fragment of the whole, just as a shard of glass is a fragment.

Availability, Reliability, Durability

Three words that are related but precise.

Availability is the fraction of time a system is operational and serving requests. “Four nines” availability (99.99%) means no more than ~52 minutes of downtime per year. Availability is about uptime.

Reliability is whether the system performs its function correctly and consistently when it is up. A system can be available (responding to requests) but unreliable (returning wrong results). Reliability is about correctness.

Durability applies specifically to data: will the data survive failures? A durable storage system ensures that once you commit data, it persists even if servers crash, disks fail, or power is cut. A system that acknowledges writes before flushing to disk may be fast but not durable - a crash loses the last second of writes.

Fault tolerance is the ability to continue operating correctly despite component failures. A fault-tolerant system may have replicas, failover mechanisms, and circuit breakers to degrade gracefully rather than collapse. Redundancy is the mechanism: having extra capacity that takes over when something fails. Failover is the act of switching to a redundant component when the primary fails.

Control: Quotas, Rate Limits, and Throttling

Why These Exist

Shared systems need protection. A single client sending a million requests per second can deny service to every other client. A runaway process can consume all disk quota. These concepts emerged alongside the growth of multi-tenant systems - cloud services, public APIs, shared infrastructure - where one customer’s behavior can affect another’s experience.

Quota, Rate Limit, Throttling, Backpressure

Quota is a limit on total consumption over a period. “You may make 10,000 API calls per day.” After the quota is exhausted, further requests are rejected until the period resets. Quotas prevent runaway long-term consumption and are used for billing and fairness. The word comes from the Latin “quotas” - “how great a share.”

Rate limiting is a limit on the rate of requests over a short window. “You may make 100 requests per minute.” Rate limiting prevents burst abuse. You might have a daily quota of 10,000 and a per-minute rate limit of 100 - both apply simultaneously.

Throttling is intentional slowing, not hard rejection. Rather than refusing a request that exceeds a rate limit, a throttled system slows it down - adds artificial delay, queues it, or processes it at a reduced priority. Throttling is gentler than rejection: the client eventually gets a response, just more slowly.

Backpressure is the signal that propagates upstream when a downstream system is overwhelmed. If a queue is full, it pushes back on the producer: stop sending. If a database connection pool is exhausted, it signals the service: wait before making more queries. Backpressure is how systems communicate “I cannot keep up” without dropping data. The word comes from fluid dynamics: pressure that resists flow in a pipe. In engineering, it is the mechanism by which a saturated system avoids catastrophic failure by pushing the congestion signal to the source.

Observability: Logs, Metrics, Traces

These are called the three pillars of observability, and they answer three different questions about a system’s behavior.

Log

A log is a timestamped record of discrete events. “User 42 logged in at 14:32:17.” “Request to /checkout failed with error: payment gateway timeout.” Logs are arbitrary text (or structured JSON). They are high-fidelity but high-volume - a busy system generates millions of log lines per minute.

Logs answer the question: what happened? They are best for debugging specific incidents. You look at the logs around the time something went wrong and reconstruct the sequence of events.

The word “log” comes from ship’s logs - the record a captain kept of the ship’s speed (measured by throwing a wooden log overboard on a rope with knots at regular intervals, then counting knots per unit time - “knots” for ship speed derives from exactly this). A ship’s log was the authoritative record of events over time. Software logs are the same idea: the authoritative timestamped record.

Metric

A metric is a number tracked over time. CPU utilization, requests per second, error rate, p99 latency, queue depth, active connections. Metrics are aggregates - they summarize many events into a single number at each point in time.

Metrics answer the question: how is the system behaving overall? They are best for detecting anomalies (“error rate just spiked from 0.1% to 12%"), triggering alerts, and understanding trends over time. A metric alone cannot tell you why something went wrong - for that you go to logs or traces.

Metrics are cheap to store compared to logs because they are just numbers. You can retain years of metrics at low cost.

Trace

A trace (specifically, a distributed trace) is the record of a single request’s journey through a distributed system. When a user request arrives at the API gateway, goes to the auth service, to the user service, to the database, and back - a trace captures all those hops, how long each took, and any errors that occurred at each step.

Traces answer the question: where did time go for this specific request? They are best for diagnosing latency problems in systems where a request touches many services. Without tracing, you know a request took 2 seconds but have no idea if the time was spent in the auth service, the database, or network transit.

The three pillars are complementary. Metrics tell you something is wrong. Logs tell you what happened. Traces tell you where in the system it happened.

Monitoring is the practice of watching metrics and logs and alerting on anomalies. Observability is a broader property of a system: the degree to which you can infer the system’s internal state from external signals. A system is highly observable if, when something unexpected happens, you can diagnose it from outside without needing to add new instrumentation. An alert is an automated notification triggered by a metric crossing a threshold. A dashboard is a visual display of metrics, usually arranged to give an at-a-glance view of system health.

Asynchronous Patterns

Synchronous vs. Asynchronous

A synchronous operation blocks: the caller waits until the operation completes before proceeding. A asynchronous operation does not block: the caller continues and is notified when the result is ready. Most HTTP requests are synchronous from the client’s perspective - you send a request and wait for the response. Most queue-based systems are asynchronous - you put a job in the queue and move on; the result arrives later.

Asynchronous patterns are used when you want to decouple the rate at which work is produced from the rate at which it is consumed, or when work can be done in the background without the user needing to wait.

Queue

A queue is a buffer between producers (who create work) and consumers (who process it). Work items enter at one end and leave at the other in order. If the consumer is slow, work accumulates in the queue rather than being dropped. If the producer is slow, the consumer waits.

Queues were invented in computing to solve the classic mismatch problem: a fast CPU produces print jobs faster than a slow printer can print them. The print spooler - invented in the 1960s - is a queue. Modern message queues (RabbitMQ, SQS, Kafka) apply the same concept to distributed systems at enormous scale.

Pub/Sub

Pub/Sub (Publish/Subscribe) decouples producers from consumers more completely than a queue. In a queue, a specific consumer takes each item. In pub/sub, producers publish events to a topic. Any number of subscribers can listen to that topic and receive a copy of every event. Producers do not know who the subscribers are. Subscribers do not know who the producers are.

This decoupling is powerful for building systems that can evolve independently. When an order is placed, the order service publishes an “order-placed” event. The inventory service, the email service, and the analytics service all subscribe to that topic and react independently. Adding a new subscriber requires no change to the order service.

Pub/Sub architectures can have exactly-once, at-least-once, or at-most-once delivery semantics, each with different guarantees and tradeoffs around durability and performance.

Stream

A stream is a continuous, ordered sequence of events that consumers can read and re-read. Unlike a queue where a consumed message is typically deleted, a stream retains events for a configurable period. Different consumers can read the same stream at different positions - one consumer might be reading live events, another might be replaying events from yesterday to reprocess them.

Kafka popularized this model. The word “stream” emphasizes the continuous, unbounded nature of the data - it flows like a river; consumers dip in at whatever position they need.

Event

An event is a record that something happened, in the past tense: “user-signed-up,” “payment-processed,” “file-uploaded.” Events are immutable records of fact. An event-driven architecture is one where services communicate primarily by emitting and reacting to events rather than calling each other directly.

Webhook

A webhook is the inverse of polling. In polling, your service periodically asks “did anything change?” - whether or not anything changed. A webhook inverts this: you register a URL with the external system, and when something happens, it calls your URL. Instead of you asking, it tells you.

Webhooks are how GitHub notifies CI systems of new commits, how Stripe notifies your backend of completed payments, how Slack sends messages to bots. The name is a blend of “web” (HTTP callback) and “hook” (a programming pattern for inserting custom code into a system’s event flow).

Identity and State

Stateless and Stateful

A stateless service does not remember anything about previous requests. Every request contains everything the service needs to handle it. The service can scale horizontally trivially - any instance can handle any request. HTTP, by design, is stateless: each request is independent.

A stateful service maintains state between requests. A gaming server that tracks your character’s position is stateful. A service with a server-side session is stateful. Stateful services are harder to scale horizontally because you must route a user back to the same instance (sticky sessions) or share state across instances (expensive).

When a user logs in, the server needs to remember who they are for subsequent requests (since HTTP is stateless). The three major ways to do this reflect the evolution of the web.

A session stores state on the server. The server creates a session record (user ID, permissions, etc.) and gives the client a session ID - a random string. On every subsequent request, the client sends the session ID; the server looks it up. Sessions are simple but require server-side storage that scales with user count.

A cookie is a small piece of data the browser stores and automatically sends with every request to the same domain. The session ID is often stored in a cookie. Cookies are also used for tracking, preferences, and shopping cart contents. They have attributes like HttpOnly (not accessible to JavaScript), Secure (HTTPS only), and SameSite (controls cross-site sending).

A token - and specifically a JWT (JSON Web Token) - is a self-contained credential. The server encodes the user’s identity and permissions into a cryptographically signed string and sends it to the client. On subsequent requests, the client sends the token. The server verifies the signature and reads the claims from the token itself, without any database lookup. Tokens are stateless: the server does not need to remember anything. This makes them ideal for distributed systems. The tradeoff: a token cannot be revoked before it expires - if a token is stolen, the server has no way to invalidate it until its expiry time.

Authentication and Authorization

These two are confused constantly. They answer different questions.

Authentication is “who are you?” - proving identity. Logging in with a username and password is authentication. So is presenting a token. The result of authentication is knowing your identity.

Authorization is “what can you do?” - checking permissions. Once the system knows who you are, it checks whether you are allowed to perform the requested action. An authenticated user might not be authorized to access admin endpoints.

You can have authentication without authorization (a system that identifies users but treats them all equally) and you need authentication before authorization (you cannot check permissions without knowing who to check them for). The two are distinct layers, and conflating them leads to security bugs: a system that authenticates correctly but checks the wrong permissions grants access it should not.

Infrastructure Vocabulary

Process and Thread

A process is an executing program. The operating system gives each process its own isolated memory space, its own file descriptors, its own CPU time. Processes are isolated from each other: a crash in one process does not corrupt another’s memory.

A thread is a unit of execution within a process. Multiple threads share the same process memory and can communicate directly by reading and writing shared variables. This shared memory is a double-edged sword: it makes communication fast but introduces race conditions - bugs where two threads modify shared data in an interleaved way that produces wrong results.

A coroutine (also called a green thread or async task) is cooperative multitasking within a single thread. Instead of the operating system preempting one thread and switching to another, coroutines explicitly yield control. Python’s asyncio, Node.js’s event loop, and Go’s goroutines all build on this model. Coroutines are excellent for I/O-bound workloads (waiting for network responses, disk reads) where threads would spend most of their time blocked anyway.

Service, Monolith, Microservice

A service is a process that runs continuously, listening for requests. The word implies a persistent, long-running program that others can call - as opposed to a script that runs to completion.

A monolith is a single deployable unit that contains all of an application’s functionality. The user interface, business logic, and database access are all compiled and deployed together. Monoliths are simpler to develop initially, easier to debug (all the code is in one place), and avoid network overhead between components. They become painful at scale: the codebase is large and hard to understand, changes anywhere risk breaking everything, and scaling requires running the entire application.

A microservice is a service that does one thing and does it independently. It can be deployed, scaled, and updated without touching other services. Microservices architecture splits a monolith into many small services that communicate over the network. The benefits are isolation and independent deployability; the costs are network overhead, distributed debugging complexity, and the operational burden of running many services.

The “micro” in microservice is about scope and responsibility, not size. A microservice might be quite large in terms of code if its responsibility is complex.

Container and Virtual Machine

A virtual machine (VM) emulates an entire computer - CPU, memory, disk, and all. A hypervisor (VMware, KVM, VirtualBox) runs on a physical machine and creates multiple isolated VMs on top of it. Each VM runs its own full operating system. VMs provide strong isolation but are heavy: each one carries a full OS kernel, consuming gigabytes of memory and minutes to boot.

A container is a lighter form of isolation. Containers share the host operating system’s kernel but are isolated in terms of process space, filesystem, and network. Docker popularized containers. Where a VM boots an entire OS, a container starts in seconds and uses megabytes. The isolation is weaker - containers share the host kernel, so a kernel exploit affects all containers. But for most applications, this tradeoff is fine.

The mental model: a VM is a house (complete with its own foundation and utilities). A container is an apartment (shared building infrastructure, isolated living space).

Cloud, Region, and Zone

Cloud computing is renting computation, storage, and networking from a provider (AWS, GCP, Azure) instead of owning hardware. You pay for what you use, can scale up or down in minutes, and outsource the physical infrastructure management. The word “cloud” is a metaphor - network diagrams historically drew the internet as a cloud (an amorphous shape representing “stuff out there”), and cloud computing is computation that happens “out there” rather than on your own machines.

A region is a geographically distinct area where a cloud provider has infrastructure - us-east-1 (Northern Virginia), eu-west-1 (Ireland), ap-southeast-1 (Singapore). Each region is largely independent. Data in one region does not automatically appear in another. Choosing a region is a latency, compliance, and cost decision.

An availability zone (AZ) is an isolated datacenter within a region. A region typically has two to five AZs, connected by low-latency fiber but separated enough that a power outage or natural disaster affecting one would not affect another. Deploying across multiple AZs within a region is the standard way to achieve availability without the latency cost of multi-region deployments.

Cloud Vocabulary

Cloud computing introduced its own dense vocabulary. These terms come up in every infrastructure discussion, and many of them are genuinely confusing because they describe concepts that did not exist before cloud providers invented them.

IaaS, PaaS, SaaS, and Serverless

These four terms describe levels of abstraction in cloud services. The question they answer is: how much of the stack is the provider managing, and how much are you?

IaaS (Infrastructure as a Service) gives you raw virtual machines. You get a computer in the cloud. You choose the OS, install your software, configure the network, manage updates, and handle security patches. AWS EC2, GCP Compute Engine, and Azure VMs are IaaS. The provider manages the physical hardware; you manage everything above it. This is the most flexible option - you control the whole stack - but also the most work.

PaaS (Platform as a Service) gives you a runtime environment, not a machine. You bring your code; the platform runs it. Heroku, Google App Engine, and AWS Elastic Beanstalk are PaaS. You do not choose an OS or configure Nginx - you just push code and the platform handles deployment, scaling, and the server. The tradeoff: less flexibility, less operational burden.

SaaS (Software as a Service) gives you a finished application accessed over the web. Gmail, Salesforce, Slack, and GitHub are SaaS. You are a user of the software, not a deployer. Zero infrastructure concern.

The hierarchy is: with IaaS you manage the most, with SaaS you manage the least. PaaS sits in between.

Serverless (also called FaaS - Function as a Service) is a step beyond PaaS. Instead of running a persistent server that waits for requests, you write individual functions that execute in response to events and then stop. AWS Lambda, GCP Cloud Functions, and Azure Functions are the canonical examples. You upload a function; the cloud provider runs it on demand, scales it to zero when there’s no traffic, and charges you only for the milliseconds it runs.

The name “serverless” is misleading - there are absolutely servers involved. The difference is that you do not provision, manage, or think about them. A server runs for your function’s duration and disappears. Serverless is ideal for spiky, event-driven workloads. It is less ideal for long-running processes, warm-start-sensitive applications (functions have a “cold start” delay when they haven’t run recently), or anything with persistent connections.

Managed service is a broader term that applies across all these layers. A managed database like AWS RDS or Cloud SQL is a database where the provider handles installation, backups, failover, and patching - you just connect and query. A managed Kubernetes service (GKE, EKS, AKS) means the provider runs the Kubernetes control plane. “Managed” always means: you use the service, the provider operates it. You trade control for reduced operational burden.

VPC, Subnet, and Network Isolation

When you run things in the cloud, they need to communicate - with each other and with the internet. But not everything should be reachable from everywhere. The vocabulary for controlling this comes from traditional networking but with cloud-specific names.

A VPC (Virtual Private Cloud) is your own isolated private network inside the cloud provider’s infrastructure. Think of it as fencing off a section of the provider’s data center and saying “this is my network.” Everything inside a VPC can communicate with each other; nothing outside can reach it unless you explicitly allow it. You choose the IP address range for your VPC (e.g., 10.0.0.0/16, giving you 65,536 addresses), and you have full control over its network configuration.

A subnet is a subdivision of a VPC. You carve your VPC’s address space into smaller blocks, each called a subnet, and associate each with an availability zone. The key distinction: public subnets have a route to the internet (via an Internet Gateway) so instances in them can be reached from the outside world. Private subnets have no such route - instances in them can initiate outbound connections (via a NAT gateway) but cannot be reached from the internet directly. Your database should live in a private subnet. Your load balancer goes in a public subnet.

A security group is a stateful firewall attached to a resource (a VM, a database, a load balancer). It specifies rules: allow inbound TCP on port 443 from anywhere, allow inbound TCP on port 5432 only from the application servers' security group, deny everything else. Security groups are the main tool for controlling which resources can talk to which. You do not open a port on the machine itself - you attach a security group rule.

Ingress and egress in a cloud networking context mean inbound and outbound traffic. Ingress is traffic coming into your resource from outside. Egress is traffic leaving your resource going out. This matters for billing: most cloud providers charge for egress (traffic leaving their network to the internet) but not ingress. Sending a terabyte of data to users costs money; receiving a terabyte from users is often free. Systems that serve large amounts of data should account for egress costs.

A NAT gateway (Network Address Translation) lets instances in a private subnet make outbound connections to the internet without being reachable from the internet. The NAT gateway sits in a public subnet, has a public IP, and translates the private instance’s address for outbound traffic. The instance can fetch a software update from the internet; the internet cannot initiate a connection back to it.

Compute: Instances, Auto-Scaling, and Pricing Models

A compute instance is a virtual machine in the cloud - an EC2 instance, a GCP VM, an Azure VM. It has a type (which determines how many vCPUs, how much RAM, what kind of storage it has), an OS image, and a lifecycle. Instances can be started, stopped, rebooted, and terminated. When you terminate an instance, it is gone.

Auto-scaling is the mechanism that adjusts the number of running instances automatically based on load. You define a minimum and maximum instance count and a scaling policy (“add one instance when CPU > 70% for 5 minutes, remove one when CPU < 30% for 10 minutes”). When traffic spikes, the auto-scaling group launches new instances and registers them with the load balancer. When traffic drops, it terminates instances and deregisters them. This is the cloud’s killer feature: you pay for what you use, and the system grows and shrinks on its own.

Cloud providers offer instances on three pricing models, which are confusingly named differently across providers but follow the same logic:

On-demand (pay-as-you-go): you pay for every hour an instance runs, at the full listed price, with no commitment. Maximum flexibility, maximum cost. Good for unpredictable workloads.

Reserved (committed use): you commit to using an instance type for one or three years and get a significant discount (30-70%). Good for baseline, stable workloads. The word “reserved” means you’ve reserved capacity; the instance is not literally always running.

Spot (AWS) / Preemptible (GCP): spare capacity that the provider rents at a steep discount (70-90% off on-demand) but can be reclaimed with short notice (2 minutes on AWS, 30 seconds on GCP) when demand rises. Good for fault-tolerant, interruptible workloads like batch processing, CI jobs, or distributed training runs that can checkpoint and restart.

Storage: Object, Block, and File

There are three fundamental models for cloud storage, and they are suited to completely different use cases.

Object storage (AWS S3, GCS, Azure Blob Storage) stores arbitrary blobs of data - files, images, videos, backups, model weights - addressed by a key in a flat namespace. There are no directories, just keys that look like paths (photos/2024/january/img001.jpg). You upload an object, you retrieve it by key. Object storage scales to exabytes, is globally durable (data is replicated across AZs automatically), and is cheap. The word blob (Binary Large Object) is used interchangeably with object - it just means an arbitrary chunk of bytes. A bucket is the container that holds objects - a named top-level namespace within which you store things.

Object storage is not a filesystem: you cannot append to the middle of a file, you cannot list efficiently by prefix without paying costs, and access latency is in the hundreds of milliseconds range. It is best for: storing things durably and retrieving them by name.

Block storage (AWS EBS, GCP Persistent Disk) behaves like a hard disk attached to a server. Your operating system mounts it, creates a filesystem on it, reads and writes files to it. Block storage has low latency (single-digit milliseconds), supports random reads and writes, and looks identical to local disk from the OS’s perspective. It is attached to one instance at a time (typically). When you run a database on a cloud VM, the database files live on block storage.

File storage (AWS EFS, GCP Filestore) is a managed network filesystem that multiple instances can mount simultaneously. Where block storage is exclusive to one machine, file storage is shared. If you have ten application servers all needing to read from a shared directory of assets, file storage is the right tool. It is more expensive than object storage and slower than local block storage.

The rule of thumb: static assets, backups, and large files → object storage. Database files, OS volumes, anything a single server writes to → block storage. Shared filesystems that multiple servers read from simultaneously → file storage.

Cold storage or archival storage (AWS Glacier, GCS Archive) is object storage optimized for data you rarely access - compliance archives, old backups, historical logs. Retrieval can take minutes to hours, but the cost per gigabyte is an order of magnitude lower than standard object storage. The name “Glacier” is evocative: the data is frozen, available eventually, not quickly.

IAM: Identity and Permissions in the Cloud

IAM (Identity and Access Management) is the cloud’s permission system. It controls who (or what) can do what to which resources. “Who” can be a human engineer, a service account running on a VM, or another cloud service. “What” can be any API call: read an S3 bucket, create a VM, delete a database. “Which resources” can be specific - this particular S3 bucket, all EC2 instances in this region, all resources owned by this project.

The core concepts:

A principal is the entity whose identity is being established - a user, a group, a service account, or a role.

A policy is a document that specifies permissions: “allow s3:GetObject on arn:aws:s3:::my-bucket/*.” Policies are attached to principals or resources.

A role is a set of permissions that can be assumed by any eligible principal. Instead of giving a VM’s code a username and password, you attach a role to the VM that grants it specific permissions. The code running on that VM can then call AWS/GCP APIs and the cloud authenticates it automatically - no credentials stored in the application.

The principle of least privilege applies here: every principal should have only the permissions it actually needs and nothing more. A VM that reads from one S3 bucket should not have access to all S3 buckets. A service that only reads from a database should not have write access. Overly broad permissions are a security liability - a compromised service can only do what its role allows.

Kubernetes: Containers at Scale

Containers solved the “works on my machine” problem by packaging an application with its dependencies. But when you have dozens or hundreds of containers to run, you need something to schedule them onto machines, restart them when they crash, route traffic to them, and roll out updates without downtime. That something is a container orchestrator, and Kubernetes (K8s) is the dominant one.

Kubernetes introduced its own vocabulary that comes up constantly in infrastructure discussions.

A cluster is the top-level unit: a set of machines managed together by Kubernetes. The cluster has a control plane (the brain, which manages state and scheduling) and worker nodes (the machines that actually run containers).

A node is one machine in the cluster - a VM or physical server. The control plane decides which workloads run on which nodes.

A pod is the smallest deployable unit in Kubernetes. A pod is one or more containers that always run together on the same node, share the same network namespace (same IP address), and share storage volumes. In practice, most pods contain one container. The reason pods exist rather than just containers: some applications need a main container and a sidecar (a helper container that handles logging, metrics, or proxying) that must run together.

A deployment is a description of desired state: “run three replicas of this container image, restart them if they crash, roll out updates with zero downtime.” Kubernetes continuously reconciles actual state with desired state. If a node fails and a pod is lost, Kubernetes schedules a replacement pod on another node.

A service in Kubernetes (distinct from the general concept of a service) is a stable network endpoint for a set of pods. Pods come and go - they are ephemeral - so you do not route traffic directly to pod IPs. A Kubernetes Service selects pods by label and gives them a stable IP and DNS name. Traffic sent to the service is load-balanced across the matching pods.

A namespace is a logical partition within a cluster. Different teams or environments (dev, staging, prod) can each get their own namespace within one cluster, with separate resource quotas and access controls.

An ingress in Kubernetes is a resource that defines how external HTTP traffic enters the cluster - which hostname maps to which service, how TLS is terminated, how paths are routed. It is essentially the cluster’s reverse proxy configuration, managed as Kubernetes resources.

Deployment Strategies

Once you have a new version of an application to release, the question is: how do you switch from the old version to the new one without taking down the service? Several strategies exist, each with different safety and complexity tradeoffs.

A rolling deployment replaces instances one (or a few) at a time. The load balancer keeps sending traffic to old instances while new instances are being brought up. Once a new instance is healthy, it starts receiving traffic and an old one is removed. During the rollout, both versions are simultaneously live. Safe for most changes; problematic if old and new versions are incompatible with each other (e.g., old version cannot read data written by new version).

A blue/green deployment maintains two identical environments - blue (current production) and green (new version). You deploy the new version to green, test it, and then flip the load balancer to send all traffic to green. Blue is kept running as a fallback. If something goes wrong, you flip back to blue instantly. Rollback is instant, but you need double the infrastructure during the switchover, and there’s a hard cutover moment rather than a gradual ramp.

A canary deployment (named after the canary in a coal mine) sends a small fraction of traffic - say 1% or 5% - to the new version while the rest continues hitting the old version. You monitor error rates and latency on the canary. If metrics look good, you gradually increase the percentage. If something goes wrong, only a small fraction of users were affected. Canary deployments are the safest strategy for high-traffic services where you want early signal without broad impact.

Infrastructure as Code

Infrastructure as Code (IaC) is the practice of defining cloud infrastructure - VPCs, VMs, databases, load balancers, IAM roles - in declarative configuration files rather than by clicking around in a web console. Terraform and Pulumi are the dominant tools.

The motivation is the same as for version-controlling application code: you want infrastructure changes to be reviewed, tracked, reproducible, and reversible. If you click through the AWS console to set up a VPC, no one knows what you did or can reproduce it. If you write a Terraform file that defines the VPC, it can be reviewed, committed, applied consistently across environments, and recovered if something is deleted.

The key word is declarative: you describe the desired end state (“I want a VPC with this CIDR block, two subnets, and a security group with these rules”), and the tool figures out what API calls are needed to get there - creating, modifying, or deleting resources as necessary. You do not write “call CreateVPC, then call CreateSubnet, then call CreateSecurityGroup.” You write “here is what I want to exist,” and the tool makes it so.

Provisioning is the act of creating and configuring infrastructure. “Provision a new environment” means create all the cloud resources it needs. “Terraform apply” provisions (or updates) the resources described in your Terraform files.

RTO and RPO

When something goes wrong badly enough that data is lost or the system is down for an extended period, two metrics define your recovery requirements.

RTO (Recovery Time Objective) is the maximum acceptable downtime - how long the system can be unavailable after a failure before the business impact becomes unacceptable. An RTO of 4 hours means: get the system back up within 4 hours of an outage. An RTO of 0 means no downtime is acceptable - every failure must be handled by automatic failover.

RPO (Recovery Point Objective) is the maximum acceptable data loss - how far back in time you can afford to roll back. An RPO of 1 hour means: in the worst case, you can lose the last hour’s worth of data. An RPO of 0 means zero data loss - every transaction must be durably committed before it is acknowledged.

These two objectives directly drive architecture decisions. A low RTO requires redundant active systems with automatic failover. A low RPO requires synchronous replication or continuous backup. Both close to zero requires active-active multi-region architecture - expensive but achievable. An organization’s RTO and RPO are negotiated with stakeholders and drive every disaster recovery decision downstream.

Environment

An environment is a running context for software - a configuration of servers, databases, and services with specific data and settings. The standard environments are:

Development (dev): runs on a developer’s machine or a shared dev server. Data is fake or minimal. Breaking things is fine.
Staging: mirrors production as closely as possible. Used for final testing before release. Sometimes called “pre-prod.”
Production (prod): the real thing. Real users, real data, real consequences.

Config, Secrets, and Environment Variables

Configuration (config) is everything that controls how a program behaves without being part of the program’s logic. The key insight behind config is the separation of what the code does from how it behaves in this specific context. The code is compiled once. Config is injected at runtime and varies by deployment.

Why does this separation matter? Consider a service that connects to a database. The database address is localhost:5432 on your laptop, db.staging.internal in staging, and db.prod.internal in production. If you hardcode the address into the source code, you need a separate build for each environment - or worse, you accidentally deploy staging code pointing at the production database. Config solves this: the code reads the address from its environment at startup, and the address is supplied differently in each context.

More broadly: anything that would need to change if you moved the application to a different environment is config. Database hostnames, API base URLs, which S3 bucket to write to, how many worker threads to run, whether to enable verbose logging, which payment gateway to call. These values are not logic - they do not affect what the program computes, only where it sends data and what resources it uses.

What happens if you hardcode instead of configuring: the code is brittle. Changing any deployment detail requires modifying source code, going through code review, and redeploying - even if the change is trivial. In practice teams end up with if environment == "prod" branches scattered through the code, which is fragile and hard to test. Config eliminates this by making context an external input rather than an internal assumption.

How config is delivered to a program:

Environment variables are the most fundamental mechanism. They are key-value string pairs that the operating system makes available to every process. A process reads them with os.environ['DATABASE_URL'] (Python), process.env.DATABASE_URL (Node.js), or equivalent. Every process inherits the environment of its parent. In Docker, environment variables are set in the Dockerfile or passed at docker run. In Kubernetes, they are specified in the pod spec. In a shell: export DATABASE_URL=postgres://... before running the program. Environment variables are simple, universal, and require no config-reading library - which is why they are the standard.

Config files (.env, config.yaml, settings.json) are structured files the application reads on startup. They are more readable than a list of environment variables for complex configuration. The .env file format (one KEY=VALUE per line) is particularly common for local development: you create a .env file that is listed in .gitignore (so it never enters version control) and a library like python-dotenv loads it into environment variables automatically.

Config services (AWS Systems Manager Parameter Store, GCP Secret Manager, HashiCorp Vault) are centralized stores for config and secrets that the application fetches at startup or runtime. They provide versioning, access control, audit logs, and automatic rotation. Instead of an environment variable containing the database password, the application calls the config service at startup: “give me the value of /prod/db/password” - and the config service checks that the calling service’s IAM role has permission to read that value before returning it.

Secrets are a subset of config that require protection. A database hostname is config - it is not sensitive and can be committed to version control or logged. A database password, an API key, a TLS private key - these are secrets. If a secret leaks, an attacker can impersonate your service, read your data, or make API calls billed to your account. Two rules: secrets are never committed to source control (even accidentally; there are tools that scan for this), and secrets are never logged. Dedicated secret management systems (Vault, AWS Secrets Manager) exist specifically for secrets: they encrypt at rest, restrict access by identity, rotate credentials automatically, and log every access.

Feature flags are a type of config that enable or disable features at runtime without code changes or redeployment. “Show the new checkout UI to 10% of users.” “Enable the experimental search algorithm for beta testers.” “Turn off the third-party analytics integration immediately if it starts causing errors.” Feature flags decouple deploying code from releasing features: you can deploy code that is dark (disabled by flag), turn the flag on for internal users first, then ramp to 1%, 10%, 100% of traffic. If something goes wrong, flip the flag off instantly - no rollback, no incident. The alternative - baking feature rollout into the deployment pipeline - requires a redeploy to reverse a bad release, which can take minutes to hours.

Failure Vocabulary

Timeout, Retry, Idempotency

A timeout is a limit on how long you will wait for an operation before giving up. Without timeouts, a slow or dead service causes callers to wait forever, eventually exhausting all their threads and becoming unresponsive themselves - a cascade failure. Timeouts are how you fail fast and prevent failures from spreading.

A retry is trying again after a failure. Retries are often the right response to transient failures (a brief network hiccup, a server that was briefly overloaded). But retries are only safe if the operation is idempotent - if calling it twice has the same effect as calling it once. Charging a credit card is not idempotent: retrying a charge double-charges the customer. Creating a session is not idempotent: retrying creates a duplicate. Reading data is idempotent: reading twice is fine.

The word “idempotent” comes from Latin “idem” (same) + “potent” (power): the same effect, regardless of how many times applied. Engineers use it constantly, and it is one of the most important properties to consider when designing operations that may be retried.

Circuit Breaker

A circuit breaker is a pattern borrowed from electrical engineering. In an electrical circuit, a circuit breaker trips when current is too high, breaking the circuit to prevent damage. In software, a circuit breaker monitors calls to a downstream service. If the failure rate exceeds a threshold, the breaker “trips” - subsequent calls are immediately rejected rather than waiting for a timeout. This gives the downstream service time to recover and prevents a slow downstream from making an upstream slow.

The states are: closed (requests flow normally), open (requests are immediately rejected), half-open (a test request is allowed through to see if the service has recovered).

Backoff

Backoff is the strategy of waiting before retrying, and waiting longer after each failure. Simple backoff: wait 1 second, then try again. If it fails, wait 2 seconds, then 4, then 8 - exponential backoff. Jitter is randomization added to the backoff interval: rather than all clients retrying at exactly the 4-second mark (causing a stampede), they retry at a random point between 3 and 5 seconds. Backoff with jitter is the standard pattern for distributed system retries.

SLA, SLO, SLI, Error Budget

These are the formal vocabulary of reliability commitments.

An SLI (Service Level Indicator) is a specific measurement - a number. Request success rate, latency p99, data durability.

An SLO (Service Level Objective) is a target for an SLI. “Request success rate will be above 99.9%.” “p99 latency will be below 200ms.” SLOs are internal commitments, owned by the engineering team.

An SLA (Service Level Agreement) is a contract with consequences - usually between a company and its customers. “If availability falls below 99.9%, we give you a credit.” SLAs are the external, legally binding version of SLOs. The SLO is usually set higher than the SLA to give a buffer.

An error budget is the flip side of an SLO. If your SLO is 99.9% availability, your error budget is 0.1% - the allowed downtime (roughly 43 minutes per month). The budget can be “spent” on planned maintenance, risky deployments, or outages. When the error budget is exhausted, the team focuses on reliability over features. It turns an abstract reliability target into a concrete resource that engineering and product teams jointly manage.

Deployment Vocabulary

A build is the process of converting source code into a runnable artifact - compiling, linking, bundling, or containerizing. The output is an artifact: a binary, a JAR file, a Docker image, a compiled JavaScript bundle. Artifacts are versioned so you can deploy a specific version and roll back to a previous one if something goes wrong.

A deploy is putting a built artifact somewhere it can run. Deploying does not mean “shipping to production” specifically - you deploy to dev, staging, and prod, and the process is the same in each.

CI/CD stands for Continuous Integration / Continuous Delivery (or Deployment). CI is the practice of automatically building and testing code on every commit - integrating changes continuously rather than in large batches. CD is extending that automation to deploying the built artifact to an environment. The goal is to make the path from a committed change to running code fast, automated, and reliable. The words “continuous” and “delivery” are meaningful: continuous means often and automatically, not infrequently and manually.

A rollback is reverting a deployment to a previous version when the new version has problems. Good deployment systems make rollbacks fast - because however carefully you test, you will sometimes need to undo.

Where Terms Blur

Some pairs are worth explicitly clarifying because they are used interchangeably in conversation but mean distinct things:

Proxy vs. Load Balancer: A load balancer is a type of reverse proxy. All load balancers are proxies; not all proxies are load balancers.

Queue vs. Topic (pub/sub): A queue delivers each message to exactly one consumer; a pub/sub topic delivers each message to all subscribers. The architectural implication is whether you want competing consumers (queue) or fanout (pub/sub). Kafka blurs this: it has topics, but each consumer group acts like a queue within a topic.

Cache vs. Database: Both store data. A cache is ephemeral and optimized for reads; losing it is acceptable because the authoritative data is elsewhere. A database is authoritative; losing it means losing data. You cache data you can afford to lose and recompute. You store in a database data you cannot.

Authentication vs. Authorization: Almost every security bug involving “wrong users accessing things they shouldn’t” is an authorization bug, not an authentication bug. The terms are often confused, which is why the bugs persist.

Availability vs. Reliability: A system can be available (responding) but unreliable (giving wrong answers). A system can be reliable (accurate when up) but unavailable (down for maintenance). High availability is about uptime; high reliability is about correctness.

Latency vs. Throughput: Improving one does not automatically improve the other. You can have low-latency, low-throughput systems (a single request completes quickly, but you can only handle one at a time) and high-throughput, high-latency systems (batch pipelines that process millions of records but each record takes minutes to work through). System design is about optimizing for the right axis given the workload.

Engineering vocabulary exists not to exclude people but because precise words enable precise thinking. When someone says “add a cache,” the follow-up questions - invalidation strategy, eviction policy, consistency requirements - only make sense if both parties share the same precise definition. When someone says “add a rate limit,” distinguishing it from throttling or a quota changes the implementation. The words are handles for complex concepts. Once the concept is clear, the handle snaps into place.

Read next: