Load Balancing & Proxies - Distributing Work So No Single Machine Drowns
Helpful context:
- Latency, Throughput & Queues - The Physics of System Performance
- Networking - How Packets Find Their Way Across the Internet
Ten servers. One million requests per second. How do you distribute work fairly - and what happens when three of those servers die at 2am?
The naive answer is “a load balancer,” but that is a starting point, not a conclusion. The real question is: which algorithm distributes work fairly for your workload, what happens when the load balancer itself fails, and how does the statefulness of your application change every assumption in the design?
This is a story that evolved over about thirty years, from a DNS hack to globally-aware routing. The evolution is worth tracing because each step was a response to a concrete failure mode of the previous approach.
The Two Levels: What a Load Balancer Can See
Before anything else, you need to understand what a load balancer actually has access to when a request arrives, because that determines everything it can do.
The internet is organized into layers. Think of sending a letter: the envelope has the address (routing information), and inside is the actual letter (content). A load balancer can operate at the envelope level or it can open the envelope and read the letter.
Layer 4 (the transport layer) is the envelope. At this level, the load balancer sees only: where the packet came from (source IP and port), where it’s going (destination IP and port), and whether it’s using TCP or UDP - the two main protocols for sending data across a network. TCP is connection-oriented and guarantees delivery; UDP is faster but fire-and-forget. The load balancer at Layer 4 has no idea what’s inside the packet - whether it’s an HTTP request, a video stream, or a database query. It just forwards based on the envelope.
This ignorance is also its strength. Because it doesn’t parse anything, a Layer 4 load balancer is extremely fast - capable of handling millions of connections with microsecond-level overhead.
Layer 7 (the application layer) is reading the letter. At this level, the load balancer parses the actual content: HTTP headers, URL paths, cookies, query parameters, request bodies. It knows that this is a POST to /api/checkout with a session cookie for user 4291. This intelligence enables routing decisions impossible at Layer 4: send all /api/* requests to the API fleet, send all /images/* requests to the image-serving pool, send requests from beta users to the new version of the code.
The price of Layer 7 intelligence is CPU overhead for parsing and slightly higher latency per request. Most production systems use Layer 7 at the edge (handling traffic from the internet) and Layer 4 for internal traffic between services where raw throughput matters more than routing flexibility.
A Brief History of Getting Work to the Right Machine
In the early web era, “load balancing” meant DNS round-robin: the same domain resolved to multiple IP addresses in rotation. Browser asks for example.com, gets 10.0.0.1. Next browser asks, gets 10.0.0.2. Beautifully simple. Also deeply broken: DNS responses are cached for minutes or hours, so a user might keep hitting a dead server long after it went down. And DNS knows nothing about server health.
The 2000s brought hardware load balancers - physical appliances from companies like F5 Networks and Cisco, purpose-built for high-throughput packet routing. These were dedicated boxes, expensive (six figures was common), and required networking specialists to configure. They worked well for large enterprises that could afford them. But they were rigid and slow to reconfigure as applications changed.
Then came software load balancers: HAProxy (2001) and Nginx (2004) brought Layer 7 load balancing - the ability to inspect HTTP traffic and route based on its content - to ordinary commodity servers. Engineers could define backend pools in a config file, reload configuration without dropping connections, and run the same software on a $50/month cloud VM that used to require a $100,000 appliance. The F5 market didn’t disappear, but it stopped being the only option.
Finally: cloud-native load balancers. AWS introduced ELB (Elastic Load Balancer) in 2009, later splitting it into ALB (Application Load Balancer, operating at Layer 7) and NLB (Network Load Balancer, operating at Layer 4). GCP and Azure followed with their own managed services. These are not machines you configure - they are services where the load balancer itself is infinitely scaled, globally distributed, and integrated with the rest of the cloud platform. The engineering team no longer owns load balancer infrastructure; they configure it through an API.
Balancing Algorithms: More Than Round-Robin
Round-robin cycles through backends sequentially. Simple and effective when requests have similar cost and servers have similar capacity. Falls apart when some requests take 1ms and others take 5 seconds - the unlucky backend accumulates slow requests while others sit idle.
Weighted round-robin assigns a weight to each backend. A server with 8 cores gets twice the requests of a server with 4 cores. Useful during rolling upgrades when new and old instance types temporarily coexist in the same pool.
Least connections routes each new request to the backend with the fewest active connections at that moment. Far superior to round-robin for variable-duration requests. The overhead: the load balancer must maintain a count of active connections per backend, which requires coordination at very high connection rates.
Consistent hashing maps each request to a backend using a hash of some request attribute - typically user ID or session key. The same user always hits the same backend. This matters for stateful workloads: if your backend keeps an in-memory cache per user, routing that user to a different machine means a cache miss. The careful design here uses a “virtual node ring” - rather than mapping each request directly to a server, you create many virtual positions on a ring and each server owns a range of them. When a server is added or removed, only its fraction of the keyspace shifts to a new owner rather than reshuffling everything.
Power of two choices picks two backends at random and routes to whichever has fewer connections. This achieves near-optimal load distribution with far less overhead than tracking all backends globally. It’s an elegant result from probability theory - two random samples are enough to avoid badly unbalanced routing.
How AWS ALB Actually Works
AWS ALB (Application Load Balancer) is worth understanding in detail because it is the load balancer most teams encounter first and because its design illustrates what cloud-native Layer 7 load balancing looks like.
An ALB sits in front of target groups - logical pools of backends. A target group can contain EC2 instances (virtual machines), IP addresses, or Lambda functions (AWS’s serverless compute service, where you upload code and AWS runs it without you managing servers). Routing rules inspect each incoming request and forward it to the matching target group.
Rules can match on: host header (so api.example.com and app.example.com can share one load balancer), URL path, HTTP method, query string parameters, or source IP range (specified in CIDR notation - a compact way to describe a block of IP addresses, e.g., 10.0.0.0/8 means all addresses from 10.0.0.0 to 10.255.255.255).
This routing flexibility makes ALB the natural place to implement two common deployment strategies:
- Blue-green deployment: maintain two identical production environments (call them blue and green). Deploy new code to green while blue serves all traffic. Switch the load balancer to send 100% of traffic to green. If something breaks, switching back to blue is a one-second config change rather than a rollback deploy.
- Canary deployment: send a small percentage (say 5%) of real production traffic to a new version of your code. If the canary fails, only 5% of users are affected. If it succeeds, gradually shift more traffic over. ALB supports this via weighted routing rules between target groups.
ALB terminates TLS - that is, it handles the HTTPS encryption layer. TLS (Transport Layer Security) is the protocol that encrypts traffic between clients and servers, turning HTTP into HTTPS. When ALB terminates TLS, it decrypts incoming HTTPS traffic and forwards plaintext HTTP to your backend servers. This moves certificate management (renewing TLS certificates before they expire) out of your application and into AWS Certificate Manager, where renewal is automatic.
ALB performs health checks against each target on a configurable path and interval. Unhealthy targets are drained: in-flight requests are allowed to complete (up to the deregistration delay, by default 300 seconds) before the target is removed from the pool. This prevents requests from being cut off mid-flight during rolling deploys.
ALB itself is not a single machine - it scales automatically across multiple AWS availability zones (physically separate data centers in the same region). The load balancer is itself load-balanced.
The Sticky Session Problem
Sticky sessions - also called session affinity - route requests from the same client to the same backend every time. Cookie-based affinity is the most common form: the load balancer injects a cookie on the first response; subsequent requests carry that cookie, and the load balancer routes accordingly. AWS ALB uses a cookie named AWSALB for this.
Sticky sessions solve a real problem: if a user’s session state lives in memory on backend-3, routing their next request to backend-5 breaks their session. But the solution creates a larger problem.
When backend-3 fails, every user pinned to it loses their session simultaneously. You haven’t eliminated the failure - you’ve concentrated it. Worse, sticky sessions defeat load balancing: if one user’s session is computationally expensive, that backend accumulates load while others sit idle.
The stateless design principle exists to eliminate this entirely. If session state lives in an external store - Redis (an in-memory database commonly used as a fast session store), a relational database, or any shared storage - then any backend can serve any request. The load balancer can route however it likes; there’s no session to lose when a backend fails. This is not just an academic preference; it is the prerequisite for true horizontal scaling. Sticky sessions are the architectural signal that session state has leaked into your compute layer.
Health Checks: Active and Passive
A load balancer that routes to unhealthy backends is worse than no load balancer at all - it concentrates failures rather than routing around them.
Active health checks send periodic probes to each backend - typically an HTTP GET to a /health endpoint - and mark backends unhealthy if they fail. AWS ALB checks on a configurable interval (default 30 seconds) and marks a backend unhealthy after a configurable number of consecutive failures (default 3). A good /health endpoint checks not just that the process is running but that it can reach its dependencies: database connections open, downstream services responding.
Passive health checks observe real traffic. If a backend returns 5xx errors above a threshold or consistently times out, it is temporarily removed without any probe traffic. This catches failure modes that active checks miss - a backend that returns 200 on /health but errors on actual requests because its database connection pool is exhausted.
Most production systems use both. Active checks catch dead backends quickly. Passive checks catch degraded-but-alive backends that fool the health probe.
Global Load Balancing: Routing Across Regions
Once your system spans multiple cloud regions, the question shifts from “which server in this data center” to “which data center on which continent.”
GeoDNS (AWS Route 53, for example) is a variant of DNS where different users get different answers to the same domain name based on where they are. A user in Tokyo resolves api.example.com to an IP address in Tokyo; a user in Frankfurt resolves it to an address in Europe. Route 53 can also integrate health checks - if the Tokyo deployment becomes unhealthy, Route 53 automatically returns the Europe address to Tokyo users instead. One trade-off: DNS responses are cached. A short TTL (60 seconds) means faster failover; a long TTL means cheaper operation but slower response to regional failures.
Anycast is a different approach. In normal routing, each IP address belongs to exactly one machine in exactly one place. Anycast breaks this: the same IP address is announced from servers in multiple data centers simultaneously. The internet’s routing infrastructure - specifically BGP (Border Gateway Protocol, the protocol that routers use to announce which IP addresses they can reach) - naturally delivers each user’s packets to the nearest data center advertising that IP. GCP Cloud Load Balancing uses anycast: a single IP serves users from the nearest Google edge location (called a Point of Presence, or PoP - a physical facility in a city where the cloud provider has equipment) across more than 100 locations worldwide. AWS Global Accelerator uses the same principle to route traffic to the nearest AWS edge location and then carry it to the right region over AWS’s private backbone, avoiding the congested public internet for the long-haul leg.
Global Server Load Balancing (GSLB) combines health monitoring with DNS-based routing for disaster recovery. The most common pattern: primary region receives 100% of traffic when healthy. On health check failure, the DNS routing policy automatically shifts traffic to a secondary region. Route 53 supports this as an active-passive failover policy.
The Load Balancer as a Single Point of Failure
Here is the irony: load balancers exist to eliminate single points of failure, but a single load balancer is itself a single point of failure.
The traditional on-premises solution is an active-passive HA pair: two identical load balancers run the same configuration, but only one is active at a time. They share a virtual IP (VIP) - a single IP address that floats between them. Only the active machine “owns” the VIP and receives traffic. The protocol that coordinates this handoff is called VRRP (Virtual Router Redundancy Protocol), and the most common software that implements it is Keepalived. When the primary fails, the secondary detects the failure via VRRP, claims the VIP for itself, and starts receiving traffic - usually within a few seconds.
Cloud-native load balancers handle this transparently. AWS ALB distributes itself across multiple availability zones and scales automatically. You don’t manage the load balancer infrastructure; you just configure it. The trade-off is control: you cannot inspect or customize the underlying machinery.
Service meshes (Istio, Linkerd) take a fundamentally different approach. Instead of one central load balancer that all traffic passes through, every service instance gets a small proxy process running alongside it - called a sidecar proxy because it sits next to the application like a sidecar on a motorcycle. This sidecar (Envoy is the most common one - an open-source high-performance proxy) handles load balancing, health checking, retries, and circuit breaking for all outbound connections from its parent service.
A central control plane (in Istio, this is a component called istiod) distributes routing configuration to all the sidecars across the cluster. No single load balancing node can fail and take down the system because the load balancing function is distributed across every service instance. The cost: significant operational complexity, and the latency overhead of two extra proxy hops for every inter-service call.
East-west traffic refers to requests that flow between services inside a data center (as opposed to north-south traffic, which flows between end users and the data center). Service meshes are primarily designed for east-west load balancing - they handle the high-volume, low-latency calls between your own microservices, while a traditional load balancer like ALB handles the north-south traffic from the internet.
Summary
| Concept | Key Insight |
|---|---|
| L4 vs L7 | L4 sees only IP and port - fast but blind to content; L7 reads HTTP headers and paths - enables canary routing, TLS termination |
| Round-robin vs least-connections | Least-connections wins for variable-duration requests |
| Consistent hashing | Required for stateful workloads; virtual node ring ensures even distribution when servers join or leave |
| AWS ALB | Layer 7, content-based routing, TLS termination, automatic health-based draining, blue-green and canary deployments |
| Sticky sessions | Solve stateful routing but defeat horizontal scaling; prefer an external session store like Redis |
| Active + passive health checks | Active catches dead backends; passive catches degraded-but-alive ones that fool the probe |
| GeoDNS vs Anycast | GeoDNS is flexible and configurable; Anycast is automatic and faster, routes via BGP to the nearest PoP |
| LB as SPOF | Cloud-native LBs (ALB) distribute across availability zones automatically; on-premises LBs need an active-passive HA pair with a floating virtual IP |
| Service mesh | Distributes load balancing to sidecar proxies on every service instance; eliminates central SPOF but adds operational complexity |
Read Next: