Prerequisite: Networking Fundamentals


A single server has two problems at scale: it is a capacity ceiling and a single point of failure. Every request goes to one machine; when that machine is saturated, latency climbs and errors follow. When it crashes, the service is down. Load balancers solve both problems by distributing traffic across a pool of backends and routing around failures automatically.

L4 vs L7 Load Balancing

The layer at which a load balancer operates determines what it can see and what decisions it can make.

Layer 4 (transport layer) load balancers operate on TCP/UDP packets. They see source IP, destination IP, and port - nothing about the application payload. Because they don’t inspect or parse HTTP, they are extremely fast and can handle millions of connections with minimal overhead. A packet arrives, the LB picks a backend based on a simple rule, and the connection is forwarded. SSL termination does not happen here - the encrypted bytes pass straight through.

Layer 7 (application layer) load balancers parse HTTP (and HTTP/2, gRPC). They can route based on URL path, HTTP method, headers, or cookie values. This enables more sophisticated patterns: route /api/* to the API fleet while /static/* hits a CDN-backed pool; route requests containing a X-Beta: true header to a canary deployment; terminate TLS and re-encrypt (or forward plaintext) to backends. The price is CPU overhead for parsing and latency from the additional processing.

Most production systems use L7 at the edge (Nginx, HAProxy, AWS ALB) and may use L4 internally for simpler east-west routing.

Balancing Algorithms

Round-robin cycles through backends in order. Simple, works well when all requests have similar cost and all backends have similar capacity.

Weighted round-robin assigns a weight to each backend - a server with twice the CPU gets twice the requests. Useful when backends are heterogeneous.

Least connections routes each new request to the backend with the fewest active connections. Better than round-robin for workloads where requests have highly variable duration (some requests take 1ms, others take 5 seconds).

IP hash (sticky sessions) hashes the client IP to deterministically pick a backend. The same client always hits the same server, which is required when session state lives on the server. The pitfall: if one backend fails, all clients hashed to that backend are disrupted simultaneously, and IP-based hashing breaks behind NAT where many clients share one IP.

Random with two choices (power of two choices) picks two backends at random and routes to the one with fewer connections. This provides most of the benefit of “least connections” with much lower overhead - no need to maintain a global sorted list.

Health Checks

A load balancer is only useful if it routes to backends that are actually serving. Health checks come in two forms.

Active health checks send periodic probes to each backend - typically an HTTP GET to a /health endpoint - and mark backends unhealthy if they fail. The probe interval and failure threshold determine how quickly the LB responds to an outage.

Passive health checks observe real traffic. If a backend returns 5xx responses or times out, it accumulates error marks. When the error rate crosses a threshold, the backend is temporarily removed. No extra probe traffic, but slower to detect failures.

Most production systems combine both.

Session Affinity

Some applications store session state on the server - user shopping carts in memory, WebSocket connections, expensive per-request setup. These require that all requests from a client consistently reach the same backend.

Cookie-based affinity is more reliable than IP-based: the load balancer injects a cookie (e.g., SERVERID=backend-3) on the first response; subsequent requests carry the cookie, and the LB routes accordingly. The pitfall is that if the assigned backend fails, the session is lost. Proper architecture avoids server-side session state entirely by externalizing it to a shared store (Redis, a database).

Reverse Proxy vs Forward Proxy

A reverse proxy sits in front of servers, accepting requests from clients on behalf of the server fleet. Clients communicate with the proxy and have no direct knowledge of backend addresses. Load balancers are reverse proxies. Nginx and HAProxy are the dominant open-source options.

A forward proxy sits in front of clients, forwarding their requests to the internet on their behalf. Corporate firewalls and VPNs are forward proxies. Clients are configured to route traffic through the proxy.

Nginx as a Reverse Proxy

Nginx defines backend pools in upstream blocks and routes traffic with proxy_pass:

upstream api_backends {
    least_conn;
    server 10.0.0.1:8080 weight=3;
    server 10.0.0.2:8080 weight=3;
    server 10.0.0.3:8080 weight=1;  # older, less capacity
}

server {
    listen 443 ssl;
    location /api/ {
        proxy_pass http://api_backends;
        proxy_connect_timeout 1s;
        proxy_read_timeout 30s;
    }
}

Active health checks in the open-source version of Nginx are limited; Nginx Plus and HAProxy have richer health check configuration.

Service Mesh

As microservice counts grow, managing L7 load balancing, retries, mutual TLS, and observability for every service-to-service connection becomes unwieldy. A service mesh handles this at the infrastructure layer.

In the sidecar proxy pattern (Istio with Envoy), a proxy container is injected alongside every application pod. All inbound and outbound traffic is intercepted by the sidecar, which handles load balancing, circuit breaking, mTLS, and emits telemetry - without any application code changes. The control plane (Istio’s Pilot) distributes routing configuration to all sidecars centrally.

The trade-off: significant operational complexity and latency overhead (each hop now has two additional network hops through the sidecar proxies).

Global Load Balancing

GeoDNS returns different IP addresses based on the geographic location of the DNS resolver. A user in Tokyo resolves api.example.com to a Tokyo datacenter IP; a user in Frankfurt resolves to a European datacenter IP. GeoDNS has coarse granularity and a propagation delay of TTL seconds when a datacenter goes down.

Anycast assigns the same IP address to servers in multiple datacenters. BGP routing naturally directs traffic to the topologically closest datacenter. Used by CDNs and DNS providers. Failover is automatic and fast (BGP convergence), but debugging is harder because the same IP routes to different physical machines.

Examples

Nginx upstream pool with active health:

upstream workers {
    server 10.0.1.10:3000;
    server 10.0.1.11:3000;
    server 10.0.1.12:3000 backup;  # only used if others fail
    keepalive 32;
}

Blue-green deployment with a load balancer:

Run two identical environments (blue = current production, green = new version). Switch the load balancer’s upstream from blue_pool to green_pool atomically. If health checks fail on green, switch back in seconds - no instance restarts required. This makes deployments a router configuration change rather than an in-place upgrade.

HAProxy passive health check:

backend api
    balance roundrobin
    option httpchk GET /health
    server api1 10.0.0.1:8080 check inter 2s fall 3 rise 2
    server api2 10.0.0.2:8080 check inter 2s fall 3 rise 2

fall 3 means three consecutive failures marks the backend down; rise 2 means two consecutive successes restores it. These thresholds prevent flapping.


Read Next: Microservices Architecture