Latency, Throughput, & Queues
Prerequisite: Networking
Two numbers dominate performance conversations: latency and throughput. Engineers routinely confuse them, optimize the wrong one, or design systems that look good in averages but behave badly under real load. This post covers what these numbers mean, how they relate, and the queuing theory that explains why systems fall apart near capacity.
Latency and Throughput
Latency is the time it takes for one unit of work to complete - from the moment a request is sent to the moment a response is received. Measured in milliseconds (or microseconds for in-memory operations). A user waits for latency; they experience it directly.
Throughput is the number of units of work completed per unit time - requests per second, messages per second, bytes per second. It measures the system’s productive capacity.
These two quantities are related but distinct. A system can have low latency and low throughput (a single-threaded server that handles each request in 1ms but processes only one at a time). It can have high throughput and high latency (a batch pipeline that processes a million records per second but takes an hour to return results). Optimizing one does not automatically improve the other.
The Latency-Throughput Trade-off
Batching is the clearest demonstration of the trade-off. Instead of sending each message immediately, a producer accumulates messages and sends them in groups. Throughput rises because each network round trip carries more payload. Latency rises because the first message in a batch waits for the rest.
Kafka producers expose this directly with linger.ms: the maximum time a producer will wait before sending a batch. Setting linger.ms = 0 minimizes latency; setting it higher improves throughput. The right value depends on whether your application is latency-sensitive (user-facing) or throughput-sensitive (batch analytics).
Percentiles, Not Averages
Average latency hides the tail. If 95% of requests complete in 10ms and 5% complete in 2000ms, the average might be 110ms - which sounds acceptable. But that 5% is thousands of requests per minute failing the user.
The standard way to report latency:
- p50 (median): Half of requests are faster, half slower. The typical experience.
- p95: 95% of requests are faster. Captures common slow cases.
- p99: 99% of requests are faster. The “bad day” experience.
- p999: 1 in 1000 requests. Relevant for high-traffic systems where even rare outliers happen constantly.
At 10,000 RPS, a p99 of 500ms means 100 requests per second are taking half a second. Those are real users having a bad experience every second.
Tail latency is often caused by resource contention (GC pauses, lock waits), background work competing with foreground requests, or simply the occasional unlucky request that gets queued behind a slow one.
Little’s Law
Little’s Law is a mathematical result from queueing theory that holds under extremely general conditions:
$$L = \lambda W$$
Where:
- $L$ = average number of items in the system (queue + being processed)
- $\lambda$ = arrival rate (items per second)
- $W$ = average time an item spends in the system (seconds)
This is useful for sizing. If your request processing time is 50ms and you receive 1,000 requests/second, you need the system to support $L = 1000 \times 0.05 = 50$ concurrent requests in flight at any time. If your thread pool has 20 threads, requests queue up.
Little’s Law also explains why latency spikes when throughput is high. If arrival rate $\lambda$ stays constant but service time $W$ increases (say, due to a slow database query), $L$ grows - more items pile up in the system.
Queuing Theory: Why Systems Break Near Capacity
The M/M/1 queue is the simplest queuing theory model: Poisson arrivals, exponential service times, one server. Its key result is the mean queue length as a function of utilization $\rho = \lambda / \mu$ (arrival rate divided by service rate):
$$L_q = \frac{\rho^2}{1 - \rho}$$
At $\rho = 0.5$ (50% utilization): $L_q = 0.5$ - barely any queue. At $\rho = 0.8$ (80% utilization): $L_q = 3.2$ - a noticeable queue. At $\rho = 0.9$ (90% utilization): $L_q = 8.1$ - requests are waiting. At $\rho = 0.95$ (95% utilization): $L_q = 18$ - the queue is growing fast. At $\rho \to 1$: $L_q \to \infty$ - the queue grows without bound.
This is the practical lesson: never run a system at 100% capacity. Not because you will hit exactly 100%, but because queueing systems become non-linear near their limits. Small increases in load cause dramatic increases in queue length and latency. Most production systems are designed to operate comfortably at 50–70% utilization, leaving headroom for traffic spikes and the non-linearities of real workloads.
Real systems violate M/M/1 assumptions (traffic is not Poisson, service times are not exponential, there are multiple servers). But the qualitative shape - latency explodes as you approach capacity - holds universally.
Queue Types
FIFO (first in, first out): The standard queue. Fair and simple. Each item is processed in the order it arrives.
Priority queue: Items with higher priority are processed first, regardless of arrival order. Useful when some work is more urgent (user-facing vs batch, or SLA tiers). Risk: low-priority items can be starved indefinitely if high-priority items arrive continuously.
Delay queue: Items become visible only after a specified delay. Used for scheduling retries (exponential backoff: retry in 1s, 2s, 4s, 8s…) and time-based triggers. AWS SQS and Redis sorted sets can implement delay queues.
Dead letter queue (DLQ): Items that have failed processing a maximum number of times are moved here for manual inspection. Essential for diagnosing systemic failures without losing messages.
Message Queues in Practice
Message queues decouple producers from consumers and are one of the most reliable tools for building resilient systems.
Absorb traffic spikes: A sudden burst of user activity creates a spike in events. The queue absorbs the spike; consumers process at their own rate. Without the queue, the spike either overwhelms downstream services or is dropped.
Retry failed work: If a consumer fails to process a message (external service unavailable, transient error), the message returns to the queue and is retried. Retries are built into the queuing contract, not the application logic.
Decouple services: The producer publishes a message and continues. It does not wait for the consumer. If the consumer is slow or temporarily down, the producer is unaffected. Services can be deployed, restarted, and scaled independently.
Back-pressure: When the queue grows beyond a threshold, producers can be slowed (back-pressure) to prevent unbounded memory growth. Without back-pressure, a slow consumer causes the queue to grow until memory is exhausted. Monitoring queue depth is a health signal - a growing queue indicates consumer throughput is below producer throughput.
Examples
Kafka for high-throughput event streaming:
Kafka is a distributed log. Producers append events to named topics; consumers read from an offset and advance it as they process. The log is persisted to disk, so consumers can replay history. Kafka achieves high throughput by batching writes to sequential disk I/O (much faster than random writes) and grouping messages within a topic into partitions, each consumed by one consumer in a group.
Kafka is appropriate when: events must be persisted and replayable; multiple independent consumers need to process the same events (fan-out); throughput is in the millions of messages/second.
SQS for task queues:
AWS SQS is a managed FIFO or standard queue. Producers enqueue tasks (send an email, process an image); workers poll and process. Visibility timeout ensures that if a worker crashes mid-processing, the message becomes visible again after a configurable timeout and is retried by another worker. SQS is simpler than Kafka - no partitions, no offset management - and appropriate for workloads where you need reliable delivery of discrete tasks rather than ordered event streaming.
Redis lists as simple queues:
LPUSH queue task and BRPOP queue timeout implement a queue with blocking pop - workers sleep until a task arrives, then process it. Zero infrastructure overhead; works within your existing Redis instance. Appropriate for low-to-moderate volume queues where persistence across Redis restart is not required (or where you use Redis AOF persistence). Not appropriate for millions of messages/second or multi-consumer fan-out - use Kafka there.
Read Next: Queueing Theory