Rate Limiting & Throttling - Protecting Services From Their Own Traffic // Megha Bose

Helpful context:

On April 3, 2021, a change to a downstream API at a major social platform caused a cascade of retries. Clients got errors, retried immediately, got more errors, retried again. Within minutes, the number of requests to the API was forty times normal volume - not from external attackers, but from the platform’s own clients responding rationally to failures. The service that could handle a million requests per minute was receiving forty million, failing at every one, receiving forty million retries, and failing at those too. The feedback loop ran until someone manually intervened.

This is the retry storm, one of the most common ways that distributed systems destroy themselves. Rate limiting is the control mechanism that prevents it - and it is also what keeps a single badly behaved client from consuming resources that belong to everyone.

Why Services Need Rate Limits

A service without rate limits has an implicit assumption baked in: that every client will be reasonable. This assumption is violated constantly, usually without malice.

A client that is misconfigured might hammer an API in a tight loop. A client written by a new engineer might not implement exponential backoff. A legitimate client running a bulk operation - exporting a year of data, reindexing a search corpus - might make millions of requests that are individually valid but collectively starve interactive users. A client that is being attacked might forward the attacker’s requests directly.

Rate limits are not punishment for misbehavior. They are a contract: here is how fast you can call this API, and here is what happens when you exceed it. This contract protects the service, protects other clients, and - by returning a clear signal (HTTP 429: Too Many Requests) instead of slow failures - actually helps well-behaved clients handle overload gracefully.

Token Bucket

The most widely used rate limiting algorithm is the token bucket. The intuition is almost literal: imagine a bucket that holds tokens, refilled at a fixed rate. To make a request, a client must spend a token. If the bucket is empty, the request is rejected.

The parameters:

Capacity: the maximum number of tokens the bucket can hold
Refill rate: how many tokens are added per second (or per minute)

A bucket with capacity 100 and refill rate 10/second allows bursts of up to 100 requests, then sustains exactly 10 requests per second indefinitely. The bucket absorbs short bursts - if a client has been idle for 10 seconds and then sends 100 requests at once, all 100 go through. But it cannot sustain a rate higher than the refill rate over any long interval.

This is the behavior most APIs want. Interactive users generate bursts - a user clicks several things in quick succession - then go quiet. Sustained high-rate requests are the ones that need limiting.

Implementation is simple: each client identifier (API key, IP address, user ID) has its own bucket. On each request, check whether the bucket has a token; if so, decrement and allow; if not, return 429. Periodically add tokens up to the capacity ceiling.

In Redis or Memcached: store the current token count and the last refill timestamp. On each request, compute how many tokens to add based on elapsed time, cap at capacity, then check and decrement atomically. Atomic operations prevent the race condition where two concurrent requests both observe a non-empty bucket and both succeed when only one should.

Leaky Bucket

The leaky bucket algorithm is related but inverts the metaphor. Requests flow into a bucket (a queue); the bucket leaks at a fixed rate regardless of how fast requests arrive. If the bucket is full, incoming requests are dropped.

The difference from token bucket is that leaky bucket enforces a smooth output rate. No matter how fast requests arrive, they are processed at exactly the configured rate. This is useful when you need to protect a downstream system that cannot handle bursty load - a database that saturates at 100 writes per second, for instance.

Token bucket allows bursts up to the bucket capacity. Leaky bucket smooths all bursts into a constant rate. The choice depends on whether your backend can tolerate short bursts.

Sliding Window

Token bucket and leaky bucket both have an edge case: a client can double their effective rate around bucket reset boundaries. With a fixed window of “1000 requests per hour, resetting at the top of the hour,” a client can make 1000 requests at 11:59 and another 1000 at 12:00 - 2000 requests in two minutes while technically respecting the limit.

Sliding window eliminates this by maintaining a count over a rolling time window rather than a fixed one. At any point in time, the count is the number of requests in the last N seconds. This prevents the boundary burst at the cost of slightly more bookkeeping.

A practical implementation: store a sorted list of request timestamps for each client, and count how many fall within the last window. On each request, discard expired timestamps, count the remaining ones, and reject if the count exceeds the limit. This gets expensive for high-traffic clients because the list of timestamps grows large. A compromise is the sliding window counter: divide time into fixed buckets (1-minute buckets for an hourly limit), keep the last two buckets, and interpolate: current count = current bucket + previous bucket * (elapsed fraction of current period).

Where Rate Limits Live

Rate limiting can happen at multiple layers, and which layer is appropriate depends on what you are protecting.

At the edge (API gateway or load balancer): the first line of defense against external clients. An API gateway like Kong, Envoy, or Google Cloud API Gateway can apply per-API-key rate limits before the request reaches any application server. This is where you enforce quotas for paying tiers - free users get 1000 requests per day, paid users get 100,000. It is also where you block IP addresses or API keys that are obviously misbehaving.

At the service level: a service enforcing its own limits against callers, including internal callers. A search service might limit how many concurrent queries a single upstream service can issue, regardless of whether the caller is external or internal. This prevents a single misbehaving upstream from degrading search for everyone.

At the client level: clients can rate-limit themselves, which sounds counterintuitive until you consider why. A batch job that needs to read a million rows from a database benefits from self-imposed rate limiting because it prevents the batch from overwhelming the database and affecting interactive queries. A well-written client should have a configurable request rate and respect it.

Distributed Rate Limiting

Rate limiting a single server is trivial. Rate limiting across a cluster of servers is harder.

The simplest approach is to store rate limit state in a shared cache like Redis. Every server in your fleet checks and updates the same bucket for each client. This is correct but introduces a dependency on Redis: if Redis is unavailable, you cannot rate limit. The choice is to fail open (allow all requests) or fail closed (reject all requests). For most rate limiting scenarios, failing open is safer - a brief window of unthrottled traffic is acceptable, but rejecting all traffic during a Redis outage is not.

A more resilient approach is approximate distributed rate limiting: each server maintains a local counter and syncs with a central store periodically. Between syncs, a client can exceed the limit by a factor proportional to the number of servers and the sync interval. If you have 10 servers with 5-second sync intervals and a limit of 100 requests per second, a client can burst to 1000 requests per 5 seconds (100 per server) before the central count catches up. Whether this approximation is acceptable depends on how strict the limit needs to be.

Throttling vs Rate Limiting

These terms are often used interchangeably, but they describe different behaviors.

Rate limiting rejects requests that exceed a threshold - the client gets a 429 and must retry later. The service does no work on the rejected request.

Throttling slows down requests rather than rejecting them - the service accepts the request but delays its processing. A throttled request waits in a queue; a rate-limited request is turned away immediately.

Throttling is useful when you want to protect a service without forcing clients to implement retry logic. The client experiences slowness rather than errors. But throttling requires space to queue the requests, and a queue that grows without bound is just a delayed crash. Effective throttling requires both a rate limit on how fast you process and a capacity limit on how many requests you will queue.

Responding to 429s: Backoff with Jitter

A client that receives a 429 should wait before retrying. How long?

Exponential backoff is the standard: after the first failure, wait 1 second. After the second, wait 2 seconds. After the third, 4 seconds. After the fourth, 8 seconds, and so on. This prevents a client from hammering a struggling service with constant retries.

The problem with pure exponential backoff across many clients is synchronization. If 1000 clients all back off after the same overload event, they will all retry at almost the same time, creating the same overload spike that caused the failure. This is the thundering herd.

Jitter is the fix: instead of waiting exactly 8 seconds, wait a random amount between 0 and 8 seconds. The clients spread their retries across the backoff window, smoothing out the retry wave. AWS’s retry logic for SDK clients uses “full jitter” by default: random between 0 and the full backoff ceiling.

The rule: whenever you implement a retry loop, add jitter. A retry loop without jitter is a thundering herd waiting to happen.

Summary

Algorithm	Behavior	Best for
Token bucket	Allows bursts up to capacity; refills at fixed rate	Most APIs - absorbs natural burstiness
Leaky bucket	Enforces smooth output rate regardless of input	Protecting backends that cannot tolerate bursts
Sliding window	Rolling count; no reset boundary artifacts	Strict per-window limits
Distributed (central store)	Exact cross-server limits with Redis dependency	Correctness-critical quotas
Distributed (approximate)	Local counters with periodic sync; may over-allow	High-volume, tolerates small overages

Rate limiting is not defensive programming for edge cases. It is table stakes for any service that accepts requests from clients you do not fully control - which is every production service. The clients that benefit most from clear rate limits are the well-behaved ones, because rate limits give them a signal they can act on rather than opaque degradation that is impossible to diagnose.

Read next: