Message Queues - Decoupling Services by Delaying Their Conversations // Megha Bose

Helpful context:

Fault Tolerance - Building Systems That Survive Their Own Failures

A user uploads a photo. Before they see “upload complete,” your service needs to resize the image to four dimensions, strip EXIF metadata (which may contain GPS coordinates), generate a thumbnail, run ML-based content moderation, and tag it with object labels. Synchronously. Waiting for all of that to finish before responding.

Or does it?

The question exposes a design choice that determines whether your upload endpoint takes 200ms or 8 seconds - and whether a spike in uploads cascades into database failures, or is absorbed invisibly. The answer is: no, of course those don’t need to happen before you tell the user “success.” The upload is complete. The rest is work that follows the upload. And the mechanism for separating them is the message queue.

Synchronous vs Asynchronous: The Fundamental Split

Synchronous processing means the caller waits for the work to complete before receiving a response. The latency the user experiences is the sum of every operation in the chain. If fraud detection takes 300ms, inventory check takes 100ms, and email confirmation takes 250ms, the user waits 650ms plus your processing time. Every downstream failure is your failure. Every downstream slowdown is your slowdown.

Asynchronous processing means the caller receives an acknowledgment (“received your upload; we’ll process it”) and the downstream work happens out-of-band, at the queue’s pace. The user sees a fast response. Processing pipelines run in parallel. A spike in uploads creates a backlog, not a crash. Downstream services fail silently and retry without affecting the user-facing response time.

This is not just a performance optimization - it’s an architectural principle. Asynchronous processing decouples the availability of the caller from the availability of downstream consumers. The upload service doesn’t care if the ML tagging service is down; the work is queued and will be processed when the service recovers.

Point-to-Point vs Pub/Sub

Two fundamental messaging patterns, and getting the choice wrong wastes significant engineering effort.

Point-to-point (task queues): One producer, one consumer pool. Each message is delivered to exactly one consumer. An image resize job is enqueued once and processed by whichever worker picks it up. This is the right model when work is discrete, acknowledgement-driven, and should not be processed more than once. AWS SQS in its standard or FIFO modes implements this pattern.

Publish/subscribe (pub/sub): One producer, many independent consumers. An order-placed event is published once and simultaneously consumed by the inventory service (to update stock), the notification service (to send confirmation email), and the analytics service (to record the transaction). None of these consumers know about each other. Adding a new consumer (say, a fraud detection service) requires no changes to the producer. AWS SNS implements fan-out broadcast; AWS EventBridge adds routing rules on top of SNS-style fan-out.

The choice between them is structural. If the work should happen exactly once and someone should acknowledge it - task queue. If multiple independent services need to react to the same event - pub/sub. Most real systems need both: a pub/sub event triggers multiple task queues, each handling their slice of downstream work.

Group Messaging Fan-Out: 1 Write or 250?

When you send a message to a WhatsApp group with 250 members, a natural question is: what exactly happens in the database?

The intuition is that it must be 250 writes - one per recipient - because all 250 people need to receive the message. But that overstates it. The intuition that it is 1 write also understates it. The reality is more interesting than either extreme.

How Messaging Fan-Out Works

The message content is stored once. There is one row in the messages table: message ID, sender, group ID, content, timestamp. That is one write. Duplicating the content 250 times would be wasteful and inconsistent - if you needed to edit or delete the message, you would have 250 rows to update.

What does scale with the group size is delivery metadata. The system needs to track the delivery and read status for each recipient independently. Person A’s message might be delivered but unread; person B might not have been online when the message was sent. This per-recipient state is typically stored as a separate record:

-- One row per (message, recipient) pair
-- For a 250-person group: 250 rows created when the message is sent
INSERT INTO message_delivery (message_id, recipient_id, status, delivered_at)
VALUES ('msg-xyz', 'user-1', 'pending', NULL),
       ('msg-xyz', 'user-2', 'pending', NULL),
       ...  -- 250 rows total

So the accurate answer: one write for message content, plus up to N writes for delivery tracking metadata, where N is the group size.

The Delivery System

Beyond database writes, the message must actually reach each member’s device. This is where fan-out happens at the connection layer, not the storage layer.

WhatsApp routes each device through a connection server (a persistent WebSocket or similar). When a message arrives, the server looks up which connection servers currently hold active connections for each group member and forwards the message to each. For offline members, the message is queued and delivered when they reconnect.

This is a push model at the connection layer: 250 individual pushes happen, but they do not all require separate database writes. The message is read from storage once; the connection layer handles the fan-out in memory.

Why Groups Have Size Limits

WhatsApp’s 250-person group limit (later raised to 1024) and Telegram’s 200,000-person supergroup limit exist for the same reason: fan-out cost.

A 1024-person group means 1024 delivery records created per message and 1024 devices notified per message. At high message volume in a large group, this creates real write amplification. The group size cap is a hard ceiling on fan-out explosion per message.

For very large communities (WhatsApp Communities, Telegram channels), the architecture shifts to a pull model: the message is stored once and members fetch it when they open the channel, rather than having it pushed to each device immediately. This trades latency (the message does not arrive as a notification) for scale (no per-member fan-out).

The Difference Between Messaging and Feed Fan-Out

This is architecturally different from the fan-out patterns in Feed Design - Push vs Pull . In a social feed (Twitter, Instagram), fan-out means writing a post reference into each follower’s timeline - the goal is fast reads from a precomputed feed. In messaging, fan-out means delivering a message to each recipient’s device as a real-time notification - the goal is low-latency delivery. The storage model, the delivery mechanism, and the latency requirements are all different.

Kafka: The Log, Not the Queue

Kafka is frequently described as a message queue, but that framing misses the fundamental design choice that makes Kafka different from every queue that came before it.

A traditional queue deletes messages after they’re consumed. The consumer acknowledges processing; the message is gone. This is intuitive but limiting: only one consumer type can process each message, you can’t replay history, and the queue has no memory of what it’s already delivered.

Kafka is a distributed log. Messages are appended to topics (divided into partitions) and are not deleted after consumption. Each consumer maintains an offset - a pointer to its current position in the log. Reading a message advances the pointer; it doesn’t delete the message. This means:

Multiple independent consumer groups can each consume the same topic at their own pace and offset. The analytics team can run a new consumer group and replay three months of order-placed events without affecting the order processing consumer group.
Failed processing can be retried by simply resetting the consumer’s offset and reprocessing.
New downstream services can be added by subscribing a new consumer group - no producer changes required, no historical events missed.

Log compaction is an alternative to time-based retention: Kafka keeps only the latest record for each key, discarding older versions. This turns a topic into a compacted changelog of current state - useful for materializing views of a database or maintaining a running summary of entity state.

Kafka achieves high throughput through sequential disk I/O (writes are always appends, which HDDs and SSDs handle optimally) and producer-side batching (linger.ms and batch.size controls). Throughput in the millions of messages per second is achievable on modest hardware.

RabbitMQ: The Traditional Broker

RabbitMQ predates Kafka and takes a fundamentally different approach. Where Kafka delegates routing and offset management to consumers, RabbitMQ does routing at the broker.

Producers publish to an exchange; the exchange routes messages to queues based on binding rules. Exchange types determine routing logic: direct routing matches exact routing keys; topic routing matches wildcard patterns (order.# matches order.placed and order.shipped); fanout ignores routing keys and broadcasts to all bound queues.

Consumers acknowledge messages explicitly. An unacknowledged message from a crashed consumer is redelivered to another consumer. A dead letter exchange (DLX) catches messages that exceed a retry limit, expire, or are explicitly rejected - allowing them to be inspected or alertable without blocking the main queue.

RabbitMQ is the right choice for task queues with rich routing logic, per-message retry/DLQ behavior, and workloads where you want the broker to manage delivery state. Kafka is the right choice when you need replay, multiple independent consumer groups, or high-throughput event streaming.

AWS SQS/SNS/EventBridge: When to Use Which

The AWS messaging ecosystem is mature and covers most use cases - but the options are genuinely confusing.

SQS is a managed task queue. Producers enqueue messages; workers poll and process. The visibility timeout is the key mechanism: when a worker receives a message, it becomes invisible to other workers for the visibility timeout period (default 30 seconds). If the worker processes and deletes the message before timeout, it’s gone. If the worker crashes or the timeout expires, the message reappears for another worker to claim. This provides at-least-once delivery semantics. SQS Standard queues provide “best-effort” ordering and may deliver messages more than once; FIFO queues provide strict ordering and exactly-once processing at lower throughput.

SQS is underrated for simpler use cases. No partition management, no offset tracking, no cluster to operate. For a job queue that processes thousands of tasks per day, SQS with Lambda triggers is operationally simpler than a Kafka cluster and requires zero infrastructure maintenance.

SNS is a managed pub/sub broadcast. Publish a message once; SNS delivers it to all subscribed endpoints - SQS queues, Lambda functions, HTTP endpoints, email addresses. SNS is the fan-out layer: one event triggers multiple SQS queues, each consumed by an independent downstream service.

EventBridge adds content-based routing on top of SNS-style pub/sub. An event from a source (your application, AWS services, or SaaS partners) is matched against rules based on event content - route order.placed events with order.total > 1000 to a high-value order queue, route all others to the standard queue. EventBridge is the right choice when you need sophisticated routing logic without writing and maintaining that logic yourself. It also natively integrates with 200+ AWS services as event sources.

The decision framework: SQS for point-to-point task queues. SNS + SQS for fan-out to multiple consumers. EventBridge when you need content-based routing or SaaS integrations.

Exactly-Once Delivery: The Distributed Systems Lie

Kafka’s marketing material references “exactly-once semantics.” This phrase needs careful unpacking because it is simultaneously true and misleading.

In a distributed system, a producer sending a message and the broker persisting it are two separate operations. The network between them can fail after the broker persists the message but before the broker sends an acknowledgment. The producer doesn’t know if the message was persisted. It retries. The message is written twice.

Kafka’s transactional API coordinates a two-phase commit between the producer and the broker: messages are written transactionally and consumers can be configured to read only committed messages. This provides exactly-once semantics within the Kafka system itself. But if your consumer’s job is to write the message content to a database, the database write is a separate operation. If the consumer crashes after the database write but before committing its Kafka offset, the message will be reprocessed and the database will receive a duplicate write.

Effectively-exactly-once is the practical alternative: use at-least-once delivery (guaranteed no message loss, possible duplicates) combined with idempotent consumers (processing the same message twice produces the same result as processing it once). Store a processed-message ID in the database; on receipt, check if already processed; skip if so. This is architecturally simpler than true exactly-once and works across any messaging system.

Kafka’s Operational Reality

Kafka is genuinely powerful, but its operational complexity is real and frequently underestimated.

Partition management: Kafka’s parallelism is bounded by partition count. Adding partitions after the fact is possible but causes temporary rebalancing. Getting partition count right upfront is an art - too few and you can’t scale consumer parallelism; too many and you waste broker resources.

Consumer group rebalancing: When consumers join or leave a consumer group (due to scaling events, restarts, or failures), Kafka triggers a rebalance - reassigning partitions among the active consumers. During a rebalance, all consumption pauses. In large consumer groups with many partitions, rebalances can take tens of seconds. Kafka’s incremental cooperative rebalancing (introduced in 2.4) significantly reduces rebalance disruption.

ZooKeeper to KRaft: Kafka historically used ZooKeeper for metadata management and controller election. ZooKeeper is a separate cluster to operate, adding operational burden. Kafka 3.x introduced KRaft (Kafka Raft), which replaces ZooKeeper with an internal Raft consensus protocol. KRaft simplifies deployment significantly - one fewer cluster to manage.

Consumer lag monitoring: The critical operational metric for Kafka consumers is consumer lag - the gap between the latest offset in a partition and the consumer’s committed offset. Growing lag means consumers are falling behind producers. CloudWatch (for Confluent Cloud or MSK) and Prometheus with the Kafka exporter expose this metric. Lag-based auto-scaling of consumer groups is the canonical response.

For simpler use cases - job queues, webhook processing, async email - SQS requires none of this operational knowledge. The engineering team that reaches for Kafka for a few thousand messages per day is paying operational complexity tax they don’t need.

Event Sourcing: State as History

Event sourcing flips the persistence model: instead of storing current state (“user balance: $500”), you store the sequence of events that produced that state (“deposited $1000, withdrew $300, withdrew $200”). Current state is reconstructed by replaying events.

The benefits are real. You have a complete, immutable audit log by construction. You can replay events to reconstruct state at any point in time - invaluable for debugging (“what did the system see when this wrong decision was made?"). You can project the same event stream into multiple read models, each optimized for a different query pattern.

The costs are also real. The read path requires projecting events into a queryable view - typically maintained as a separate database updated by consuming the event stream. Schema evolution is harder: events are immutable, so changing the shape of an event requires versioning the schema and handling multiple versions in your replay logic. For high-write-volume entities, the event log grows large and reconstructing current state from scratch becomes expensive (mitigated by snapshotting: storing periodic checkpoints of materialized state).

Event sourcing pairs naturally with Kafka: the event log is a Kafka topic with compaction. This combination - CQRS (Command Query Responsibility Segregation) + event sourcing on Kafka - appears in systems like LinkedIn’s activity feed, Confluent’s stream processing products, and Martin Fowler’s influential microservices work.

Serverless Event Processing and Tiered Storage

The integration of SQS and SNS with AWS Lambda represents the operational end-game for simple event-driven architectures: no consumers to deploy, no instances to manage, no scaling to configure. Lambda polls SQS automatically and invokes your function for each batch of messages. Dead-letter queues handle retry limits. CloudWatch logs capture function output. For moderate-volume workloads, the operational cost approaches zero.

Kafka tiered storage is the architectural direction for very high-retention event logs: store recent segments on fast local broker disks, and offload older segments to S3. S3 acts as infinite, cheap long-term storage for the event log; brokers only serve recent data from local disk. Confluent Cloud, AWS MSK, and WarpStream (a new entrant that stores all data in S3) all offer variations on this model. The implication: Kafka becomes a stream processor atop object storage, breaking the traditional coupling between partition count and broker disk capacity.

Summary

Concept	Key Insight
Async decoupling	Fast user response + spike absorption; downstream failures don’t cascade
Point-to-point vs pub/sub	Task queue for discrete work; pub/sub for fan-out to independent consumers
Kafka as a log	Replay-friendly; multiple consumer groups at independent offsets
RabbitMQ vs Kafka	RabbitMQ for routing + per-message ACK; Kafka for replay + high throughput
SQS vs SNS vs EventBridge	Task queues vs fan-out vs content-based routing
Exactly-once delivery	Nearly impossible end-to-end; use idempotent at-least-once instead
Kafka operational reality	Partition tuning, rebalancing, consumer lag monitoring; simpler than it looks only with experience
Event sourcing	Audit log + time-travel by construction; replay complexity is the cost
Tiered storage	S3 as infinite Kafka log; decouples retention from broker disk capacity

Read Next:

Microservices - Small Services, Large Coordination Problems