Kafka Internals - The Distributed Log That Became an Ecosystem // Megha Bose

Helpful context:

Apache Kafka is described as a message queue, a pub/sub system, a streaming platform, and an event store - all at once, all accurately. It is production infrastructure at LinkedIn (where it was built in 2011), Uber, Netflix, Airbnb, and virtually every company that processes large volumes of real-time data. Understanding why requires understanding what Kafka actually is underneath: a distributed, replicated, persistent log. Not a queue. A log.

The distinction is not semantic.

The Problem With Traditional Message Queues

RabbitMQ and similar brokers deliver messages and then delete them. A consumer receives a message, processes it, and it is gone. This is the right model for task queues: you send a job to a worker, the worker does it, done.

But consider activity events: every click, every page view, every purchase. You want the analytics service to process these events. You also want the recommendations engine to process them. And the fraud detection system. And the audit log. And you want to be able to replay them if a bug corrupts your analytics database.

A traditional queue fails at this. When the analytics service consumes a message, it is gone - the recommendations engine cannot see it. You would have to fan out every message to N separate queues for N consumers, which is architecturally ugly and doubles, triples, or N-tuples your storage. And replay is impossible: once consumed, messages are gone.

Kafka’s insight: the log is the right abstraction for events. An append-only log keeps everything. Consumers read from whatever position they want. Multiple consumers can read independently. Replaying from any point in history is trivially re-reading from an earlier offset.

The Anatomy of Kafka

Topic: a named stream of records. “user-clicks,” “order-placed,” “payment-processed.” A topic is the unit of data organization.

Partition: a topic is split into partitions for parallelism and scalability. Each partition is an ordered, immutable sequence of records stored on disk, on one broker. Within a partition, each record has a sequential integer called an offset. Partition 0 of the “user-clicks” topic might have offsets 0 through 4,831,227. Partition 1 has its own independent sequence.

The number of partitions determines the maximum parallelism for consumers. A topic with 12 partitions can be consumed by up to 12 parallel consumers simultaneously. More partitions = more throughput but more file handles, more ZooKeeper znodes (in older Kafka), and more leader elections.

Broker: a Kafka server. A cluster has multiple brokers. Each partition’s data lives on one broker (the leader for that partition), with replicas on others.

Offset: the position of a record within a partition. The offset is the consumer’s bookmark. Unlike a queue (where the broker tracks what has been delivered to each consumer), in Kafka the consumer tracks its own offset. This is a fundamental design choice that has enormous implications.

Consumer group: a set of consumers that collectively consume a topic. Each partition is assigned to exactly one consumer within the group. If the topic has 12 partitions and there are 4 consumers in the group, each consumer gets 3 partitions. If there are 12 consumers, each gets 1. If there are 13 consumers, one sits idle (you cannot have more consumers than partitions in a group).

Different consumer groups are completely independent. The analytics service’s consumer group and the fraud detection service’s consumer group each maintain their own offsets. When analytics reads partition 0 up to offset 5000, fraud detection can still be at offset 3000, and a third group replaying events can be at offset 0. Nobody interferes with anybody.

Producer: a client that writes records to a topic. The producer chooses which partition to write to. The default: round-robin. With a partition key: hash(key) mod N. Using a partition key means all records with the same key always go to the same partition, which guarantees ordering within a key (e.g., all events for a specific user_id are in order).

The Log on Disk

Each partition is physically stored as a directory of files on the broker’s disk. Within the directory, records are organized into segments: files of up to 1GB (configurable). The active segment is the one being written to. Older segments are immutable.

The internal structure of a segment is simple: records are appended sequentially. Each record has its offset, timestamp, key, value, and headers. Reading record at offset $k$ is finding the right segment (binary search on the index file) and reading from the appropriate position.

This sequential layout is why Kafka is fast. Writes are always appends to the active segment - sequential disk writes. Reads for a consumer maintaining sequential offsets are sequential reads. The OS page cache aggressively prefetches sequential reads, meaning hot partitions often serve consumers entirely from memory without touching disk. Kafka uses zero-copy I/O: sendfile() moves data from the page cache directly to the network socket without copying through user space, eliminating a CPU bottleneck that would otherwise limit throughput.

A broker with spinning disks can sustain hundreds of MB/s of sequential I/O. A broker with SSDs can push gigabytes per second. At LinkedIn, individual brokers handle millions of writes per second.

Retention: Time-Based and Size-Based

Unlike a queue, Kafka does not delete records when they are consumed. Records are retained based on time or size.

Time-based retention: delete records older than N days. Default: 7 days. With 7-day retention, any consumer can replay up to a week of history.

Size-based retention: delete the oldest records when total partition size exceeds N bytes.

Compact topics: instead of deleting by time or size, only keep the latest record for each key. This is Kafka’s “log compaction” mode. Useful for topics that represent current state: the compacted topic is eventually a snapshot of the latest value per key, deduplicated by history. Kafka Streams uses compacted topics to maintain KV state.

Retention policy is per-topic. An event topic might keep 7 days of history. A configuration change topic might use log compaction to keep only the latest config.

Replication: Leaders and Followers

Each partition has a configured replication factor (typically 3). One broker is the leader for that partition; the others are followers (also called replicas).

Producers write to the leader. The leader appends to its log and immediately replicates to followers via the followers' fetch requests (followers poll the leader, not the other way). Consumers also read from the leader by default (Kafka 2.4+ supports reading from the nearest replica for latency optimization).

ISR (In-Sync Replicas) is the set of replicas that are “caught up” with the leader - not more than some configurable lag behind. The leader only considers an ISR member eligible to be a new leader if the current leader fails.

acks setting: producers configure how many replicas must acknowledge a write before it is considered successful.

acks=0: fire and forget. Fastest, no durability guarantee.
acks=1: the leader has received and written the record. Default in older Kafka. Risk: if the leader fails before followers replicate, the record is lost.
acks=all (or acks=-1): all ISR members have written the record. Highest durability. Combined with min.insync.replicas=2, guarantees the record survives the loss of any one broker.

The throughput-durability tradeoff is explicit and configurable per producer.

Offset Management and Exactly-Once Semantics

Consumers commit their offsets back to Kafka (to an internal topic called __consumer_offsets). This is how a consumer group remembers where it left off across restarts. On restart, the consumer fetches its last committed offset and resumes from there.

This creates at-least-once delivery by default: the consumer processes some records, then commits the offset. If it crashes between processing and committing, it re-reads and reprocesses those records on restart. Idempotent downstream processing (or deduplication on the consumer side) handles duplicates.

Kafka 0.11 introduced idempotent producers: the broker deduplicates retried writes using producer IDs and sequence numbers, guaranteeing exactly-once delivery from producer to broker even under retries.

Transactions extend this across multiple partitions: a producer can atomically write to multiple partitions, and a consumer can atomically read-process-write in a consume-transform-produce pipeline. Together these give end-to-end exactly-once semantics, which is what Kafka Streams uses internally.

Consumer Lag and Monitoring

Consumer lag is the difference between the latest offset in a partition and the consumer’s current offset. Lag = 0 means the consumer is caught up. Lag = 10,000 means the consumer is 10,000 records behind.

Consumer lag is the primary health metric for Kafka consumers. Rising lag means: the consumer is too slow (scale it out or optimize), the producer is producing faster than expected (is there a traffic spike?), or the consumer has crashed (lag grows indefinitely without processing). Monitoring consumer lag per consumer group per partition is essential. Tools: Kafka’s own consumer-groups.sh, Burrow (LinkedIn’s dedicated lag monitoring service), Prometheus exporters.

Kafka Streams and KSQL

Kafka’s ecosystem has grown beyond the broker.

Kafka Streams is a Java library for building streaming applications that read from Kafka, process data (filter, aggregate, join, window), and write results back to Kafka. It runs inside your application - no separate cluster. Stream processing operations are compiled to a DAG of processing nodes, each processing one record at a time. State stores (backed by RocksDB locally, compacted Kafka topics remotely) enable stateful aggregations.

ksqlDB is a SQL-like interface over Kafka Streams. You write SQL queries against Kafka topics as if they were database tables. Running queries are continuously processing new records. This makes stream processing accessible without writing Java code.

Kafka Connect is a framework for streaming data between Kafka and external systems: pull from a MySQL database into Kafka (CDC - change data capture), push from Kafka into Elasticsearch, S3, Snowflake, BigQuery. Connectors are plug-and-play: you configure which tables to pull or which topic to push, and Connect handles parallelism, offset tracking, and restart behavior.

Kafka vs. Everything Else

	Kafka	RabbitMQ	AWS SQS	Redis Streams
Message retention	Time/size based	Until consumed	14 days max	Configurable
Replay	Yes, any offset	No	No	Yes
Multiple consumers	Yes, independent groups	Competing consumers	Competing consumers	Yes, groups
Throughput	Very high (millions/s)	High (thousands/s)	High	Moderate
Ordering	Per partition	Per queue	No (FIFO queues: per group)	Per stream
Use case	Event streaming, log aggregation	Task queues, routing	Task queues, simple fanout	Lightweight streaming

Kafka is not always the right answer. For simple task queues where you do not need replay or multiple independent consumers, RabbitMQ or SQS are simpler to operate. For very low latency (sub-millisecond), neither is suitable - consider Redis or direct RPCs. Kafka’s operational complexity (the cluster, ZooKeeper or KRaft, topic partitioning decisions, retention tuning) is justified when you need its specific properties: high throughput, persistence, replay, and multiple independent consumer groups.

KRaft mode (Kafka 2.8+): Kafka is replacing ZooKeeper with its own Raft-based metadata management, simplifying the operational model significantly. A KRaft cluster requires no external coordination service.

Read next: