Prerequisite:


The Cost of the Kernel Network Stack

When a packet arrives at a NIC and you call recv() in your application, the following happens: the NIC raises a hardware interrupt, the kernel interrupt handler runs, the packet is copied into a socket buffer in kernel memory, the kernel processes protocol headers through multiple layers (Ethernet, IP, TCP), the packet is placed in the socket’s receive queue, and your application’s recv() call copies data from kernel space to user space. Context switches between user and kernel mode add overhead at both ends.

The total cost for a typical UDP packet on a modern Linux system: 1-10 microseconds of latency. For a single connection this is often acceptable. At high packet rates - millions of packets per second - this model breaks down. The CPU spends more time in interrupt handling and kernel stack processing than in application logic. Throughput plateaus well below the NIC’s rated capacity.

The kernel network stack was not designed for microsecond latency requirements. It was designed for correctness, generality, and security across all possible workloads. For applications that need every nanosecond - high-frequency trading, 100G packet processing, distributed ML training at scale - the right answer is to bypass the kernel entirely.

DPDK: Data Plane Development Kit

DPDK (originally from Intel, now an open-source project) is the dominant framework for kernel bypass networking. Its core idea: move the NIC driver into userspace, and have application threads poll the NIC directly rather than waiting for interrupts.

Poll-mode drivers (PMDs): instead of the NIC raising an interrupt that wakes the kernel, DPDK’s PMD has a dedicated CPU core spinning in a tight loop, calling rte_eth_rx_burst() to drain packets from the NIC’s receive queues. Latency is bounded by the polling interval - typically under 1 microsecond, sometimes under 100 nanoseconds.

Zero-copy packet processing: DPDK allocates packet buffers (rte_mbuf) in hugepage memory that is DMA-mapped to the NIC. The NIC writes packets directly into these buffers. The application reads them directly. No copies between kernel and user space - the data never touches the kernel at all.

Dedicated CPU cores: DPDK typically pins one or more cores exclusively to packet processing using isolcpus and rte_eal_remote_launch. These cores do nothing but poll the NIC and process packets. The sacrifice of CPU cores is the price for deterministic latency.

#include <rte_ethdev.h>
#include <rte_mbuf.h>

// Typical DPDK polling loop
static int lcore_main(void *arg) {
    struct rte_mbuf *pkts[BURST_SIZE];
    uint16_t port = 0;
    uint16_t queue = 0;

    while (1) {
        uint16_t nb_rx = rte_eth_rx_burst(port, queue, pkts, BURST_SIZE);
        for (uint16_t i = 0; i < nb_rx; i++) {
            process_packet(pkts[i]);
            rte_pktmbuf_free(pkts[i]);
        }
        // No sleep, no yield - pure spin
    }
    return 0;
}

A single DPDK core on modern hardware can process 10-40 million packets per second - impossible with interrupt-driven kernel networking.

RDMA: Remote Direct Memory Access

RDMA eliminates the CPU from the data path entirely for memory transfers between hosts. The NIC handles all protocol processing and DMA operations autonomously. From the application’s perspective, writing to a remote machine’s memory looks like a local memory write - except the data appears in the remote process’s address space without the remote CPU ever being interrupted.

Latency: RDMA operations complete in 1-2 microseconds end-to-end, including network transit. Compare this to the 50-100μs typical for a TCP round trip over a kernel network stack.

Key RDMA operations:

  • RDMA Write: local application writes data directly into a pre-registered region of remote memory. Remote CPU is not involved.
  • RDMA Read: local application reads from a region of remote memory. Again, remote CPU is not involved.
  • Send/Receive: two-sided operation - remote CPU posts a receive buffer, local CPU sends. More like traditional messaging but still with kernel bypass.

RDMA requires special hardware and a registered memory region - the NIC must be given pinned, physically-contiguous memory that it can DMA into or out of:

// Register memory with the RDMA NIC
struct ibv_mr *mr = ibv_reg_mr(pd, buffer, buffer_size,
    IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);

// Post a work request for RDMA Write
struct ibv_send_wr wr = {
    .opcode = IBV_WR_RDMA_WRITE,
    .sg_list = &sge,         // local scatter-gather element
    .num_sge = 1,
    .wr.rdma.remote_addr = remote_addr,  // destination in remote memory
    .wr.rdma.rkey = remote_rkey,
};
ibv_post_send(qp, &wr, &bad_wr);

InfiniBand, RoCE, and iWARP

RDMA is a capability, not a single protocol. Three transports implement it:

InfiniBand: a purpose-built HPC interconnect with native RDMA support. Used in GPU clusters for distributed training (NCCL’s AllReduce over InfiniBand is the standard for large-scale ML). Offers the lowest latency and highest bandwidth - up to 400Gb/s (NDR) per link. Requires dedicated InfiniBand switches and HCAs (Host Channel Adapters), making it expensive.

RoCE (RDMA over Converged Ethernet): RDMA over Ethernet fabric. RoCEv1 uses Ethernet framing; RoCEv2 uses UDP/IP framing, making it routable. Requires a “lossless” Ethernet fabric (Priority Flow Control to prevent packet drops, since RDMA has no built-in retransmission). Widely used in cloud environments and increasingly replacing InfiniBand in ML clusters due to cost.

iWARP: RDMA over TCP. Inherits TCP’s reliability and congestion control, so it works on standard Ethernet without lossless fabric requirements. Higher overhead than RoCE due to TCP processing, but simpler to deploy.

The choice: InfiniBand for maximum performance and budget; RoCEv2 for RDMA on commodity Ethernet with controlled fabric; iWARP when the network is not under your control.

io_uring: Async I/O Without Syscall Overhead

io_uring (Linux 5.1+, 2019) takes a different approach. Rather than bypassing the kernel, it minimizes the cost of interacting with it. The kernel and application share two ring buffers in memory: a submission queue (SQ) where the application writes I/O requests, and a completion queue (CQ) where the kernel writes results.

The application can submit batches of I/O operations with a single io_uring_enter syscall - or none at all in sqpoll mode, where a kernel thread polls the submission queue continuously, eliminating syscalls entirely.

#include <liburing.h>

struct io_uring ring;
io_uring_queue_init(256, &ring, IORING_SETUP_SQPOLL);

// Submit a read - no syscall if sqpoll is draining fast enough
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, offset);
io_uring_sqe_set_data(sqe, user_data);
io_uring_submit(&ring);  // may be a no-op with SQPOLL

// Collect completions
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
int result = cqe->res;
io_uring_cqe_seen(&ring, cqe);

As of Linux 5.19+, io_uring supports network sockets directly - including send, recv, accept, and connect. For workloads with many concurrent connections doing moderate I/O (web servers, database backends), io_uring offers substantial throughput improvements without the complexity of full kernel bypass.

AF_XDP: Selective Packet Bypass

AF_XDP (eXpress Data Path socket) is a Linux socket type introduced in kernel 4.18 that enables selective kernel bypass at the NIC driver level. Unlike DPDK, which takes over the entire NIC, AF_XDP lets you redirect specific traffic (matched by an eBPF/XDP program) into a userspace ring buffer, while normal traffic continues through the kernel stack.

// XDP program (runs in kernel, written in eBPF):
// Redirect packets matching a rule to AF_XDP socket
if (is_my_flow(pkt))
    return bpf_redirect_map(&xsks_map, queue_id, 0);
return XDP_PASS;  // everything else goes to normal kernel stack

This is ideal when you only need bypass for a specific port or flow - you keep the benefits of the kernel network stack for everything else without managing your own TCP/IP implementation.

Use Cases

High-frequency trading: order execution systems need to receive a market data feed, compute a trading signal, and send an order in under 1 microsecond. DPDK and RDMA are standard tools. Some firms use FPGA-based NICs that process packets in hardware without any CPU involvement.

Distributed ML training: AllReduce collectives (used to synchronize gradients across GPUs) over InfiniBand with NCCL. A ring AllReduce across 1000 GPUs must complete in milliseconds - kernel networking cannot deliver the required bandwidth and latency at scale.

100G/400G packet processing: network appliances (firewalls, load balancers, intrusion detection systems) processing line-rate traffic. DPDK lets a single server handle 100Gb/s of traffic inspection that would require a rack of hardware with kernel networking.

Examples

DPDK application structure:

int main(int argc, char *argv[]) {
    // Initialize EAL (Environment Abstraction Layer)
    rte_eal_init(argc, argv);

    // Create mempool for packet buffers in hugepages
    struct rte_mempool *mbuf_pool = rte_pktmbuf_pool_create(
        "MBUF_POOL", 8192, 256, 0,
        RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());

    // Configure and start the NIC port
    rte_eth_dev_configure(0, 1, 1, &port_conf);
    rte_eth_rx_queue_setup(0, 0, 128, rte_eth_dev_socket_id(0), NULL, mbuf_pool);
    rte_eth_tx_queue_setup(0, 0, 128, rte_eth_dev_socket_id(0), NULL);
    rte_eth_dev_start(0);

    // Launch polling loop on a dedicated lcore
    rte_eal_remote_launch(lcore_main, NULL, 1);
    rte_eal_mp_wait_lcore();
}

io_uring echo server sketch:

// Submit accept, then for each connection submit recv
// On recv completion, submit send with same buffer
// All via ring buffer - no per-operation syscalls in steady state
while (1) {
    io_uring_wait_cqe(&ring, &cqe);
    struct conn_state *conn = io_uring_cqe_get_data(cqe);
    if (conn->op == OP_RECV && cqe->res > 0) {
        conn->len = cqe->res;
        conn->op = OP_SEND;
        sqe = io_uring_get_sqe(&ring);
        io_uring_prep_send(sqe, conn->fd, conn->buf, conn->len, 0);
        io_uring_sqe_set_data(sqe, conn);
        io_uring_submit(&ring);
    }
    io_uring_cqe_seen(&ring, cqe);
}

RDMA Write operation flow:

Local host                          Remote host
──────────────────────────────────────────────
1. App: ibv_post_send(RDMA_WRITE)
2. HCA: reads local buffer via DMA
3. HCA: sends RDMA Write packet →  → NIC receives packet
                                    → HCA writes into registered
                                       remote memory via DMA
                                    → HCA sends ACK
4. HCA: receives ACK
5. HCA: posts completion to CQ
6. App: ibv_poll_cq() sees done
                                    (Remote CPU never interrupted)

The remote CPU is completely uninvolved in steps 3-4. The data appears in its memory region as if by magic - which is exactly the point.


Read Next: