Kernel-Bypass Networking - Sending Packets Without Asking the OS // Megha Bose

Helpful context:

A 100Gbps NIC can push 12.5 gigabytes per second. That is not a bottleneck. At 100Gbps, a minimum-size 64-byte Ethernet frame arrives every 5.12 nanoseconds - roughly 150 million packets per second. The hardware is not the problem.

The problem is the software stack between the hardware and your application. A packet arriving at the NIC raises a hardware interrupt. The kernel interrupt handler runs. The packet is copied into a kernel socket buffer. The TCP/IP stack processes protocol headers across multiple layers. The packet is placed in the socket’s receive queue. Your application calls recv(), which copies data from kernel space to user space - a second copy, across the kernel-user boundary. Context switches, cache pollution, memory copies.

The total cost for a typical UDP packet on a modern Linux system: 1 - 10 microseconds of added latency. At moderate packet rates this is acceptable. At 10 million packets per second, the CPU spends more time in the kernel stack than in application logic. At 150 million packets per second, the kernel stack cannot keep up at all - you will see packet drops before you even open the application code.

For most applications - web servers, database backends, message queues - the kernel network stack is correct, well-understood, and fast enough. For applications that need every microsecond - high-frequency trading, 100G packet processing, GPU-to-GPU gradient synchronization across a 2000-node training cluster - the kernel itself is the bottleneck. The answer is to bypass it entirely.

The History: From C10K to Kernel Bypass

The first major inflection point in Linux networking performance was the C10K problem (1999): how do you handle 10,000 simultaneous connections without a thread per connection? The answer was epoll - an event-driven I/O multiplexing interface that scales to millions of connections. epoll remains the foundation of every high-performance Linux network server: nginx, Node.js, Redis.

epoll solved the concurrency problem but not the per-packet overhead problem. Each recv() is still a syscall crossing the kernel-user boundary. Each packet still passes through the full kernel TCP/IP stack.

DPDK (Data Plane Development Kit) emerged from Intel in 2010 as the answer to packet-processing performance. The insight: move the NIC driver entirely into user space and have application threads poll the NIC directly, bypassing the kernel entirely. Early adopters were network appliance vendors (firewalls, load balancers, intrusion detection systems) building high-performance middleboxes on commodity x86 hardware.

RDMA (Remote Direct Memory Access) has older roots - it came from InfiniBand, a purpose-built HPC interconnect developed in the late 1990s for supercomputer clusters. The idea: let the NIC transfer data between hosts without involving either host’s CPU. The CPU sets up the transfer; the NIC does the rest. HPC clusters adopted RDMA for MPI message passing; ML training clusters adopted it for gradient synchronization.

io_uring is the most recent addition (Linux 5.1, 2019). It does not bypass the kernel but minimizes the cost of talking to it, using a shared ring buffer between kernel and user space to submit and collect I/O operations with zero or near-zero syscall overhead.

Why the Kernel Network Stack Is Slow

Understanding the sources of overhead tells you which solution to apply.

Syscall cost: every send() and recv() crosses the user-kernel boundary. On x86, a syscall takes roughly 100-300 nanoseconds on modern hardware (the cost increased after Spectre/Meltdown mitigations added retpoline and IBRS overhead). At 10 million send operations per second, syscall cost alone consumes 1 - 3 seconds of CPU per second - more than one full core dedicated to syscall overhead.

Data copies: the kernel cannot directly DMA received packets into user-space memory (it does not know which user-space process the packet belongs to yet, and user memory may not be pinned or physically contiguous). The packet is copied into a kernel socket buffer, then recv() copies it again to user space. Two copies per packet. At 10Gbps, that is 1.25 GB/s of copy bandwidth just for the receive path.

Interrupt overhead: the NIC interrupts the CPU for every packet (or batch, with interrupt coalescing). Each interrupt preempts the current thread, invokes the interrupt handler, processes the packet, and returns. On the receive path, this creates latency spikes whenever a high-priority thread gets interrupted.

Lock contention in the network stack: the kernel socket receive queue is protected by a spinlock. Under high receive rates with many cores, this becomes a contention point.

Context switch overhead: when a thread blocks on recv() waiting for data, it is descheduled. When data arrives, it is rescheduled. The context switch cost - saving and restoring registers, TLB flush, cache warming - adds hundreds of microseconds of latency in the worst case.

DPDK: Polling-Mode User-Space Networking

DPDK’s approach: eliminate the kernel from the packet path entirely.

Poll-mode drivers (PMDs): instead of the NIC interrupting the CPU, a DPDK application dedicates one or more CPU cores to spinning in a tight polling loop, calling rte_eth_rx_burst() to drain packets from the NIC’s hardware receive queues. Latency is bounded by the polling interval - typically under 1 microsecond, often under 100 nanoseconds.

Hugepage memory: DPDK allocates packet buffers (rte_mbuf) in hugepages (2MB or 1GB pages). Hugepages reduce TLB pressure (fewer TLB entries needed to cover the same address range) and allow the NIC to DMA directly into them without physically contiguous mapping constraints.

Zero-copy: the NIC writes received packets directly into DPDK’s preallocated hugepage buffers via DMA. The application reads packets directly from those buffers. No copies between kernel and user space - the kernel is never involved.

Dedicated CPU isolation: DPDK pins its polling cores using isolcpus and rte_eal_remote_launch, preventing the OS from scheduling other tasks there. The core spins continuously with no interruptions.

#include <rte_ethdev.h>
#include <rte_mbuf.h>

// DPDK polling loop - runs on a dedicated isolated core
static int lcore_rx(void *arg) {
    struct rte_mbuf *pkts[BURST_SIZE];
    const uint16_t port = 0, queue = 0;

    while (running) {
        uint16_t nb_rx = rte_eth_rx_burst(port, queue, pkts, BURST_SIZE);
        for (uint16_t i = 0; i < nb_rx; i++) {
            process_packet(pkts[i]);
            rte_pktmbuf_free(pkts[i]);
        }
        // No sleep, no yield. Pure spin.
        // This core is 100% utilized regardless of traffic rate.
    }
    return 0;
}

int main(int argc, char *argv[]) {
    rte_eal_init(argc, argv);

    struct rte_mempool *pool = rte_pktmbuf_pool_create(
        "pool", 8192, 256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());

    rte_eth_dev_configure(0, 1, 1, &port_conf);
    rte_eth_rx_queue_setup(0, 0, 128, rte_eth_dev_socket_id(0), NULL, pool);
    rte_eth_tx_queue_setup(0, 0, 128, rte_eth_dev_socket_id(0), NULL);
    rte_eth_dev_start(0);

    rte_eal_remote_launch(lcore_rx, NULL, 1);  // launch on lcore 1
    rte_eal_mp_wait_lcore();
}

A single DPDK core on modern hardware can sustain 10 - 40 million packets per second - compared to 1 - 3 million with kernel networking on the same hardware. For 100Gbps at minimum frame size (150M pps), you need multiple cores and careful batching.

The cost of DPDK: you own the entire network stack. DPDK gives you raw packets; if you want TCP, you implement TCP (or use a user-space TCP stack like MTCP or VPP’s built-in stack). The polling core is 100% utilized regardless of traffic - idle 100Gbps DPDK burns a core doing nothing. Operational complexity is significant: hugepage configuration, driver binding (dpdk-devbind), NUMA-aware memory placement. Most applications should not use DPDK.

RDMA: Moving Data Without CPU Involvement

RDMA takes a more radical approach: eliminate the CPU from the data transfer path entirely, on both ends.

The NIC (called an HCA, Host Channel Adapter, in RDMA terminology) handles all protocol processing autonomously. To perform an RDMA write, the application posts a work request to a queue pair; the HCA reads the source buffer via local DMA, constructs the network packet, and transmits it. On the remote end, the receiving HCA receives the packet, validates it, and writes the payload directly into the destination memory region via DMA - without ever interrupting the remote CPU. The remote CPU has no idea a write happened until it polls its completion queue.

// Register memory with the RDMA HCA
// The memory must be pinned (cannot be paged out)
struct ibv_mr *mr = ibv_reg_mr(pd, buffer, size,
    IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ);

// Post an RDMA Write work request
struct ibv_sge sge = {
    .addr   = (uint64_t)buffer,
    .length = size,
    .lkey   = mr->lkey,
};
struct ibv_send_wr wr = {
    .opcode          = IBV_WR_RDMA_WRITE,
    .sg_list         = &sge,
    .num_sge         = 1,
    .wr.rdma.remote_addr = remote_addr,
    .wr.rdma.rkey        = remote_rkey,
    .send_flags      = IBV_SEND_SIGNALED,
};
struct ibv_send_wr *bad_wr;
ibv_post_send(qp, &wr, &bad_wr);

// Poll for completion
struct ibv_wc wc;
while (ibv_poll_cq(cq, 1, &wc) == 0);  // spin until complete
// wc.status == IBV_WC_SUCCESS → the write landed in remote memory

RDMA latency: 1 - 2 microseconds end-to-end for a round trip on InfiniBand. Compare to 50 - 100 microseconds for a TCP round trip on kernel networking. A factor of 50 - 100x difference in latency.

Three transports implement RDMA:

InfiniBand: purpose-built HPC interconnect with native RDMA. No IP addressing - its own fabric protocol. Lowest latency, highest bandwidth (up to 400Gb/s per link, NDR generation). Requires dedicated InfiniBand switches and HCAs (Mellanox/NVIDIA ConnectX series are the standard). Used in almost every large GPU training cluster for NCCL AllReduce operations. The cost is significant: InfiniBand switches are 3-5x more expensive than equivalent Ethernet switches.

RoCE (RDMA over Converged Ethernet): RDMA semantics over Ethernet. RoCEv2 encapsulates RDMA packets in UDP/IP, making them routable. Works on standard Ethernet hardware with RDMA-capable NICs. Requires a lossless fabric - Priority Flow Control (PFC) to prevent packet drops, since the RDMA protocol assumes lossless delivery and has no built-in retransmission for data packets. Used in hyperscale data centers (Microsoft Azure uses RoCEv2 extensively; AWS uses a custom protocol in the same design space).

iWARP: RDMA over TCP. Inherits TCP’s reliability and congestion control, eliminating the lossless fabric requirement. Runs on standard Ethernet with no special fabric configuration. Higher overhead than RoCE (TCP processing) but simpler to deploy. Used where the network is not under tight control.

io_uring: Fewer Syscalls, Same Kernel

io_uring (2019, merged into Linux 5.1) takes a different approach from DPDK and RDMA. Instead of bypassing the kernel, it minimizes the cost of interacting with it.

The kernel and application share two ring buffers in a memory region mapped into both address spaces: a submission queue (SQ) where the application writes I/O requests, and a completion queue (CQ) where the kernel writes results. No copying - both sides read and write the same memory.

In sqpoll mode, a kernel thread continuously polls the submission queue. The application writes requests to the SQ ring and the kernel thread picks them up immediately - no syscall needed for submission. Completions appear in the CQ ring; the application polls the CQ with io_uring_peek_cqe() without a syscall. The only required syscall is the initial setup.

#include <liburing.h>

struct io_uring ring;
io_uring_queue_init(256, &ring, IORING_SETUP_SQPOLL);

// Submit a read - no syscall if sqpoll is active and the kernel thread is spinning
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, offset);
io_uring_sqe_set_data(sqe, user_data);
io_uring_submit(&ring);  // no-op if SQPOLL is consuming fast enough

// Collect completion - no syscall if CQ ring is not empty
struct io_uring_cqe *cqe;
while (io_uring_peek_cqe(&ring, &cqe) != 0);  // spin on CQ
int result = cqe->res;
io_uring_cqe_seen(&ring, cqe);

As of Linux 5.19+, io_uring supports network sockets fully: send, recv, accept, connect, sendmsg, recvmsg. For a high-throughput web server or database backend, io_uring reduces syscall overhead from O(requests) to O(1) at steady state, while keeping the full kernel TCP/IP stack for correctness and congestion control.

io_uring is the right choice for: web servers (Rust’s tokio runtime supports io_uring via the io-uring crate), database backends (ScyllaDB uses io_uring for storage I/O), any high-IOPS application that is not processing raw packets. It provides 80% of the performance benefit of kernel bypass with 10% of the operational complexity.

AF_XDP: Selective Bypass

AF_XDP (eXpress Data Path socket, Linux 4.18+) sits between the full kernel stack and DPDK’s all-or-nothing approach. An eBPF/XDP program runs at the NIC driver level and redirects specific packets - matched by port, protocol, or any arbitrary rule - into a user-space ring buffer, while all other traffic continues through the normal kernel stack.

This is ideal for: processing a specific high-throughput flow (say, all UDP packets on port 8000) with user-space speed, while leaving TCP connections and other traffic to the kernel. You keep the kernel stack for management traffic, control protocols, and general connectivity; you bypass it only for the hot data path.

AWS: How Cloud Abstracts Kernel Bypass

AWS builds DPDK-like kernel bypass into its networking infrastructure so you get the performance without managing it directly.

AWS ENA (Elastic Network Adapter) is the virtual NIC used on most current-generation instances. ENA supports SR-IOV (Single Root I/O Virtualization), which gives EC2 instances direct hardware access to a physical NIC queue - bypassing the hypervisor’s software networking path. The ENA driver is open-source and can be used with DPDK in DPDK-mode for applications that need raw packet processing.

AWS EFA (Elastic Fabric Adapter) exposes RDMA-like semantics between EC2 instances. EFA uses the SRD (Scalable Reliable Datagram) protocol - a custom AWS protocol designed for low-latency, high-bandwidth inter-instance communication in ML training and HPC workloads. SRD uses multiple network paths simultaneously (multipathing), spreading load across the full bisection bandwidth of the network, which RDMA over a single path cannot do. NCCL, the NVIDIA collective communication library, has native EFA support; PyTorch distributed training on p4d.24xlarge instances uses NCCL over EFA for gradient AllReduce.

For Placement Groups (cluster placement groups), EFA achieves ~100Gbps bandwidth per instance with sub-100-microsecond latency - competitive with on-premises InfiniBand for most training workloads.

SmartNICs: Programmable Offload

The next evolution is moving the network stack into the NIC itself - programmable network processing without any host CPU involvement.

The AWS Nitro card is a custom ASIC that handles network I/O, EBS storage I/O, and hypervisor functions. The Nitro card processes packets at line rate without using host CPU cycles, which is why C5 and later instances can dedicate all host CPU to the application. The Nitro card is itself a small ARM-based computer running a specialized OS.

NVIDIA BlueField (formerly Mellanox) is the commercial SmartNIC in this space: an ARM processor complex integrated with a 100/200/400Gbps NIC. You can run arbitrary Linux software on the BlueField’s onboard ARM cores - storage encryption, firewall rules, load balancing - while the host CPU is freed from all network processing. BlueField is deployed in data centers for secure multi-tenant networking, DPU (Data Processing Unit) offload, and storage acceleration.

SmartNICs represent the logical endpoint of kernel bypass: the NIC becomes a full network computer, and the host CPU is entirely off the critical path.

The Right Tool for Each Problem

Requirement	Right Approach
>10M concurrent connections, normal latency	epoll-based async I/O (nginx, Node.js model)
High IOPS, reduced syscall overhead	io_uring with SQPOLL
10M+ packets/second, custom protocol	DPDK with user-space stack
Sub-2μs remote memory access	RDMA (InfiniBand / RoCEv2 / EFA)
Specific flow bypass, normal stack for rest	AF_XDP + eBPF
AWS inter-instance ML training	EFA + NCCL
Line-rate network processing without host CPU	SmartNIC (BlueField / Nitro)

The Critique

DPDK’s polling model wastes CPU on idle systems. A DPDK application spinning on an empty 100Gbps NIC burns a full CPU core doing nothing. This is a real cost in any environment where CPU is shared or expensive. The answer is to only use DPDK when the NIC is near saturation - which is precisely the use case it was designed for, but means you cannot use the same binary for both low-traffic and high-traffic deployments without reconfiguring.

RDMA’s requirement for lossless fabric (for RoCE) or dedicated hardware (for InfiniBand) makes it expensive and operationally complex. A single misconfigured PFC priority can cause head-of-line blocking that cascades into cluster-wide slowdowns. This is not theoretical: RDMA fabric incidents in large ML clusters are a known failure mode, and teams managing them need deep networking expertise most software engineers do not have.

io_uring is the safest choice for most applications that need performance improvement. It is kernel-maintained, does not require dedicated hardware, and integrates with the existing TCP/IP stack. If io_uring + zero-copy (MSG_ZEROCOPY) does not meet your latency budget, then consider DPDK. The vast majority of applications that believe they need DPDK actually need io_uring.

Summary

Technology	Bypass Level	Latency	Use Case
Standard Linux sockets + epoll	None (kernel)	1 - 10μs per packet	General purpose
io_uring + SQPOLL	Syscall bypass	~0.5μs per op	High-IOPS servers
AF_XDP	Selective packet bypass	~200ns for bypassed flows	Hybrid: kernel + fast path
DPDK	Full kernel bypass	<100ns	10M+ pps packet processing
RDMA (InfiniBand/RoCEv2)	Full kernel + CPU bypass	1 - 2μs round trip	Distributed ML, HPC
EFA (AWS)	Managed RDMA-like	~20μs intra-VPC	ML training on EC2
SmartNIC	Host CPU bypass	Line-rate	Data center infrastructure

Read Next:

Real-Time Systems - When Missing a Deadline Is a Bug