Memory Management - The Hidden Allocator in Every Program // Megha Bose

Helpful context:

Operating Systems - The Software That Runs All Other Software

Here is a thought experiment: open two terminal windows and run the same Python program in each. Print the memory address of a variable in both. You might see the same address in both processes - 0x7ffcd3a0b2f0, say. But each program has its own copy of that variable. They cannot read each other’s memory. They are operating in completely separate universes that happen to use the same coordinate system.

How is that possible?

The answer - virtual memory - is the foundation of everything else in this post. Memory management is the story of how programs get memory, use it safely, share it securely, and - in some languages - give it back automatically. Getting it wrong causes the majority of serious security vulnerabilities ever found.

Two Programs, Same Address: Virtual Memory

Every process runs in a virtual address space - an illusion of private, contiguous memory that the OS and hardware maintain on its behalf. The CPU’s Memory Management Unit (MMU) intercepts every memory access and translates virtual addresses to physical RAM addresses using page tables maintained by the kernel.

Physical RAM is divided into fixed-size pages (typically 4 KB on x86-64, 16 KB on Apple Silicon). Each virtual page maps to a physical frame - or to nothing, in which case accessing it triggers a page fault. The kernel handles the fault, allocates a physical frame if appropriate, updates the page table, and restarts the instruction.

This mechanism provides three things:

Isolation: Process A’s virtual address 0x400000 maps to a different physical frame than Process B’s 0x400000. The MMU enforces this separation in hardware. No matter how buggy Process A is, it cannot corrupt Process B’s memory without a kernel vulnerability.

Overcommit: You can allocate 16 GB of virtual address space on a machine with 8 GB of RAM. The kernel only maps physical pages on first access. Many allocations that are reserved but never touched cost nothing real.

Swap: When physical RAM is exhausted, the kernel can evict a page to disk (swap), freeing the physical frame for something else. When that virtual address is accessed again, a page fault fires, the kernel reads the page from disk, and execution resumes. Major page faults - requiring disk I/O - cost ~10 ms. That is six orders of magnitude slower than an L1 cache hit.

The TLB: Caching Translations

Translating every memory access through a multi-level page table would be catastrophically slow. The Translation Lookaside Buffer (TLB) is a small hardware cache of recent virtual-to-physical translations, sitting inside the CPU core. A TLB hit costs ~1 cycle. A TLB miss - requiring a page table walk - costs ~10 - 100 cycles.

This is why huge pages (2 MB or 1 GB instead of 4 KB) exist. One TLB entry covers 2 MB instead of 4 KB, so a 2 GB working set needs 1000 entries vs 500,000. Databases (PostgreSQL, MySQL) configure huge pages to reduce TLB pressure. Kubernetes workloads can configure hugepages-2Mi resource limits for the same reason.

A context switch flushes the TLB (on CPUs without ASID support), which is one reason context switches have hidden costs beyond the register save/restore.

Memory Layout: The Map of a Process

A process’s virtual address space follows a standard layout:

Segment	Contents	Grows
Text	Executable code (read-only)	Fixed
Data	Initialized globals	Fixed
BSS	Uninitialized globals (zero-filled)	Fixed
Heap	Dynamic allocations (`malloc`)	Upward
Stack	Call frames, local variables	Downward

The heap and stack grow toward each other from opposite ends of the address space. Stack overflow (the actual phenomenon, not the website) occurs when the call stack grows deep enough to collide with the heap - typically the result of unbounded recursion.

Stack memory is automatic and fast. Each function call pushes a frame containing the return address, saved registers, and local variables. The frame is popped when the function returns. Stack allocation is a single instruction - subtract from the stack pointer - and costs essentially nothing. The constraint: stack frames cannot outlive their function, and the total stack is small (usually 1 - 8 MB).

Heap memory is dynamic. You request a block of arbitrary size, use it as long as needed, and - in C - release it explicitly. In managed languages, the runtime handles release automatically.

malloc: Not Magic, Just Engineering

malloc sits on top of OS primitives (brk and mmap on Linux) and manages a pool of memory. The allocator’s job is to satisfy arbitrary-sized requests quickly while minimizing wasted space.

Common strategies:

Free list: Maintain a linked list of free blocks. On allocation, find a block that fits (first-fit, best-fit, or worst-fit heuristics); on free, return the block and coalesce adjacent free blocks. glibc’s allocator (ptmalloc2) uses this with size-class buckets for small allocations.

Slab allocator: Pre-allocate pools of fixed-size objects. The Linux kernel’s slab allocator handles objects like task_struct in O(1) - grab the next pre-zeroed slot from the appropriate slab. No search, no fragmentation for uniform-size objects.

Buddy system: Split memory in powers of two. Allocating 256 bytes splits a 512-byte block; freeing coalesces adjacent “buddy” blocks back up. Linux’s page allocator uses this.

The “Heap is Slow” Myth

Here is a common misconception worth addressing: that heap allocation is inherently slow. It is not. malloc for a small object typically takes 20 - 100 ns - fast enough that it is almost never the bottleneck. What is slow is fragmentation and allocation patterns.

External fragmentation occurs when free memory is scattered in holes too small for new requests. A 1 GB heap can have 400 MB free, but fragmented into thousands of tiny gaps, unable to satisfy a 1 MB allocation.

Internal fragmentation occurs when the allocator rounds up your request (you asked for 17 bytes, got 32). The 15 bytes inside the allocation are wasted.

A long-running server that allocates and frees objects of many sizes will gradually fragment its heap. This is why some production systems use a custom pool allocator for objects they create at high frequency - not because malloc is slow, but because they want predictable, unfragmented memory patterns.

Garbage Collection: Automatic Memory Management

In managed languages - Python, Java, Go, C# - the runtime tracks live objects and reclaims dead ones automatically. Several strategies exist, each with different tradeoffs.

Reference Counting

CPython, Swift, and Rust’s Arc use reference counting: each object stores a count of references pointing to it. When the count drops to zero, the object is freed immediately.

This is deterministic - objects are freed exactly when their last reference disappears - which makes resource cleanup predictable. The fatal flaw: cycles. Two objects that reference each other will never reach count zero even when no external reference exists.

class Node:
    def __init__(self):
        self.partner = None

a, b = Node(), Node()
a.partner = b
b.partner = a   # cycle: a→b→a

del a, b        # refcounts don't hit 0; both objects leak

CPython solves this with a supplementary cycle detector that runs periodically and collects unreachable cyclic structures. It is opt-out (gc.disable()) but rarely worth disabling in practice.

Mark-and-Sweep

The GC pauses the program, marks all objects reachable from roots (global variables, stack frames, registers), then sweeps through the heap freeing everything unmarked.

This handles cycles cleanly. The downside is stop-the-world pauses proportional to heap size - genuinely problematic for latency-sensitive applications. Early Java GC was notorious for multi-second pauses on large heaps.

Generational Garbage Collection

The key empirical insight: most objects die young. In typical programs, the vast majority of heap objects are short-lived temporaries - intermediate results, request objects, string buffers. A small fraction survive long enough to be considered “old.”

Generational GC exploits this:

The nursery (young generation) is small and collected frequently. Collection is fast because most objects are dead.
Long-lived objects are promoted to the old generation, which is collected infrequently.

JVM’s G1 collector divides memory into regions and prioritizes collecting regions with the most garbage - hence “Garbage First.” Go’s GC is concurrent (runs alongside your code) with sub-millisecond pauses. CPython’s cycle detector is a simple mark-and-sweep layered on top of reference counting, not a full generational system.

Why Python Objects Are Slow: The PyObject Tax

Every Python value - integer, string, list, True, None - is a PyObject on the heap. The PyObject header alone consumes 16 bytes on 64-bit systems: 8 bytes for the reference count and 8 bytes for a pointer to the type object.

A Python integer like 42 is not just four bytes. It is a heap-allocated PyObject with a refcount, a type pointer, and then the actual integer value - roughly 28 bytes for something that fits in a register in C.

Multiply this by a list of 10 million integers and you have: 10M × 28 bytes = 280 MB, plus the list object itself (8 bytes per pointer for the array of pointers to those 280 MB of PyObjects). A NumPy array of 10M int64 values uses exactly 80 MB - and the data is contiguous, cache-friendly, and can be processed with SIMD instructions.

This is not a Python design mistake. It is a deliberate tradeoff: every value is an object, so every value is first-class, introspectable, and dynamically typed. The overhead is the cost of that flexibility. When you need raw numeric throughput, you use NumPy, which stores data in C arrays and calls into C code.

Reference counting also means every assignment and variable scope exit involves incrementing and decrementing counts - memory traffic that adds up in tight loops.

C Manual Memory Management: Total Control, Total Responsibility

In C, you own every byte:

void process_requests(int n) {
    for (int i = 0; i < n; i++) {
        char *buf = malloc(4096);
        if (!buf) { perror("malloc"); exit(1); }
        handle_request(buf);
        free(buf);    // miss this once → leak 4KB per iteration
    }
}

Common failures:

Use-after-free: Access memory after it has been freed. The memory may have been reallocated to something else. This is the root cause of the majority of browser CVEs.

Double-free: Call free() on the same pointer twice. Corrupts the allocator’s internal state; often exploitable.

Buffer overflow: Write past the end of an allocation. Overwrites adjacent memory - possibly another object’s data, a function pointer, or a return address on the stack. The basis of stack smashing attacks.

Memory leak: Forget to free. In a long-running server, leaked memory accumulates until the OOM killer arrives.

Microsoft, Google, and the Chrome security team have all published analyses showing that ~70% of their critical security vulnerabilities involve memory safety bugs in C and C++. This is not a skill issue - it is a property of the language model.

Rust: A Third Path

Rust’s ownership system enforces memory safety at compile time, with zero runtime overhead.

Each value has exactly one owner. When the owner goes out of scope, the value is dropped (freed). Ownership can be transferred (moved) or temporarily lent out (borrowed). The borrow checker verifies at compile time that:

No two parts of the program can mutably access the same data simultaneously
No reference outlives the data it points to

The result: no use-after-free, no double-free, no data races - guaranteed at compile time, without a garbage collector and without runtime overhead. The tradeoff is learning the borrow checker, which requires thinking about ownership explicitly rather than relying on GC or careful manual tracking.

Rust is why the Linux kernel began accepting Rust code in 2022 - not because Rust is faster than C (it is not, meaningfully) but because the kernel cannot afford use-after-free bugs, and Rust is the first systems language that can prevent them without sacrificing performance.

Kubernetes OOMKilled: Memory Management in Production

When a Kubernetes pod exceeds its memory limit, the container runtime kills the container with an OOM (out-of-memory) kill - visible as OOMKilled in kubectl describe pod. This is the kernel’s OOM killer, triggered when the cgroup’s memory usage hits the configured limit.

Why does this happen?

The application allocates memory that it never frees (a leak).
The application’s working set is genuinely larger than the limit - the limit is set too low.
A managed language’s GC has not yet collected dead objects when the limit is checked - the OS sees a larger RSS than the application “uses.”

Go is notorious for the third case: Go’s GC can hold onto pages even after objects are freed, not immediately returning them to the OS. The RSS grows, Kubernetes sees the limit hit, OOMKilled. The fix is GODEBUG=madvdontneed=1 or tuning the GC with runtime/debug.SetMemoryLimit (Go 1.19+).

JVM applications have similar patterns: the heap is pre-allocated at startup (-Xmx), and the OS sees that entire heap as RSS even if most of it contains dead objects the next GC cycle will collect.

Setting Kubernetes memory limits requires understanding not just what your code does with memory, but what the runtime beneath it does.

Serverless Cold Starts: Memory as Latency

AWS Lambda cold starts happen partly because of memory allocation. A cold start involves:

Provisioning a micro-VM (Firecracker)
Loading the container image
Starting the language runtime (Python: importing the standard library; JVM: class loading)
Running your initialization code

Step 3 is memory allocation under a different name: the runtime must allocate heap space, initialize data structures, and load libraries. Python’s startup allocates memory for the interpreter, the GC structures, the module cache. The JVM’s class loading allocates method tables, constant pools, bytecode arrays.

This is why Lambda’s provisioned concurrency exists: keep warm instances around to avoid the cold start cost. The warm instance already has all that memory allocated and initialized. You pay for idle memory to avoid paying for allocation latency on each request.

Memory-Mapped Files: The Shortcut

mmap maps a file directly into the virtual address space. Reading from those addresses reads the file; writing to them writes the file. The OS handles paging transparently.

import mmap

with open("data.bin", "r+b") as f:
    with mmap.mmap(f.fileno(), 0) as mm:
        header = mm[:16]          # read bytes 0 - 15
        mm[0:4] = b"NEW!"         # write directly to file

Databases use mmap (SQLite, LMDB, RocksDB’s table files) because it delegates caching to the OS page cache. The OS automatically evicts cold pages when memory is needed, re-faults them on access. The database does not need to implement its own buffer pool - or it implements one on top.

The tradeoff: the OS page cache does not know about your database’s access patterns, so it evicts pages that your database would prefer to keep. High-performance databases (PostgreSQL at larger scales) implement their own buffer pool precisely to take control of this.

Concept	Tradeoff
Virtual memory	Isolation + overcommit vs. TLB pressure; huge pages reduce TLB misses
Stack vs heap	Stack is automatic/fast but size-limited; heap is flexible but requires management
Reference counting	Deterministic destruction vs. cycle problem; CPython’s approach
Mark-and-sweep	Handles cycles vs. stop-the-world pauses
Generational GC	Fast minor GC for short-lived objects vs. occasional major GC pauses
C manual memory	Maximum control vs. use-after-free, double-free, leaks; 70% of CVEs
Rust ownership	Compile-time safety with zero runtime overhead vs. borrow checker learning curve
PyObject overhead	First-class dynamic objects vs. 28 bytes for an integer

Read Next: