Prerequisite: How Computers Execute Programs


Every program you run does so under the supervision of an operating system. The OS is the software layer that stands between your code and the hardware, arbitrating access to the CPU, memory, storage, and network. Without it, every application would need its own device drivers, scheduling logic, and memory isolation - an obviously unworkable situation. Understanding how the OS does its job explains a surprising number of practical questions: why a Python import takes time, why forking a process is cheap, and why kill -9 always works.

Processes

A process is the OS’s unit of isolation. It has:

  • Its own virtual address space (text, data, BSS, heap, stack - see How Computers Execute Programs)
  • One or more threads of execution
  • A file descriptor table (open files, sockets)
  • A process ID (PID) and parent PID
  • Signal handlers and resource limits

All of this bookkeeping is stored in the kernel’s Process Control Block (PCB), also called task_struct in Linux.

Process states form a simple state machine:

  • Running: Actively executing on a CPU core
  • Ready: Runnable, waiting for a CPU slot
  • Blocked: Waiting for I/O, a lock, or a timer - not runnable

State transitions happen thousands of times per second on a busy system.

The Fork-Exec Model

Unix creates new processes with fork(), which copies the current process’s address space (using copy-on-write - no actual copying until a write occurs), then exec() which replaces the address space with a new program:

pid_t pid = fork();
if (pid == 0) {
    // child process
    execvp("/usr/bin/ls", args);
} else {
    // parent process
    waitpid(pid, &status, 0);
}

This model underlies every shell, every web server’s worker spawn, and Docker’s container isolation.

CPU Scheduling

The scheduler decides which process runs next. It must balance fairness, throughput, and latency:

First-In, First-Out (FIFO): Run each process to completion before the next. Simple, but a long-running process starves everything else.

Shortest Job First (SJF): Run the shortest expected job next. Optimal for minimising average wait time in theory; requires knowing job length in advance (impossible in general).

Round Robin: Each process gets a fixed time quantum (typically 1–10 ms). When the quantum expires, a timer interrupt fires, the scheduler preempts the current process, and runs the next in queue. Fairness guaranteed; latency bounded by $n \times \text{quantum}$ for $n$ processes.

CFS (Completely Fair Scheduler): Linux’s default since 2.6.23. Each process accumulates “virtual runtime” - time it has run, weighted by its priority (nice value). The scheduler always picks the process with the smallest virtual runtime. CFS uses a red-black tree (ordered by virtual runtime) for $O(\log n)$ scheduling decisions.

Context Switch Cost

A context switch saves the current process’s registers, program counter, and memory mappings to its PCB, then restores another process’s state. This costs:

  • Direct cost: ~1–10 μs for saving/restoring registers and flushing pipelines
  • Indirect cost: TLB flush (address translations are now invalid), cache pollution (new process brings new working set)

At 1000 context switches per second, the direct overhead is ~1–10 ms - manageable. The indirect cache effects are often larger and harder to measure.

Threads

A thread is a lightweight unit of execution within a process. All threads in a process share the same address space, file descriptors, and heap. Each thread has its own stack and registers.

Creating a thread (pthread_create) is much faster than fork() because no address space copy is needed. Communication between threads is trivially fast (shared memory), but requires synchronisation - see Concurrency & Race Conditions.

Virtual Memory and Paging

Each process sees a private virtual address space. The OS, with hardware support from the Memory Management Unit (MMU), translates virtual addresses to physical RAM addresses via page tables.

Memory is divided into fixed-size pages (typically 4 KB). A page table maps virtual page numbers to physical frame numbers. For a 64-bit address space, a naive page table would be enormous; Linux uses a 4-level page table (PGD → PUD → PMD → PTE) that only allocates entries for mapped regions.

The TLB (Translation Lookaside Buffer) caches recent virtual-to-physical translations. A TLB miss forces a page table walk - expensive. This is why large pages (2 MB, 1 GB “huge pages”) exist: they reduce TLB pressure for large allocations.

Page replacement: When physical RAM is full and a new page must be loaded, the OS must evict an existing page. The LRU (Least Recently Used) policy evicts the page unused for the longest time. Linux approximates LRU with the clock algorithm: a circular list of pages with an “accessed” bit. The clock hand sweeps; accessed bits are cleared; the first page with cleared bit is evicted.

File Systems

At the storage level, the OS manages file systems built on raw blocks:

Inodes: Each file is represented by an inode - a data structure storing metadata (permissions, timestamps, size, data block pointers) but not the filename. Directory entries map filenames to inode numbers. ls -i shows inode numbers.

Journaling (ext4): Before writing file system metadata, write the intended change to a journal. If power fails, the journal can replay uncommitted changes to restore consistency - avoiding the “inconsistent directory” bugs of earlier file systems.

Copy-on-write (ZFS, Btrfs): Never overwrite data in place. Write new data to a new location, then atomically update the pointer. The old data remains until reclaimed. This makes snapshots trivially cheap - just save the root pointer.

Signals

Signals are asynchronous notifications sent to a process:

Signal Number Default action Typical cause
SIGINT 2 Terminate Ctrl+C
SIGTERM 15 Terminate kill PID
SIGKILL 9 Terminate (forced) kill -9 PID
SIGSEGV 11 Core dump Invalid memory access
SIGCHLD 17 Ignore Child process exited

SIGKILL cannot be caught or ignored - the kernel delivers it directly, which is why it always works even on a hung process.

Inter-Process Communication

Processes need to communicate. The OS provides several mechanisms:

  • Pipes: Unidirectional byte stream; ls | grep foo uses an anonymous pipe. mkfifo creates named pipes.
  • Shared memory (mmap, POSIX shm_open): Fastest IPC - map the same physical pages into two address spaces. Requires synchronisation.
  • Sockets: Full-duplex byte streams, either local (Unix domain sockets) or networked (TCP/IP). Used by databases, web servers, and systemd.
  • Message queues: Structured messages with priorities; POSIX mq_open. Simpler than shared memory, slightly higher overhead.

Examples

strace output for a Python script:

strace -e trace=openat,read,write python3 -c "print('hello')" 2>&1 | head -30

You’ll see hundreds of openat calls as Python imports its standard library - each .pyc file, each .so extension module. This explains why Python startup takes ~30–100 ms: it is making dozens of system calls before your first line runs.

Reading htop:

  PID USER  PRI  NI  VIRT   RES   SHR S  CPU%  MEM%
12345 alice  20   0  512M  128M   32M R  99.9   1.6
  • VIRT: virtual memory committed (includes mmap’d files, libraries - often looks huge, don’t panic)
  • RES: actual physical RAM in use
  • SHR: shared pages (shared libraries) - double-counted across processes in RES
  • S: state (R=running, S=sleeping, D=disk wait)

fork and process isolation:

import os, time

pid = os.fork()
if pid == 0:
    # Child: modifying x doesn't affect parent (copy-on-write)
    x = 999
    print(f"Child (PID {os.getpid()}): x = {x}")
    os._exit(0)
else:
    x = 1
    os.waitpid(pid, 0)
    print(f"Parent (PID {os.getpid()}): x = {x}")  # still 1

The OS is not just a convenience layer - it is the enforcer of the abstractions that make multi-process, multi-user computing safe and predictable. Knowing what happens beneath open(), fork(), and malloc() lets you reason about performance, debug mysterious slowdowns, and understand the security boundaries that protect your processes.


Read Next: