Operating Systems - The Software That Runs All Other Software
Helpful context:
Your laptop has 8 cores. Right now, it is probably “running” somewhere between 200 and 500 processes simultaneously. Email clients, browser tabs, background sync services, system daemons, your terminal. 200 things on 8 cores. The math does not work - unless someone is managing the illusion.
That someone is the operating system, and its job is harder than it looks.
The Origin Story: From Batch Processing to Time-Sharing
In the 1950s, computers ran one program at a time. You submitted a deck of punch cards, the machine ran your job for minutes or hours, printed the output, and returned your cards. The computer sat idle between jobs. This was batch processing - efficient for the machine, agonizing for users.
The moment that changed everything was the invention of time-sharing in the early 1960s. The insight: a human takes hundreds of milliseconds to read output and decide what to type next. A computer can do millions of instructions in that time. Why not share the computer among many users, giving each a slice so small they would not notice the sharing?
MIT’s CTSS (1961) was the first time-sharing system. It led directly to Multics - an overambitious project to build the perfect OS - and then to Ken Thompson and Dennis Ritchie at Bell Labs building Unix in 1969, partly as a reaction against Multics' complexity. Unix succeeded by being small, principled, and composable. By 1985, every modern OS concept - processes, virtual memory, the filesystem hierarchy, pipes - had been invented.
The evolution from batch → time-sharing → multitasking → modern OS is the story of how we figured out what an operating system should do. Every design decision reflects a constraint that was real in 1970 and is still real today.
The OS as Enforcer: Privilege Rings
The OS does not ask programs to behave. It enforces behavior using hardware.
Modern CPUs support multiple privilege levels, called rings on x86 (though most OSes only use two). Ring 0 is kernel mode - unrestricted access to hardware, all instructions available. Ring 3 is user mode - restricted. Certain instructions (enabling/disabling interrupts, writing to the MMU’s page tables, accessing I/O ports) are only available in ring 0.
When a user-mode program tries to execute a privileged instruction, the CPU raises an exception - hardware-enforced - and the kernel handles it. This is why one buggy process cannot corrupt another process’s memory: the MMU is configured by the kernel, and user-mode programs cannot modify it.
This privilege separation is the foundation of OS security. Docker “containers” are processes running with restricted kernel interfaces (namespaces, cgroups) - still in user space, still subject to the same MMU enforcement. A container escape vulnerability is typically finding a bug in the kernel’s enforcement of that boundary.
Processes: The Unit of Isolation
A process is the OS’s abstraction for an isolated execution environment. Each process has:
- Its own virtual address space (the MMU enforces isolation between address spaces)
- One or more threads
- A file descriptor table
- A process ID (PID) and parent PID
- Signal handlers, resource limits, credentials (UID, GID)
All this state is stored in the kernel in a Process Control Block - task_struct in Linux, a structure with hundreds of fields. When a process is preempted, its entire state is stored in its PCB, ready to be restored when it gets CPU time again.
Process lifecycle:
- Running: Actively executing on a CPU core
- Ready: Could run, waiting for a CPU time slot
- Blocked: Waiting for I/O, a lock, or a timer - explicitly not runnable
The transition from Blocked to Ready is typically triggered by an interrupt (disk read completes, network packet arrives) or a timer.
Fork-Exec: Unix Process Creation
Unix creates processes with the fork() system call, which creates an exact copy of the calling process. The new process is called the child; the original is the parent.
pid_t pid = fork();
if (pid == 0) {
// This code runs in the child process
execvp("/usr/bin/ls", args); // replace child's address space with ls
} else if (pid > 0) {
// This code runs in the parent
waitpid(pid, &status, 0); // wait for child to finish
}
fork() uses copy-on-write: the child initially shares the parent’s physical memory pages. Pages are only copied when one process writes to them. A fork followed immediately by exec (the typical shell pattern) copies almost nothing - exec replaces the address space entirely anyway.
This model underlies every shell, every subprocess.Popen call in Python, every web server that spawns worker processes. Nginx, Apache, Gunicorn - all use the fork-exec model to create worker processes that handle requests independently.
The fork() call has a curious property: it returns twice - once in the parent (returning the child’s PID) and once in the child (returning 0). This symmetry is an elegant design; a single call produces two execution paths.
CPU Scheduling: The Fairness Problem
The scheduler answers one question continuously: which process runs next?
This seems simple but has deep tradeoffs. Interactive processes need low latency - your keypress should appear on screen within milliseconds. Background jobs need throughput - compile your code as fast as possible without blocking the user. Real-time processes (audio, video) need guaranteed time slots. A good scheduler handles all three simultaneously.
Round-robin: Each process gets a fixed time quantum (typically 1 - 10 ms). When the quantum expires, a timer interrupt fires, the kernel preempts the current process, and the scheduler picks the next one from the queue. Simple, fair, bounded worst-case latency ($n \times \text{quantum}$ for $n$ processes), but not great for mixed interactive/batch workloads.
Completely Fair Scheduler (CFS): Linux’s default since kernel 2.6.23. Each process accumulates “virtual runtime” - time it has actually run, weighted inversely by its priority (nice value). The scheduler always picks the process with the smallest virtual runtime. CFS maintains these processes in a red-black tree ordered by virtual runtime, making scheduling decisions O(log n).
The key insight of CFS: rather than trying to predict whether a process is “interactive” or “batch” (which is hard), give every process exactly its fair share of CPU time, weighted by priority. Interactive processes voluntarily block (waiting for keyboard input, network data) and accumulate little virtual runtime, so when they wake up, they jump to the front of the queue. Batch processes that run continuously accumulate virtual runtime and get preempted more often, automatically stepping aside for interactive work.
Context Switches: The Hidden Tax
A context switch occurs when the CPU stops running one process and starts running another. The kernel must:
- Save the current process’s registers, program counter, and stack pointer to its PCB
- Switch to the kernel’s page table (to avoid leaking kernel addresses)
- Run the scheduler to choose the next process
- Restore the new process’s registers from its PCB
- Switch to the new process’s page table
- Resume execution at the saved instruction pointer
Direct cost: 1 - 10 μs for the register save/restore and PCB manipulation.
Hidden cost: the TLB is now invalid (different process, different address space). The first several hundred memory accesses by the new process will miss the TLB and require page table walks. The L1 and L2 caches contain the old process’s working set - the new process will experience cache misses until its working set is loaded.
At 1000 context switches per second, direct overhead is 1 - 10 ms, manageable. But the indirect cache and TLB effects can add 10 - 50% overhead to the processes involved in frequent context switching.
This is why burstable instances exist on AWS (T3/T4g). These instances give you a CPU credit budget. When you burst above the baseline CPU level, you spend credits; when you run below baseline, you accumulate them. The underlying mechanism is the scheduler: during low-credit periods, your VMs get fewer time slices, context-switching to other tenants more frequently. Burstable instances are cheaper because the cloud provider is betting that your workload is bursty in practice - you will not be constantly spending credits.
Threads: Lighter-Weight Concurrency
A thread is a unit of execution within a process. Unlike processes (separate address spaces), threads share the process’s address space, heap, file descriptors, and signal handlers. Each thread has its own stack and registers.
Creating a thread is much cheaper than forking a process - no address space copy, no file descriptor duplication. Thread-to-thread communication is trivially fast (shared memory). The tradeoff: shared state requires synchronization, and a bug in one thread can corrupt the entire process’s heap.
User-space threads vs kernel threads: Early threading models (Green threads in early Java, early Python gevent) scheduled threads entirely in user space. The kernel saw only one kernel thread and had no idea there were many “threads” above it. This allowed cheap context switches (no kernel involvement) but meant the OS could only schedule one green thread per CPU core. Modern systems use kernel threads - each thread is a schedulable entity from the kernel’s perspective, allowing true parallelism on multi-core hardware.
Linux does not actually distinguish between processes and threads in its internal representation. fork() and pthread_create() both call clone(), differing only in which resources are shared. A “thread” in Linux is just a process that shares its parent’s virtual memory.
Virtual Memory and Paging: Deeper
The Memory Management - The Hidden Allocator in Every Program covers virtual memory from the allocator perspective. From the OS perspective, the kernel’s job is maintaining page tables and handling page faults.
Linux uses a 4-level page table on x86-64: PGD (Page Global Directory) → PUD (Upper) → PMD (Middle) → PTE (Page Table Entry). Each level indexes 9 bits of the 48-bit virtual address, with 12 bits for the offset within a 4 KB page. A full page table for a 48-bit address space would require 128 GB - but only mapped regions have entries, so sparse address spaces use little memory.
Page replacement: When physical RAM is exhausted and a page fault requires loading a new page, something must be evicted. Linux uses a variant of the clock algorithm (second-chance LRU): pages are arranged in a circular list with an “accessed” bit. The clock hand sweeps; if the accessed bit is set, it is cleared and the hand moves on. The first page with a cleared bit is evicted. This approximates LRU without the overhead of a true LRU list.
Memory pressure and OOM: When the system is completely out of physical memory and swap, the kernel’s Out-of-Memory killer selects a process to terminate - typically the one using the most memory with the lowest “OOM score.” In Kubernetes, when a pod exceeds its memory limit, the cgroup limit causes the OOM killer to trigger within that container, producing OOMKilled. This is not a kernel bug; it is the designed enforcement mechanism.
System Calls: The Kernel Boundary
Every interaction between a user program and the kernel crosses via system calls. File operations (open, read, write), network operations (socket, connect, accept), process management (fork, exec, wait), memory mapping (mmap, brk) - all are system calls.
strace python3 -c "print('hello')" 2>&1 | head -30
strace intercepts every system call a program makes. For Python, you will see hundreds of openat calls as the interpreter loads .pyc files and .so extension modules before executing a single line of your code. Python startup takes 30 - 100 ms because of this - dozens of system calls, each with the kernel crossing cost.
The cost of a system call is roughly 100 - 1000 ns - the syscall instruction plus context switch plus kernel validation. For I/O-bound applications issuing thousands of small system calls per request, this overhead adds up.
io_uring (Linux 5.1, 2019) addresses this by allowing applications to submit batches of I/O operations through shared memory ring buffers, bypassing individual syscall overhead. Async I/O operations are submitted to a submission queue; completions appear in a completion queue. The kernel can process a full batch with a single system call boundary crossing. High-performance databases and web servers (Nginx, PostgreSQL, Cloudflare’s async stack) use io_uring for this reason.
Signals: Asynchronous Notifications
Signals are the OS’s mechanism for asynchronous notification to a process. They are the Unix equivalent of hardware interrupts - they can arrive at any point during execution and interrupt the current flow.
| Signal | Number | Default | Typical Cause |
|---|---|---|---|
| SIGINT | 2 | Terminate | Ctrl+C |
| SIGTERM | 15 | Terminate | kill PID |
| SIGKILL | 9 | Terminate (forced) | kill -9 PID |
| SIGSEGV | 11 | Core dump | Invalid memory access |
| SIGCHLD | 17 | Ignore | Child process exited |
| SIGUSR1 | 10 | Terminate | Application-defined |
SIGKILL is special: it cannot be caught, blocked, or ignored. The kernel delivers it directly, bypassing any signal handler. This is why kill -9 always works - it is a kernel operation, not a request.
Nginx uses SIGUSR1 to trigger log rotation without downtime: send SIGUSR1, the master process reopens its log files, worker processes continue serving requests. This pattern - using signals as control channels for running daemons - is pervasive in Unix system administration.
File Deletion Is O(1): The Inode Trick
A 10 GB file can be deleted in milliseconds. Creating or copying the same 10 GB file takes minutes. This asymmetry seems wrong - surely deleting something should take at least as long as creating it? The asymmetry is explained by how filesystems represent files.
What a File Actually Is
A file on disk is not a contiguous block of data with a name stamped on it. It is three separate things:
Data blocks: the actual bytes of content, scattered across the disk in 4 KB chunks. A 10 GB file occupies roughly 2.5 million blocks.
An inode: a small data structure (typically 256 bytes) that contains the file’s metadata - owner, permissions, timestamps, size - and a list of pointers to the data blocks where the content lives. The inode does not contain the filename.
A directory entry: a mapping from the filename to the inode number. The directory is just a table: “photo.jpg” maps to inode 18432. “document.pdf” maps to inode 91274.
When you navigate a directory and see a list of filenames, you are reading the directory’s mapping table. When you open a file by name, the OS looks up the filename in the directory to find the inode number, then reads the inode to find the data block locations.
Why Deletion Is Instant
Deleting a file does two things:
- Removes the filename-to-inode mapping from the directory.
- Decrements the inode’s link count by 1. When the link count reaches 0, the inode and its data blocks are marked as free in the filesystem’s allocation bitmap.
Neither step requires reading or zeroing the 10 GB of data. Marking blocks as “available” in a bitmap is a small fixed operation - O(1) regardless of file size. The 2.5 million data blocks are not touched; they remain on disk, silently ignored, until another file happens to be written and the filesystem reuses those blocks for new data.
This is why file recovery software works: the data survives deletion. The OS has simply stopped tracking where it is. A recovery tool can scan the disk for inode patterns and reconstruct the file before the blocks are overwritten.
Why Copying Is O(n)
Creating or copying the same 10 GB file takes proportional time because you must actually move 10 GB of bytes. The OS allocates new data blocks, reads from the source, and writes to the destination. At 500 MB/s sequential disk throughput, 10 GB takes about 20 seconds. No shortcuts available.
Hard Links and Reference Counting
The link count explains why you can have multiple filenames pointing to the same file without duplicating data - this is a hard link.
ln photo.jpg photo_backup.jpg
Now photo.jpg and photo_backup.jpg both point to the same inode. The inode’s link count is 2. Deleting photo.jpg decrements the count to 1 - the data still exists because photo_backup.jpg still points to the inode. Only when you delete photo_backup.jpg (and the count reaches 0) does the data become eligible for reuse.
This reference counting is the same mechanism that many garbage collectors use for memory management - the language design mirrors the filesystem design, because both are solving the same problem: when can a resource be safely freed?
The Security Implication
“Deleting” a file does not erase the data. It makes the data invisible to the OS. If someone gains physical access to your disk (or forensic access to an image of it), they can recover deleted files as long as those blocks have not been overwritten by new data.
Secure deletion tools (shred on Linux, secure empty trash on macOS) explicitly overwrite the data blocks with random bytes before removing the directory entry. This is O(n) - it must touch every block - but it guarantees the data cannot be recovered.
Containers as Processes in Disguise
Docker containers are not VMs. There is no hypervisor, no guest OS, no separate kernel. A container is a process (or group of processes) running with restricted kernel interfaces:
Namespaces give each container its own view of the system: a separate PID namespace (the container’s init process is PID 1), a separate network namespace (its own network interfaces and routing table), a separate mount namespace (its own filesystem hierarchy), a separate UTS namespace (its own hostname).
cgroups (control groups) enforce resource limits: this container gets at most 2 CPUs and 4 GB of memory. The scheduler respects cgroup limits; the OOM killer enforces memory limits.
That is it. A Docker container is a process that thinks it is alone, using Linux kernel mechanisms that have existed since 2006 (cgroups) and 2002 (namespaces). docker run is essentially a sophisticated fork() + exec() with namespace and cgroup configuration.
This matters for understanding Kubernetes scheduling. Kubernetes is solving the same problem as the OS scheduler - how to allocate compute resources among competing workloads - but at the level of containers on a cluster of machines rather than threads on a single CPU. The abstractions mirror each other: processes → pods, scheduler time slices → CPU requests/limits, memory limits → cgroup memory limits, node → machine. Kubernetes internalized OS scheduling theory and applied it to fleets.
eBPF: The Programmable Kernel
eBPF (extended Berkeley Packet Filter) is the most significant Linux kernel feature of the last decade. It allows you to load small programs into the kernel - verified to be safe before loading - that execute in response to kernel events without modifying the kernel itself.
Originally designed for network packet filtering (tcpdump uses BPF), eBPF now hooks into hundreds of kernel events: system calls, scheduler decisions, network packet processing, memory allocation, file system operations.
# Watch every execve() call (process creation) system-wide
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("%s\n", str(args->filename)); }'
Cloudflare uses eBPF for DDoS mitigation - packet filtering at line rate in the kernel before packets reach user space. Meta uses eBPF for system-wide performance profiling. Cilium (Kubernetes networking) uses eBPF to replace iptables for service routing. The key insight: eBPF lets you extend the kernel at runtime, without writing kernel modules, with safety guarantees enforced by the kernel’s eBPF verifier.
The Future: io_uring and the Kernel as a Service
The trend in OS design is moving computation closer to hardware while making the kernel boundary thinner and cheaper to cross. io_uring reduces system call overhead for I/O. DPDK (Data Plane Development Kit) bypasses the kernel entirely for network packet processing. eBPF moves application logic into the kernel. io_uring, DPDK, and eBPF are all answering the same question: the kernel crossing is expensive; how do we minimize it without losing protection?
The answer is not to eliminate the kernel - it is to make the kernel programmable, so you can customize what runs at kernel privilege without the overhead of a full system call for every operation.
| Concept | The Real-World Consequence |
|---|---|
| Privilege rings | User-mode programs cannot crash the kernel (usually) |
| fork-exec | Every shell command, subprocess, Docker run |
| CFS scheduler | Responsive interactive apps + fair batch throughput |
| Context switch cost | Why burstable AWS instances exist; thread pool sizing |
| Page tables / TLB | Why huge pages matter; why context switches are expensive |
| Namespaces + cgroups | Docker = process with restrictions, not mini-VM |
| System calls | Why Python startup is slow; why io_uring exists |
| eBPF | Programmable kernel without kernel modules |
Read Next: