How Computers Execute Programs - From Instruction Fetch to Writeback
Helpful context:
- Binary & Number Systems - How Computers Count
- Logic Gates - The Basic Building Blocks of All Computation
Picture this: you type python hello.py and half a second later, “Hello, world!” appears. What actually happened? From your perspective, magic. From the CPU’s perspective, a precise sequence of billions of individual steps - each one embarrassingly simple, each one reducible to transistors switching between two voltage levels. The gap between those two perspectives is what this post closes.
The Revolution That Wasn’t Obvious: Stored Programs
Before 1945, computers were wired. You wanted a machine to do something different, you rewired it. The ENIAC, completed in 1945, had to be physically reconfigured for each new computation - tens of thousands of wire connections rearranged by hand.
John von Neumann’s insight, articulated in his 1945 report on EDVAC, was deceptively simple: store the program in memory, right alongside the data. Code is just numbers. If code and data live in the same memory, the CPU can read instructions the same way it reads data - and you can write programs that modify other programs, load new code dynamically, or treat functions as values.
This “stored-program concept” is why your laptop can run a browser, a music player, and a Python interpreter simultaneously without hardware changes. It is the foundational idea of modern computing, and it still governs every CPU made today.
Von Neumann architecture has one notorious bottleneck: code and data compete for the same memory bus. Every instruction fetch potentially delays a data load. This is the “von Neumann bottleneck,” and the entire history of CPU design - caches, out-of-order execution, prefetching - is largely a story of working around it.
ISA: The Contract Between Hardware and Software
You write C. The compiler produces machine code. The CPU executes machine code. The piece that makes this work is the Instruction Set Architecture - the ISA - a formal specification of what instructions the CPU understands, what registers exist, how memory is addressed, and what the binary encoding of every operation looks like.
The ISA is a contract. Hardware teams can redesign the silicon entirely - change the pipeline depth, add execution units, shrink the transistors - as long as they honor the contract. Software compiled for the ISA ten years ago still runs. This is why you can run a binary compiled in 2005 on a modern CPU without recompilation.
Two ISAs dominate computing today:
x86-64 (also called AMD64) dominates laptops, desktops, and most cloud servers. It is a CISC architecture - Complex Instruction Set Computer - with variable-length instructions ranging from 1 to 15 bytes. It accumulated decades of extensions (MMX, SSE, AVX, AVX-512) that make its encoding table a historical artifact rather than a clean design. It has 16 general-purpose 64-bit registers: rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, and r8 through r15.
ARM (AArch64) dominates mobile, embedded, and increasingly servers. It is RISC - Reduced Instruction Set Computer - with fixed 32-bit instruction width, a simpler instruction set, and a power efficiency advantage that is the direct reason your phone lasts all day and AWS Graviton instances exist.
RISC-V is the new entrant: an open ISA, no licensing fees, designed from scratch with 50 years of hindsight. It is fragmentary today but growing fast - SiFive, StarFive, and dozens of chip startups are betting their futures on it.
Why do cloud providers care about ISA? AWS Graviton (ARM-based) offers the same workloads as x86 instances at 20 - 40% better price-performance. Apple Silicon (also ARM) made the “x86 is inevitable” argument obsolete almost overnight. The era of x86 monoculture in servers is ending, not because x86 is bad, but because the ISA contract is just as enforceable on ARM.
Registers: The CPU’s Working Memory
Before the CPU can do anything, it needs somewhere to put the numbers it is working with. That somewhere is registers - a small set of extremely fast storage locations inside the CPU itself, connected directly to the arithmetic logic unit.
Key registers to know:
rax- the accumulator and the conventional return value register. Function results come back here.rsp- the stack pointer, always pointing to the top of the current stack.rip- the instruction pointer (program counter), holding the address of the next instruction to execute. The CPU reads this, executes what it finds, then advancesrip.rbp- the base pointer, historically used to anchor a stack frame’s local variables.rflags- a collection of condition bits set by arithmetic operations. The zero flag, carry flag, overflow flag. Conditional branches read these.
x86-64 has 16 general-purpose registers, plus SIMD registers (XMM0 - XMM15, YMM, ZMM) for vectorized arithmetic. ARM has 31 general-purpose registers (X0 - X30). RISC-V has 32.
The number of registers matters enormously for performance. More registers means the compiler can keep more values in fast storage without “spilling” to the stack. One reason RISC architectures often outperform CISC at equivalent clock speeds is that they have more registers to work with.
The Fetch-Decode-Execute Cycle
The CPU’s fundamental operation is a loop:
Fetch: Read the bytes at the address in rip from memory (or, in practice, from the L1 instruction cache). On x86-64, this might be 1 to 15 bytes depending on the instruction.
Decode: Parse those bytes. What operation is this? What are the operands? Which registers, which memory addresses? The decode unit translates raw binary opcodes into micro-operations the execution units understand.
Execute: Hand the decoded operation to the appropriate execution unit - the ALU for arithmetic, the memory unit for loads and stores, the branch unit for jumps. Produce a result.
Write back: Store the result to a register or memory.
Then increment rip and repeat. Forever.
On a 3 GHz CPU, this cycle happens roughly three billion times per second per core. But “per cycle” is a simplification - modern CPUs execute multiple instructions per cycle through techniques that fundamentally complicate the simple loop:
Pipelining: Overlap the stages. While instruction N executes, instruction N+1 is being decoded, and instruction N+2 is being fetched. A modern out-of-order CPU has 10 - 20 pipeline stages. The downside: a mispredicted branch flushes the pipeline, wasting those in-flight instructions.
Out-of-order execution: The CPU does not execute instructions in program order. It maintains a reorder buffer, finds instructions whose operands are ready, and executes them ahead of later instructions that are stalled waiting for data. The result is committed in order, but execution is opportunistic.
Branch prediction: Conditional branches depend on values not yet computed. The CPU predicts which way the branch will go - using history tables tracking recent branch behavior - and speculatively executes the predicted path. Modern predictors are right 95 - 99% of the time. When they are wrong, the speculative work is discarded and the pipeline restarts.
This is where Spectre and Meltdown came from. Speculative execution crossed security boundaries: the CPU would speculatively read memory it was not authorized to access, leave traces in the cache, and - even after squashing the speculative result - the cache state change persisted. Side-channel attacks could read that trace. Out-of-order execution and branch prediction are why modern CPUs are fast, and they are also why they had the most serious hardware vulnerabilities ever disclosed.
What “Running” Means at the Silicon Level
A transistor is a switch. It is either on (passing current, representing 1) or off (blocking current, representing 0). A modern CPU has tens of billions of transistors.
Logic gates are built from transistors - an AND gate, an OR gate, a NOT gate. Combine gates into adders, multiplexers, flip-flops (one-bit memory cells), and registers. Combine those into an ALU. Add a control unit, a decode unit, caches, execution ports. The result is a CPU.
The “executing” in “executing a program” is just transistors switching states in coordinated patterns, millions of times per nanosecond. Addition is electrons flowing through gates that implement binary addition. A load from memory is signals traveling from cache to register. There is no magic - just physics, at enormous scale and speed.
This matters for software engineers because transistor switching is not free. Every operation consumes energy, generating heat. This is why CPUs have thermal design points, why laptops throttle under sustained load, and why data center operators obsess over joules per computation. The move to ARM in servers is partly a power story: the same instruction per second, fewer watts.
System Calls: Where Programs Meet the Kernel
User programs run in user mode - a restricted execution context where they cannot directly access hardware, other processes' memory, or kernel data structures. This restriction is enforced by the CPU itself: certain instructions are privileged and raise a fault if executed in user mode.
To do anything interesting - write to a file, open a network connection, allocate more memory - a program must cross into kernel mode via a system call.
On x86-64 Linux, the syscall instruction triggers a controlled mode switch. The CPU saves the current state, switches to a kernel stack, and jumps to the kernel’s system call handler. The kernel validates arguments, performs the operation, and returns. The entire round trip costs 100 - 1000 ns.
// This C code:
write(1, "hello\n", 6);
// Compiles to roughly:
mov rax, 1 // syscall number for write
mov rdi, 1 // fd: stdout
mov rsi, buf // pointer to string
mov rdx, 6 // length
syscall // trap into kernel
The kernel boundary matters for performance. Database engines use io_uring (Linux 5.1+) to batch system calls, reducing the per-call overhead. High-performance networking uses DPDK to bypass the kernel entirely for packet processing. Every abstraction the OS provides has a crossing cost.
The Compilation Pipeline: Source to Silicon
High-level code must be translated into machine instructions before a CPU can run it. For C, that journey has four stages:
Preprocessor: Handles #include and #define macros. Pure textual substitution, producing a single translation unit.
Compiler: Translates C into assembly. This is where optimization happens - constant folding, dead code elimination, loop unrolling, inlining. Modern compilers (Clang, GCC) produce assembly that routinely outperforms what most humans would write by hand.
Assembler: Converts assembly mnemonics into binary machine instructions stored in an .o object file. References to external symbols (like printf) are left as placeholders.
Linker: Combines object files and libraries, resolves all symbol references, assigns final virtual addresses, and produces an executable in ELF format (Linux) or Mach-O (macOS) or PE (Windows).
gcc -E hello.c -o hello.i # stop after preprocessing
gcc -S hello.i -o hello.s # stop at assembly
gcc -c hello.s -o hello.o # stop at object file
gcc hello.o -o hello # link into executable
objdump -d -M intel hello.o # disassemble: see the opcodes
Watch what the optimizer does to a simple function:
int add(int a, int b) { return a + b; }
With -O0 (no optimization), this generates a full stack frame setup and teardown. With -O2:
add:
lea eax, [rdi+rsi]
ret
Two instructions. The compiler proved the stack frame was unnecessary and eliminated it. That gap - 10 lines of assembly vs 2 - is a preview of why a Python function can be 100x slower than its C equivalent for the same mathematical operation.
Interrupts: The CPU’s Attention Mechanism
The CPU does not poll for external events. Instead, hardware devices signal the CPU via interrupts - asynchronous signals that pause the current execution, save state, and jump to an interrupt handler registered in the interrupt descriptor table.
Your keyboard sends an interrupt when you press a key. The NIC sends one when a network packet arrives. The timer chip sends one every millisecond for the scheduler. Without interrupts, the OS would have to busy-poll every device, burning CPU for nothing.
Interrupts are why your system stays responsive while a CPU-bound process runs. The timer interrupt fires, the kernel preempts the running process, and the scheduler gets to run someone else. The user experience of “simultaneous” execution on a single core is entirely a product of interrupts.
Exceptions are synchronous interrupts raised by the CPU itself - divide-by-zero, invalid opcode, page fault, segmentation fault. SIGSEGV in Unix is the OS translating a page fault exception from the hardware into a signal to the process.
From Keypress to Execution: The Full Chain
Before the steps, three concepts that apply at every layer:
What “reading” is. A read always follows the same pattern. Something presents a binary number (an address) on a set of wires. The storage element at that address - a DRAM cell, a flip-flop in a hardware register, a GPIO pin comparator - drives its current value onto a second set of wires (the data bus). A destination register captures those values on the next clock edge. That is a read, at every scale. A LOAD instruction in the CPU, the MCU reading a key matrix, the kernel reading an MMIO register: same pattern, different wires.
What triggers sequential reads. Every processor - the keyboard microcontroller, the main CPU, each GPU shader core - has a program counter register. The clock signal advances it after each instruction. If the current instruction is a LOAD, it presents a memory address and reads from it. If it is a STORE, it presents an address and writes to it. The program, stored as bytes in DRAM, is what determines which addresses get read in which order. The program counter is the engine that drives every sequential step.
What triggers asynchronous events. A key closing, a timer firing, an interrupt arriving are voltage changes on dedicated hardware lines. Hardware state machines - circuits that transition through states when signals change - detect these and respond without waiting for any software instruction.
Step 1: The physical switch. Triggered by: mechanical force. Data form: voltage level - 3.3 V (key up) or 0 V (key down). One bit.
Each column of the key matrix is connected to 3.3 V through a pull-up resistor. When a key is open, nothing pulls the column down - it sits at 3.3 V. When pressed, the switch connects that column to GND, pulling it to 0 V. That is the complete signal at this layer: one voltage on one wire. No encoding yet, no byte, no number. Just electrons and a spring.
Step 2: The keyboard microcontroller scans the matrix. Triggered by: the MCU’s clock advancing its program counter through an infinite scan loop stored in flash memory. Data form: 0 V or 3.3 V on a pin → 1 bit in a GPIO input register flip-flop → scan code byte written to MCU RAM.
The firmware running on the keyboard’s microcontroller is an infinite loop. Each iteration:
A STORE instruction writes to the GPIO output register for row N. This register is a bank of flip-flops wired directly to the row pin’s driver transistor. Writing 1 turns the transistor on, pulling the row to 3.3 V.
A LOAD instruction reads the GPIO input register. This register is a bank of flip-flops whose inputs are physically connected to the column pins. For each column: a comparator inside the MCU checks whether the pin voltage is above or below the logic threshold. If the pin is at 3.3 V, the flip-flop bit is 1. If pulled to 0 V by a pressed key, the bit is 0. The LOAD instruction copies all 8 column bits from those flip-flops into an MCU CPU register in one clock cycle.
Comparison instructions detect which bit changed from the previous scan (stored in MCU RAM via a LOAD). A change at (row N, column M) maps to a scan code - a number like 0x1C for ‘A’ - looked up from a table in flash (another LOAD). The scan code is stored to a buffer in MCU RAM (a STORE).
The debounce filter keeps a counter in MCU RAM that resets on any change and only emits the scan code when the column pin voltage has been stable for several consecutive scans - a few milliseconds. This is LOAD, increment, STORE, compare instructions executing each scan cycle.
Every “read” in this step is a LOAD whose address resolves to a flip-flop rather than a DRAM cell. The mechanism is identical.
Step 3: USB transmission. Triggered by: a hardware timer inside the USB host controller, firing every 1-8 ms. No CPU instruction causes this. Data form: scan code byte in MCU RAM → 6-byte USB HID report → serial bit transitions on D+/D- differential pair → bytes in a DMA buffer in system DRAM.
The USB host controller has its own timer. When it expires, the controller’s hardware state machine - logic gates whose transitions are hardwired, not programmed - automatically initiates a poll transaction to the keyboard’s USB address. The MCU’s USB hardware responds by reading the scan code from its RAM buffer (a LOAD) and feeding it to its USB transmit FIFO, a register that serializes bytes to D+/D-.
Serialization: the MCU takes each byte, emits one bit per USB clock period as a differential voltage state. For a 1: D+ high, D- low. For a 0: reversed. The encoding (NRZI) converts bits to transitions - a 1 means swap D+/D-, a 0 means hold - so long runs of the same bit keep the wire toggling rather than static, letting the receiver stay synchronized.
On the host controller, a shift register - 8 flip-flops in a chain - clocks in one D+/D- state per USB bit period. Each clock edge shifts all values right by one position and captures a new bit at the left. After 8 clocks, the shift register holds one byte. The DMA engine then writes that byte to a pre-allocated buffer in system DRAM, placing the buffer’s address on the memory bus and the byte on the data bus - exactly what a CPU STORE does, but without CPU involvement.
At the end of the transaction, the HID report sits as 6-8 bytes in system DRAM. The host controller sets an “interrupt pending” flip-flop in its status register to 1.
Step 4: The interrupt. Triggered by: the USB controller setting a status flip-flop, which the APIC detects and signals to the CPU. Data form: flip-flop bit in USB controller → message written to APIC address → interrupt vector number (an 8-bit integer) → CPU state save + program counter redirected to kernel handler address.
The APIC (Advanced Programmable Interrupt Controller) monitors the USB controller’s interrupt-pending bit through a message-signaled interrupt: when the USB controller sets its status flip-flop, it writes a specific 32-bit value to a specific memory address that is physically wired to the APIC. The APIC decodes this to an interrupt vector number and asserts a signal to the CPU.
At the end of every executed instruction, the CPU’s control logic checks: is there a pending interrupt, and is the interrupt-enable flag set in rflags? If yes, the CPU’s hardwired interrupt sequence begins - this is not a software instruction, it is state-machine logic baked into the CPU’s control unit:
-
The CPU decrements
rspand writesrip,cs,rflags,rsp,ssto the kernel stack - STORE operations where the CPU’s register flip-flops drive the data bus, and DRAM cells at the stack addresses capture the values. The current execution context is now frozen in DRAM. -
The CPU reads the Interrupt Descriptor Table. The
IDTRregister holds the IDT’s base address in DRAM. The CPU computesIDTR.base + (vector_number × 16), presents that address on the memory bus, and captures 16 bytes from DRAM (or cache). Those 16 bytes contain the handler’s address and privilege level. This is a memory read identical in mechanism to any LOAD instruction - address on bus, DRAM drives data, CPU captures. -
The handler address goes into
rip. The next instruction fetch is from the interrupt handler’s address in kernel memory.
Step 5: The kernel handler reads the key.
Triggered by: rip now holding the handler’s first instruction address; the CPU’s program counter advancing through handler code instruction by instruction.
Data form: instructions fetched from kernel text segment in DRAM → executed operations → MMIO address in a LOAD operand → hardware flip-flop value in USB chip → scan code byte → Unicode code point (via table LOAD) → character byte STOREd into ring buffer array in DRAM.
The interrupt handler is compiled kernel code sitting in DRAM, cached in L1i. The CPU fetches each instruction (LOAD from rip), decodes it, executes it - the same fetch-decode-execute loop as always.
One instruction is a LOAD with an operand address in the USB controller’s MMIO region. This address goes on the memory bus. The system bus address decoder - a network of logic gates comparing the address to hardwired range boundaries - routes the transaction not to DRAM but to the USB controller chip. The USB controller puts the value of its internal status register flip-flops onto the data bus. The CPU captures the byte. From the instruction’s point of view this is indistinguishable from a DRAM read - same mechanism, different destination wired to the same address bus.
Subsequent instructions LOAD the HID report bytes from the DMA buffer address in DRAM, LOAD the scan-code-to-Unicode lookup table entry (an array in kernel DRAM, indexed by scan code), and STORE the resulting Unicode byte into the ring buffer at position ring_buffer_base + write_pointer. An ADD increments write_pointer, a STORE saves the new value. Each of these is one or a few machine instructions, each fetched and executed by the program counter advancing.
iret executes last. The CPU LOADs the saved rip, cs, rflags, rsp, ss from the kernel stack addresses, writes them back into the actual register flip-flops, and resumes. The interrupted program’s state is byte-for-byte restored.
Step 6: The waiting application reads the character.
Triggered by: the scheduler LOADing the process’s saved rip into the CPU - nothing inside the application triggers this; the kernel does it.
Data form: byte in ring buffer array in kernel DRAM → copied to user-space buffer address in process memory → returned as integer in rax.
The application had called read(fd, buf, 1). The kernel, finding no data, saved the application’s register state to its kernel stack (same STORE sequence as the interrupt save), moved the process descriptor from the run queue to a wait queue (pointer updates in kernel DRAM), and switched to another process.
After the interrupt handler incremented the write pointer, it called a kernel function that moves the process descriptor back to the run queue (updating the same pointer fields in DRAM). The timer interrupt fires roughly every 1 ms, running the scheduler. The scheduler LOADs the run queue head from DRAM, picks the next process, and performs a context switch: it STOREs the current process’s registers to its kernel stack, then LOADs the resumed process’s saved register values from its kernel stack into the CPU, including restoring rip to the instruction inside read() where execution was suspended.
read() resumes. It finds write_pointer != read_pointer (a LOAD and compare). It LOADs one byte from ring_buffer_base + read_pointer in kernel DRAM, STOREs it to the user-space buffer address, increments read_pointer, and returns 1 in rax. The application’s instruction after the read() call now runs.
Step 7: The character appears on screen. Triggered by: application program counter → GPU shader program counters → display controller hardware timer. Data form: Unicode byte → Bezier curve floats (from font file in DRAM) → alpha bitmap (byte array in DRAM) → draw command struct (in shared DRAM) → RGBA values (32 bits/pixel in VRAM) → digital RGB stream on HDMI → voltage per LCD transistor → photon.
The application STOREs the character into its internal grid array in DRAM and calls its rendering library. The font engine LOADs glyph data from a memory-mapped font file - on first access this triggers a page fault and the kernel reads the font file bytes from disk into DRAM; on subsequent accesses they hit DRAM or cache directly. Glyph outlines are stored as Bezier curve control points: floating-point coordinates. The font engine executes an algorithm that traces those curves at the current pixel size, computing for each pixel in a small bounding box how much of it falls inside the glyph outline. The output is an alpha bitmap - a byte array in DRAM. This is FADD, FMUL, compare, STORE instructions through the normal fetch-decode-execute loop.
The terminal submits a draw call to the GPU. The CPU STOREs a draw command structure into a region of DRAM shared between CPU and GPU, then writes to a GPU doorbell MMIO address - a flip-flop in the GPU chip that, when written, signals the GPU’s command processor. The GPU’s command processor LOADs the command from the shared DRAM and dispatches shader programs to the GPU cores.
Each shader core is a small processor with its own program counter and fetch-decode-execute cycle. Hundreds run simultaneously. Each handles one pixel: LOAD the alpha value from the glyph bitmap at the pixel’s position, LOAD foreground and background colors from a constant buffer in DRAM, compute alpha × foreground + (1 - alpha) × background with floating-point instructions, STORE the resulting 32-bit RGBA into the framebuffer at address vram_base + (y × screen_width + x) × 4. The framebuffer is an array in VRAM - GPU memory - one 32-bit value per screen pixel.
The display controller has a hardware timer that fires at the monitor’s refresh rate (every 16.67 ms at 60 Hz). On each fire, a state machine LOADs the framebuffer sequentially from VRAM, one pixel at a time, extracts RGB values, and feeds them to a serializer that converts the parallel pixel data into a high-speed differential bit stream on the HDMI or DisplayPort cable.
Inside the monitor, the timing controller drives a thin-film transistor (TFT) per pixel. On LCD: the TFT applies a voltage to a liquid crystal cell between two polarizers - voltage rotates the crystals, controlling how much of the LED backlight passes through. On OLED: the TFT drives current through an organic emitter, and the emitter produces photons in proportion to the current.
The photon reaches your retina.
The complete chain. LOAD, STORE, arithmetic, branch. Every step is one of those four things. The keyboard MCU LOADed a GPIO flip-flop, found a changed bit, STOREd a scan code. A USB state machine shifted bits through a chain of flip-flops and DMA-STOREd bytes into DRAM. A status flip-flop was set to 1, the APIC signaled the CPU, the CPU STOREd sixteen register values to DRAM, LOADed a handler address from the IDT array in DRAM, and set rip. The handler’s instructions LOADed the scan code from an MMIO flip-flop, LOADed a Unicode value from a table in DRAM, STOREd it into a ring buffer in DRAM, then LOADed the sixteen register values back. The scheduler LOADed the waiting process’s rip. The font engine LOADed Bezier floats, computed an alpha bitmap, STOREd it. GPU shader cores each LOADed one alpha and STOREd one RGBA to VRAM. The display controller LOADed VRAM pixel by pixel, serialized to HDMI, drove a transistor, which let light through.
LOAD. STORE. Add. Branch. That is the complete vocabulary.
Who Keeps Everything Running?
One natural question at this point: the keypress example showed the terminal waking up when a key arrived. But what about programs that are always running - the SSH daemon, the window manager, the system logger, the background sync service? Nobody pressed a key to wake them. What drives them? How do they know when to run?
Processes do not know anything. A sleeping process is not a program that is waiting and checking. It is a data structure in kernel memory - a struct containing saved register values, a list of open file descriptors, memory mappings - sitting in a queue. No code runs. No clock advances. The process is completely inert.
The kernel scheduler is the master. The scheduler runs on every timer interrupt (roughly 1000 times per second on Linux). Each time, it examines the run queue - a list of processes that are ready to execute - and picks the next one. It performs a context switch: STOREs the current CPU state to the outgoing process’s kernel stack, LOADs the incoming process’s saved state into the CPU. From the resumed process’s perspective, no time passed; it simply continues from the next instruction.
Events move processes between queues. A process is either:
- Running - currently executing on a CPU core.
- Runnable - in the run queue, waiting for a CPU core to become available.
- Sleeping - in a wait queue, waiting for a specific event.
The kernel has wait queues for everything: waiting for a file to have data (read()), waiting for a lock to be released, waiting for a timer to expire (sleep()), waiting for a network packet. When a process calls read() and no data is available, the kernel moves it to the wait queue for that file descriptor and runs something else. When data arrives (via an interrupt, exactly as shown above), the interrupt handler moves the process back to the run queue. The process does not “check” anything - it is moved.
Daemons wait on events. The SSH daemon calls accept() and sleeps waiting for a new TCP connection. The system logger calls read() on a socket and sleeps waiting for log messages. The cron scheduler calls nanosleep() and sleeps until its next scheduled job. All of them are in wait queues. When the event they need arrives - a new connection, a log message, a timer expiry - the relevant interrupt handler or kernel subsystem moves them to the run queue. The scheduler gives them CPU time on the next tick.
The timer interrupt is the heartbeat. The timer chip (historically the PIT, now the HPET or local APIC timer) is a hardware counter that decrements on every clock cycle and fires an interrupt when it reaches zero. The kernel programs it to fire every 1 ms. This timer interrupt is what drives the scheduler, what wakes sleep() calls, what drives the entire illusion of simultaneous execution. If the timer stopped firing, the system would freeze: one process would run forever and nothing else would get CPU time.
Kernel threads are processes that run entirely in kernel space. kswapd wakes up when memory pressure is high and pages memory to disk. ksoftirqd processes network packets that arrived via interrupt but were deferred for later processing. kworker threads run items placed on work queues by device drivers. These threads spend most of their time sleeping; they wake when their work queue has items. The mechanism is the same: they are data structures in wait queues until the interrupt handler or other kernel code moves them to the run queue.
Nothing is magic. There is no background “tick” that programs perceive. There is no invisible thread that checks things. There is a timer that fires an interrupt 1000 times per second, a scheduler function that runs on each interrupt and moves things between queues, and a set of processes that are either running, waiting for a core, or waiting for an event. The entire impression of a “live” system with dozens of things happening simultaneously is this loop, running continuously, switching between processes faster than any human can perceive.
Memory Layout: Where Your Program Lives
Every process gets a virtual address space laid out in sections:
| Segment | Contents |
|---|---|
| Text | Executable machine code (read-only) |
| Data | Initialized global variables |
| BSS | Uninitialized globals (zero-filled, no disk space) |
| Heap | Dynamic allocations (malloc) - grows upward |
| Stack | Call frames, local variables - grows downward |
The stack and heap grow toward each other. Stack overflow (no pun intended) occurs when the call stack is so deep it collides with the heap - recursive algorithms without tail-call optimization can hit this.
The text segment is marked read-only and executable. The stack is writable but not executable by default (the NX bit, or “No eXecute”). This is a security measure: if an attacker injects shellcode onto the stack, the CPU will refuse to execute it. Return-oriented programming (ROP) is the attacker’s answer - instead of executing injected code, chain together existing executable bytes in the text segment.
Why Cloud VMs Feel Like Real Hardware
When you launch an EC2 instance, you get what feels like a dedicated server. This illusion is maintained by the ISA abstraction and hardware virtualization.
Modern CPUs have a hypervisor mode (VMX on Intel, SVM on AMD) that sits below ring 0. The hypervisor can intercept privileged instructions, emulate hardware, and enforce isolation between VMs - all without the guest OS knowing. The guest’s syscall instruction lands in the guest kernel, but the guest kernel itself runs in a restricted context where its own privileged instructions are intercepted.
The ISA is the common language. A VM running x86-64 guest code on an x86-64 host runs nearly at native speed because the instruction set is the same - just the privilege level is virtualized. This is why AWS Graviton (ARM) requires running ARM OS images: the ISA contract is the boundary, and crossing it requires translation (emulation), which is slow.
The Future: RISC-V and the End of ISA Lock-In
For 30 years, x86-64 won because of backward compatibility and ecosystem momentum. ARM won mobile because of power efficiency. The question for the next decade is whether RISC-V can win on openness.
RISC-V is not just another ISA. It is an open standard, royalty-free, maintained by a non-profit. Any company can implement it without licensing fees. The base ISA is minimal and clean; extensions are modular. SiFive and Ventana make RISC-V server chips. Google is building RISC-V into Android. Western Digital and Seagate put RISC-V cores in storage devices.
The risk: ecosystem fragmentation. The modular extensions mean a binary compiled for RISC-V with AVX-equivalent extensions will not run on a bare RISC-V core. The ISA contract is only as strong as the extensions it specifies, and the RISC-V ecosystem is still working this out.
Apple Silicon demonstrated that a well-executed ARM chip can obliterate x86 chips on performance per watt. The M-series chips are not just fast - they redefined what laptop performance could mean. Every cloud provider is watching and building ARM capacity.
The x86 monopoly in servers is over. The question now is what the new equilibrium looks like.
| Concept | Key Insight |
|---|---|
| Stored-program | Code is data; both live in memory; changed computing permanently |
| ISA | The hardware-software contract; what makes portability possible |
| Fetch-decode-execute | The CPU’s fundamental loop; 3 billion times/second per core |
| Registers | The CPU’s working scratch space; more is faster |
| System calls | The user/kernel boundary; 100 - 1000 ns crossing cost |
| Out-of-order execution | Speed via reordering; the root of Spectre/Meltdown |
| Interrupts | How the OS stays in control; timer interrupts power scheduling |
| RISC-V | Open ISA; the end of x86 lock-in, if the ecosystem cooperates |
Read Next: