How Computers Execute Programs
Prerequisite: None - this is the starting point for the computer systems series.
When you run a Python script or a compiled C binary, something extraordinary happens beneath the surface. Billions of transistors switch state in coordinated patterns, moving numbers through registers and caches, ultimately producing the output you see. Understanding this pipeline - from source code to machine execution - is foundational to reasoning about performance, security, and systems design.
The Compilation Pipeline
High-level code must be translated into machine instructions before a CPU can run it. For a C program, that journey has four stages.
Preprocessor: Handles #include directives and #define macros. It performs textual substitution before any real parsing, producing a single translation unit.
Compiler: Translates C source into assembly language. This is where optimisation happens - constant folding, inlining, loop unrolling. The output is human-readable assembly like mov rax, 5 or call printf.
Assembler: Converts assembly into object code - binary machine instructions specific to the target architecture, stored in an .o file. References to external symbols (like printf) are left unresolved.
Linker: Combines multiple object files and libraries, resolves symbol references, assigns final memory addresses, and produces the executable. The result is a binary in ELF format (Linux) or Mach-O (macOS).
# Observe each stage explicitly
gcc -E hello.c -o hello.i # preprocessed output
gcc -S hello.i -o hello.s # assembly
gcc -c hello.s -o hello.o # object file
gcc hello.o -o hello # link → executable
# Inspect the object file
objdump -d hello.o
objdump -d disassembles the object file, showing the actual hex opcodes alongside human-readable assembly. This is invaluable for understanding what the compiler actually produced.
Instruction Set Architecture
The ISA is the contract between hardware and software. It defines which instructions exist, what they do, how memory is addressed, and how registers are named. Two programs compiled to different ISAs are completely incompatible at the binary level.
x86-64 (AMD64) dominates desktops and servers. It is a CISC (Complex Instruction Set Computer) architecture with variable-length instructions - some instructions are 1 byte, others up to 15. It has 16 general-purpose 64-bit registers: rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, and r8–r15.
ARM (AArch64) dominates mobile and, increasingly, servers (AWS Graviton, Apple Silicon). RISC-based: fixed 32-bit instruction width, simpler instruction set, typically more energy-efficient.
Key registers to know:
rax- return value from functions, accumulatorrsp- stack pointer, always points to the top of the stackrip- instruction pointer, the address of the next instruction to executerbp- base pointer, often used to reference local variables in a stack frame
The Fetch-Decode-Execute Cycle
At the hardware level, the CPU repeats three steps endlessly:
- Fetch: Read the instruction at the address in
ripfrom memory (or cache). - Decode: Interpret the binary opcode - what operation is this? What are its operands?
- Execute: Perform the operation - add two registers, load from memory, branch to a new address.
Modern CPUs complicate this picture significantly. Pipelining overlaps these stages so that while one instruction executes, the next is being decoded, and the one after that is being fetched. Out-of-order execution lets the CPU reorder independent instructions to hide latency. Branch prediction speculatively executes the likely branch before the condition is even resolved. These optimisations are why Spectre and Meltdown were possible - speculation crossed security boundaries.
Memory Layout
Every process gets a virtual address space divided into segments:
| Segment | Contents |
|---|---|
| Text | Executable machine code (read-only) |
| Data | Initialised global variables |
| BSS | Uninitialised global variables (zero-filled) |
| Heap | Dynamic allocations (malloc, new) - grows upward |
| Stack | Function call frames, local variables - grows downward |
The stack is LIFO and automatic. Each function call pushes a new stack frame containing the return address, saved registers, and local variables. When the function returns, the frame is popped. Stack allocation is just a subtraction from rsp - extremely fast, $O(1)$.
The heap is for data whose lifetime outlives a single function call. malloc asks the OS for memory (via brk or mmap system calls) and the allocator manages the pool. This is slower and requires explicit deallocation in C.
System Calls: Crossing Into the Kernel
User programs run in user mode - they cannot directly access hardware, other processes' memory, or kernel data structures. To do anything privileged (write to a file, open a network socket, allocate memory), a program must invoke a system call.
On x86-64 Linux, the syscall instruction triggers a mode switch to the kernel. The kernel validates arguments, performs the operation, and returns. This context switch costs ~100–1000 ns - significant in tight loops.
// This C code:
write(1, "hello\n", 6);
// Compiles to roughly:
mov rax, 1 // syscall number for write
mov rdi, 1 // fd = stdout
mov rsi, buf // pointer to string
mov rdx, 6 // length
syscall // trap into kernel
Process vs Thread
A process is an isolated execution environment with its own virtual address space, file descriptors, and signal handlers. Creating a process with fork() copies the address space (using copy-on-write for efficiency).
A thread shares the address space of its parent process - all threads see the same global variables and heap. Threads are lighter to create and communicate via shared memory, but require synchronisation (mutexes, atomics) to avoid data races.
Von Neumann vs Harvard Architecture
In the Von Neumann architecture (used by all mainstream CPUs), code and data share the same memory bus. In Harvard architecture (used in many microcontrollers), instruction memory and data memory are separate, allowing simultaneous access. Modern CPUs are von Neumann at the DRAM level but use split L1 caches (separate instruction and data caches) - a hybrid sometimes called Modified Harvard.
Examples
Tracing a simple function through compilation:
// add.c
int add(int a, int b) {
return a + b;
}
gcc -O0 -S add.c -o add.s
The resulting assembly (simplified x86-64):
add:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi ; store parameter a
mov DWORD PTR [rbp-8], esi ; store parameter b
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-8]
add eax, edx ; a + b
pop rbp
ret
With -O2, this collapses to just lea eax, [rdi+rsi]; ret - the compiler eliminates the stack frame entirely.
What objdump shows:
objdump -d -M intel add.o | head -20
Output reveals the hex encoding of each instruction alongside the mnemonic. The add eax, edx instruction encodes as 01 d0 - a single byte opcode plus a ModRM byte specifying the operands. This hex representation is what actually lives in the .text section of your binary.
Understanding this pipeline - source to binary to execution - makes it possible to reason about why a 10-line C function can outperform a 3-line Python equivalent by 100x, or how a single missing bounds check becomes a remote code execution vulnerability.
Read Next: