How Computers Execute Programs // Megha Bose

Prerequisite: None - this is the starting point for the computer systems series.

When you run a Python script or a compiled C binary, something extraordinary happens beneath the surface. Billions of transistors switch state in coordinated patterns, moving numbers through registers and caches, ultimately producing the output you see. Understanding this pipeline - from source code to machine execution - is foundational to reasoning about performance, security, and systems design.

The Compilation Pipeline

High-level code must be translated into machine instructions before a CPU can run it. For a C program, that journey has four stages.

Preprocessor: Handles #include directives and #define macros. It performs textual substitution before any real parsing, producing a single translation unit.

Compiler: Translates C source into assembly language. This is where optimisation happens - constant folding, inlining, loop unrolling. The output is human-readable assembly like mov rax, 5 or call printf.

Assembler: Converts assembly into object code - binary machine instructions specific to the target architecture, stored in an .o file. References to external symbols (like printf) are left unresolved.

Linker: Combines multiple object files and libraries, resolves symbol references, assigns final memory addresses, and produces the executable. The result is a binary in ELF format (Linux) or Mach-O (macOS).

# Observe each stage explicitly
gcc -E hello.c -o hello.i        # preprocessed output
gcc -S hello.i -o hello.s        # assembly
gcc -c hello.s -o hello.o        # object file
gcc    hello.o -o hello           # link → executable

# Inspect the object file
objdump -d hello.o

objdump -d disassembles the object file, showing the actual hex opcodes alongside human-readable assembly. This is invaluable for understanding what the compiler actually produced.

Instruction Set Architecture

The ISA is the contract between hardware and software. It defines which instructions exist, what they do, how memory is addressed, and how registers are named. Two programs compiled to different ISAs are completely incompatible at the binary level.

x86-64 (AMD64) dominates desktops and servers. It is a CISC (Complex Instruction Set Computer) architecture with variable-length instructions - some instructions are 1 byte, others up to 15. It has 16 general-purpose 64-bit registers: rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, and r8–r15.

ARM (AArch64) dominates mobile and, increasingly, servers (AWS Graviton, Apple Silicon). RISC-based: fixed 32-bit instruction width, simpler instruction set, typically more energy-efficient.

Key registers to know:

rax - return value from functions, accumulator
rsp - stack pointer, always points to the top of the stack
rip - instruction pointer, the address of the next instruction to execute
rbp - base pointer, often used to reference local variables in a stack frame

The Fetch-Decode-Execute Cycle

At the hardware level, the CPU repeats three steps endlessly:

Fetch: Read the instruction at the address in rip from memory (or cache).
Decode: Interpret the binary opcode - what operation is this? What are its operands?
Execute: Perform the operation - add two registers, load from memory, branch to a new address.

Modern CPUs complicate this picture significantly. Pipelining overlaps these stages so that while one instruction executes, the next is being decoded, and the one after that is being fetched. Out-of-order execution lets the CPU reorder independent instructions to hide latency. Branch prediction speculatively executes the likely branch before the condition is even resolved. These optimisations are why Spectre and Meltdown were possible - speculation crossed security boundaries.

Memory Layout

Every process gets a virtual address space divided into segments:

Segment	Contents
Text	Executable machine code (read-only)
Data	Initialised global variables
BSS	Uninitialised global variables (zero-filled)
Heap	Dynamic allocations (`malloc`, `new`) - grows upward
Stack	Function call frames, local variables - grows downward

The stack is LIFO and automatic. Each function call pushes a new stack frame containing the return address, saved registers, and local variables. When the function returns, the frame is popped. Stack allocation is just a subtraction from rsp - extremely fast, $O(1)$.

The heap is for data whose lifetime outlives a single function call. malloc asks the OS for memory (via brk or mmap system calls) and the allocator manages the pool. This is slower and requires explicit deallocation in C.

System Calls: Crossing Into the Kernel

User programs run in user mode - they cannot directly access hardware, other processes' memory, or kernel data structures. To do anything privileged (write to a file, open a network socket, allocate memory), a program must invoke a system call.

On x86-64 Linux, the syscall instruction triggers a mode switch to the kernel. The kernel validates arguments, performs the operation, and returns. This context switch costs ~100–1000 ns - significant in tight loops.

// This C code:
write(1, "hello\n", 6);

// Compiles to roughly:
mov rax, 1       // syscall number for write
mov rdi, 1       // fd = stdout
mov rsi, buf     // pointer to string
mov rdx, 6       // length
syscall          // trap into kernel

Process vs Thread

A process is an isolated execution environment with its own virtual address space, file descriptors, and signal handlers. Creating a process with fork() copies the address space (using copy-on-write for efficiency).

A thread shares the address space of its parent process - all threads see the same global variables and heap. Threads are lighter to create and communicate via shared memory, but require synchronisation (mutexes, atomics) to avoid data races.

Von Neumann vs Harvard Architecture

In the Von Neumann architecture (used by all mainstream CPUs), code and data share the same memory bus. In Harvard architecture (used in many microcontrollers), instruction memory and data memory are separate, allowing simultaneous access. Modern CPUs are von Neumann at the DRAM level but use split L1 caches (separate instruction and data caches) - a hybrid sometimes called Modified Harvard.

Examples

Tracing a simple function through compilation:

// add.c
int add(int a, int b) {
    return a + b;
}

gcc -O0 -S add.c -o add.s

The resulting assembly (simplified x86-64):

add:
    push   rbp
    mov    rbp, rsp
    mov    DWORD PTR [rbp-4], edi   ; store parameter a
    mov    DWORD PTR [rbp-8], esi   ; store parameter b
    mov    edx, DWORD PTR [rbp-4]
    mov    eax, DWORD PTR [rbp-8]
    add    eax, edx                  ; a + b
    pop    rbp
    ret

With -O2, this collapses to just lea eax, [rdi+rsi]; ret - the compiler eliminates the stack frame entirely.

What objdump shows:

objdump -d -M intel add.o | head -20

Output reveals the hex encoding of each instruction alongside the mnemonic. The add eax, edx instruction encodes as 01 d0 - a single byte opcode plus a ModRM byte specifying the operands. This hex representation is what actually lives in the .text section of your binary.

Understanding this pipeline - source to binary to execution - makes it possible to reason about why a 10-line C function can outperform a 3-line Python equivalent by 100x, or how a single missing bounds check becomes a remote code execution vulnerability.

Read Next: