Assembly & Machine Code - The Language the CPU Actually Speaks // Megha Bose

Helpful context:

Take a simple C function and compile it with gcc -O0. You get perhaps 200 lines of x86-64 assembly: stack frame setup, arguments shuffled through memory, variables loaded and stored on every use. Compile the same function with gcc -O3. You get 15 lines - sometimes fewer. The compiler hoisted loop-invariant computations, eliminated dead stores, scheduled instructions to avoid pipeline stalls, and chose lea over imul because the latency is lower. Same C code. Completely different machine behavior.

That gap is what assembly lets you see. Not because you need to write assembly - almost nobody does anymore. But because understanding what the compiler produces is the difference between treating performance as mysterious and treating it as legible.

Machine Code and Its History

In the early days of computing, before assemblers existed, programmers entered machine code directly - flipping switches, loading binary patterns, patching addresses by hand when code changed. The first “assembler” (Grace Hopper’s A-0, 1952, was technically a higher-level language translator, but the concept of a tool that translated symbolic names into binary came in the early 1950s) was considered radical: why would you let a machine write its own code?

Assembly is that symbolic representation. One assembly instruction corresponds to one machine instruction. The assembler’s job is simple: translate mnemonic opcodes to binary opcodes, replace label names with actual addresses, and output an object file.

What separates x86-64 from every competing ISA is a moat built on binary compatibility. When IBM shipped the 8086 in the PC in 1981, a massive installed base of software accumulated. Intel’s commitment was that new processors would always run old binaries. The 286 ran 8086 code. The 386 added 32-bit protected mode but ran 286 code. The x86-64 extension (from AMD, licensed to Intel) added 64-bit mode but remained backward compatible. Every new processor generation - from Pentium to Core to Skylake to Alder Lake - runs the same binaries from 1981. This is why x86 survived the RISC vs CISC wars of the 1990s, when RISC architectures (MIPS, Alpha, SPARC) were architecturally cleaner and faster per clock. RISC won the argument. x86 won the market. Binary compatibility beat elegance.

The irony: modern x86 processors are not actually CISC machines internally. Since the Pentium Pro (1995), Intel decomposes CISC instructions into micro-operations (µops) internally and executes them on a RISC-style out-of-order engine. The CISC instruction set is a compatibility layer - a translator that converts the legacy encoding into the actual operations the CPU executes. You write x86 assembly; the CPU runs something else entirely.

Registers: The CPU’s Working Memory

Registers are named storage locations inside the CPU itself, directly accessible by the execution units. A memory access (even L1 cache) takes 4 - 5 cycles. A register access takes 0 cycles - it is already in the execution unit.

x86-64 general-purpose registers: rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, and r8 through r15. Each holds 64 bits. The same physical register is accessible at different widths to maintain backward compatibility:

rax: full 64-bit register
eax: low 32 bits (writing to eax zero-extends into rax)
ax: low 16 bits
al: low 8 bits
ah: bits 8 - 15 (historical artifact, limited use)

Special-purpose registers:

rip: instruction pointer, always points to the next instruction. You cannot read or write it directly (except via jmp/call/ret).
rflags: condition bits set by arithmetic instructions - zero flag (ZF), carry flag (CF), sign flag (SF), overflow flag (OF). Conditional jumps read these flags.
rsp: stack pointer, points to the top of the call stack (lowest valid address).
rbp: base pointer, optional but conventional anchor for the stack frame.

Floating-point and SIMD work happens in a separate register file:

xmm0 - xmm15: 128-bit registers (SSE)
ymm0 - ymm15: 256-bit registers (AVX2), each is the upper half of a zmm register plus the corresponding xmm register
zmm0 - zmm31: 512-bit registers (AVX-512)

When a function receives a double argument or returns one, it goes through xmm0. When a linear algebra kernel processes 8 floats simultaneously, they live in one ymm register.

Instruction Classes

Data movement. mov dst, src copies a value from register to register, memory to register, register to memory, or immediate to register. movzx zero-extends (useful for unsigned integers narrower than 64 bits). movsx sign-extends.

push rax is sugar for: decrement rsp by 8, then write rax to the memory at rsp. pop rbx does the reverse: read memory at rsp into rbx, then increment rsp by 8.

Arithmetic. add rax, rbx adds rbx to rax and stores the result in rax, setting flags. sub, mul, imul, div, idiv. imul is used even for unsigned multiplication when you only need the low 64 bits - the hardware doesn’t distinguish signed from unsigned for the low half.

lea rax, [rdi + rdi*4] computes an address arithmetic expression (rdi + rdi*4 = 5*rdi) and stores the result in rax without touching memory. Compilers use lea for multiplication by small constants because it has lower latency and consumes fewer execution ports than imul on most Intel microarchitectures.

Logic. and, or, xor, not, shl (shift left), shr (shift right logical), sar (shift right arithmetic, preserves sign bit). xor rax, rax zeroes a register in one instruction with fewer bytes than mov rax, 0 - the disassembly of any generated code is full of it.

Control flow. jmp label unconditionally transfers execution by loading label into rip. Conditional jumps - je (jump if equal / ZF=1), jne, jl (jump if less / SF≠OF), jg, jge, jle, jb (below, for unsigned), ja (above, for unsigned) - transfer execution only if the named condition holds in rflags.

The typical sequence for an if (a == b) branch:

cmp rax, rbx   ; sets ZF if rax == rbx
je  equal      ; jumps to 'equal' label if ZF is set
; else branch here

call target pushes the return address (the address of the instruction after the call) onto the stack, then jumps to target. ret pops that address and jumps to it. This is the entire mechanism behind function calls - no magic, just two instructions and a convention about what’s on the stack.

The System V AMD64 Calling Convention

Linux and macOS agree on how function arguments are passed and returned. This agreement is the ABI (Application Binary Interface), and it is what allows code compiled by different compilers to interoperate.

Integer and pointer arguments, in order: rdi, rsi, rdx, rcx, r8, r9. Arguments beyond the sixth go on the stack. Floating-point arguments: xmm0 - xmm7. Return value: rax for integers/pointers, xmm0 for floating-point.

Registers are split into two classes:

Caller-saved (volatile): rax, rcx, rdx, rsi, rdi, r8, r9, r10, r11, xmm0 - xmm15. A function may clobber these freely; the caller must save them before the call if it still needs their values afterward.
Callee-saved (non-volatile): rbx, rbp, r12 - r15. If a function uses these registers, it must save them (push to stack on entry) and restore them (pop from stack on exit) before returning.

The distinction is a performance optimization: most functions don’t need to save many registers, and most callers don’t need their caller-saved registers preserved across calls. The ABI minimizes unnecessary save/restore work in the common case.

Stack Frames in Assembly

When a function is called, the CPU pushes the return address automatically. The function then sets up a stack frame - an area of stack memory reserved for its local variables and saved registers:

; Function prologue:
push rbp          ; save caller's base pointer
mov  rbp, rsp     ; anchor: rbp now points to the saved rbp
sub  rsp, 32      ; reserve 32 bytes for local variables

; Local variables live at negative offsets from rbp:
; [rbp - 8]  = first local variable
; [rbp - 16] = second local variable

; Incoming arguments (beyond the first six) are at positive offsets:
; [rbp + 16] = seventh argument (first six came in registers)
; [rbp + 8]  = return address (pushed by call)
; [rbp + 0]  = saved rbp (pushed by push rbp)

; Function epilogue:
mov  rsp, rbp     ; restore stack pointer (deallocate locals)
pop  rbp          ; restore caller's base pointer
ret               ; pop return address into rip

With optimization (-O1 or higher), compilers often omit the frame pointer (rbp) entirely - the frame pointer elimination optimization allows rbp to be used as a general-purpose register, freeing one extra register for computation. The stack is still correct (the compiler tracks offsets relative to rsp instead), but rbp-based stack unwinding no longer works, which is why debuggers and profilers sometimes need frame pointers to attribute call stacks correctly.

Disassembling a Real Function

Take this C function:

int add_two(int a, int b) {
    return a + b;
}

Compiled unoptimized (-O0), disassembled with objdump -d -M intel:

add_two:
  push   rbp
  mov    rbp, rsp
  mov    DWORD PTR [rbp-0x4], edi   ; spill arg a to stack
  mov    DWORD PTR [rbp-0x8], esi   ; spill arg b to stack
  mov    edx, DWORD PTR [rbp-0x4]  ; reload a
  mov    eax, DWORD PTR [rbp-0x8]  ; reload b
  add    eax, edx                   ; a + b → eax (return value)
  pop    rbp
  ret

With -O2:

add_two:
  lea    eax, [rdi + rsi]   ; compute a + b using LEA arithmetic
  ret

The optimized version uses lea (load effective address) to compute the sum without touching memory at all - rdi holds a, rsi holds b, and lea eax, [rdi + rsi] computes rdi + rsi and stores it in eax. The entire function is 2 instructions. The unoptimized version is 9 instructions and 4 memory accesses, all redundant.

This is what optimization means at the assembly level: eliminating unnecessary memory accesses, choosing lower-latency instructions, removing redundant stack operations. Reading the difference between -O0 and -O2 output teaches you what the compiler considers “unnecessary” - and therefore what you should avoid writing in the first place.

AT&T Syntax vs Intel Syntax

Two syntaxes exist for x86 assembly, and they are confusingly opposite:

Intel syntax (used by NASM, Intel’s documentation, MASM): mov rax, rbx means “copy rbx into rax.” Destination is first. No register prefixes, no immediate prefixes.
AT&T syntax (used by GAS, the GNU assembler; default for objdump): mov %rbx, %rax - source is first, registers are prefixed with %, immediates with $. Memory operands use (%rax) rather than [rax].

When reading objdump -d output on Linux, you see AT&T syntax by default. Pass -M intel to get Intel syntax: objdump -d -M intel. For reading assembly from papers, Intel documentation, or Windows tooling, you’ll see Intel syntax. Most engineers find Intel syntax more readable. The underlying instructions are identical.

How Compilers Use Assembly

You don’t need to write assembly to benefit from understanding it. The workflow is:

Write C or C++.
Compile with gcc -O3 -S -masm=intel myfile.c to emit assembly.
Read the output to understand what the compiler actually produced.

This reveals: whether the compiler vectorized a loop (you’ll see ymm register usage), whether a function was inlined (no call instruction), whether a branch was transformed into a conditional move (cmov), and whether the compiler created an unrolled loop (repeated instruction blocks without a branch).

Compiler Explorer (godbolt.org) provides an interactive version of this: paste C code, see assembly for any compiler and optimization level in real time. It is the best way to develop intuition for compiler behavior.

SIMD intrinsics live at the boundary between C and assembly. They are C functions with names like _mm256_add_ps that map one-to-one onto SIMD instructions. Writing intrinsics is essentially writing assembly with the syntactic sugar of C. The generated assembly is predictable and portable across compilers. This is the level of abstraction that high-performance numerical code (BLAS, signal processing libraries, video codecs) operates at.

Assembly in Security: Why Attackers Care

Security researchers read assembly because vulnerabilities are properties of machine code, not source code. A buffer overflow is: data written past the end of a stack buffer overwrites the saved return address. If you know the stack layout (readable from the assembly), you can predict exactly how many bytes to write to reach the return address and what address to put there.

Return-Oriented Programming (ROP) is an exploitation technique that uses sequences of existing assembly instructions - “gadgets” - already present in the program or its libraries. A gadget is a short sequence ending in ret. By chaining gadgets (each ret transfers control to the next gadget), an attacker constructs arbitrary computation from existing code without injecting new instructions. Finding gadgets requires reading assembly across the entire binary.

Buffer overflow mitigations (stack canaries, ASLR, NX/DEP) are also expressed in assembly: stack canaries insert a random value between local variables and the saved return address and check it before ret; address space layout randomization randomizes the base address of the binary and libraries so gadget addresses are unpredictable; NX marks stack memory non-executable so injected shellcode can’t run. ROP is specifically designed to bypass NX - it doesn’t inject code, it reuses existing code.

Understanding these exploits requires reading assembly. Security researchers who don’t read assembly are working blind.

The Practical Case for Assembly Literacy

Almost nobody writes assembly manually anymore. The cases where you would: hand-optimized inner loops in CPU-specific performance libraries (the AVX-512 kernels in NumPy’s MKL backend), embedded systems where code size constraints are severe, and bootstrapping stages of operating systems and firmware where no C runtime exists yet.

But reading assembly is a different skill with a much broader application. You need it when:

The profiler says a function is slow and you want to know why.
A core dump shows a crash and you need to understand the stack frame.
The debugger stops at an address with no source line and you need to figure out what’s happening.
You want to verify that the compiler vectorized a critical loop.
A fuzzer finds a vulnerability and you need to understand the exploit primitive.

Assembly is not a specialization. It is the language all programs eventually speak, and being able to read it turns hardware behavior from mysterious into legible.

Summary

Concept	x86-64 Detail	Why It Matters
Argument passing	`rdi`, `rsi`, `rdx`, `rcx`, `r8`, `r9` for first 6 ints	Explains why functions over 6 args are slower
Return value	`rax` (int/ptr), `xmm0` (float/double)	Understanding ABI for FFI and debugging
Caller-saved regs	`rax`, `rcx`, `rdx`, `rsi`, `rdi`, `r8 - r11`, `xmm0 - 15`	Function can clobber these; caller must save if needed
Callee-saved regs	`rbx`, `rbp`, `r12 - r15`	Function must save/restore these
Stack frame	`rbp` anchors frame; locals at `[rbp-N]`	Stack frame layout for debugging, exploits
Frame elimination	`-O1` removes `rbp` usage	May break stack unwinding in profilers
AT&T vs Intel	Source-first vs dest-first	Know which you’re reading
`lea` for arithmetic	Computes address math without memory access	Compiler prefers this over `imul` for small constants

Read Next: