Prerequisite:


Assembly language is the closest thing to direct conversation with a CPU. Every high-level line of code you write eventually collapses into a sequence of bytes - machine code - that the processor fetches, decodes, and executes. Understanding that translation is what separates engineers who debug by intuition from those who debug by evidence.

Machine Code and Assembly

Machine code is raw bytes. The CPU’s decode stage interprets those bytes as instructions: an opcode that names the operation, and operands that name the data. Assembly is machine code with the opcodes replaced by human-readable mnemonics - mov, add, jmp - and addresses replaced by labels. There is a one-to-one correspondence: one assembly instruction, one machine instruction.

Assemblers like NASM and GAS (GNU Assembler) translate assembly text into object files. Disassemblers like objdump -d do the reverse, turning object files back into assembly so you can read what the compiler produced.

x86-64 Registers

x86-64 gives you a set of named storage locations inside the CPU itself - registers - that are far faster than any memory access.

General-purpose registers: rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, and r8 through r15. Each holds 64 bits. The same physical register is accessible at different widths: rax (64-bit), eax (low 32), ax (low 16), al (low 8).

Special-purpose registers: rip is the instruction pointer - it always points to the next instruction to execute. The flags register (rflags) stores condition bits (zero, carry, sign, overflow) that control conditional jumps.

rsp is the stack pointer, always pointing to the top of the call stack. rbp is the base pointer, used to anchor a function’s stack frame.

Instruction Categories

Data movement - mov dst, src copies a value. push rax decrements rsp by 8 and writes rax to memory at the new rsp. pop rbx reads from rsp and increments it.

Arithmetic - add rax, rbx adds rbx into rax. sub, mul, imul (signed multiply), and div/idiv follow the same pattern. imul is commonly used even for unsigned multiplication because it gives the same low-64-bit result.

Logic - and, or, xor, not operate bitwise. xor rax, rax is the canonical way to zero a register - one byte shorter than mov rax, 0 and equally fast.

Control flow - jmp label unconditionally transfers execution. Conditional jumps like je (jump if equal), jne, jl (jump if less), jg read the flags register, which was set by the most recent cmp or arithmetic instruction. call target pushes the return address then jumps; ret pops it and jumps back.

The System V AMD64 Calling Convention

Linux and macOS follow the System V AMD64 ABI. When you call a function, the first six integer or pointer arguments go in rdi, rsi, rdx, rcx, r8, r9 in that order. Floating-point arguments go in xmm0xmm7. The return value comes back in rax.

Callee-saved (non-volatile) registers: rbp, rbx, r12r15. If a function uses these registers, it must save and restore them - typically by pushing them on entry and popping on exit. All other registers are caller-saved: a function may clobber them freely.

Stack Frames

When a function is called, it typically sets up a stack frame:

push rbp          ; save caller's base pointer
mov  rbp, rsp     ; set base pointer to current stack top
sub  rsp, 32      ; allocate 32 bytes of local space

Local variables live at negative offsets from rbp ([rbp - 8], [rbp - 16], …). Function arguments beyond the sixth live at positive offsets ([rbp + 16], …). On exit:

mov  rsp, rbp     ; restore stack pointer
pop  rbp          ; restore caller's base pointer
ret               ; return to caller

call pushes the return address (8 bytes), so [rbp + 8] is always the return address and [rbp] is the saved rbp.

AT&T vs Intel Syntax

Two syntaxes exist for x86 assembly. Intel syntax (used by NASM): mov rax, rbx means “copy rbx into rax” - destination first. AT&T syntax (used by GAS, and by default in objdump): mov %rbx, %rax - source first, registers prefixed with %, immediates with $. When reading objdump -d output, you will see AT&T syntax unless you pass -M intel.

Reading objdump Output

0000000000401140 <add_two>:
  401140: 55                   push   %rbp
  401141: 48 89 e5             mov    %rsp,%rbp
  401144: 89 7d fc             mov    %edi,-0x4(%rbp)
  401147: 89 75 f8             mov    %esi,-0x8(%rbp)
  40114a: 8b 55 fc             mov    -0x4(%rbp),%edx
  40114d: 8b 45 f8             mov    -0x8(%rbp),%eax
  401150: 01 d0                add    %edx,%eax
  401152: 5d                   pop    %rbp
  401153: c3                   ret

The leftmost column is the virtual address. The middle is the raw bytes. The right is the mnemonic. Notice that edi and esi are the 32-bit halves of rdi and rsi - the arguments come in as 32-bit int values.

Inline Assembly in C

For rare cases where you need one specific instruction, GCC lets you embed assembly directly:

int result;
__asm__("cpuid"
    : "=a"(result)
    : "a"(1)
    : "rbx", "rcx", "rdx");

The constraint syntax ("=a", "a") tells the compiler which registers to bind to C variables. This is difficult to use correctly and should be a last resort; intrinsics are almost always preferable.

Examples

Writing a function in NASM. A simple addition function:

section .text
global add_two

add_two:                ; int add_two(int a, int b)
    mov eax, edi        ; return value = first arg
    add eax, esi        ; + second arg
    ret

Arguments arrive in edi and esi (32-bit int slots of rdi/rsi). The sum goes into eax, which is the low half of the 64-bit rax return register.

Understanding call and ret. Before call:

rsp → [ ... other stack data ... ]

After call add_two:

rsp → [ return address (8 bytes) ]
      [ ... other stack data ... ]

ret pops that 8-byte return address back into rip, resuming the caller at exactly the instruction after the call. The stack is returned to its pre-call state.

Reading a disassembly to spot a compiler optimisation. Compile return x * 2 at -O2 and you will see lea eax, [rdi+rdi] rather than imul eax, edi, 2. The compiler chose an address-computation instruction because it has lower latency on most microarchitectures - something invisible at the C level but immediately obvious in the disassembly.

Assembly is a lens. It lets you see exactly what the CPU will do, exactly how many bytes a branch costs, and exactly which registers carry your data. Every hour spent reading disassembly pays dividends in every performance-sensitive piece of code you write afterward.


Read Next: