Memory and the Bus - RAM, ROM, and How the CPU Talks to Everything // Megha Bose

Helpful context:

A CPU that can compute but cannot store results, read instructions, or communicate with the outside world is useless. This post covers how memory works at the hardware level, how multiple components share the same wires, and how the CPU reaches everything from RAM to keyboard to display through a unified addressing scheme.

Random Access Memory

RAM (Random Access Memory) lets you read or write any location in any order in the same amount of time - unlike sequential storage (tape, disk) where you must seek to the position first. “Random access” means any address, any time.

At the hardware level, RAM is built from storage cells. SRAM (Static RAM) uses a small circuit of cross-coupled inverters - essentially a flip-flop - per bit. Fast but expensive: six transistors per bit. Used for CPU caches.

DRAM (Dynamic RAM) stores each bit as charge in a capacitor. One transistor per bit - much denser - but the capacitor leaks, so the memory controller must periodically refresh it (read and rewrite every row every few milliseconds). This is why your computer doesn’t work if you pull the RAM mid-operation: the charge is gone within milliseconds.

Address Decoding

A RAM chip with $n$ address lines can hold $2^n$ locations. To select one, you need address decoding: a circuit that takes an $n$-bit address and activates exactly one of the $2^n$ word lines.

A 2-bit address decoder selects one of 4 rows:

Each output is a combination of AND and NOT gates: row 10 is active when $A_1 = 1$ AND $A_0 = 0$, which is $A_1 \cdot \overline{A_0}$. For a real 32-bit address space you have $2^{32}$ possible rows - you don’t decode all at once; instead RAM chips decode in stages (row address, then column address).

ROM and Non-Volatile Storage

ROM (Read-Only Memory) holds data permanently - it is not lost when power is removed. Early ROM was literally wired at manufacture time. Modern variants include:

EPROM - erasable by ultraviolet light, reprogrammable
EEPROM - electrically erasable, byte by byte
Flash memory - electrically erasable in blocks, the basis of SSDs, USB drives, and phones

Flash stores bits as charge trapped in a floating gate transistor. Writing requires applying high voltage to push charge onto the gate; erasing removes it. It can only be erased in blocks, not individual bytes - this is why your SSD has a write amplification problem: changing one byte may require erasing and rewriting an entire 512 KB block.

ROM is used for firmware - the BIOS/UEFI code that runs when a computer first powers on, before the operating system loads from disk.

The Bus

If every component had its own dedicated wires to the CPU, a modern computer would be physically impossible - you cannot route thousands of separate connections. Instead, components share a set of common wires called a bus.

A bus has three parts:

Address bus - the CPU drives a binary number onto these lines to specify which memory location or device it wants to communicate with. One-directional: CPU to memory/devices.

Data bus - carries the actual data being transferred. Bidirectional: CPU reads by receiving on these lines, writes by driving them.

Control bus - carries signals that coordinate the transaction: read/write select, bus clock, interrupt requests, bus grant (for devices that want to take over the bus).

The width of the data bus determines how many bits transfer per cycle. A 64-bit data bus transfers 8 bytes per clock cycle. The address bus width determines the maximum addressable memory: a 32-bit address bus can address $2^{32} = 4$ GB, which is why 32-bit systems have a 4 GB RAM ceiling.

Memory-Mapped I/O

How does the CPU communicate with devices like the keyboard, display, or network card? One approach is dedicated I/O instructions (x86 has in and out instructions for this). The more common modern approach is memory-mapped I/O (MMIO).

In MMIO, devices are assigned addresses in the same address space as RAM. Reading or writing those addresses triggers the device rather than RAM. The CPU uses identical load and store instructions whether it is accessing RAM or hardware registers.

For example, on many systems:

Write a byte to address 0xFFFF0000 - this goes to a hardware register in the display controller, setting a pixel color
Read from address 0xFFFF8000 - this reads the keyboard status register, telling you which key is pressed

The address decoder on the bus determines which component gets activated based on the high bits of the address. The CPU does not know or care - it just puts an address on the bus and reads or writes data.

MMIO and caching. CPU caches assume that reading an address twice returns the same value both times. MMIO registers violate this - reading a device status register twice may return different values if the hardware changed. This is why MMIO regions are marked as non-cacheable in the page table, and why device drivers use volatile in C when accessing hardware registers. The volatile keyword tells the compiler: do not cache this variable in a register, read it from memory every time.

Why Not One Fast Memory?

The obvious question: why have different kinds of memory at all? Why not build everything as fast as a register?

Physics and economics make it impossible. SRAM - the technology behind registers and CPU caches - needs six transistors per bit to hold a value without refreshing. It is fast because the output is directly readable at all times. DRAM needs only one transistor and one capacitor per bit: six times denser, but the capacitor leaks, so the controller must periodically read and rewrite every row to prevent data loss. That refresh cycle adds latency. Physical distance adds more: a register file sits micrometers from the execution units; DRAM sits centimeters away on a separate chip. Electrical signals travel at a significant fraction of the speed of light, but that still means distance costs cycles.

Cost scales directly with transistor count. 32 GB of SRAM would require roughly 1.5 trillion transistors - about a hundred times the transistor count of a modern CPU - and would cost thousands of dollars per gigabyte. DRAM achieves a few dollars per gigabyte through density. Larger capacity and lower latency cannot coexist in the same technology.

The saving grace is locality. What makes a hierarchy work despite these constraints is that programs do not access memory randomly. Two regularities appear in almost every program:

Temporal locality - recently-accessed data is likely to be accessed again soon. A loop counter, a function on the hot path, a struct you just modified: these are touched repeatedly within a short window.
Spatial locality - nearby addresses are accessed together. Arrays are traversed sequentially. Instructions execute in order. Local variables share the same stack frame.

Because of locality, a small fast memory holding recently-used data handles the vast majority of requests. A cache of a few hundred kilobytes satisfies 90-95% of accesses from the CPU; the slow DRAM is needed only for the cold remainder. Expensive fast memory goes a long way when locality is high.

The Memory Hierarchy

Not all memory is equal. The further from the CPU, the larger but slower:

The speed difference is enormous: an L1 cache hit takes about 4 clock cycles; a DRAM access takes 200+ cycles; an SSD access takes tens of millions of cycles. A program that accesses memory randomly is often waiting on DRAM 90% of the time. Cache-friendly code - accessing memory sequentially, keeping working sets small - can be 10-100x faster than cache-unfriendly code doing the same work.

Cache Levels: L1, L2, and L3

The “cache” layer in the triangle is actually three distinct levels, each a different point on the size-latency tradeoff.

L1 cache (32-64 KB per core, 4-5 cycles) sits physically inside the CPU core, within a few hundred micrometers of the execution units. It must be tiny to be this fast - a longer wire takes more time. Most designs split L1 into a separate instruction cache (L1i) and data cache (L1d), letting the CPU fetch the next instruction and read data operands simultaneously without conflict. Every core has its own private L1.

L2 cache (256 KB - 1 MB per core, 10-12 cycles) is also private to each core. It is unified - instructions and data share the space. L2 is the first fallback when L1 misses: still fast enough that the CPU does not stall badly, but large enough to hold a wider working set than L1 can.

L3 cache (8-64 MB shared, 30-40 cycles) sits outside the individual cores and is shared across all of them on the same chip. The shared nature is the key point. When two threads running on different cores work with the same data, they exchange it through L3 without going to DRAM. When a core evicts a cache line from its L2, it lands in L3 rather than going all the way back to main memory, so another core that wants the same line finds it there. L3 is the last stop before a full DRAM access.

The rule of thumb: an L1 hit costs 4 cycles, L2 about 12, L3 about 40, DRAM about 200. A loop whose working set fits in L1 runs at full speed. One that thrashes DRAM while doing identical arithmetic runs 5-50x slower. This gap - not the speed of the arithmetic units - is what dominates real-world program performance.

Cache lines. Caches do not transfer individual bytes - they move 64-byte chunks called cache lines. Accessing any byte fetches the full 64-byte block into cache. This is spatial locality made concrete: striding through an array one element at a time is efficient because each cache line load brings in several consecutive elements. Jumping through memory in strides larger than 64 bytes triggers a fresh cache miss on every access, eliminating the benefit.

Summary

Component	Type	Speed	Persistence	Use
Registers	SRAM (flip-flops)	1 cycle	Volatile	CPU working values
Cache	SRAM	4-50 cycles	Volatile	Fast RAM buffer
Main memory	DRAM	200+ cycles	Volatile	Program data/code
Flash/SSD	Floating-gate	Millions of cycles	Non-volatile	Long-term storage

The bus unifies these components under a single address space. The CPU does not need separate mechanisms for memory and devices - the same address/data/control protocol reaches everything, with the address decoder routing each transaction to the right destination.

Read next:

How Computers Execute Programs - The Fetch-Decode-Execute Cycle