Prerequisite: Real-Time Systems | GPU vs TPU Architectures

A single CPU core topped out at roughly 4–5 GHz years ago. Physical limits - power density, heat dissipation, the speed of light across a die - mean that sequential performance has been flat for a decade. The only path to more compute is running things in parallel: more cores, more machines, more specialized hardware. Understanding how to structure parallel programs, and where parallelism breaks down, is essential for anyone building systems that handle serious workloads.

Types of Parallelism

Data parallelism applies the same operation to different chunks of data simultaneously. Processing a batch of images where each image goes to a different worker is data parallelism. It scales well because workers rarely need to coordinate.

Task parallelism runs different operations concurrently. A web server handling one request that queries a database while another request fetches from a cache is task parallelism. The tasks are heterogeneous; coordination is the challenge.

Pipeline parallelism stages a computation like an assembly line. Stage 1 preprocesses data and hands it to Stage 2, which runs inference, which hands off to Stage 3 for postprocessing. At steady state all stages run concurrently on different items.

The Python GIL

Python’s Global Interpreter Lock (GIL) means that only one thread executes Python bytecode at a time, even on a multicore machine. threading in Python is useful for I/O-bound work (where threads spend time waiting, not executing) but provides no speedup for CPU-bound computation.

For CPU parallelism in Python, use multiprocessing. Each process gets its own Python interpreter and its own GIL, so they genuinely run on separate cores.

from multiprocessing import Pool

def process_image(path):
    # CPU-intensive work
    return preprocess(load(path))

with Pool(processes=8) as pool:
    results = pool.map(process_image, image_paths)

Pool.map distributes the list across workers and collects results. starmap handles functions with multiple arguments. apply_async submits work without blocking.

concurrent.futures

concurrent.futures provides a cleaner interface over both threads and processes:

from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor(max_workers=8) as executor:
    futures = [executor.submit(process_image, p) for p in paths]
    results = [f.result() for f in futures]

ProcessPoolExecutor for CPU work, ThreadPoolExecutor for I/O work. The API is identical, making it easy to switch.

Inter-Process Communication

Separate processes can’t share memory by default. The options are:

  • Queues and Pipes: Pass pickled objects between processes. Simple but involves serialization overhead.
  • Shared memory (multiprocessing.shared_memory): Allocate a block of raw memory that multiple processes map into their address space. Zero-copy, but you manage synchronization manually.
  • multiprocessing.Manager: A manager process hosts shared objects (lists, dicts, queues) that other processes access over a connection. Convenient but adds latency.

For large NumPy arrays shared between workers, shared_memory is significantly faster than pickling.

Ray: Distributed Python

Ray abstracts parallelism across a cluster. A @ray.remote function becomes a distributed task; a @ray.remote class becomes an actor with persistent state.

import ray

ray.init()

@ray.remote
def compute_features(batch):
    return extract_features(batch)

# Launch 100 tasks across the cluster
futures = [compute_features.remote(b) for b in batches]
results = ray.get(futures)

Ray’s object store uses shared memory within a node and transfers objects over the network between nodes. It handles fault tolerance, scheduling, and memory management. For ML workloads, Ray Tune (hyperparameter search) and Ray Data (distributed preprocessing) build on top of it.

Dask: Parallel NumPy and Pandas

Dask provides a parallel, lazy version of NumPy and pandas that partitions data into chunks and executes operations in a task graph. It scales from a single machine to a cluster without changing the API.

import dask.dataframe as dd

df = dd.read_parquet("s3://bucket/data/*.parquet")
result = df.groupby("user_id")["value"].mean().compute()

The .compute() call triggers execution. Until then, operations build a graph that Dask optimizes and schedules.

MPI: High-Performance Computing

MPI (Message Passing Interface) is the standard in scientific computing. Processes communicate explicitly by sending and receiving messages. Collective operations like AllReduce (sum all values across all processes and broadcast the result) are the backbone of distributed ML training.

mpirun -np 8 python train.py

Inside the program, MPI.COMM_WORLD.Allreduce aggregates gradients across all ranks. PyTorch’s DistributedDataParallel uses NCCL for GPU-to-GPU communication, which follows the same AllReduce pattern but over NVLink or InfiniBand instead of Ethernet.

Data Parallelism in ML

In distributed ML training, the model is replicated on each GPU. Each GPU processes a different shard of the batch, computes gradients, and then an AllReduce synchronizes the gradients across all GPUs. Each GPU updates its weights identically, so they stay in sync.

This is called synchronous data parallelism. It scales to hundreds of GPUs with near-linear throughput. The bottleneck is the gradient synchronization - each AllReduce takes time proportional to the number of parameters and inversely proportional to bandwidth.

GPU Parallelism: CUDA Thread Hierarchy

Inside a GPU, parallelism is structured as a hierarchy. A thread is the smallest unit; 32 threads form a warp, which executes in lockstep (SIMT - Single Instruction, Multiple Thread). Warps are grouped into blocks that share an L1 cache and can synchronize via barriers. Blocks form a grid that covers the entire computation.

Writing efficient CUDA means ensuring that threads in a warp access adjacent memory addresses (coalesced access), that shared memory is used to reduce HBM bandwidth pressure, and that there is enough work to saturate all streaming multiprocessors.

When Parallelism Hurts

Not everything parallelizes well. Amdahl’s Law states that if a fraction $p$ of a program is parallelizable, the maximum speedup over $n$ cores is:

$$S(n) \leq \frac{1}{1 - p + \frac{p}{n}}$$

If 10% of your program is sequential, the maximum possible speedup is 10x, no matter how many cores you add. At large $n$, the $1 - p$ term dominates entirely.

False sharing is a subtle cache problem in shared-memory parallelism. If two cores write to different variables that happen to live on the same cache line, each write invalidates the other core’s cached copy of the line. The cores thrash even though they’re writing to logically independent locations. Padding data structures to cache line boundaries (typically 64 bytes) eliminates false sharing.

Coordination overhead grows with the number of workers. A reduction that requires all workers to synchronize adds a latency proportional to the slowest worker at every step. One straggler can bottleneck a thousand-node cluster.

Examples

Parallel Data Preprocessing with multiprocessing

from multiprocessing import Pool
from pathlib import Path
import numpy as np

def load_and_preprocess(path: str) -> np.ndarray:
    data = np.load(path)
    return (data - data.mean()) / (data.std() + 1e-8)

paths = list(Path("data/").glob("*.npy"))
with Pool(processes=16) as pool:
    arrays = pool.map(load_and_preprocess, [str(p) for p in paths])

Ray Remote Function

import ray
ray.init(num_cpus=32)

@ray.remote
def train_fold(fold_idx: int, config: dict) -> float:
    model = build_model(config)
    return cross_validate(model, fold_idx)

configs = [{"lr": lr} for lr in [1e-3, 3e-3, 1e-2]]
futures = [train_fold.remote(i, c) for i, c in enumerate(configs)]
val_losses = ray.get(futures)

Amdahl’s Law Calculation

def max_speedup(p: float, n: int) -> float:
    """p = parallelizable fraction, n = number of cores"""
    return 1 / ((1 - p) + p / n)

for n in [2, 4, 8, 16, 64, 1024]:
    print(f"n={n:5d}: speedup = {max_speedup(0.95, n):.2f}x")
# n=    2: speedup = 1.95x
# n=   16: speedup = 11.0x
# n= 1024: speedup = 18.3x  (hard limit ~20x when p=0.95)

The last line is the lesson: with 5% sequential code, you can never do better than 20x speedup regardless of how many machines you throw at the problem.


Read Next: Building Sharded Transformers | Inside JAX & XLA