GPU vs TPU - Different Bets on the Shape of Computation // Megha Bose

Helpful context:

The 8,760x Question

A neural network that takes 10 years to train on a single CPU trains in 10 hours on a modern GPU cluster. 10 years is 87,600 hours. 10 hours is 10 hours. That is an 8,760x improvement.

No chip got 8,760x faster. Single-threaded CPU performance has been roughly flat since 2005. What changed is not clock speed or instruction latency - it is the fundamental model of computation. The CPU is built to execute one thing very fast. The GPU is built to execute millions of things simultaneously, even if each individual execution is slower.

The choice between these architectures - and the custom silicon that came after - determines what machine learning is possible and what it costs. AlexNet (2012) proved deep learning worked at scale. The hardware that ran AlexNet made the rest of the decade possible.

Why CPUs Are the Wrong Tool

A modern server CPU is an extraordinary machine for its design goal. Each core has a deep out-of-order execution pipeline, large branch predictors, multilevel cache hierarchies (L1/L2/L3), hardware prefetchers, and speculative execution logic. A single core can execute multiple instructions per cycle through superscalar execution. A 128-core server has 128 of these sophisticated engines.

All of this engineering optimizes for latency: completing one task as fast as possible, regardless of what that task is. This is the right design for a web server, an operating system, a compiler, a database. These workloads execute irregular code - lots of branches, data-dependent control flow, mixed memory access patterns - where the ability to hide latency and execute ahead speculatively determines performance.

Matrix multiplication for neural networks does not need any of that. It is perfectly regular: the same multiply-accumulate operation, repeated billions of times, on contiguous memory. There are no branches. There is no data-dependent control flow. The CPU’s branch predictor, speculative execution engine, and out-of-order logic sit idle. Only the arithmetic units do useful work, and 128 cores is not many arithmetic units.

The GPU Architecture: Designed for Throughput

NVIDIA’s bet on GPGPU (General-Purpose GPU computing) in 2007, with the CUDA programming model, was a recognition that the GPU’s design - many simple cores running in lockstep - mapped perfectly onto scientific computing and, later, machine learning.

An NVIDIA A100 has 6,912 CUDA cores organized into 108 Streaming Multiprocessors (SMs). Each SM is far simpler than a CPU core: limited out-of-order execution, no branch prediction, small per-thread register file. What SMs have instead is warp-level parallelism: 32 threads form a warp, and all 32 execute the same instruction simultaneously (SIMT - Single Instruction, Multiple Thread).

If threads in a warp diverge - some take an if branch and others don’t - the warp executes both branches in sequence, masking out the inactive threads. Both paths run; half the warp is wasted. Warp divergence is the enemy of GPU efficiency. Neural network kernels avoid it by design: every element in a batch takes the same code path.

32 threads × 108 SMs × (threads per SM, up to 2,048) gives the A100 roughly 6,912 concurrent “lanes” of computation. The scheduler keeps these lanes busy by issuing instructions from different warps while others wait for memory - latency hiding through parallelism rather than latency reduction.

GPU Memory: The Real Bottleneck

The memory hierarchy on a GPU differs fundamentally from a CPU:

HBM (High Bandwidth Memory): the main GPU memory, physically stacked on the same package as the die. The A100 has 80 GB of HBM2e at roughly 2 TB/s. The H100 has 80 GB of HBM3 at 3.35 TB/s. Bandwidth is high by DRAM standards, but still orders of magnitude slower than on-chip memory.

Shared memory / L1 SRAM: each SM has approximately 192 KB of SRAM. Access is roughly 100x faster than HBM, but capacity is tiny - roughly 20 MB total across the entire A100. This memory is programmer-managed (in CUDA) or compiler-managed (in Triton, XLA).

Registers: private to each thread, sub-nanosecond access. Limited by the total register file size per SM.

The practical implication: a kernel that repeatedly reads the same data from HBM is wasting bandwidth. The optimization strategy is always to stage data in shared memory, compute as much as possible on it, then write results back. FlashAttention’s entire contribution is rewriting the attention operation to be SRAM-resident: instead of reading and writing the full attention matrix to HBM, it tiles the computation to stay in the 192 KB shared memory budget. This alone gives 2-4x speedup over the naive attention implementation.

Tensor Cores and Precision

Standard CUDA cores perform scalar FP32 multiply-accumulate operations: one multiply-add per clock per core. Tensor Cores are specialized hardware units for matrix multiply-accumulate at lower precision.

The A100’s Tensor Cores in FP16 mode: 312 TFLOPS. In FP32 mode on regular CUDA cores: 19.5 TFLOPS. That is a 16x difference for the same chip, achievable only by lowering precision.

BF16 (Brain Float 16) has become the standard training precision for LLMs. It allocates 8 bits to the exponent (same as FP32) and 7 bits to the mantissa (versus FP32’s 23). The large exponent range means overflow and underflow are rare - a major practical advantage over FP16’s narrower exponent, which caused training instabilities in early mixed-precision work.

INT8 quantization (during inference, not training) delivers 2x the throughput of FP16 at the cost of quantization error. For models above a certain size, quantization error is manageable and INT8 inference is the standard. FP8 (introduced in H100) pushes further, enabling training at FP8 precision for models that can tolerate it.

The precision hierarchy: FP32 (accuracy, slow) → BF16/FP16 (training, fast) → INT8/FP8 (inference, fastest). Moving down the hierarchy requires understanding what accuracy you can tolerate.

The Roofline Model: Memory vs. Compute

Every GPU kernel is bounded by one of two limits: compute throughput or memory bandwidth. The roofline model makes this precise.

The A100’s ridge point: 312 TFLOPS / 2 TB/s = 156 FLOPs per byte of HBM traffic.

A kernel whose arithmetic intensity (FLOPs per byte it moves from HBM) exceeds 156 is compute-bound: the arithmetic units are the bottleneck, and more memory bandwidth wouldn’t help. A kernel below 156 is memory-bound: the chip sits idle waiting for data from HBM.

Matrix multiplication of large matrices has high arithmetic intensity. For an (M, K) × (K, N) multiply: roughly $2MNK$ FLOPs and $(MK + KN + MN) \times 2$ bytes of I/O (in FP16). For large square matrices (M=N=K=4096), this works out to ~512 FLOPs/byte - well above the ridge point, solidly compute-bound. Tensor Cores are doing useful work.

A simple elementwise ReLU reads each element and applies a comparison: roughly 0.25 FLOPs/byte, deeply memory-bound. Adding more compute units does nothing. The optimization is kernel fusion: combine ReLU with the preceding matrix multiply into one kernel so the activation data flows through registers, never hitting HBM. This is what compilers like XLA and Triton do automatically.

Modern LLM inference is primarily memory-bound. During token generation, the model processes one token at a time (batch size 1 or small). The arithmetic intensity of matrix multiplies with a tiny batch is low - you’re paying the full bandwidth cost to read the weight matrices but only performing a small number of FLOPs per weight. This is why inference efficiency is about minimizing weight movement, not maximizing FLOPS - and it is why quantization (smaller weights) and KV-cache compression matter so much for inference.

The TPU: Custom Silicon for One Job

In 2016, Google announced the TPU (Tensor Processing Unit), revealing it had been running in production data centers since 2015. The TPU is an ASIC - application-specific integrated circuit - designed entirely for matrix multiply-accumulate in neural networks.

The core architecture is a systolic array: a grid of multiply-accumulate cells that pass data through in a wave. Inputs stream in from one direction, weights from another; partial sums flow through the array cell by cell. The key property: data moves through the array on every cycle without requiring SRAM reads. The arithmetic units are never waiting for memory during the matrix multiply itself - the data arrives at exactly the right time due to the structured dataflow.

TPU v4 has a 128×128 systolic array, delivering roughly 275 TFLOPS of BF16 compute per chip. TPU v4 pods connect 4,096 chips with a custom high-bandwidth ICI (Inter-Chip Interconnect) at 1.2 Tb/s per link - significantly higher than InfiniBand between GPU nodes. For AllReduce operations during training, the TPU pod’s interconnect is a major advantage.

TPUs are designed for XLA (Accelerated Linear Algebra), Google’s tensor computation compiler. XLA performs aggressive operator fusion, padding, and hardware-aware scheduling. Code that goes through XLA - JAX, TensorFlow, and increasingly PyTorch via torch_xla - benefits automatically. You do not write TPU kernels manually; XLA compiles tensor programs to the TPU’s instruction set.

GPU vs TPU: Where Each Wins

The GPU won the first round of the deep learning hardware race through ecosystem, not architecture. PyTorch chose CUDA. The research community chose PyTorch. Researcher adoption drove production adoption. By the time TPUs were available outside Google, the CUDA ecosystem - libraries, profilers, tooling, documentation - had years of investment behind it.

GPUs win on:

Framework support (PyTorch is first-class; TPU support via torch_xla lags)
Custom kernels (CUDA and Triton give low-level control; XLA is less flexible)
Research flexibility (easy to try novel architectures without compiler support)
Inference versatility (diverse batch sizes, variable sequence lengths, model serving)
Cloud availability (AWS p3/p4d/p5, Azure NDv4, GCP A100 instances - any cloud)

TPUs win on:

Training large Transformer models at scale (the systolic array + pod interconnect is exceptional for this specific workload)
Total cost of ownership for Google-internal workloads (custom hardware at hyperscaler volume)
XLA-compatible frameworks (JAX training is typically faster on TPUs than equivalent GPU setups)
GCP only - not available on AWS or Azure, which limits adoption

The practical decision for most teams: GPU. The TPU advantage is real but narrow (primarily large-scale Transformer training) and requires committing to the JAX or XLA ecosystem and GCP infrastructure.

Cloud Economics: GPU and TPU Instances

AWS GPU instances follow a naming pattern that reflects generation: p3 (V100, 2017-era), p4d (A100, 2020), p5 (H100, 2023). A p4d.24xlarge has 8 × A100 40GB GPUs with NVLink, 96 vCPUs, and 1.1 TB RAM. On-demand pricing is high (over $30/hour). Spot pricing is roughly 50-70% lower, but spot interruptions are common for high-demand GPU instances. Reserved instances at 1- or 3-year terms reduce cost by 40-60%.

AWS Trainium (Trn1) and AWS Inferentia (Inf2) are Amazon’s custom silicon - Trainium for training, Inferentia for inference. Both are cheaper per FLOP than NVIDIA GPUs for their respective workloads, but require the AWS Neuron SDK (a layer above the hardware) and support a narrower range of model architectures. For teams running the same model in production at scale (inference serving), Inferentia’s cost savings are substantial. For research and varied workloads, GPUs remain the default.

GCP TPU pods are available in v3, v4, and v5p configurations. TPU v4 pods can be rented by the “TPU slice” (a subset of a pod). GCP’s TPU pricing per TFLOP is competitive with A100 pricing for training workloads that fully utilize the systolic array.

CUDA Lock-In: The uncomfortable truth about the GPU ecosystem is that CUDA is a proprietary platform. AMD’s ROCm aims to provide an open alternative, and AMD’s MI300X GPU is technically competitive with the H100 on some benchmarks. But the software ecosystem - PyTorch, NCCL, cuDNN, cuBLAS, Triton - is CUDA-first. ROCm support exists but is consistently one to two generations behind in maturity. Teams that have written custom CUDA kernels face a substantial rewrite to migrate. This is NVIDIA’s deepest moat.

The Energy Problem

Training GPT-3 (175B parameters) was estimated to consume roughly 1,287 MWh of electricity and produce approximately 500 tons of CO₂e, depending on grid carbon intensity. Inference on a deployed LLM at scale consumes energy continuously and at a level that is only now entering public discourse.

The energy cost is not a reason not to use GPUs, but it is a reason to take efficiency seriously. Quantization, knowledge distillation, and efficient attention implementations (FlashAttention) reduce the compute required per inference. Running inference on hardware with lower energy per FLOP (Inferentia, TPU) reduces operating costs and emissions.

The hardware lottery (a term from Sara Hooker’s 2020 paper) captures a related concern: the research ideas that succeeded in the deep learning era succeeded in part because they happened to fit the hardware that existed - dense matrix multiply on GPUs. Ideas that require sparse, irregular, or symbolic computation were disadvantaged not because they were wrong but because the hardware wasn’t optimized for them. Future hardware may validate different paradigms.

Future: Custom Silicon and the Post-CUDA World

The NVIDIA dominance is not permanent. The economics of training at hyperscaler scale drive every major technology company to build custom silicon:

Google TPU (established, v5p deployed)
AWS Trainium (Trn2 in development)
Apple Neural Engine (inference on-device, M3/M4 chips)
Intel Gaudi 3 (competitive with H100 on some benchmarks, with AMD-style open software stack)
Cerebras WSE-3 (wafer-scale chip with 4 trillion transistors, designed for models that don’t fit in GPU memory)
Groq LPU (deterministic execution for inference, very low latency)

The CUDA ecosystem’s moat is real, but the PyTorch team’s explicit investment in making custom backends (via torch.compile and the PyTorch 2.0 compiler stack) first-class citizens has reduced the barrier to using non-CUDA hardware. Models written in standard PyTorch can increasingly run on non-NVIDIA hardware without manual porting.

Photonic computing (optical matrix multiply) and neuromorphic chips (event-driven, sparse computation inspired by biological neurons) are earlier-stage alternatives that promise fundamentally different efficiency profiles. Whether they achieve commercial relevance for mainstream ML workloads in the next decade is genuinely uncertain.

What is certain is that the landscape of ML hardware in 2030 will look substantially different from today’s NVIDIA-dominated picture.

Summary

Dimension	CPU	GPU (A100/H100)	TPU v4
Core count	64-128 powerful cores	6,912-18,432 simple cores	2×128×128 systolic arrays
Optimization target	Latency (one task, fast)	Throughput (many tasks)	Throughput (matrix multiply)
Memory bandwidth	~300 GB/s (DDR5)	2-3.35 TB/s (HBM)	~1.2 TB/s (HBM)
Interconnect (multi-chip)	PCIe	NVLink + InfiniBand	ICI (custom, 1.2 Tb/s/link)
Framework support	All	CUDA-first (PyTorch excellent)	JAX/TF first; PyTorch limited
Custom kernels	Easy	CUDA, Triton	XLA only; limited
Cloud availability	All	AWS, GCP, Azure	GCP only
Best workload	OS, databases, irregular code	Research, inference, varied training	Large Transformer training at scale
CUDA lock-in risk	N/A	High	Low (but GCP lock-in)

Read Next: