Sparse Linear Algebra - When Most of the Matrix Is Zero // Megha Bose

Helpful context:

Imagine a graph with one billion nodes - every user on Facebook, every page on the web. The edges between them number maybe three billion. If you tried to store the adjacency matrix of this graph densely, every row would need one billion floats. All one billion rows. That is 8 petabytes. The matrix is 99.9999997% zeros. You would be allocating 8 petabytes to store nothing useful.

The sparse representation of the same graph - three arrays, one for row indices, one for column indices, one for edge weights - needs 24 gigabytes. A factor of 300,000 less memory. And because you are no longer multiplying by billions of zeros, the arithmetic shrinks proportionally too.

This is why sparse linear algebra exists. Not as a theoretical curiosity, but as the difference between a computation that fits in a data center and one that doesn’t fit on the planet.

Why Dense Storage Fails

For an $m \times n$ matrix with $\text{nnz}$ nonzero entries, dense storage requires $O(mn)$ memory regardless of $\text{nnz}$. A $10^6 \times 10^6$ matrix uses $8 \times 10^{12}$ bytes - 8 TB in float64. If it has $10^7$ nonzeros (0.001% density), dense storage wastes a factor of $10^5$ in memory, and every dense operation performs $10^5$ unnecessary multiplications by zero.

This is not just about memory. It is about compute time. Matrix-vector multiplication on a dense $n \times n$ matrix costs $O(n^2)$ operations. If the matrix has $\text{nnz}$ nonzeros, sparse multiplication costs $O(\text{nnz})$. For a graph adjacency matrix with average degree 3, that is $O(3n)$ instead of $O(n^2)$ - a quadratic versus linear gap that dominates everything else at scale.

Sparse formats store only the nonzeros and their locations. Memory drops to $O(\text{nnz})$.

Sparse Storage Formats

Three formats dominate real systems, each optimized for a different phase of computation.

COO (Coordinate Format) - Built for Construction

The simplest format: three arrays - row, col, data - each of length $\text{nnz}$. Entry $(i, j)$ with value $v$ contributes one entry to each array.

# Matrix:
# 1 0 2
# 0 3 0
# 4 0 0

row  = [0, 0, 1, 2]
col  = [0, 2, 1, 0]
data = [1, 2, 3, 4]

COO is the natural format for building a sparse matrix incrementally - just append triples. Libraries that read graph edge lists or finite element meshes build COO first, then convert. The problem is arithmetic: random access to row $i$ requires scanning all of row, making SpMV $O(\text{nnz}^2)$ naive, and any real row operation slow.

COO is a staging format. You build in COO, then convert to something better.

CSR (Compressed Sparse Row) - Built for Row Operations

CSR compresses the row array into an array of row pointers. indptr[i] is the index in data and indices where row $i$ begins. Row $i$ spans data[indptr[i]:indptr[i+1]].

# Same matrix as above, in CSR:
indptr  = [0, 2, 3, 4]   # row 0 starts at 0, row 1 at 2, row 2 at 3
indices = [0, 2, 1, 0]   # column indices of nonzeros
data    = [1, 2, 3, 4]

Memory: $O(\text{nnz} + m)$. Row $i$ is accessible in $O(1)$ via indptr[i]. CSR is the workhorse format: it is what scipy.sparse, PyTorch sparse, and cuSPARSE use internally for row operations. Sparse matrix-vector multiplication, iterative solvers, and GNN message passing all work naturally on CSR.

CSC (Compressed Sparse Column) - Built for Column Operations

CSC is the column analog of CSR. Column $j$ spans data[indptr[j]:indptr[j+1]]. Efficient for column operations, and this matters: sparse direct linear solvers (CHOLMOD, SuperLU) work column by column during factorization. Transposing a CSR matrix gives CSC - which is why the transpose of a transpose gets cached by most libraries.

Choosing between CSR and CSC is purely about access pattern. Multiplying $Ax$, row dot products - use CSR. Solving $A^T y = b$, column slicing - use CSC. Getting this wrong can cost an order of magnitude in performance.

BSR (Block Sparse Row) - Built for Dense Local Structure

BSR groups nonzeros into fixed-size dense blocks, typically $r \times c$ where $r$ and $c$ are small (4, 8, 16). Instead of storing individual scalar nonzeros, it stores small dense matrices. This is efficient when nonzeros cluster in block patterns - which they do in finite element methods (where node connectivity produces block structure) and in graph neural networks applied to graphs with community structure.

BSR allows BLAS-3 operations (matrix-matrix multiply) within each block, which is far more cache-friendly than scalar SpMV and maps well to SIMD registers.

Sparse Matrix-Vector Multiply (SpMV)

SpMV - computing $y = Ax$ for sparse $A$ - is the fundamental primitive. Every iterative solver, every GNN message passing layer, every PageRank iteration reduces to repeated SpMV. In CSR:

for i in range(m):
    for j_idx in range(indptr[i], indptr[i+1]):
        y[i] += data[j_idx] * x[indices[j_idx]]

This looks simple. The problem is the line x[indices[j_idx]].

For a graph with random structure, indices[j_idx] is essentially a random number between 0 and $n$. The memory access pattern into x is therefore random. If x doesn’t fit in L3 cache - and for a billion-node graph it absolutely doesn’t - each access is a full DRAM fetch: 60 - 100 nanoseconds. The multiply itself takes less than a nanosecond. The computation is bottlenecked by memory latency, not arithmetic.

This is what engineers mean when they say SpMV is memory-bandwidth-bound. Worse: it isn’t even bandwidth-bound, it’s latency-bound - random access can’t be prefetched because the hardware prefetcher only recognizes sequential and stride patterns. Achieved memory bandwidth for SpMV is often 10 - 20% of peak.

The mitigations are:

Reordering: Reverse Cuthill-McKee reorders rows and columns to reduce the bandwidth of the matrix, making the nonzeros cluster closer to the diagonal, which improves locality in x.
Blocking: Grouping nearby nonzeros into BSR format allows loading a chunk of x contiguously.
Segmented SpMV: On GPU, assigning multiple threads per row and using warp-level reductions, so that bandwidth pressure is spread across warps.

Even with mitigations, SpMV on irregular graphs is one of the hardest computations to accelerate. The hardware isn’t the problem - the problem is that random memory access is fundamentally expensive.

Sparsity Pattern Matters as Much as Sparsity

Two matrices with the same number of nonzeros can have wildly different performance. A matrix with a banded pattern (nonzeros clustered near the diagonal) has excellent cache behavior - x[indices[j_idx]] always accesses nearby entries. A matrix with a random pattern (social graph, web graph) has terrible cache behavior.

The sparsity pattern also determines algorithm choice:

Symmetric positive definite with banded pattern: direct Cholesky factorization (CHOLMOD) is fast; fill-in is minimal.
General sparse with random pattern: iterative solvers (CG, GMRES) with preconditioning, because direct factorization produces catastrophic fill-in.
Nearly dense blocks: BSR and BLAS-3 operations.
Ultra-sparse graphs (average degree 2 - 3): breadth-first search-style traversal beats SpMV.

Iterative Solvers and Why They’re Necessary

For dense systems, Gaussian elimination costs $O(n^3)$. For sparse systems, even direct factorization (CHOLMOD, SuperLU) can produce dense factors from sparse inputs - a phenomenon called fill-in. Factoring a sparse $10^6 \times 10^6$ matrix might produce a factor with $10^{10}$ nonzeros, which is worse than the original dense problem.

Iterative methods avoid factorization entirely. They only require matrix-vector products.

Conjugate Gradient (CG) solves $Ax = b$ for symmetric positive definite $A$. Each iteration costs one SpMV plus $O(n)$ vector operations. It converges in at most $n$ iterations in exact arithmetic, but for well-conditioned problems it reaches machine precision in $O(\sqrt{\kappa})$ iterations, where $\kappa$ is the condition number.

GMRES (Generalized Minimal Residual) handles non-symmetric and indefinite systems. It builds a Krylov subspace orthogonally (Arnoldi process) and minimizes the residual over it. Storage grows as $O(k \cdot n)$ for $k$ iterations, which is why restart variants (GMRES(k)) are used in practice.

Both methods converge in a number of iterations proportional to the condition number. For poorly conditioned matrices, the iteration count explodes, which is why preconditioning is usually non-optional.

Preconditioning: Reshaping the Problem

A preconditioner $M \approx A$ transforms $Ax = b$ into $M^{-1}Ax = M^{-1}b$. If $M^{-1}A$ has a smaller condition number, the iterative solver converges much faster. The trade-off: computing $M^{-1}v$ (applying the preconditioner) must be cheap.

Diagonal (Jacobi): $M = \text{diag}(A)$. Application is $O(n)$. Modest benefit for diagonal-dominant systems; ineffective for most real problems.
Incomplete LU (ILU): Compute an approximate LU factorization that drops small fill-in entries. Expensive to compute ($O(\text{nnz})$ to $O(n^{1.5})$ depending on the drop tolerance) but can reduce iteration counts by 10 - 100x for elliptic PDE systems.
Algebraic Multigrid (AMG): Builds a hierarchy of coarser problems, solving the coarse problem exactly and interpolating back. Near-optimal for elliptic PDEs ($O(n \log n)$ total work), but expensive to set up and complex to implement. BOOMERAMG (from hypre) is the production implementation.

Choosing a preconditioner is more art than science. ILU works for fluid dynamics PDEs. AMG works for elasticity. Diagonal works only as a sanity check.

Sparse Matrices in Real Systems

Graph Neural Networks

GNNs operate on graphs. The message passing step - aggregating neighbor features - is exactly SpMV:

$$H' = \hat{A} H$$

where $\hat{A}$ is the normalized sparse adjacency matrix ($n \times n$, sparse) and $H$ is the node feature matrix ($n \times d$, dense). This is a sparse-dense matrix-matrix multiply (SpMM), not just SpMV. PyTorch Geometric and DGL both implement this operation with custom CUDA kernels because cuBLAS doesn’t handle sparse inputs.

The challenge: GNN graphs are often power-law distributed (a few nodes have millions of neighbors). The resulting load imbalance between rows - some rows have 3 entries, some have 3 million - kills GPU utilization. Work-partitioning schemes that dynamically assign work to warps based on row length are an active research area.

Recommendation Systems

User-item interaction matrices are among the most extreme sparse matrices in production. Netflix has $\sim 200$ million users and $\sim 15,000$ titles. The interaction matrix is $200M \times 15K$. Most users have rated fewer than 100 titles. Density: 0.003%. A dense matrix would need 24 terabytes. The sparse matrix needs a few gigabytes.

Matrix factorization for collaborative filtering (ALS, implicit feedback) repeatedly multiplies this sparse matrix by dense factor matrices. The sparsity pattern here is more benign than graph sparsity - the user dimension has reasonably balanced row lengths - which is why GPU-accelerated ALS can reach good hardware utilization.

PageRank

PageRank solves the linear system $(I - \alpha A D^{-1}) x = (1-\alpha) \mathbf{1}$ iteratively, where $A$ is the web graph adjacency matrix (billions of nonzeros) and $D$ is the degree diagonal matrix. Each iteration is one sparse matrix-vector product. The original Google implementation in 1998 ran this on a single machine; modern implementations run across thousands of machines, partitioning the matrix by rows.

Finite Element Methods

FEM discretizes PDEs over a 3D mesh. Each element contributes a local stiffness matrix to the global stiffness matrix. The resulting global matrix is symmetric, positive definite, and has a sparsity pattern determined entirely by the mesh topology - each node is connected only to its geometric neighbors. Bandwidth (the maximum distance from the diagonal to any nonzero) depends on how nodes are numbered.

This is the application that drove the development of sparse direct solvers (CHOLMOD, MUMPS) and algebraic multigrid. The matrix structure is benign (symmetric, PD, banded) and the sparsity pattern is predictable, which allows more aggressive optimizations than graph SpMV.

scipy.sparse in Practice

import scipy.sparse as sp
import numpy as np

# Build from COO (natural for edge lists)
rows = np.array([0, 0, 1, 2])
cols = np.array([0, 2, 1, 0])
data = np.array([1., 2., 3., 4.])
A = sp.csr_matrix((data, (rows, cols)), shape=(3, 3))

# SpMV
x = np.array([1., 2., 3.])
y = A @ x  # uses CSR format internally

# Iterative solver with ILU preconditioner
from scipy.sparse.linalg import gmres, spilu, LinearOperator

b = np.array([1., 1., 1.])
ilu = spilu(A.tocsc())  # ILU requires CSC
M = LinearOperator(A.shape, ilu.solve)
x_sol, info = gmres(A, b, M=M)

For GPU sparse operations, NVIDIA’s cuSPARSE library provides GPU-accelerated SpMV, accessible from Python via cupy.sparse. PyTorch also exposes torch.sparse_csr_tensor and torch.sparse.mm for SpMM on GPU, though the performance is highly shape-dependent.

When to Use Dense Instead

Here is an uncomfortable truth: for many ML workloads on GPU, sparse is slower than dense.

The break-even density depends on the operation and the hardware. For SpMV on GPU, dense matrix-vector multiply (cuBLAS) tends to win at densities above 1 - 5% because:

Dense operations saturate memory bandwidth efficiently - sequential access is prefetchable.
GPU SpMV has load imbalance - rows with different lengths cause warp divergence and idle threads.
cuBLAS is extremely well optimized - it uses hand-tuned CUDA kernels, tensor cores, and autotuning.

The rule of thumb: if your matrix is more than ~1% dense and you’re doing repeated SpMV on GPU, benchmark dense vs sparse before committing to a sparse format. You may find the 99% zeros are faster to multiply through than the overhead of the sparse format.

For CPU, the crossover is lower (roughly 0.1 - 0.5% density) because CPU memory bandwidth is more favorable for irregular access patterns, and the overhead of sparse indexing is lower.

The Critique: Sparse is Hard to Parallelize

The fundamental problem with sparse linear algebra is that irregular memory access conflicts with the hardware assumptions of both CPUs (prefetching, cache lines) and GPUs (coalesced memory access, warp uniformity). Every optimization in dense linear algebra - blocking, register tiling, cache reuse - becomes harder to apply when the memory access pattern is data-dependent.

This has produced a fragmented ecosystem. scipy.sparse works well for CPU. cuSPARSE works for GPU but is notoriously hard to use correctly and has historically had poor performance for graph workloads. PyTorch Geometric has its own custom CUDA extensions. DGL has yet another implementation. There is no single library that handles sparse-dense mixed computation efficiently across all sparsity patterns and hardware targets.

The situation is improving: NVIDIA’s A100 introduced native support for 2:4 structured sparsity in tensor cores - where exactly 2 of every 4 weights are nonzero, compressed into a special format. This 50% sparsity doubles throughput with near-zero overhead because the sparsity pattern is regular enough for hardware to handle. Unstructured sparsity remains slow.

Future: Sparse Attention and Hardware Support

The Transformer’s attention mechanism computes $n^2$ attention scores for a sequence of length $n$. For long sequences, this is prohibitive. Sparse attention variants restrict which tokens can attend to which others, producing a sparse attention matrix.

Longformer uses a sliding window (each token attends to its $k$ nearest neighbors) plus global attention for special tokens. This gives $O(n)$ attention complexity. The attention mask is a sparse matrix with a banded pattern - regular enough to implement efficiently.

BigBird adds random attention on top of the window, providing a theoretical guarantee (any sparse matrix can be approximated by a combination of window + random + global attention). The random component is irregular and requires a custom CUDA kernel.

Both approaches expose the core tension: sparse attention is asymptotically better but requires irregular computation that is harder to implement and often slower in practice than the dense $O(n^2)$ baseline on modern hardware, up to sequence lengths of several thousand tokens.

On the hardware side, beyond NVIDIA’s 2:4 sparsity, there is research into hardware support for unstructured sparsity: Cerebras’s CS-2 architecture, for example, is designed around sparse communication patterns. The longer-term question is whether future accelerators will close the gap between regular and irregular memory access, or whether sparse computation will remain a software-level challenge.

Summary

Format	Build	Row Access	SpMV	Best For
COO	$O(1)$ append	$O(\text{nnz})$ scan	Slow	Construction, input/output
CSR	Needs sort	$O(1)$	Fast	Row operations, SpMV, GNNs
CSC	Needs sort	$O(1)$ per column	Slower	Column ops, direct solvers
BSR	From CSR	$O(1)$	Fastest (BLAS-3)	Block-structured problems

Method	Works For	Cost Per Iter	Convergence
CG	SPD only	1 SpMV + $O(n)$	$O(\sqrt{\kappa})$ iters
GMRES	General	1 SpMV + Arnoldi	$O(\kappa)$ iters worst case
Direct (Cholesky)	SPD	$O(n^{1.5})$ setup	Exact in one solve

Read Next:

Inside JAX & XLA - Composable Transformations That Compile to Accelerators