Prerequisite: Vectors, Matrices, & Linear Systems | Cache-Friendly Code

Most real-world matrices are nearly empty. A social network’s adjacency matrix for 100 million users would be $10^{16}$ entries in dense form - impossible to store, let alone multiply. A finite element mesh over a 3D object produces a stiffness matrix where each row has maybe 20 nonzeros out of millions of columns. Sparse linear algebra is the toolkit for working with these structures efficiently, exploiting their emptiness rather than fighting it.

Why Dense Storage Fails

For an $m \times n$ matrix with $\text{nnz}$ nonzero entries, dense storage requires $O(mn)$ memory regardless of $\text{nnz}$. A $10^6 \times 10^6$ matrix uses $8 \times 10^{12}$ bytes - 8 TB in float64. If it has $10^7$ nonzeros (0.001% density), dense storage wastes a factor of $10^5$ in memory and every dense operation performs $10^5$ unnecessary multiplications by zero.

Sparse formats store only the nonzeros and their locations. Memory drops to $O(\text{nnz})$.

Sparse Storage Formats

COO (Coordinate Format)

The simplest format: three arrays - row, col, data - each of length $\text{nnz}$. Entry $(i, j)$ with value $v$ contributes one entry to each array.

Matrix:          COO:
1 0 2            row  = [0, 0, 1, 2]
0 3 0    -->     col  = [0, 2, 1, 0]
4 0 0            data = [1, 2, 3, 4]

COO is easy to construct (just append triples) but inefficient for arithmetic - random access to row $i$ requires scanning all of row.

CSR (Compressed Sparse Row)

CSR compresses the row array into an array of row pointers. indptr[i] is the index in data and indices where row $i$ begins. Row $i$ spans data[indptr[i]:indptr[i+1]].

CSR:
indptr  = [0, 2, 3, 4]
indices = [0, 2, 1, 0]
data    = [1, 2, 3, 4]

Memory: $O(\text{nnz} + m)$ (vs $O(mn)$ dense). Row $i$ can be accessed in $O(1)$ via indptr. Row-wise operations - matrix-vector products, row slicing - are efficient. CSR is the standard format for sparse linear solvers.

CSC (Compressed Sparse Column)

CSC is the column analog of CSR. Column $j$ spans data[indptr[j]:indptr[j+1]]. Efficient for column operations, direct linear solvers (which work column by column), and the backward pass through CSR operations. Transposing a CSR matrix gives CSC.

Choosing between CSR and CSC depends on the access pattern of your computation. For matrix-vector products $Ax$, CSR is natural (row dot products). For solving $A^T y = b$, CSC is natural.

Sparse Matrix-Vector Multiply (SpMV)

SpMV - computing $y = Ax$ for sparse $A$ - is the fundamental operation in iterative solvers, graph algorithms, and GNN message passing. For a CSR matrix:

for i in range(m):
    for j_idx in range(indptr[i], indptr[i+1]):
        y[i] += data[j_idx] * x[indices[j_idx]]

The irregular memory access pattern - x[indices[j_idx]] jumps to arbitrary positions in x - is cache-unfriendly. If $x$ doesn’t fit in L3 cache, each access becomes a DRAM fetch. This is why SpMV’s achieved bandwidth is often 10–20% of peak: the hardware prefetcher cannot predict the irregular access pattern.

Optimizations include reordering rows and columns to improve locality (reverse Cuthill-McKee), blocking nearby nonzeros together (BSR format), and using multiple threads to parallelize over rows (OpenMP #pragma omp parallel for).

Iterative Solvers for Sparse Linear Systems

For dense systems, Gaussian elimination costs $O(n^3)$. For sparse systems, direct factorization (CHOLMOD, SuperLU) is faster but can produce dense factors even from sparse inputs - fill-in. For very large sparse systems, iterative methods are the practical choice.

Conjugate Gradient (CG) solves $Ax = b$ for symmetric positive definite $A$. Each iteration costs one SpMV. It converges in at most $n$ iterations in exact arithmetic, and in practice in far fewer if the matrix is well-conditioned.

GMRES (Generalized Minimal Residual) handles non-symmetric and indefinite systems. It builds a Krylov subspace and minimizes the residual over it. Each iteration is more expensive than CG (requires an orthogonalization step), but it handles a broader class of problems.

Both methods converge in a number of iterations proportional to the condition number of the matrix. A poorly conditioned matrix requires many iterations, which is why preconditioning matters.

Preconditioning

A preconditioner $M \approx A$ is used to solve the equivalent system $M^{-1}Ax = M^{-1}b$, which has a smaller condition number and converges faster.

  • Diagonal (Jacobi) preconditioning: $M = \text{diag}(A)$. Very cheap, modest benefit.
  • Incomplete LU (ILU): Compute an approximate LU factorization that drops small fill-in entries. More expensive to compute but dramatically accelerates convergence for many problems.
  • Algebraic multigrid (AMG): Builds a hierarchy of coarser problems. Near-optimal for elliptic PDEs but complex to implement.

Sparse Matrices in Graph Algorithms

A graph with $n$ nodes and $m$ edges has a natural sparse adjacency matrix $A$ of size $n \times n$ with $m$ nonzeros (2$m$ for undirected). Graph algorithms become matrix operations:

  • BFS: $k$ steps of BFS from a source is equivalent to the sparsity pattern of $e_s A^k$ (Boolean sparse matrix-vector multiply).
  • PageRank: Solve $(I - \alpha A D^{-1}) x = (1-\alpha) \mathbf{1}$ iteratively.
  • GNN message passing: Aggregating neighbor features is $\hat{A} H$ where $\hat{A}$ is the normalized sparse adjacency and $H$ is the node feature matrix.

scipy.sparse

Python’s scipy.sparse provides CSR, CSC, COO, and several other formats, along with sparse BLAS routines and access to sparse direct solvers (via UMFPACK and CHOLMOD through scipy.sparse.linalg).

import scipy.sparse as sp
import numpy as np

# Build sparse matrix from COO data
rows = np.array([0, 0, 1, 2])
cols = np.array([0, 2, 1, 0])
data = np.array([1., 2., 3., 4.])
A = sp.csr_matrix((data, (rows, cols)), shape=(3, 3))

# Sparse matrix-vector multiply
x = np.array([1., 2., 3.])
y = A @ x

# Sparse direct solver
b = np.array([1., 1., 1.])
x_sol = sp.linalg.spsolve(A, b)

# Iterative solver with ILU preconditioner
ilu = sp.linalg.spilu(A.tocsc())
M = sp.linalg.LinearOperator(A.shape, ilu.solve)
x_cg, info = sp.linalg.gmres(A, b, M=M)

For GPU sparse operations, NVIDIA’s cuSPARSE library provides GPU-accelerated SpMV and sparse solvers, accessible from Python via cupy.sparse.

Examples

Sparse Graph Representation

import scipy.sparse as sp
import numpy as np

def build_adjacency(edges, n_nodes):
    """Build CSR adjacency matrix for an undirected graph."""
    src, dst = zip(*edges)
    src, dst = np.array(src), np.array(dst)
    # Undirected: add both directions
    rows = np.concatenate([src, dst])
    cols = np.concatenate([dst, src])
    data = np.ones(len(rows))
    return sp.csr_matrix((data, (rows, cols)), shape=(n_nodes, n_nodes))

edges = [(0, 1), (1, 2), (2, 3), (3, 0)]
A = build_adjacency(edges, n_nodes=4)
print(f"Stored {A.nnz} nonzeros vs {4*4} dense entries")

scipy.sparse for a Recommender System Matrix

import scipy.sparse as sp
import numpy as np

# User-item interaction matrix: 1M users, 100K items, ~10M interactions
# Density: 10M / (1e6 * 1e5) = 0.01%
n_users, n_items = 1_000_000, 100_000
user_ids = np.random.randint(0, n_users, 10_000_000)
item_ids = np.random.randint(0, n_items, 10_000_000)
ratings  = np.ones(10_000_000)

R = sp.csr_matrix((ratings, (user_ids, item_ids)),
                  shape=(n_users, n_items))
print(f"Memory: {R.data.nbytes / 1e6:.0f} MB sparse "
      f"vs {n_users * n_items * 4 / 1e9:.0f} GB dense")

Conjugate Gradient Convergence

from scipy.sparse.linalg import cg
import scipy.sparse as sp
import numpy as np

n = 1000
# Sparse SPD matrix: tridiagonal
diag = 4 * np.ones(n)
off  = -1 * np.ones(n - 1)
A = sp.diags([off, diag, off], [-1, 0, 1], format='csr')

b = np.random.randn(n)
residuals = []

def callback(xk):
    residuals.append(np.linalg.norm(b - A @ xk))

x, info = cg(A, b, callback=callback)
print(f"Converged in {len(residuals)} iterations, info={info}")

Sparse linear algebra is not a niche topic - it underlies search engines (PageRank), recommender systems, physics simulations, and GNNs. The common thread is exploiting the structure of nearly-empty matrices to make infeasible problems tractable.


Read Next: Numerical Methods & Stability | GPU vs TPU Architectures