Graph Neural Networks - Learning When the Data Has Structure // Megha Bose

Helpful context:

Most machine learning begins with a table: each row is an example, each column is a feature, and the job of the model is to find patterns in that tabular structure. Even image classifiers treat the input as a flat grid of pixel values - a vector. But a huge class of problems have a fundamentally different shape. A social network is a graph: people are nodes, friendships are edges, and predicting whether two users will connect depends on the topology of the whole network, not just on individual profile attributes. A molecule is a graph: atoms are nodes, bonds are edges, and whether a compound will bind to a target protein depends on how those atoms are arranged relative to each other. Citation networks, road networks, protein interaction networks, knowledge graphs - all of these have rich relational structure that a flat vector representation cannot capture without an enormous information loss. The challenge for deep learning is to process graphs directly, without forcing them into an inappropriate fixed-size representation.

The difficulty is not just that graphs are variable-sized. Graphs also have no canonical ordering of their nodes. Give the same molecule to two different software libraries and they may number the atoms in completely different orders. Any useful function on graphs must be permutation invariant: the output should not change if you relabel the nodes. A standard fully connected network applied to a feature matrix of node attributes would produce completely different outputs for the same graph with different node orderings. This invariance requirement, combined with the need to propagate information across edges, is what distinguishes graph learning from all other structured-prediction tasks.

Graph Neural Networks address both requirements through a single design principle: message passing. Rather than assigning a fixed coordinate to each node, a GNN defines computation in terms of a node’s local neighborhood. Each node starts with its own feature vector, and through repeated rounds of aggregation, every node’s representation accumulates evidence from progressively larger neighborhoods. Because the aggregation is defined in terms of the graph topology and not in terms of node indices, the resulting representations are automatically permutation equivariant. After enough layers, a node’s representation encodes information about its entire connected component.

The Message Passing Framework

The general message passing update for layer $l$ at node $i$ is:

$$h_i^{(l+1)} = \text{UPDATE}\left(h_i^{(l)},; \text{AGGREGATE}\left(\left\{h_j^{(l)} : j \in \mathcal{N}(i)\right\}\right)\right)$$

where $\mathcal{N}(i)$ is the set of neighbors of node $i$, AGGREGATE is some permutation-invariant function (sum, mean, max), and UPDATE combines the node’s own representation with the aggregated neighborhood message.

This is the template. Different GNN architectures are different instantiations of AGGREGATE and UPDATE. The key property is that after $k$ layers, each node’s representation $h_i^{(k)}$ encodes information from every node within $k$ hops - the $k$-hop neighborhood. With enough layers and sufficient representational capacity, this can in principle capture any structural information reachable from node $i$.

Why does permutation invariance follow? Because AGGREGATE is applied to an unordered set, not an ordered sequence. Sum, mean, and max are all invariant to the order in which their inputs are presented. The UPDATE function operates on the fixed-dimensional outputs of AGGREGATE, so it also does not depend on node ordering. The entire computation graph is defined by adjacency, not by index.

GCN: Graph Convolutional Networks

Kipf and Welling (2017) proposed a particularly clean instantiation. Add self-loops to the adjacency matrix $A$: let $\tilde{A} = A + I$. Compute the degree matrix $\tilde{D}$ of $\tilde{A}$ (so $\tilde{D}_{ii} = 1 + \deg(i)$). Define the normalized adjacency:

$$\hat{A} = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2}.$$

The GCN layer update is then:

$$H^{(l+1)} = \sigma\left(\hat{A} H^{(l)} W^{(l)}\right)$$

where $H^{(l)} \in \mathbb{R}^{n \times d_l}$ is the matrix of all node representations at layer $l$, $W^{(l)} \in \mathbb{R}^{d_l \times d_{l+1}}$ is a learnable weight matrix, and $\sigma$ is a pointwise nonlinearity (typically ReLU).

Why the symmetric normalization? Without it, summing over all neighbors gives a value proportional to the neighbor’s degree - a high-degree hub would dominate aggregation, and activations would scale differently for high-degree and low-degree nodes, destabilizing training. The factor $\tilde{D}^{-1/2}$ on each side normalizes both the sending and receiving node by the square root of their degree. This is analogous to the normalization in the spectral view of graph convolution and keeps the operator’s spectral radius bounded by 1.

Written out for a single node $i$:

$$h_i^{(l+1)} = \sigma\left(W^{(l)\top}\sum_{j \in \mathcal{N}(i) \cup \{i\}} \frac{h_j^{(l)}}{\sqrt{\tilde{d}_i \tilde{d}_j}}\right)$$

where $\tilde{d}_i = 1 + \deg(i)$. Each neighbor’s contribution is downweighted by the geometric mean of the two nodes' (augmented) degrees.

The spectral motivation: $\hat{A}$ is a normalized Laplacian-adjacent operator whose eigenvalues lie in $[-1, 1]$. Applying $\hat{A}$ repeatedly performs a smoothing operation on node features, analogous to low-pass filtering on a graph signal. GCN can be understood as applying a fixed graph filter and then learning a linear map - a highly regularized spectral graph convolution.

GraphSAGE: Inductive Learning

GCN is transductive: it operates on a fixed graph and cannot generalize to nodes or graphs not seen during training. If you train on a citation network and a new paper is added, you must retrain. GraphSAGE (Hamilton et al., 2017) makes GNNs inductive by learning an aggregation function that can be applied to any neighborhood.

The key change: rather than using all neighbors, GraphSAGE samples a fixed-size subset $\mathcal{S}_i^{(l)} \subset \mathcal{N}(i)$ at each layer. The aggregation is:

$$h_{\mathcal{N}(i)}^{(l)} = \text{AGG}\left(\left\{h_j^{(l)} : j \in \mathcal{S}_i^{(l)}\right\}\right)$$

$$h_i^{(l+1)} = \sigma\left(W^{(l)} \cdot \left[h_i^{(l)} ;|; h_{\mathcal{N}(i)}^{(l)}\right]\right)$$

where $|$ denotes concatenation. Three choices for AGG are standard:

Mean: $\text{AGG} = \frac{1}{|\mathcal{S}|}\sum_{j \in \mathcal{S}} h_j^{(l)}$ - simple and fast.
Max-pooling: $\text{AGG} = \max_{j \in \mathcal{S}} \sigma(W_{\text{pool}} h_j^{(l)})$ - applies an element-wise max after a linear transform.
LSTM: pass the (randomly ordered) neighbors through an LSTM - more expressive but loses permutation invariance if order matters.

The concatenation of the node’s own representation with the aggregated neighborhood - rather than GCN’s simple averaging including self - preserves the node’s own information more explicitly. The fixed-size neighborhood sampling makes each training step $O(k^L)$ in the hop depth $L$, independent of the full graph size. At inference time, the learned AGGREGATE and UPDATE functions apply to any new node with any neighborhood, enabling generalization to unseen nodes and graphs.

GAT: Graph Attention Networks

In GCN, all neighbors contribute equally (up to degree normalization). But not all neighbors are equally relevant. A paper in a citation graph is influenced more by its intellectual predecessors than by coincidental citations. An atom in a molecule contributes differently to a property depending on its chemical context. Graph Attention Networks (Velickovic et al., 2018) replace the fixed normalization with learned attention weights.

For each edge $(i, j)$, compute an attention score:

$$e_{ij} = \text{LeakyReLU}\left(a^\top \left[W h_i ;|; W h_j\right]\right)$$

where $W \in \mathbb{R}^{d' \times d}$ is a shared linear transform, $a \in \mathbb{R}^{2d'}$ is a learnable attention vector, and $|$ denotes concatenation. Normalize across the neighbors of $i$:

$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k \in \mathcal{N}(i)} \exp(e_{ik})}.$$

The update is:

$$h_i^{(l+1)} = \sigma\left(\sum_{j \in \mathcal{N}(i)} \alpha_{ij} W h_j^{(l)}\right).$$

In practice, multi-head attention is used: run $K$ independent attention heads and concatenate (or average) their outputs:

$$h_i^{(l+1)} = \Big|{k=1}^{K} \sigma\left(\sum{j \in \mathcal{N}(i)} \alpha_{ij}^{(k)} W^{(k)} h_j^{(l)}\right).$$

The attention mechanism allows the model to focus on the most relevant neighbors for each node, which is particularly valuable in heterogeneous graphs where different edges have different semantic significance. The weights $\alpha_{ij}$ are also interpretable - you can inspect which neighbors a node attended to for a given prediction.

Expressive Power: The Weisfeiler-Leman Test

How powerful are GNNs? Can they distinguish any two non-isomorphic graphs? The answer is no - and the precise characterization involves the Weisfeiler-Leman (WL) graph isomorphism test.

The 1-WL test works by iteratively assigning colors to nodes based on their current color and the multiset of neighbors' colors:

$$c_i^{(t+1)} = \text{HASH}\left(c_i^{(t)},; \left\{\left\{ c_j^{(t)} : j \in \mathcal{N}(i) \right\}\right\}\right).$$

If two graphs are isomorphic, 1-WL assigns identical color sequences. If 1-WL produces different color histograms, the graphs are definitely non-isomorphic. But 1-WL can fail: there exist non-isomorphic pairs of graphs that 1-WL cannot distinguish.

The key theorem (Xu et al., 2019): any message-passing GNN is at most as powerful as the 1-WL test. A GNN can distinguish two graphs only if 1-WL can. Equality is achieved when the aggregation is injective on multisets - specifically, when AGG uses summation (not mean or max) and the UPDATE is also injective. This is the GIN (Graph Isomorphism Network) architecture:

$$h_i^{(l+1)} = \text{MLP}^{(l)}\left((1 + \epsilon^{(l)}) h_i^{(l)} + \sum_{j \in \mathcal{N}(i)} h_j^{(l)}\right)$$

where $\epsilon$ is learnable. Mean and max aggregators are strictly weaker than sum for distinguishing graph structures.

1-WL fails on regular graphs: if two nodes have the same degree sequence in their neighborhoods at every hop, 1-WL cannot tell them apart. A classic example: two regular graphs with the same degree but different connectivity structure are indistinguishable by any standard message-passing GNN, no matter how many layers or how large the hidden dimension.

Beyond 1-WL: More Expressive GNNs

The k-WL hierarchy provides more powerful tests. 2-WL colors pairs of nodes; k-WL colors $k$-tuples. $k$-WL is strictly more powerful than $(k-1)$-WL, but the cost is $O(n^k)$ in the number of nodes - exponential and impractical for large graphs.

Several practical approaches improve expressiveness without the full k-WL cost:

Random node features. Assign each node a unique random feature (e.g., a Gaussian vector) at the start of each forward pass. Two nodes that were structurally identical are now distinguishable because their random seeds differ. This breaks symmetry and provably exceeds 1-WL on structures where 1-WL fails, at the cost of non-determinism (different runs produce different embeddings for structurally equivalent nodes).

Distance encoding. Augment each node’s input features with structural information: the distance from every node to a set of anchor nodes, or the shortest path distances between pairs. This injects global structural information that message passing cannot compute locally.

Subgraph GNNs. For each node $v$, run a GNN on the subgraph obtained by removing $v$ from the graph. The node’s representation encodes how the graph changes when it is removed - this captures structural information invisible to 1-WL. This is more expensive ($O(n)$ GNN runs per graph) but more expressive.

Equivariant GNNs. For molecular graphs where 3D coordinates are available, architectures like SchNet and DimeNet use geometric invariances (distance, angle, torsion) as structural features.

Task Levels: Node, Edge, and Graph

GNNs are applied at three granularities.

Node-level tasks. After $L$ layers of message passing, each node $i$ has a representation $h_i^{(L)} \in \mathbb{R}^d$. Apply a classifier or regressor to each $h_i^{(L)}$ independently. Example: predict whether each node in a citation graph belongs to a research topic (transductive node classification). Because the GNN aggregates neighborhood information, the prediction for node $i$ is informed by its local context, not just its own features.

Edge-level tasks. Predict whether an edge exists (link prediction) or its properties. A common approach: for each pair of nodes, form a pair representation $[h_i^{(L)} | h_j^{(L)}]$ or compute their dot product $h_i^{(L)} \cdot h_j^{(L)}$, then apply a classifier. In knowledge graphs, this is entity-relation-entity triple completion.

Graph-level tasks. To predict a property of the whole graph (e.g., predict the solubility of a molecule), the per-node representations must be aggregated into a single graph representation:

$$h_G = \text{READOUT}\left(\left\{h_i^{(L)} : i \in V\right\}\right).$$

Simple choices: mean pooling $h_G = \frac{1}{n}\sum_i h_i^{(L)}$ or max pooling. These are permutation invariant by construction. Differentiable pooling (DiffPool, Ying et al., 2018) learns a soft clustering of nodes into a coarser graph at each pooling step, progressively reducing the number of nodes while preserving hierarchical structure. Sort pooling (SortPool) concatenates the top-$k$ node representations sorted by their last feature dimension.

Applications

AlphaFold2. DeepMind’s protein structure prediction system uses a graph-structured representation of protein residues (the “Evoformer”). Residues within a threshold distance are connected by edges, and attention-augmented message passing propagates structural information. The result is a learned representation that captures both sequence and spatial context, enabling prediction of 3D structure from amino acid sequence alone.

Drug discovery. Molecules are naturally represented as graphs: atoms are nodes with features (element, charge, hybridization), bonds are edges with features (type, stereo). GNNs predict molecular properties (solubility, toxicity, binding affinity) directly from graph structure, replacing hand-engineered fingerprints like ECFP. Trained on databases of experimentally measured properties, GNNs enable virtual screening of billions of candidate molecules.

Traffic prediction. Road networks are graphs. DCRNN and related models represent intersections as nodes and road segments as edges, using GNNs with temporal sequence models to predict traffic speed and flow 30-60 minutes ahead. The graph structure allows the model to learn that congestion at one node propagates along connected roads.

Recommendation systems. PinSage (Pinterest, Hamilton et al., 2018) represents users and items as nodes in a bipartite graph with edges for user-item interactions. GraphSAGE on this graph produces item embeddings that capture collaborative filtering information, outperforming matrix factorization approaches.

Scalability

Naive GCN performs full-batch gradient descent: the entire graph must fit in memory, and computing $\hat{A} H W$ requires touching all $n$ nodes and $m$ edges. For billion-node graphs (social networks, web graphs), this is infeasible.

Mini-batch with neighborhood sampling (GraphSAGE). Sample a mini-batch of target nodes. For each target node, sample $k_1$ 1-hop neighbors; for each of those, sample $k_2$ 2-hop neighbors; etc. This gives a computation graph that is $O(k_1 k_2 \cdots k_L)$ per target node - small and independent of graph size. The downside: each node’s representation is computed with a different sampled neighborhood at each mini-batch step, introducing variance in the gradient estimates.

Cluster-GCN (Chiang et al., 2019). Partition the graph into subgraphs using a graph clustering algorithm (METIS). Train on one subgraph at a time. Since nodes within a subgraph are closely connected, mini-batches contain nodes whose $L$-hop neighborhoods are mostly within the subgraph - reducing the “neighborhood explosion” problem. This also improves memory locality.

Simplified GCN (SGC, Wu et al., 2019). Remove all nonlinearities from GCN:

$$H^{(L)} = \hat{A}^L X W.$$

The $L$ applications of $\hat{A}$ can be precomputed once (since $\hat{A}$ is fixed), reducing training to simple linear regression on the precomputed features $\hat{A}^L X$. For many node classification benchmarks, SGC performs comparably to GCN at a fraction of the computational cost. This suggests that much of GCN’s power comes from the graph propagation operator $\hat{A}$, not from the interleaved nonlinearities.

Summary

Architecture	Aggregation	Key property	Limitation
GCN (Kipf & Welling)	Normalized sum (fixed)	Simple, spectral motivation	Transductive, symmetric normalization may not suit all tasks
GraphSAGE	Sampled mean/max/LSTM	Inductive, scales to large graphs	Sampling introduces variance
GAT	Learned attention weights	Flexible per-neighbor weighting	More parameters, slower than GCN
GIN	Injective sum + MLP	As powerful as 1-WL	Still limited to 1-WL expressiveness
DiffPool	Differentiable node clustering	Hierarchical graph-level representations	Quadratic in cluster assignments

Concept	Definition
Message passing	Each node aggregates representations from its neighbors and updates its own
$k$-hop neighborhood	All nodes reachable in at most $k$ steps; captured after $k$ GNN layers
GCN normalization	$\hat{A} = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2}$; prevents degree-dependent activation scaling
Permutation invariance	Output unchanged under relabeling of nodes
1-WL test	Iterative color refinement; standard message-passing GNNs are at most as powerful
Expressive power gap	1-WL cannot distinguish all non-isomorphic graphs (e.g., some regular graphs)
Graph-level readout	READOUT$(\{h_i^{(L)}\})$ - permutation-invariant pooling to obtain a graph representation
Inductive vs. transductive	Inductive: generalizes to unseen nodes/graphs; transductive: operates on fixed graph
Neighborhood sampling	Sample fixed-size subsets of neighbors for scalable mini-batch training

Read next: