Helpful context:


Suppose you want to model the joint distribution over 30 binary variables - perhaps the presence or absence of 30 symptoms in a patient. Fully specifying this distribution requires $2^{30} - 1 \approx 10^9$ numbers. That’s a billion parameters. You don’t have a billion training examples. You don’t have the memory to store a billion numbers. And even if you did, you haven’t learned anything - you’ve just memorized every possible combination of symptoms. This is the fundamental problem: joint distributions over many variables are exponentially large.

But real distributions are not arbitrary. Symptom A may be independent of symptom B once you know whether the patient has disease C. A patient’s test result depends on the disease, not directly on the patient’s age. These conditional independence relationships mean the distribution has structure - it factorizes. Graphical models are a language for encoding that structure. By representing a distribution as a graph, where nodes are variables and edges indicate direct dependencies, you can specify an exponentially large distribution with a polynomial number of parameters, reason about independence without doing any computation, and design efficient algorithms for inference and learning.

There are two major families. Bayesian networks are directed graphs, natural for modeling causal processes where some variables influence others. Markov Random Fields are undirected graphs, natural for modeling symmetric relationships where no direction is implied. Each has its own factorization, its own inference algorithms, and its own class of problems it handles best. Understanding both - and their relationship - is essential for probabilistic machine learning.


Bayesian Networks

A Bayesian network (also called a directed graphical model or belief network) is a directed acyclic graph (DAG) where each node $X_i$ represents a random variable and each directed edge $X_j \to X_i$ indicates that $X_j$ is a direct cause or parent of $X_i$.

The fundamental factorization: the joint distribution over all variables equals the product of local conditional distributions, one per node:

$$P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^n P(X_i \mid \text{parents}(X_i))$$

For a node with no parents, the term is just its marginal $P(X_i)$.

A medical example. Consider four variables: Disease $D$, Symptom $S$, TestResult $T$, and Age $A$. Suppose Age influences the prior on Disease: older patients are more likely to have it. Disease causes Symptom and influences TestResult. Nothing else is directly connected.

The graph is: $A \to D$, $D \to S$, $D \to T$. The joint distribution factorizes as:

$$P(A, D, S, T) = P(A) \cdot P(D | A) \cdot P(S | D) \cdot P(T | D)$$

Instead of $2^4 - 1 = 15$ numbers for 4 binary variables, we need: 1 number for $P(A)$, 2 for $P(D | A)$, 2 for $P(S | D)$, 2 for $P(T | D)$ - only 7 parameters. And the structure explicitly encodes that $S$ and $T$ are conditionally independent given $D$: once you know the disease, the symptom carries no information about the test result.

Why acyclic? Cycles would create circular dependencies: $X$ depends on $Y$ which depends on $X$. This makes the factorization undefined (there would be no valid ordering for the product). DAGs have a topological ordering, which ensures the factorization is well-defined.


D-Separation: Reading Independence from the Graph

The graphical structure tells you which variables are conditionally independent without any computation - just by reading the graph. The tool is d-separation (directional separation).

Two sets of variables $A$ and $B$ are d-separated by a third set $C$ if every path between any node in $A$ and any node in $B$ is “blocked” by $C$. A path is blocked if it contains a node with a specific relationship to $C$. There are three elementary patterns:

Chain: $X \to Z \to Y$. Information flows from $X$ to $Y$ through $Z$. If we observe $Z$, the path is blocked: $X \perp Y | Z$. Without observing $Z$, information flows and $X$ and $Y$ are dependent. Example: a disease ($X$) causes inflammation ($Z$) which causes fever ($Y$). Once you know inflammation status, disease status and fever are independent.

Fork: $X \leftarrow Z \rightarrow Y$. A common cause $Z$ induces dependence between $X$ and $Y$. If you observe $Z$, the path is blocked: $X \perp Y | Z$. Without observing $Z$, $X$ and $Y$ are correlated (through their common cause). Example: weather ($Z$) causes both ice cream sales ($X$) and drowning rates ($Y$). Without conditioning on weather, these appear correlated; conditioning on weather removes the spurious correlation.

Collider: $X \to Z \leftarrow Y$. A shared effect $Z$. This is the counterintuitive case: without observing $Z$, the path is blocked and $X \perp Y$. But if you observe $Z$, the path is active: knowing $Z$ induces dependence between $X$ and $Y$. This is called explaining away or Berkson’s paradox.

The explaining away effect deserves careful thought. Suppose two diseases $X$ (flu) and $Y$ (allergy) both cause a runny nose $Z$. Without any information, knowing someone has flu tells you nothing about whether they have allergies. But if you know they have a runny nose ($Z$ is observed) and you learn they have flu, you update downward on the probability they also have allergies - the flu “explains away” the nose. Conditioning on a collider opens a path that was previously closed, inducing dependence between the parents.

A practical consequence: in causal inference, if you condition on a collider (a variable caused by both treatment and outcome), you can introduce spurious associations. Selection bias is often a collider bias.


Markov Random Fields

A Markov Random Field (MRF), also called an undirected graphical model, uses an undirected graph. Each edge indicates that two variables are directly related, without specifying direction.

The joint distribution of an MRF is proportional to a product of potential functions over cliques (fully connected subgraphs):

$$P(X_1, \ldots, X_n) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \psi_c(X_c)$$

where $\mathcal{C}$ is the set of maximal cliques, $\psi_c$ is a non-negative potential function on clique $c$, and $Z = \sum_x \prod_c \psi_c(x_c)$ is the partition function that normalizes the distribution.

The potential functions need not be probabilities themselves - they just need to be non-negative. A potential $\psi(X_i, X_j)$ might encode that neighboring pixels in an image tend to have similar labels, without any causal claim about which pixel “causes” the other.

The Markov property in MRFs. In an undirected graph, the Markov blanket of a node $X_i$ is its immediate neighbors. Given its neighbors, $X_i$ is conditionally independent of all other variables:

$$P(X_i | X_{\setminus i}) = P(X_i | \text{neighbors}(X_i))$$

This local structure means each variable depends only on its immediate neighbors, making inference tractable when the graph is sparse.

Bayesian network vs MRF. Every Bayesian network can be converted to an MRF (by “moralizing” the graph: adding edges between co-parents, then dropping directions). The reverse is generally not true. Bayesian networks can encode v-structures (colliders) that MRFs cannot represent. MRFs can encode cyclic undirected dependencies that have no clean directed representation. Neither subsumes the other.


Learning vs Inference

Two distinct computational problems arise with graphical models.

Learning is the problem of fitting parameters (and possibly structure) to data. For a Bayesian network with fully observed data, this is straightforward: estimate each $P(X_i | \text{parents}(X_i))$ independently by counting (for discrete variables) or fitting a regression (for continuous). With hidden (latent) variables, learning requires marginalizing over the unobserved variables, which is the E-step in the EM algorithm.

Structure learning - inferring the graph topology from data - is a harder problem. For Bayesian networks, the search space is the set of DAGs over $n$ nodes, which grows superexponentially in $n$. Methods include constraint-based algorithms (PC, FCI, which test conditional independence in data), score-based algorithms (search the DAG space greedily optimizing a score like BIC), and continuous relaxations (NOTEARS, which parameterizes the DAG constraint as a differentiable equality constraint).

Inference is the problem of computing conditional distributions given observed evidence. Given that the patient has a fever and a positive test result, what is the probability they have the disease? This requires marginalizing over the hidden variables.


Exact Inference: Variable Elimination and Belief Propagation

Variable elimination is the fundamental exact inference algorithm. To compute a marginal or conditional, sum out variables one at a time in an elimination ordering:

$$P(X_1) = \sum_{x_2} \sum_{x_3} \cdots \sum_{x_n} P(X_1, X_2, \ldots, X_n)$$

The trick is to push the sums inside the products, eliminating one variable at a time. The cost depends on the elimination ordering: a good ordering keeps intermediate factors small.

On a tree-structured graph, optimal elimination corresponds to a message-passing algorithm called belief propagation (sum-product algorithm). Messages flow from leaves toward the root (collect phase) and from the root back to leaves (distribute phase). After two passes, every node has the exact marginal distribution. The cost is $O(n k^2)$ where $k$ is the size of the state space (e.g., $k=2$ for binary variables). This is the algorithm that makes HMMs (chain-structured Bayesian networks) tractable.

On general graphs with cycles, exact inference is #P-hard (as hard as counting satisfying assignments to a Boolean formula). The treewidth of the graph determines the cost of exact inference: a graph with treewidth $w$ admits inference in $O(n k^{w+1})$. For small treewidth (sparse, tree-like graphs), this is tractable. For dense graphs, the treewidth can be $O(n)$ and exact inference is infeasible.


Approximate Inference

When exact inference is intractable, three families of approximations are used.

Loopy belief propagation runs the sum-product algorithm on graphs with cycles, treating them as if they were trees. There is no guarantee of correctness or convergence, but in practice it often converges to good approximations, especially on sparse graphs. It is the algorithm behind modern error-correcting codes (Turbo codes, LDPC codes) and is used in computer vision for image segmentation.

Variational inference reframes inference as an optimization problem. Rather than computing $P(\text{hidden} | \text{observed})$ exactly, find the distribution $Q$ in some tractable family (e.g., fully factorized: $Q(Z) = \prod_i Q_i(Z_i)$) that minimizes the KL divergence from the true posterior:

$$Q^* = \arg\min_{Q \in \mathcal{Q}} \text{KL}(Q | P(\cdot | \text{observed}))$$

The fully factorized approximation is called mean field. The optimization is typically done by coordinate ascent, iteratively updating each $Q_i$ while holding others fixed. Variational inference scales to large datasets (via stochastic gradients) and is the backbone of variational autoencoders.

MCMC (Markov Chain Monte Carlo) draws samples from the posterior by constructing a Markov chain whose stationary distribution is the target posterior. The Gibbs sampler iteratively samples each variable from its conditional distribution given the current values of all other variables - this is exact and easy to implement for graphical models where the conditionals are tractable. Metropolis-Hastings accepts or rejects proposed samples to ensure convergence to the correct distribution. MCMC is asymptotically exact but slow to converge on multimodal distributions or highly correlated variables.


Hidden Markov Models as a Special Case

A Hidden Markov Model (HMM) is a Bayesian network with a chain structure. There is a sequence of hidden states $Z_1, Z_2, \ldots, Z_T$ that follow a Markov chain ($P(Z_t | Z_{t-1})$ is the transition distribution). Each hidden state emits an observation $X_t$ according to an emission distribution $P(X_t | Z_t)$.

The joint distribution is:

$$P(X_{1:T}, Z_{1:T}) = P(Z_1) \prod_{t=2}^T P(Z_t | Z_{t-1}) \prod_{t=1}^T P(X_t | Z_t)$$

This is exactly the Bayesian network factorization on a chain-structured DAG. The directed edges run $Z_1 \to Z_2 \to \cdots \to Z_T$ (chain) and $Z_t \to X_t$ (emissions).

The forward-backward algorithm is belief propagation on this chain. The forward pass computes $\alpha_t(z) = P(X_1, \ldots, X_t, Z_t = z)$ iteratively. The backward pass computes $\beta_t(z) = P(X_{t+1}, \ldots, X_T | Z_t = z)$. Their product gives the smoothed posterior: $P(Z_t = z | X_{1:T}) \propto \alpha_t(z) \beta_t(z)$. The Viterbi algorithm replaces the sum in the forward pass with a max to find the most probable state sequence.

HMMs have been applied to speech recognition (phoneme sequences), bioinformatics (gene finding), and natural language processing (part-of-speech tagging). They are now largely superseded by recurrent neural networks and Transformers for sequence modeling, but remain important as a clean example of exact inference in a latent variable model.


Applications

Graphical models appear across many domains:

Medical diagnosis (Bayesian networks). Model diseases as hidden nodes, symptoms and test results as observed nodes. Given symptoms, compute the posterior over diseases. The CPCS network (1989) modeled 448 diseases and 578 findings; belief propagation enabled real-time diagnosis even on the hardware of the era.

Image segmentation (MRFs). Model each pixel label as a node; connect neighboring pixels with edges encoding the prior that adjacent pixels tend to have the same label. The likelihood term comes from color or texture features; the MRF prior encourages spatially smooth labelings. Exact inference is intractable (loopy graph), so loopy BP or graph cuts are used.

Speech recognition (HMMs). Model a speech signal as observations emitted by a sequence of phoneme states. The Viterbi algorithm finds the most probable phoneme sequence given the acoustic signal. HMM-based speech recognition dominated for three decades.

Protein structure prediction. Model the amino acid sequence as a Markov chain; model the 3D structure as a set of pairwise energy potentials between residues. Inference over the structure given the sequence is a graphical model inference problem (though modern neural approaches like AlphaFold have largely displaced this framing).


Model Graph type Key independence Inference algorithm Canonical application
Bayesian network Directed (DAG) d-separation Variable elimination, BP Medical diagnosis, causal models
Markov Random Field Undirected Markov blanket Loopy BP, MCMC Image segmentation, physics
Hidden Markov Model Directed chain Markov: $Z_t \perp Z_{< t-1} | Z_{t-1}$ Forward-backward, Viterbi Speech, sequences
CRF Undirected, conditioned on $X$ Same as MRF BP, MCMC Sequence labeling

Read next: