Monte Carlo Tree Search - Planning by Sampling, Not Exhaustion // Megha Bose

Helpful context:

Imagine playing a game of Go. There are roughly $10^{170}$ possible board positions - more than the number of atoms in the observable universe - so exhaustive search is out of the question. Even with a depth limit, the branching factor at each move (around 250 legal moves) makes minimax search with alpha-beta pruning impractical at competitive depths. For decades, Go was considered too complex for computers to play at a master level. What changed was not faster hardware. What changed was the realization that you do not need to search the entire tree; you need to search the most promising parts of it, and you can identify promising parts by combining lightweight simulation with statistical tracking of which branches have performed well. That is the core idea of Monte Carlo Tree Search.

MCTS is a framework for sequential decision-making under uncertainty. It builds a partial search tree incrementally, guided by a balance between exploiting moves that have worked well in the past and exploring moves that have been tried less often. Each iteration of the algorithm descends the tree, adds a node, simulates from it, and then propagates the result back up to update statistics. Over thousands of iterations, the tree grows in the directions that matter most, and the action with the most visits at the root is selected. In board games, this produces play that surprised even expert human players - not because it solved the game, but because statistical search in the right regions of the tree is enormously powerful.

The reason this matters for language models is a structural analogy. Generating text is also a sequential decision process: at each step, the model chooses the next token (or, at a coarser level, the next reasoning step). The space of possible continuations is exponentially large. Greedy decoding picks the highest-probability token at each step, which is fast but often misses better solutions reachable through apparently suboptimal intermediate steps. MCTS offers an alternative: at each step, simulate multiple continuations, evaluate their quality with a learned value function, and select the continuation with the strongest signal. The alignment between game-playing and language generation - sequential choice, exponential search space, the need for multi-step evaluation - is why MCTS is increasingly central to the frontier of LLM reasoning.

The Four Phases

Each iteration of MCTS executes four steps. Starting from the root (the current game state), the algorithm:

Selection. Traverse the tree from the root downward, at each node choosing the child that maximizes the Upper Confidence Bound for Trees (UCT) score:

$$\text{UCT}(v) = \underbrace{Q(v)}{\text{exploitation}} + C \cdot \underbrace{\sqrt{\frac{\ln N(\text{parent}(v))}{N(v)}}}{\text{exploration}}$$

where $Q(v)$ is the average reward seen from subtree $v$, $N(v)$ is the number of times $v$ has been visited, $N(\text{parent}(v))$ is the parent’s visit count, and $C > 0$ is an exploration constant. Continue until you reach a node with an unvisited child (a node that has not yet been expanded).

Expansion. Add one (or more) new child nodes to the tree for an unvisited action from the selected node. Initialize $Q = 0$, $N = 0$ for the new node.

Simulation (Rollout). From the new node, simulate to a terminal state using a fast rollout policy (typically random or using a lightweight heuristic). Collect the terminal reward $R$.

Backpropagation. Traverse the path from the new node back to the root, updating statistics at each node:

$$N(v) \leftarrow N(v) + 1, \qquad Q(v) \leftarrow Q(v) + \frac{R - Q(v)}{N(v)}.$$

After many iterations, the visit count $N(v)$ at the root’s children reflects the algorithm’s confidence in each action. The final decision is typically the action with the highest visit count (not the highest $Q$) - visit count is more robust to noisy rollouts.

The UCB1 Formula and Exploration-Exploitation

The UCT formula adapts the Upper Confidence Bound 1 (UCB1) algorithm from the multi-armed bandit literature. UCB1 is a strategy for allocating trials among multiple options to balance learning and reward accumulation.

For a bandit with $K$ arms, UCB1 selects the arm maximizing:

$$\hat{\mu}_i + \sqrt{\frac{2 \ln t}{n_i}}$$

where $\hat{\mu}_i$ is the empirical mean reward of arm $i$, $t$ is the total number of trials so far, and $n_i$ is the number of times arm $i$ has been pulled. The first term exploits the best-known arm; the second term grows for arms pulled infrequently (since $n_i$ is small) and shrinks as an arm is pulled more (since $n_i$ grows faster than $\ln t$). UCB1 achieves logarithmic regret - the suboptimal pulls grow only as $O(\log t)$ over time.

In UCT, the “arms” are the children of the current node. The exploration term $\sqrt{\ln N(\text{parent}) / N(v)}$ encourages visiting under-explored branches: a child visited only once has a very high exploration bonus relative to a child visited many times. As more iterations are spent, the tree balances between deepening well-performing branches (exploitation) and broadening less-explored ones (exploration). The constant $C$ trades off between these: larger $C$ explores more widely; smaller $C$ exploits more aggressively.

AlphaGo and AlphaZero: Neural-Augmented MCTS

The breakthrough that took MCTS from competitive amateur play to superhuman performance was replacing the random rollout with a learned value function.

AlphaGo (Silver et al., 2016) trained two networks on human expert games: a policy network $p_\sigma(a \mid s)$ that predicts which moves human experts play, and a value network $V_\theta(s)$ that estimates the probability of winning from position $s$. MCTS was modified to use both:

Selection: use a modified UCT that mixes the value network and visit statistics, weighted by the policy network’s prior $p_\sigma(a \mid s)$ instead of uniform priors.
Simulation: replace random rollout with an evaluation that blends a fast rollout policy and the value network: $V = \lambda V_\theta(s) + (1-\lambda) z$ where $z$ is the fast rollout outcome.

AlphaZero (Silver et al., 2017) removed the human data entirely. Starting from random play, AlphaZero learns both a policy and a value network from self-play alone:

Play games using MCTS with the current networks.
Train the policy network to match the MCTS visit distribution (not just the single best move).
Train the value network to match the game outcome.
Repeat.

The MCTS visit distribution is a richer training signal than a single action: it encodes how confident the search was in each move, essentially distilling the result of thousands of rollouts into a soft target. This self-reinforcing loop - better networks improve search; better search generates better training targets - produces superhuman play in Chess, Go, and Shogi from scratch within hours.

The key insight is that the value network eliminates the need for long rollouts. Random rollouts are noisy because early mistakes compound over many moves. A well-trained value network evaluates any position directly, giving a low-variance estimate of the outcome without any simulation at all.

LLMs as Sequential Decision Processes

Language generation maps naturally onto the MCTS framework. Define:

State $s_t$: the sequence of tokens generated so far: $(x_1, x_2, \ldots, x_t)$.
Action $a_t$: the next token to generate (a choice from a vocabulary of $\sim 50{,}000$ tokens).
Transition: deterministic - $s_{t+1} = (s_t, a_t)$.
Terminal state: end-of-sequence token, or a maximum length.
Reward: correctness of the final answer, verifiability of a proof step, score on a coding test.

The search tree is a trie over token sequences, with each path from root to leaf being one possible generation. The branching factor ($\sim 50{,}000$) is much larger than Go ($\sim 250$), which makes token-level MCTS expensive. The practical solution is to operate at the level of reasoning steps rather than tokens: each “action” is a complete sentence or logical step, reducing the branching factor to a manageable size (typically 3-10 candidate continuations).

The analogy to AlphaZero is direct: the LLM itself serves as the policy network (it generates candidate continuations and assigns them probabilities), and a separate value model evaluates partial solutions. MCTS then allocates the inference compute budget toward the most promising reasoning paths.

Process Reward Models as Value Functions

In the AlphaZero framework, the value network $V(s)$ estimates the probability of winning from position $s$. The analogue for LLM reasoning is a Process Reward Model (PRM): a model that estimates, given the problem and the reasoning steps generated so far, the probability that the current partial solution will eventually lead to a correct final answer.

PRMs are distinct from Outcome Reward Models (ORMs), which only evaluate the final answer. A PRM evaluates intermediate steps, which is essential for MCTS: you need a value estimate at every node in the search tree, not just at terminal leaves.

Training a PRM requires step-level annotations. For mathematical reasoning, this means labeling each reasoning step as correct or incorrect. OpenAI’s PRM800K dataset (Lightman et al., 2023) provides human step-level annotations for 800K math solutions. With a well-trained PRM, MCTS can identify that a particular reasoning path went wrong at step 3 - not just that the final answer was wrong - and prune that branch early.

The quality of MCTS with LLMs is bounded by the quality of the PRM. If the PRM incorrectly scores a wrong reasoning step as promising, the search will waste compute exploring that branch. This is the central challenge: PRMs are trained on human or synthetic data and can be fooled by fluent but incorrect reasoning.

Tree-of-Thought (Yao et al., 2023) formalizes multi-step search for language models. At each step, the model generates multiple candidate “thoughts” (reasoning steps), evaluates them with a scoring function (either the LLM itself used as a judge, or a trained verifier), and selects the most promising candidates to continue from.

This is breadth-first search with pruning: maintain a beam of $k$ candidate partial solutions, expand each by generating $b$ continuations, score all $k \times b$ candidates, and keep the top $k$. The difference from MCTS is that Tree-of-Thought does not use visit counts or an explicit exploration-exploitation balance; it is more like best-first beam search with a verifier guiding the beam.

The key contribution is the separation of roles: the generator (LLM) proposes continuations; the evaluator (also LLM, or a separate model) scores them; the search algorithm (tree-of-thought or MCTS) allocates compute based on scores. This separation allows any of the three components to be improved independently.

AlphaCode 2 (Leblond et al., 2023) applied a similar principle to competitive programming. A large model generates many candidate solutions; a separate discriminator model trained to identify correct and compilable programs filters the candidates. The discriminator plays the role of the value function, and the overall system is a form of best-of-N search with a learned verifier.

Inference-Time Compute Scaling

A recurring theme in recent LLM research is that increasing compute at inference time - rather than just at training time - can substantially improve performance on hard problems. MCTS and related search methods are one mechanism for this.

Greedy decoding uses $O(1)$ model evaluations per token. Best-of-N sampling runs the model $N$ times and takes the best result according to a verifier, using $O(N)$ evaluations but no structured search. Beam search maintains $k$ candidate sequences at each step. MCTS with a PRM performs structured search, spending more evaluations on promising paths and fewer on dead ends.

The key insight of systems like OpenAI o1 (2024) and related work: at test time, the model has a compute budget, and that budget should be spent on structured reasoning search rather than a single forward pass. The scaling law for test-time compute suggests that with enough search budget, smaller models with good value functions can match or exceed larger models that generate single responses.

$$\underbrace{\text{Greedy}}{\text{1 call}} \to \underbrace{\text{Best-of-N}}{\text{N parallel calls}} \to \underbrace{\text{Beam Search}}{\text{N calls, breadth-first}} \to \underbrace{\text{MCTS}}{\text{N calls, structured}}$$

Each step in this hierarchy uses the same total compute budget more efficiently by incorporating value estimates into the search.

Fundamental Tensions

MCTS is most powerful when two conditions hold: the value function is accurate, and the branching factor is manageable. For language, both are challenging.

Branching factor. A vocabulary of 50,000 tokens means 50,000 possible continuations at each step. Full token-level MCTS is intractable. Step-level MCTS reduces this to $\sim 5$-10 candidates per step, but each “action” is now a multi-token generation, introducing variance in the expansion step.

Value function accuracy. PRMs are trained on finite data and can be systematically wrong in ways that are hard to detect. An adversarially-written incorrect proof step that fools the PRM will also fool the MCTS search. The distribution shift between training data and hard test problems further degrades PRM accuracy precisely where it matters most.

Credit assignment. In Go, each move has a clear game-theoretic value. In open-ended reasoning, the quality of an intermediate step is ambiguous - a step may look wrong but lead to a creative valid solution, or look right but be subtly incorrect. PRMs trained on human annotations inherit human biases about which intermediate steps look good.

Despite these challenges, the structural advantages of search - the ability to backtrack, to avoid committing to a path early, to concentrate compute on hard subproblems - make MCTS and search-augmented LLMs a compelling direction for tasks that require extended chains of reasoning.

Summary

Method	Search type	Value model	Compute scaling
Greedy decoding	None	None	$O(L)$ per token
Best-of-N	Embarrassingly parallel	ORM (outcome)	$O(N \cdot L)$
Beam search	Breadth-first bounded	None (or heuristic)	$O(k \cdot L)$
Tree-of-Thought	Breadth-first with pruning	LLM-as-judge or verifier	$O(k \cdot b \cdot L)$
MCTS + PRM	Depth-first with backprop	PRM (process)	$O(N \cdot L)$ concentrated

Concept	Definition
UCT score	$Q(v) + C\sqrt{\ln N(\text{parent})/N(v)}$; balances exploitation and exploration
$Q(v)$	Average reward seen from subtree rooted at node $v$
$N(v)$	Visit count at node $v$
Policy network	In AlphaZero: predicts action probabilities; in LLMs: the LLM itself
Value network	In AlphaZero: estimates game outcome from a position
PRM	Process Reward Model; scores intermediate reasoning steps
ORM	Outcome Reward Model; scores only the final answer
Rollout	Simulation to a terminal state, replaced by value network in neural MCTS
Backpropagation (MCTS)	Update $Q(v)$ and $N(v)$ along the path from new node to root
Test-time compute scaling	Allocating inference budget to structured search rather than single-pass generation

Read next: