Process Reward Models - Supervising the Steps, Not Just the Answer // Megha Bose

Helpful context:

Suppose you are grading a student’s math exam, and the student gets the right answer on a problem but shows completely wrong work. Do you give full credit? The answer should be no: a correct final answer that follows from incorrect reasoning is either a lucky coincidence or an exploitation of the grading rubric. The student has not demonstrated they understand the material, and their method will fail on the next problem. This distinction, between the quality of reasoning and the correctness of the final answer, is exactly the problem that Process Reward Models (PRMs) are designed to solve in the context of large language models.

When we train LLMs to reason carefully, we need to evaluate not just whether they produce correct final answers, but whether the steps they take to get there are valid. The naive approach is to train a reward model on outcomes: if the LLM’s final answer is correct, assign reward 1; if wrong, assign reward 0. This is an Outcome Reward Model (ORM), and it has a well-documented failure mode. A model optimized purely for outcome reward can learn to produce reasoning chains that look plausible but contain fundamental errors, as long as the final answer happens to be correct. More perversely, a model that learns to mimic the surface structure of correct proofs without actually reasoning can score well on an ORM while being completely unreliable when the problem changes even slightly.

The solution is dense reward supervision at the step level: assign a reward to each reasoning step individually, not just to the final answer. A model trained or selected with step-level feedback must produce each intermediate step correctly, not just reach the right endpoint. This is the core idea of Process Reward Models, and it represents a more principled approach to eliciting reliable reasoning from language models.

Outcome Reward Models: The Baseline and Its Limits

An Outcome Reward Model frames LLM evaluation as a binary classification problem. Given a reasoning chain (a sequence of steps leading to a final answer) and the ground-truth answer, the ORM outputs a scalar reward indicating whether the reasoning succeeded:

$$r_{\text{ORM}}(x, c) = \begin{cases} 1 & \text{if final answer in chain } c \text{ matches ground truth} \\ 0 & \text{otherwise} \end{cases}$$

where $x$ is the problem and $c$ is the complete chain. In practice the reward is a learned score from a trained classifier, but the supervision signal is binary correct/incorrect on the final answer.

The appeal of ORMs is obvious: they require only answer-level labels, which are cheap to obtain. For math problems, you can check the final numerical answer automatically. For code, you can run a test suite. For factual questions, you can check against a database. No human annotation of reasoning steps is required.

The failure mode that ORMs cannot detect is reasoning chain hacking: a model that produces a plausible-looking chain of steps, reaches the correct answer by a path that does not follow logically, and receives full ORM reward. Consider a multi-step algebra problem where the correct answer is 42. A model might correctly compute the answer using a completely different (and incorrect) algebraic approach that happens to produce 42 by coincidence, or might recover from an error in step 3 by making a compensating error in step 5. The ORM sees the final 42 and gives full reward.

This is not a purely theoretical concern. Models trained with RLHF on outcome rewards routinely produce reasoning chains with internal inconsistencies, leaps of logic, and errors that cancel out. At scale, a model that optimizes for ORM reward learns to produce text that looks like careful reasoning without being careful reasoning.

Process Reward Models: Step-Level Supervision

A Process Reward Model assigns a reward to each step in the reasoning chain:

$$r_{\text{PRM}}(x, c, k) = \text{score for step } k \text{ of chain } c \text{ on problem } x$$

The PRM is trained to predict whether each individual step is a valid intermediate step toward a correct solution. A step that contains an error in algebra, an unjustified logical leap, or a false claim receives a low PRM score even if the final answer happens to be correct.

This is a strictly richer supervision signal than ORM. Every step-level label implies an outcome label (if every step is correct, the answer should be correct), but an outcome label does not imply step-level correctness. PRMs provide dense reward supervision over the reasoning trajectory rather than sparse terminal reward.

The tradeoff is annotation cost. To train a PRM, you need humans to read each reasoning step and rate its correctness. This is expensive, slow, and requires annotators with domain expertise. For high-school algebra, annotators can be trained quickly. For graduate-level mathematics or advanced programming, expert annotators are scarce and expensive.

PRM800K: Training Data and Empirical Results

The foundational dataset for PRM research is PRM800K (Lightman et al., 2023, OpenAI). The dataset contains 800,000 step-level human feedback labels on GPT-4 solutions to problems from the MATH benchmark, a collection of competition mathematics problems across algebra, geometry, number theory, and combinatorics.

Each label is one of three ratings: positive (the step is correct and helpful), negative (the step contains an error), or neutral (the step is correct but not useful, e.g., a restatement). Annotators reviewed each step in sequence and were instructed to rate based on correctness and relevance to the solution, not on the style or efficiency of the approach.

Training a reward model on PRM800K labels produces a PRM that scores each step of a reasoning chain. Lightman et al. then used this PRM to improve best-of-N selection: generate $N$ candidate solutions to each problem (using the LLM with temperature sampling), score each step of each candidate with the PRM, and select the candidate with the best aggregate PRM score.

The results were striking. For large $N$ (e.g., $N = 1860$), best-of-N selection using a PRM trained on PRM800K significantly outperformed best-of-N selection using an ORM trained on the same problems. The PRM correctly identified higher-quality solutions that the ORM rated similarly to lower-quality ones, because the PRM penalized solutions with errors in intermediate steps even when the final answer happened to be correct.

Using PRMs at Inference Time

Once trained, a PRM can be deployed in several ways at inference time to improve the quality of LLM outputs.

Best-of-N with PRM scoring. Generate $N$ complete reasoning chains from the LLM. For each chain, compute a scalar summary of the step-level scores. Common aggregations include:

Minimum step score: $\min_k r_{\text{PRM}}(x, c, k)$. A chain is only as strong as its weakest step.
Product of step scores (if scores are probabilities): $\prod_k r_{\text{PRM}}(x, c, k)$. This rewards chains where every step is confidently correct.
Average step score: $\frac{1}{K}\sum_k r_{\text{PRM}}(x, c, k)$.

Select the candidate with the highest aggregate score. The minimum step score is often most discriminating, as it penalizes chains that are mostly correct but have one critical error.

PRM-guided beam search. Rather than generating $N$ complete chains and post-hoc selecting, use the PRM to guide generation step by step. At each decoding step, generate $B$ candidate continuations (beam search with width $B$), score each partial chain with the PRM, and keep only the top-$B$ candidates. Prune branches with low PRM scores early, before they generate many additional steps.

This is more compute-efficient than flat best-of-N for large $N$: instead of generating thousands of complete solutions, beam search explores promising directions and discards poor ones early. The PRM serves as a search heuristic that steers generation toward valid intermediate steps.

Monte Carlo Tree Search (MCTS) with PRM. At each reasoning step, the partial chain represents a state in a tree. The PRM provides a value function estimate for each state. MCTS alternates between expanding the tree (sampling new reasoning steps) and evaluating leaf nodes (using PRM scores or full-chain evaluations), updating tree statistics to guide future expansion toward high-reward regions. This is the basis of “thinking” systems that iteratively refine reasoning through structured search.

Contrast with RLVR: Training vs. Selection

It is important to distinguish PRMs from a related approach called RLVR (Reinforcement Learning from Verifiable Rewards).

RLVR trains the generator LLM using RL with a reward signal that comes from a verifier (for math: symbolic checking of the final answer; for code: test suite execution). The generator’s policy is directly updated to produce higher-reward outputs. RLVR changes the model weights.

PRMs, by contrast, provide a scoring function used at inference time to select among candidate outputs. The generator’s weights are not modified by the PRM. PRMs improve selection; RLVR improves the generator.

The two are complementary. A model trained with RLVR produces better distributions of candidate solutions. A PRM then helps select the best candidate from that distribution. In practice, strong reasoning systems (like those used in OpenAI’s o1/o3 series) likely combine both: RLVR training to shift the generator toward higher-quality reasoning, and PRM-guided search at inference time to select among candidates.

RLVR requires verifiable rewards (can automatically check if the answer is correct) and can only improve final-answer correctness directly. PRMs require step-level supervision but can improve intermediate reasoning quality. Neither approach alone is sufficient for domains where neither the final answer is easily verifiable nor intermediate steps can be automatically checked.

The Annotation Bottleneck and Alternatives

Human annotation of reasoning steps does not scale. PRM800K represents an enormous annotation effort, and extending it to new domains (chemistry, law, medical reasoning) would require domain experts at prohibitive cost.

Several approaches attempt to sidestep this bottleneck.

Monte Carlo estimation of step correctness. A step’s quality can be estimated by sampling many completions from that step and checking how often they lead to a correct final answer. If continuing from step $k$ leads to a correct final answer in 80% of sampled completions, step $k$ is likely correct. If it leads to a correct answer in only 10% of completions, it is likely an error.

Formally, define the step-level value:

$$V(x, c_{1:k}) = \Pr(\text{correct final answer} \mid x, c_{1:k})$$

where the probability is over completions sampled from the LLM conditioned on the partial chain. This value function can be estimated by Monte Carlo sampling and used directly as a process reward signal without human annotation. The cost is compute (many completions per step) rather than human labor.

Formal verification. For mathematics and programming, formal proof assistants (Lean, Isabelle, Coq) or type checkers can automatically verify whether an intermediate step is logically valid. A reasoning chain in a formal language can be checked step-by-step with zero human annotation. This approach is limited to problems that can be formalized, but for mathematics and code it represents a fully scalable and noise-free alternative to human annotation.

Weak supervision and transfer. A PRM trained on one domain (high-school math) may transfer partially to related domains (undergraduate math). Training on weaker supervision signals (e.g., soft labels from a model that predicts step correctness) can bootstrap a PRM when human labels are scarce.

Limitations and Open Problems

PRMs are not a complete solution to the reasoning reliability problem.

Reward model gaming. A PRM is itself a learned model, and a generator trained via RL to maximize PRM scores can exploit the PRM’s weaknesses. The PRM may assign high scores to reasoning chains that look locally correct step by step but are globally incoherent. This is the same reward hacking problem that affects ORMs, displaced to the step level.

Distributional shift. PRMs are trained on human-annotated reasoning chains from a specific model and distribution. At inference time, the PRM scores reasoning chains produced by the same or a related model. If the model changes (through RLVR training, for example), the distribution of chains shifts, and the PRM may be poorly calibrated for the new distribution. Regular PRM retraining or online adaptation is needed but rarely done in practice.

Ill-defined correctness. In domains without clear ground truth for intermediate steps (open-ended essay writing, multi-step legal reasoning, scientific hypothesis generation), defining what counts as a “correct step” is itself a hard problem. PRMs are most natural for mathematics and formal reasoning where each step has a definite correct/incorrect status.

Evaluation difficulty. Measuring the true quality of a PRM requires human evaluation of its step-level judgments, which brings back the annotation cost. Standard benchmarks for PRMs measure downstream task performance (best-of-N accuracy on MATH), not the accuracy of step-level scores directly. A PRM might improve final-answer accuracy while being miscalibrated about which steps are actually errors.

Summary

Approach	What it supervises	Supervision cost	Failure mode
ORM	Final answer correctness only	Low (automatic or cheap)	Rewards correct answers with wrong reasoning
PRM (human-annotated)	Each reasoning step	High (domain expert annotation)	Expensive to scale; distributional shift
PRM (Monte Carlo)	Step value via sampled completions	Medium (compute-heavy)	Noisy; requires many samples per step
RLVR	Trains generator via verifiable rewards	Low for verifiable domains	Only final answer; no step-level feedback
Formal verification	Logical validity of each step	Zero (for formal languages)	Requires formalization; not generally applicable

Read next: