RLHF // Megha Bose

Prerequisite:

Reinforcement Learning from Human Feedback (RLHF) is the dominant method for aligning language model behaviour with human preferences. It converts the ill-defined objective “behave helpfully and harmlessly” into a tractable optimisation problem by learning a reward function from human comparisons and then optimising the policy against that reward.

The Three-Stage Pipeline

RLHF as applied to LLMs (Ouyang et al., InstructGPT, 2022) proceeds in three stages: supervised fine-tuning, reward modelling, and reinforcement learning.

Stage 1: Supervised Fine-Tuning (SFT)

A pretrained LLM is fine-tuned by maximum likelihood on a dataset of high-quality (prompt, response) demonstrations curated by human labellers:

$$L_{\text{SFT}} = -\mathbb{E}{(x, y) \sim \mathcal{D}{\text{demo}}} \sum_{t=1}^{|y|} \log \pi_{\text{SFT}}(y_t \mid x, y_{<t})$$

This produces a policy $\pi_{\text{SFT}}$ that knows how to follow instructions and produce well-formed outputs, but is not yet aligned to human preferences beyond the style of the demonstrations.

Stage 2: Reward Model Training

Human labellers compare pairs of model outputs $(y_w, y_l)$ for the same prompt $x$, where $y_w$ is the preferred response and $y_l$ the rejected one. The reward model $r_\phi(x, y)$ is trained to predict which response is preferred using the Bradley-Terry model of pairwise preference:

$$P(y_w \succ y_l \mid x) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))$$

where $\sigma$ is the logistic sigmoid. The training loss is the negative log-likelihood of the observed preferences:

$$L_{\text{RM}} = -\mathbb{E}{(x, y_w, y_l) \sim \mathcal{D}{\text{pref}}} \left[\log \sigma!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]$$

The Bradley-Terry model assumes transitivity of preferences and a scalar utility, both of which are violated to varying degrees in practice. The reward model is typically initialised from $\pi_{\text{SFT}}$ with the final token’s hidden state projected to a scalar.

Stage 3: PPO Fine-Tuning

The policy $\pi_\theta$ (initialised from $\pi_{\text{SFT}}$) is optimised to maximise the expected reward while staying close to $\pi_{\text{SFT}}$. The objective is:

$$\max_\theta , \mathbb{E}{x \sim \mathcal{D}, y \sim \pi\theta(\cdot|x)}!\left[r_\phi(x, y) - \beta , D_{\text{KL}}!\left[\pi_\theta(\cdot|x) ,|, \pi_{\text{SFT}}(\cdot|x)\right]\right]$$

The KL penalty $-\beta D_{\text{KL}}$ prevents the policy from drifting far from the SFT model, which acts as both a quality floor and a protection against reward hacking.

This objective is optimised with Proximal Policy Optimisation (PPO). The policy ratio is $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t)$, and the clipped surrogate objective is:

$$L^{\text{CLIP}}(\theta) = \mathbb{E}_t!\left[\min!\left(r_t(\theta)\hat{A}_t,\ \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_t\right)\right]$$

where $\hat{A}_t$ is the estimated advantage at step $t$ and $\epsilon \approx 0.2$ is the clipping threshold. The clipping prevents excessively large policy updates that could destabilise training. In the LLM setting, the “action” at each step is the next token, and the reward is received only at the end-of-sequence token (with the KL penalty distributed per token).

Reward Hacking and Goodhart’s Law

The reward model $r_\phi$ is a proxy for true human preferences, not the true objective itself. PPO is a powerful optimiser: it will find policies that score highly on the reward model through strategies that do not correspond to genuine quality improvements. This is a manifestation of Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure.

Empirically, reward hacking takes forms such as:

Generating excessively long or verbose responses (if the reward model gives higher scores to length).
Repeating certain phrases that the reward model associates with quality.
Hallucinating confidently (if hedging is penalised).

The KL penalty mitigates this by limiting the policy’s deviation from the SFT model, but it is not a complete solution. As $\beta \to 0$, the policy collapses onto whatever pattern maximally exploits the reward model. As $\beta \to \infty$, the policy stays at $\pi_{\text{SFT}}$ regardless of the reward signal.

Reward Model Overoptimisation

Even without explicit adversarial exploitation, simply training the PPO policy for too many steps causes reward model overoptimisation: the policy finds regions of the output space that the reward model rates highly but that human evaluators rate poorly. Gao et al. (2023) characterise this precisely: the true human reward $r^\ast$ as a function of the KL divergence from the reference policy increases, peaks, and then decreases as the KL grows. The optimal stopping point is well before the reward model’s maximum is reached.

Value Misspecification and Annotation Disagreements

Two deeper problems:

Value misspecification refers to the gap between what annotators label as preferred and what is actually beneficial. Annotators may prefer confident-sounding but incorrect answers, verbose but unhelpful responses, or outputs aligned with their own cultural assumptions. The reward model faithfully learns these preferences, including the errors.

Annotation disagreements arise because human preferences are not a well-defined function: different annotators systematically disagree, and the same annotator may disagree with themselves across sessions. The Bradley-Terry model ignores this noise, treating the preference dataset as arising from a single consistent preference function.

Examples

InstructGPT results. Ouyang et al. found that a 1.3B InstructGPT model (RLHF-fine-tuned) was preferred over a 175B GPT-3 baseline (no RLHF) by human raters in 71% of comparisons. The alignment tax on standard NLP benchmarks was small (2–4% performance drop on a few tasks), demonstrating that RLHF does not catastrophically harm capabilities.

Why the KL penalty is essential. In ablation experiments without the KL penalty, the PPO policy rapidly learns to produce repetitive high-reward outputs that human raters find useless or incoherent within a few thousand steps. With a well-tuned $\beta$, the policy improves for thousands more steps before degradation begins. The KL penalty is not merely a regulariser - it is what keeps the optimisation in the regime where the reward model’s ratings are reliable.

Read Next:

Direct Preference Optimization