RLHF - Aligning Models by Learning What Humans Prefer
Helpful context:
- LoRA & Quantization - Fine-Tuning at a Fraction of the Cost
- Objective Functions & Loss Design - What You Optimize Is What You Get
- Markov Chains - Where You’re Going Depends Only on Where You Are
A language model trained only on next-token prediction is, strictly speaking, a document completion machine.
Ask it a question and it will produce text that looks like it might follow that question somewhere in its training data. That might be a different question (FAQ pages often list questions back-to-back), a wrong answer stated confidently, an irrelevant continuation, or something harmful. The model has no concept of “answering helpfully.” It only knows “what comes next in text like this.”
InstructGPT (Ouyang et al., 2022) introduced RLHF - Reinforcement Learning from Human Feedback - to teach models not just to produce plausible text, but to produce text that humans prefer. The method worked better than anyone expected, and every major assistant-style language model since (ChatGPT, Claude, Gemini, LLaMA-2-chat) has used some version of it.
The Alignment Problem
The challenge has a name: alignment. You want your model to be helpful, harmless, and honest. These are properties you can articulate in English but cannot easily encode as a loss function.
Consider what goes wrong with a model trained purely on next-token prediction:
- Harmful content. The internet contains harmful text. A model trained on it will reproduce it when prompted appropriately.
- Confidently wrong answers. A model trained to produce plausible text will produce plausible-sounding false statements. There is no penalty in the training objective for being wrong.
- Sycophancy. If the user’s question implies a particular answer, the model will often agree, because agreement is “plausible” given the context.
- Verbose irrelevance. Long responses are “safe” in the sense that they continue the context plausibly, even if most of the words don’t help.
You want to correct all of this. But defining “helpful, harmless, and honest” as a mathematical loss function is genuinely hard. Human preferences are contextual, nuanced, sometimes inconsistent, and hard to specify in advance.
One observation simplifies things considerably: humans are better at comparing two outputs than rating one output absolutely. Given two responses to the same question, most people can reliably say which is better, even when they couldn’t articulate in advance what “better” means. RLHF exploits this.
Why Not Just Fine-Tune on “Good” Responses?
Before getting into RLHF, it’s worth asking: why not just collect examples of ideal responses and fine-tune directly on them (supervised learning)?
This is actually Stage 1 of RLHF - supervised fine-tuning (SFT) - and it’s done. But it has limits:
- You can only learn from the examples you collected. The model cannot explore or generalize beyond them.
- Quality labeling is expensive. Collecting enough diverse, high-quality demonstrations to cover all situations is infeasible.
- Preference information is richer than demonstration information. “A is better than B” is often easier to obtain and more informative than “here is a perfect response.”
Reinforcement learning offers something supervised learning doesn’t: the ability to explore. The model can generate candidate responses, receive feedback on them, and update accordingly - learning from responses it has never seen before. The feedback signal (the reward model) acts as a proxy for “what humans prefer,” allowing the model to optimize beyond the fixed training set.
The RLHF Pipeline: Three Stages
RLHF as used in InstructGPT involves three sequential training stages. Each produces a new model, which is used as input to the next stage.
Stage 1: Supervised Fine-Tuning (SFT)
Start with a pre-trained base language model. Collect demonstrations: human contractors (or high-quality curated data) provide examples of ideal responses to a variety of prompts. Fine-tune the base model on these demonstration pairs using standard cross-entropy loss.
The result is the SFT model: a model that can follow instructions, handle diverse prompts, and produce coherent responses. It’s noticeably better than the raw pre-trained model for task completion. But it’s not yet aligned in the sense we want - it hasn’t been specifically optimized to produce responses humans prefer.
Stage 2: Reward Model Training
Now we train a separate model to predict human preferences.
Sample pairs of responses $(y_1, y_2)$ to the same prompt $x$ from the SFT model. Have human raters compare them and indicate which is better: “$y_1$ is preferred over $y_2$” or vice versa. Collect a large dataset of such comparisons.
The reward model $R_\phi(x, y) \to \mathbb{R}$ takes a prompt $x$ and response $y$ and outputs a scalar quality score. We train it using the Bradley-Terry preference model:
$$P(y_w \succ y_l \mid x) = \sigma\left(R_\phi(x, y_w) - R_\phi(x, y_l)\right),$$
where $y_w$ is the preferred (“won”) response, $y_l$ is the rejected (“lost”) response, and $\sigma$ is the sigmoid function. The reward model should assign higher scores to preferred responses.
The training loss is:
$$\mathcal{L}{\text{RM}} = -\mathbb{E}{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma\left(R_\phi(x, y_w) - R_\phi(x, y_l)\right)\right].$$
This is just binary cross-entropy: we want the model to correctly predict which response was preferred. The Bradley-Terry model is a classic way to convert pairwise comparisons into scalar scores - the same approach used in chess rating systems (Elo) and tournament ranking.
The reward model is typically initialized from the SFT model (same architecture, extra linear head for the scalar output). After training, it acts as a proxy for human judgment: given any (prompt, response) pair, it estimates how much a human would prefer it.
Stage 3: RL Fine-Tuning with PPO
Now we use the reward model to train the policy - the language model - via reinforcement learning.
The setup:
- Agent: the language model (policy $\pi_\theta$)
- State: the prompt $x$
- Action: the response $y$ (sampled token by token)
- Reward: $R_\phi(x, y)$ minus a KL penalty term
The objective the RL training maximizes:
$$\mathcal{J}(\theta) = \mathbb{E}{x \sim \mathcal{D}, y \sim \pi\theta(\cdot | x)}\left[R_\phi(x, y) - \beta \cdot \text{KL}\left(\pi_\theta(\cdot | x)|\pi_{\text{SFT}}(\cdot | x)\right)\right].$$
The KL penalty term is critical. Without it, the policy quickly finds degenerate responses that fool the reward model - long repetitive strings, sycophantic phrases, or technically correct but useless outputs that the RM happens to rate highly. The KL penalty anchors the RL-trained model close to the SFT model, preventing it from drifting too far from coherent language.
The algorithm used for optimization is PPO (Proximal Policy Optimization). PPO clips the policy update at each step to prevent the kind of large destabilizing updates that simpler policy gradient methods suffer from:
$$\mathcal{L}_{\text{PPO}}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta)\hat{A}_t,; \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_t\right)\right],$$
where $r_t(\theta) = \pi_\theta(a_t | s_t) / \pi_{\theta_{\text{old}}}(a_t | s_t)$ is the probability ratio between new and old policies, $\hat{A}_t$ is the estimated advantage, and $\epsilon$ (typically $0.1$ to $0.2$) is the clip range.
The result of Stage 3 is the final aligned model: a language model that generates responses maximizing the reward model’s score while staying close to the SFT model’s distribution.
Reward Hacking
The reward model is an imperfect proxy for human preferences. It was trained on a finite set of comparisons, by a particular group of raters, on a particular distribution of prompts. It will have blind spots and biases.
When you run RL optimization against the reward model, you’re optimizing against this imperfect proxy. The policy can - and often does - find strategies that maximize the proxy reward without actually being better:
- Repetition: repeat correct-sounding phrases multiple times. Many RMs reward comprehensive-looking responses.
- Length hacking: produce very long responses. RMs often associate length with thoroughness.
- Sycophancy: agree with the user’s implied view, even when wrong.
- Superficial correctness: produce responses that sound authoritative and confident even when the content is wrong.
This is the fundamental problem with RLHF. The RM is a learned approximation, and any optimization process is good at finding its weaknesses. The KL penalty mitigates this by limiting how far from the SFT model the policy can go - but it’s a blunt instrument.
The solution in practice is careful RM training, limited optimization budget, and human evaluation of the final model (rather than trusting the RM score). This is why the RLHF pipeline often involves multiple rounds: train RM, train policy, collect new preferences on the new policy, retrain RM, repeat.
Why PPO Is Hard
Running PPO on a 7B+ parameter language model is not easy.
At any given training step, you need four separate models in memory simultaneously:
- Policy model $\pi_\theta$: the model being trained.
- Reference model $\pi_{\text{SFT}}$: frozen, for computing the KL divergence.
- Reward model $R_\phi$: frozen, for computing rewards.
- Value function $V_\psi$: a critic network for estimating advantages in PPO.
For a 7B parameter model, that’s roughly $4 \times 14$ GB = 56 GB just for the model weights in FP16, before gradients or optimizer states. In practice, LoRA is used for the policy’s trainable parameters to keep this tractable.
Beyond memory, PPO training is notoriously sensitive:
- The KL coefficient $\beta$ must be tuned carefully. Too small: reward hacking. Too large: the model doesn’t improve.
- The clip ratio $\epsilon$ affects stability.
- The value function can diverge independently of the policy.
- Rewards can collapse to near-zero as the policy drifts too far from the reference.
This instability is one of the primary motivations for DPO (Direct Preference Optimization), which achieves similar alignment without running RL at all.
Discomfort check. “Reinforcement learning from human feedback” sounds like the model is learning in real time from individual human interactions. It’s not. Stages 1 through 3 are an offline training procedure: you collect a dataset, train fixed models, and move on. The “RL” is an optimization process run against a fixed, pre-trained reward model - not a live feedback loop. Online RLHF, where the model actually interacts with humans in real time and receives live rewards, is a separate and harder research problem. Most deployed RLHF uses the offline procedure described here.
Constitutional AI: Anthropic’s Variation
Anthropic introduced Constitutional AI (CAI) as a variation on RLHF that reduces dependence on human labeling.
Instead of collecting human preference comparisons for Stage 2, CAI uses a set of written principles - the “Constitution” - to generate AI-written critiques and revisions of model outputs. The process:
- The model critiques its own responses according to the principles (“Is this response harmful? How could it be improved?")
- The model revises based on the critique.
- Critiques and revised responses are used to train the reward model (RLAI: RL from AI feedback, rather than human feedback).
This scales more readily than human labeling: once the Constitution is written, the AI can generate preference data automatically. The tradeoff is that the quality of the reward model depends on the quality of the AI’s self-evaluation, which depends on the quality of the base model.
RLVR: Reinforcement Learning from Verifiable Rewards
RLHF requires a learned reward model - a neural network trained on human preference data that scores responses. This creates two problems: reward hacking (the policy learns to fool the reward model rather than actually improve) and the cost of human annotation at scale.
Verifiable rewards sidestep both problems. For tasks where correctness can be checked automatically - mathematics, code, formal logic - the reward signal is the answer to “is this solution correct?” A math problem either has the right answer or it doesn’t. Code either passes tests or it doesn’t. No learned reward model, no reward hacking.
RLVR setup. The language model generates a chain-of-thought reasoning trace followed by a final answer. A deterministic verifier checks whether the answer is correct and returns reward 1 (correct) or 0 (incorrect). The policy is updated via GRPO (Group Relative Policy Optimization) or similar on-policy RL: generate $G$ solutions per problem, compute rewards, use the group mean and variance to normalise advantages, update with a clipped policy gradient loss identical in structure to PPO.
$$\mathcal{L}{\text{GRPO}} = -\mathbb{E}\left[\min\left(\frac{\pi\theta(o|q)}{\pi_{\theta_{\text{old}}}(o|q)} A, ;\text{clip}\left(\frac{\pi_\theta}{\pi_{\theta_{\text{old}}}}, 1-\varepsilon, 1+\varepsilon\right) A\right)\right]$$
where $A$ is the advantage normalised within the group of $G$ outputs for the same question, and $o$ is the output sequence.
Why it works. The key insight is that the verifiable reward provides a training signal exactly where you want it - at the final answer - without requiring a learned reward model that can be gamed. The chain-of-thought is not directly supervised; the model is free to develop whatever reasoning process leads to correct answers. This is how DeepSeek-R1 and similar “reasoning models” are trained: start from a base model, apply RLVR on math/code problems with ground-truth verifiers, and the model spontaneously develops longer, more structured reasoning chains.
Limitations. RLVR only works when correctness is verifiable. Open-ended tasks (writing, summarisation, subjective judgements) have no ground-truth verifier. For these, RLHF with a learned reward model remains necessary. RLVR also requires a base model that already has enough capability to solve some fraction of problems - if the initial success rate is near zero, the RL signal vanishes. “Rubric-based” RLVR extends verifiable rewards to partially-correct answers using structured rubrics evaluated by a stronger model.
The Bigger Picture
RLHF is a paradigm, not a single algorithm. The key ideas:
- Human preferences are better captured by comparisons than by absolute ratings.
- A reward model trained on comparisons can generalize to new outputs.
- RL against this reward model can improve the policy beyond the distribution of training demonstrations.
- A KL penalty against a reference model prevents reward hacking and maintains coherent language.
These four ideas, combined with careful engineering, produced a step change in the quality of AI assistants. The mathematical machinery - Bradley-Terry models, PPO, KL divergence - had all existed for years. The insight was applying them here, at this scale, in this order.
Summary
| Concept | Description |
|---|---|
| SFT model | Base model fine-tuned on human-written demonstrations |
| Reward model | Predicts human preference; trained on comparison pairs with Bradley-Terry loss |
| Bradley-Terry | $P(y_w \succ y_l) = \sigma(R(y_w) - R(y_l))$; loss is $-\log\sigma(R(y_w) - R(y_l))$ |
| RL objective | $\mathbb{E}[R(x,y)] - \beta\text{KL}(\pi_\theta | \pi_{\text{SFT}})$ |
| KL penalty | Prevents reward hacking; anchors RL model near SFT model |
| PPO | Clips policy updates; prevents destabilizing large steps |
| Reward hacking | Policy finds RM weaknesses; fundamental challenge of RLHF |
| Constitutional AI | Replaces human comparisons with AI-generated critiques guided by a written constitution |
| Why PPO is hard | Requires 4 models; sensitive to $\beta$, clip ratio, learning rate; can destabilize |
| RLVR | Uses deterministic verifiers (math, code) as reward signal; no learned reward model; trained with GRPO; produces reasoning models like DeepSeek-R1 |
RLHF transformed the field of language model alignment. Its limitations - complexity, memory requirements, reward hacking - directly motivated simpler alternatives. The most important of these is DPO, which achieves comparable results without a reward model or any RL.
Read next: