Direct Preference Optimization
Prerequisite:
Direct Preference Optimization (DPO) is an alignment method that achieves the same goal as RLHF - training a language model to match human preferences - without an explicit reward model and without reinforcement learning. It derives an exact equivalence that turns the RL objective into a supervised loss on the policy itself.
Motivation
PPO-based RLHF requires three separately trained models (SFT model, reward model, RL policy) plus a value function, and training involves a complicated nested loop of sampling from the policy, scoring with the reward model, and computing advantage estimates. The training is sensitive to hyperparameters, prone to instability, and requires significant engineering infrastructure. DPO collapses all of this into a single fine-tuning step.
Derivation
Start from the RLHF objective: maximise expected reward under a KL constraint from the reference policy $\pi_{\text{ref}}$:
$$\max_\pi , \mathbb{E}{x \sim \mathcal{D},, y \sim \pi(\cdot|x)}!\left[r(x,y)\right] - \beta , D{\text{KL}}!\left[\pi(\cdot|x) ,|, \pi_{\text{ref}}(\cdot|x)\right]$$
This objective has a closed-form optimal solution. For any fixed $x$, it is a KL-regularised reward maximisation over distributions on $y$, whose solution is the Gibbs distribution:
$$\pi^\ast(y|x) = \frac{\pi_{\text{ref}}(y|x) \exp(r(x,y)/\beta)}{Z(x)}$$
where $Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp(r(x,y)/\beta)$ is the partition function that normalises the distribution.
Rearranging this equation to express $r$ in terms of $\pi^\ast$:
$$r(x,y) = \beta \log \frac{\pi^\ast(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)$$
Now substitute this expression for $r$ into the Bradley-Terry preference model:
$$P(y_w \succ y_l \mid x) = \sigma(r(x,y_w) - r(x,y_l))$$
$$= \sigma!\left(\beta \log \frac{\pi^\ast(y_w|x)}{\pi_{\text{ref}}(y_w|x)} + \beta \log Z(x) - \beta \log \frac{\pi^\ast(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \beta \log Z(x)\right)$$
The partition function $Z(x)$ cancels, giving:
$$P(y_w \succ y_l \mid x) = \sigma!\left(\beta \log \frac{\pi^\ast(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi^\ast(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)$$
Replacing $\pi^\ast$ with the parameterised policy $\pi_\theta$ and taking the negative log-likelihood over preference pairs yields the DPO loss:
$$L_{\text{DPO}} = -\mathbb{E}{(x,y_w,y_l) \sim \mathcal{D}}!\left[\log \sigma!\left(\beta \log \frac{\pi\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]$$
This is a standard cross-entropy loss computed purely from the log-probabilities of two fixed models ($\pi_\theta$ and $\pi_{\text{ref}}$) on sequences from the preference dataset. No reward model, no sampling, no advantage estimation.
The Role of $\beta$
The temperature $\beta$ controls the trade-off between reward maximisation and KL divergence from the reference policy. A small $\beta$ allows the policy to deviate substantially from $\pi_{\text{ref}}$ in pursuit of preference alignment; a large $\beta$ keeps the policy close to $\pi_{\text{ref}}$.
In the limit $\beta \to \infty$, the gradient of $L_{\text{DPO}}$ vanishes and the policy does not move. In the limit $\beta \to 0$, the policy can collapse onto the single highest-probability continuation under the preference data, ignoring the reference policy entirely. In practice $\beta \in [0.05, 0.5]$ works well.
Gradient Analysis
The gradient of the DPO loss has an illuminating form:
$$-\nabla_\theta L_{\text{DPO}} = \beta , \mathbb{E}!\left[\sigma(\hat{r}l - \hat{r}w)\left(\nabla\theta \log \pi\theta(y_w|x) - \nabla_\theta \log \pi_\theta(y_l|x)\right)\right]$$
where $\hat{r}w = \beta \log \frac{\pi\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)}$ and $\hat{r}l = \beta \log \frac{\pi\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}$ are the implicit rewards. The weight $\sigma(\hat{r}_l - \hat{r}_w)$ is large when the implicit reward incorrectly ranks the pair (rejected response ranked above preferred), giving large updates precisely when the policy is most wrong. This is the analogous mechanism to the reward model error signal in PPO.
Variants
IPO (Identity Preference Optimisation, Azar et al., 2023) replaces the Bradley-Terry model with a direct preference assumption, removing the sigmoid and logarithm from the loss to avoid the problematic regime where $\sigma(\cdot) \to 1$ and the gradient vanishes:
$$L_{\text{IPO}} = \mathbb{E}!\left[\left(\log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \frac{1}{2\beta}\right)^2\right]$$
SimPO (Simple Preference Optimisation, Meng et al., 2024) normalises log-probabilities by sequence length to remove the reference model entirely, using the average per-token log-probability as the implicit reward. This avoids length bias and eliminates the need to run a reference model forward pass during training.
KTO (Kahneman-Tversky Optimisation, Ethayarajh et al., 2024) uses unpaired preference data - individual (prompt, response, label) triples rather than comparative pairs. The loss is based on prospect theory’s value function, separately modelling the utility of desirable and undesirable outputs:
$$L_{\text{KTO}} = \mathbb{E}!\left[\lambda_D \sigma(\beta \hat{r}(x,y) - z_{\text{ref}}) \cdot \mathbf{1}[y \text{ desirable}] + \lambda_U \sigma(z_{\text{ref}} - \beta \hat{r}(x,y)) \cdot \mathbf{1}[y \text{ undesirable}]\right]$$
where $z_{\text{ref}}$ is the KL divergence from the reference policy and $\lambda_D$, $\lambda_U$ are loss weights.
When DPO Fails
Reference policy collapse. DPO’s derivation assumes the reference policy assigns non-negligible probability to both $y_w$ and $y_l$. If $\pi_{\text{ref}}$ assigns near-zero probability to some preferred responses (because they are long or stylistically different from pretraining), the implicit reward $\beta \log \pi_\theta(y_w|x)/\pi_{\text{ref}}(y_w|x)$ is dominated by the reference probability rather than the preference signal.
Mode collapse at high $\beta$. With large $\beta$, the KL constraint is tight and the policy cannot move far from $\pi_{\text{ref}}$. On datasets where the preferred response is very different from what $\pi_{\text{ref}}$ would generate, high $\beta$ prevents the policy from reaching the preferred region at all.
Examples
DPO vs PPO empirical comparison. On the Anthropic Helpful and Harmless dataset, Rafailov et al. (2023) showed that DPO matches or exceeds PPO in win-rate against reference policy outputs as judged by GPT-4, while requiring no reward model and being stable to train. Training time dropped by approximately $3\times$.
Fine-tuning alignment with DPO. Fine-tuning Mistral 7B on a 50k-pair preference dataset with DPO ($\beta = 0.1$, 1 epoch, learning rate $5 \times 10^{-7}$) consistently shifts outputs toward preferred responses. The implicit reward margin $\hat{r}_w - \hat{r}_l$ serves as a useful diagnostic: if it is not increasing during training, the reference policy is mismatched with the preference data and the dataset or $\beta$ should be revised.
Read Next: