Research Question

We introduce a new parameterization of the reward model in RLHF that enables

Approach

image.png

Overview:

Preliminary - RLHF pipeline (the three phases usually)

  1. Supervised Fine-tuning (SFT): fine-tuning a pre-trained LM, resulting a model $\pi^{SFT}$

  2. Preference Sampling and Reward Learning:

  3. RL optimization: During RL phase, we aim to maximize the reward given by the learned reward model and penalized by KL divergence as:

    $\max_{\pi_{\theta}} \mathbb{E}{x \sim \mathcal{D}, y \sim \pi{\theta}(y \mid x)} \left[ r_{\phi}(x, y) \right] - \beta \mathbb{D}{\text{KL}} \left[ \pi{\theta}(y \mid x) \| \pi_{\text{ref}}(y \mid x) \right],$

Direct Preference Optimization

Our approach leverages a particular choice of reward model parameterization that enables extraction of its optimal policy in closed form, without an RL training loop. The key insight is to leverage an analytical mapping from reward functions to optimal policies, which enables us to transform a loss function over reward functions into a loss function over policies.