Research Question

image.png

Methods and experimental details

High-level methodology

image.png

Our methodology follows that of Ziegler et al. (2019) and Stiennon et al. (2020), who applied it in the stylistic continuation and summarization domains. We apply the following three steps

  1. Collect demonstration data, and train a supervised policy.
    1. Our labelers provide demonstrations of the desired behavior on the input prompt distribution
    2. We then fine-tune a pretrained GPT-3 model on this data using supervised learning.
  2. Collect comparison data, and train a reward model.
    1. We collect a dataset of comparisons between model outputs, where labelers indicate which output they prefer for a given input.
    2. We then train a reward model to predict the human-preferred output.
  3. Optimize a policy against the reward model using PPO.
    1. We use the output of the RM as a scalar reward.
    2. We fine-tune the supervised policy to optimize this reward using the PPO algorithm

Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy.

In practice, most of our comparison data comes from our supervised policies, with some coming from our PPO policies.

Dataset

Prompts