Research Question

In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback.
We collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning
We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback.

Methods and experimental details

High-level methodology

Our methodology follows that of Ziegler et al. (2019) and Stiennon et al. (2020), who applied it in the stylistic continuation and summarization domains. We apply the following three steps

Collect demonstration data, and train a supervised policy.
1. Our labelers provide demonstrations of the desired behavior on the input prompt distribution
2. We then fine-tune a pretrained GPT-3 model on this data using supervised learning.
Collect comparison data, and train a reward model.
1. We collect a dataset of comparisons between model outputs, where labelers indicate which output they prefer for a given input.
2. We then train a reward model to predict the human-preferred output.
Optimize a policy against the reward model using PPO.
1. We use the output of the RM as a scalar reward.
2. We fine-tune the supervised policy to optimize this reward using the PPO algorithm

Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy.

In practice, most of our comparison data comes from our supervised policies, with some coming from our PPO policies.

Dataset

Prompts

Source:
- Our prompt dataset consists primarily of text prompts submitted to the OpenAI API (and the prompts written by labelers at the beginning),
- specifically those using an earlier version of the InstructGPT models (trained via supervised learning on a subset of our demonstration data) on the Playground interface
- We limit the number of prompts to 200 per user ID
- the validation and test sets contain no data from users whose data is in the training set.