Let’s Verify Step by Step (Lightman et al., 2023)

Research Question:

Despite the previous research, how does a PRM really perform in comparison with ORM when comes to solving questions in MATH dataset

Contributions:

We show that process supervision can train much more reliable reward models than outcome supervision. We use our state-of-the-art PRM to solve 78.2% of problems from a representative subset of the MATH test set.
We show that a large reward model can reliably approximate human supervision for smaller reward models, and that it can be used to efficiently conduct large-scale data collection ablations.
We show that active learning leads to a 2.6× improvement in the data efficiency of process supervision.
We release our full process supervision dataset, PRM800K, to promote related research.

Approach:

Note: we do not finetune the generator with the trained reward model using RL, instead we focus exclusively on how to train the most reliable reward model possible. We evaluate a reward model by its ability to perform best-of-N search over uniformly sampled solutions from the fixed generator.
Approach at different scales
- At large scale: All large-scale models are finetuned from the base GPT-4 (with no RLHF)
- At small scale: Models are similar in design to GPT-4 (base), but they were pretrained with roughly 200 times less compute
Training: finetune all models on a dataset of roughly 1.5B math-relevant tokens called MathMix to improve the model’s mathematical reasoning capabilities.(Appendix A in the paper)
- Generator:
  - We few-shot generate solutions to MATH training problems, filter to those that reach the correct final answer, then finetune the base model on this dataset for a single epoch to teach the model to produce solutions in a newline delimited step-by-step format.
- Reward Models:
  - ORM: similar to Cobbe et al. (2021). We uniformly sample a fixed number of solutions per problem from the generator, and we train the ORM to predict whether each solution is correct or incorrect. At test time, we use the ORM’s prediction at the final token as the overall score for the solution. (Details in Appendix E)
  - PRM: We train PRMs to predict the correctness of each step after the last token in each step, This prediction takes the form of a single token, and we maximize the log-likelihood of these target tokens during training.
    - To determine the step-level predictions at test time, it suffices to perform a single PRM forward pass over the whole solution.
    - We define the PRM score for a solution to be the probability that every step is correct under the PRM. We implement this as the product of the correctness probabilities for each step (Caution for step length bias).
    - When we provide process supervision, we deliberately choose to supervise only up to the first incorrect step. (If we were to provide additional process supervision beyond the first mistake, then process supervision would have an even greater information advantage.)
Data Collection with Human labeling → A new dataset called PRM800k: contains 800K step-level labels across 75K solutions to 12K problems.
- present human data-labelers with step-by-step solutions to MATH problems sampled by the large-scale generator
- let them assign each step in the solution a label of positive (the step is correct), negative (incorrect), or neutral (ambiguity)
- we choose to surface convincing wrong-answer solutions (rated highly by the current best PRM yet it reaches an incorrect final answer) to human labelers
- we also iteratively re-train our PRM using the latest data at several points in the data collection process.

Experiments:

Large-scale Supervision (First, the training sets for the ORM and the PRM are not directly comparable)
- ORM: trained on 100 uniform samples per problem from the generator. This means the ORM training set has no overlap with PRM800K, and it is an order of magnitude larger.
  - we note that training the ORM solely on PRM800K solutions would be problematic, since our active learning strategy has heavily biased the dataset towards wrong-answer solutions. Training the ORM on a superset of PRM800K also did not improve the performance
- PRM: trained on the PRM800K dataset
- Comparison between ORM and PRM
Small-scale Supervision
- Rationale for the small-scale experiment:
  - First, the training sets for the ORM and the PRM are not directly comparable:
  - Second, the final-answer grading will provide positive labels to spurious solutions that reach the correct final answer despite incorrect reasoning. which could damage ORM performance.
- Supervision details (Details in Appendix H): We first sample between 1 and 200 solutions per problem from a small-scale generator. For each dataset, we provide three forms of supervision:
  - process supervision from $PRM_{large}$,
  - outcome supervision from $PRM_{large}$ and
  - outcome supervision from final-answer checking.
- Comparison between ORM and PRM: In Figure 4a, we evaluate each reward model by its best-of-500 selection. In Figure 4b, we evaluate the best reward model from each series by its best-of-N performance across different values of N.
- Active Learning:
  - We train a small-scale reward model, $PRM_{selector}$, on a single sample from each problem, and we use this model to score 1000 samples per problem.
  - To train each of our larger reward models, we select N samples per problem such that 80% are the most convincing (according to $PRM_{selector}$) wrong-answer samples, and 20% are the most convincing samples that remain (right- or wrong-answer).
  - We score the selected samples with PRMlarge and train on those scores.
  - Performance of this data labelling scheme is shown in Figure 4a.
  - By comparing the slopes of the line of best fit with and without active learning, we estimate that this form of active learning is approximately 2.6x more data efficient than uniform data labelling.