To determine the step-level predictions at test time, it suffices to perform a single PRM forward pass over the whole solution.
We define the PRM score for a solution to be the probability that every step is correct under the PRM. We implement this as the product of the correctness probabilities for each step (Caution for step length bias).
When we provide process supervision, we deliberately choose to supervise only up to the first incorrect step. (If we were to provide additional process supervision beyond the first mistake, then process supervision would have an even greater information advantage.)
present human data-labelers with step-by-step solutions to MATH problems sampled by the large-scale generator
let them assign each step in the solution a label of positive (the step is correct), negative (incorrect), or neutral (ambiguity)
we choose to surface convincing wrong-answer solutions (rated highly by the current best PRM yet it reaches an incorrect final answer) to human labelers
we also iteratively re-train our PRM using the latest data at several points in the data collection process.
Large-scale Supervision (First, the training sets for the ORM and the PRM are not directly comparable)
ORM: trained on 100 uniform samples per problem from the generator. This means the ORM training set has no overlap with PRM800K, and it is an order of magnitude larger.
PRM: trained on the PRM800K dataset
Comparison between ORM and PRM
Small-scale Supervision
Rationale for the small-scale experiment:
Supervision details (Details in Appendix H): We first sample between 1 and 200 solutions per problem from a small-scale generator. For each dataset, we provide three forms of supervision:
Comparison between ORM and PRM: In Figure 4a, we evaluate each reward model by its best-of-500 selection. In Figure 4b, we evaluate the best reward model from each series by its best-of-N performance across different values of N.
Active Learning: