As shown in Figure 4, we train the verifier as follows:
Unless otherwise specified, we use the same model size for the generator and the verifier.
In addition to predicting solution correctness, we also train the verifier with the same language modeling objective as the generator.
Results:
Ablation study (This is probably the earliest idea of PRM vs. ORM): either train verifiers to make a single scalar prediction conditioned on the entire generated solution, or to make a scalar prediction after each token in the solution (a token-level value function).
(a) Despite the initially slower training, the token-level verifier ultimately outperforms the solution-level verifier. Moreover, the token-level verifier is still improving late in training, whereas the solution-level verifier quickly shows signs of overfitting. We hypothesize that the full value function provides a useful auxiliary signal that encourages the model to judge the reasoning throughout solutions, rather than merely memorizing the correct final answer.
(b) including the language modeling objective is a strict improvement.
(c) We find that using a large generator with a small verifier performs significantly better than using a small generator with a large verifier.