Training Verifiers to Solve Math Word Problems (Cobbe et al., 2021)

How to robustly perform multi-step mathematical reasoning with transformer models?

Our main contributions are as follows:
1. We present a curated dataset of 8.5K grade school math questions and natural language solutions, useful for probing the informal reasoning ability of large language models.
2. We show that, compared to a finetuning baseline, the use of verifiers results in approximately the same performance boost as a 30x model size increase, and that verifiers scale significantly better with increased data.
3. We show that dropout acts as a strong regularizer, significantly improving performance for both finetuning and verification.

Curated dataset of 8.5K grade school math questions and natural language solutions
Investigate two methods to solve problems in GSM8K: finetuning and verification.
- Finetuning (baseline): the same language modeling objective as the generative pretraining in GPT-3. At test time, we judge performance by autoregressively sampling a single low temperature solution and checking whether the final answer is correct.
- Verification: sampling multiple high temperature solutions, assigning each solution a score, and outputting the highest ranked solution. Verifiers are trained to judge the correctness of solutions, with the training signal determined solely by whether or not the solution reached the correct final answer
*We train all models to use a calculator by injecting calculation annotations into the training se

Finetuning: We perform finetuning by updating model parameters to minimize the cross-entropy loss over all training tokens. Eval results after finetuning is as follows:

Results:
- We find that it is not beneficial to use verification at low dataset sizes. We believe this is due to the pressure to overfit to the correct answer: with small datasets, overfitting to the correct answer happens faster than learning more generalizable properties of correct reasoning.
- It’s interesting to note that the 175B verifiers “take off” earlier than the 6B verifiers, requiring fewer training problems to surpass the finetuning baseline.
Ablation study (This is probably the earliest idea of PRM vs. ORM): either train verifiers to make a single scalar prediction conditioned on the entire generated solution, or to make a scalar prediction after each token in the solution (a token-level value function).
- (a) Despite the initially slower training, the token-level verifier ultimately outperforms the solution-level verifier. Moreover, the token-level verifier is still improving late in training, whereas the solution-level verifier quickly shows signs of overfitting. We hypothesize that the full value function provides a useful auxiliary signal that encourages the model to judge the reasoning throughout solutions, rather than merely memorizing the correct final answer.
- (b) including the language modeling objective is a strict improvement.
- (c) We find that using a large generator with a small verifier performs significantly better than using a small generator with a large verifier.