Solving math word problems with process and outcome-based feedback (Ursato et al., 2022)

Research Question:

We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task, GSM8K and compare the results.

Approach:

Dataset: GSM8K
Evaluation metrics:
- Trace error rate: the fraction of problems with correct final answers for which the method produces at least one incorrect reasoning step.
- Final-answer error rate: the fraction of problems for which the method does not produce the correct final answer, can be calculated with exact string matching.
- Selective final-answer error rate: to assess performance when abstaining is allowed.
- Out-of-distribution (OOD) error rate on pre-algebra problems from the MATH dataset to assess generalization
Modeling and training
- Formulation:
  - RL formalism: this treats each step as an action, and the observation is provided by all the tokens so far. Policy can be obtained through any of few-shot prompting, supervised finetuning or RL
  - Train LMs as reward models, which can be used both for reranking samples from the policy, or as the source of rewards during reinforcement learning
- Components (possible policies and reward models):
  - Policies:
    - few-shot prompting
    - supervised fine-tuning: training details are given in section 2.3
    - reinforcement learning (RL) via expert iteration: See section 2.6
      - In policy improvement, we combine a base policy with a search procedure to produce samples from a so-called expert policy. Then in distillation, we perform supervised learning on these expert samples to improve the base policy towards the expert policy.
  - Reward Models (for both reranking (as verifier) and RL & implemented as an LM):
    - ORM:
      - data: using samples from the policy of the approach being used
      - training: same as Cobbe et al. 2021 with dropout = 0.1
      - To speed up training, in sft approach, we initialize the ORM training using the SFT model parameters, while for the few-shot based approaches we initialize from the base pretrained LM.
    - PRM: constrained by human annotation budget, we initialize the PRM parameters to the ORM parameters and lower the learning rate to 1 x 10^7, select the RM with the best validation loss before 2000 steps.
Decoding: For all test-time decoding, we first generate 𝐾 = 96 samples of full solutions, and then select the best sample with two approaches:
- Majority Voting / self-consistency (no RM available): select a random sample from those yielding the selected most-common final answer.
- Verifier Voting / RM-weighted decoding:
  1. we weight each sample according to the RM-estimated correctness probability,
  2. select the final answer with the largest total weight,
  3. and then select the sample with the highest RM score from those yielding the selected final answer
Data Annotation (details in Appendix B):
- Instructions: of ours define a major mistake as “a step where the information expressed is incorrect, or it would no longer be possible to reach the correct solution without undoing that step”
- Dataset cleaning (of low inter-annotator agreement) removed about 20% of our data, leaving annotations for 1560 model samples across 530 training set problems, corresponding to 9856 step-level binary labels.
- Our validation set contained 162 model samples, with 913 total steps

Experiments:

*Highlights
We have surpassed the state-of-the-art: The ORM-RL and PRM-RL models achieve a final-answer error rate below 13%, improving on the 16.8% final-answer error for the current state-ofthe-art model (Li et al., 2022). This is further reduced to 2.7% when the model is allowed to abstain on only 30% of questions.
Few-shot+Final-Answer RL can be more token-efficient when only final-answer error rate is considered: The SFT and Few-shot+Final-Answer RL models attain similar final-answer error rates both without an RM (22.3% vs. 23.5%) and with an ORM (14.8% vs. 16.6%). This is notable, as Few-shot+Final-Answer RL only requires demonstrators to provide a final answer, rather than a full reasoning trace. Put another way, Few-shot+Final-Answer RL uses 1-4 tokens of label supervision per question, while SFT uses hundreds.
ORM-supervised reward models approximate PRM labels: ORM predictions tend to agree more with the PRM labels than with the ORM labels themselves (85% vs. 77% averaged over all steps). We suspect this is because it is simpler for the ORM to learn to recognize when steps are correct, than it is to check the answer by internally computing the final answer itself.
No Reward Model (section 3.1)
- SFT and Few-shot+Final-Answer RL have similar final-answer error rates
- SFT has significantly better trace error (outcome-based approaches can find ways to produce correct answers for incorrect reasons)
- finetuning outperforms few-shot prompting alone.
With Reward Model for reranking (section 3.2)
- RM reranking significantly improves trace error, reducing it from 11.4% to below 5% for SFT
- RMs decrease final-answer error, from 22.3% to below 15%.
With Reward Model for RL and reranking (section 3.3)
- RL significantly improves few-shot performance, but provides more modest gains on top of SFT.
- ORM-RL and PRM-RL outperform Final-Answer RL across all three decoding strategies:
  - On face, this may be surprising given that ORM-RL optimizes an approximation (the ORM full-solution scores) of final-answer correctness. However, our earlier RM analysis (Fig. 4) suggests that the ORM approximates process-based feedback, and checks reasoning steps rather than the final answer directly.
  - One potential explanation is that Final-Answer RL only checks that solutions reach the correct final answer, whereas PRM-RL and ORM-RL check for solutions which reach the right answer for the right reason.
Selective Prediction (Section 3.4): allow the model to abstain rather than produce an incorrect output.
- Selective prediction greatly reduces final-answer error, particularly for models with low trace error: Figure 5 shows that by abstaining on 𝑟 = 30% of inputs, we reduce final-answer error rate from 14.1%!2.7%, which can be further reduced to 1.5% at 𝑟 = 50%.