We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task, GSM8K and compare the results.
*Highlights
We have surpassed the state-of-the-art: The ORM-RL and PRM-RL models achieve a final-answer error rate below 13%, improving on the 16.8% final-answer error for the current state-ofthe-art model (Li et al., 2022). This is further reduced to 2.7% when the model is allowed to abstain on only 30% of questions.
Few-shot+Final-Answer RL can be more token-efficient when only final-answer error rate is considered: The SFT and Few-shot+Final-Answer RL models attain similar final-answer error rates both without an RM (22.3% vs. 23.5%) and with an ORM (14.8% vs. 16.6%). This is notable, as Few-shot+Final-Answer RL only requires demonstrators to provide a final answer, rather than a full reasoning trace. Put another way, Few-shot+Final-Answer RL uses 1-4 tokens of label supervision per question, while SFT uses hundreds.
ORM-supervised reward models approximate PRM labels: ORM predictions tend to agree more with the PRM labels than with the ORM labels themselves (85% vs. 77% averaged over all steps). We suspect this is because it is simpler for the ORM to learn to recognize when steps are correct, than it is to check the answer by internally computing the final answer itself.
No Reward Model (section 3.1)
With Reward Model for reranking (section 3.2)
With Reward Model for RL and reranking (section 3.3)
RL significantly improves few-shot performance, but provides more modest gains on top of SFT.
ORM-RL and PRM-RL outperform Final-Answer RL across all three decoding strategies:
Selective Prediction (Section 3.4): allow the model to abstain rather than produce an incorrect output.
Selective prediction greatly reduces final-answer error, particularly for models with low trace error: Figure 5 shows that by abstaining on π = 30% of inputs, we reduce final-answer error rate from 14.1%!2.7%, which can be further reduced to 1.5% at π = 50%.