Overview of ReFT:
Models: We conduct extensive experiments using two foundational models, CodeLLAMA (Roziere et al., 2023) and Galactica (Taylor et al., 2022)
Datasets: We conduct experiments on three math problem datasets: GSM8K (Cobbe et al., 2021a), SVAMP (Patel et al., 2021) and MathQA (Amini et al., 2019).
Data format: In this work, we focus on natural language CoT (N-CoT) (Wei et al., 2022) (Figure 1) and program-based CoT (Gao et al., 2023) (P-CoT) using Python.
Data construction: We perform few-shot prompting (Wei et al., 2022; Gao et al., 2023) using GPT-3.5-turbo to obtain both the N-CoT and P-CoT annotations
Algorithms:
Overview:
Warm-up stage: In this stage, the policy is fine-tuned for a few epochs on a dataset comprising of the “(question, CoT)” tuples: (x, e). It enables the model to have basic problem-solving skills to generate a proper response. The underlying concept is similar to the verifier training (Cobbe et al., 2021a) to generate multiple solutions.
Reinforcement Learning stage: In this stage, the policy improves its performance via a form of online self-learning using a dataset comprising of (question, answer) tuples: (x, y). Specifically, the policy model learns by repeatedly sampling responses, evaluating the response’s answer correctness, and updating its parameters in an online fashion (line 7-14 in Algorithm 1)
Value Model: Following Ziegler et al. (2019), the value model $V_{\phi}$ is constructed by appending a linear value head on top of the last hidden states of the policy model $\pi_{\theta}$, which is the model after the warm-up stage.
Reward Model:
RL techs I do not understand yet:
The generalized advantage estimate (Schulman et al., 2018) is used for advantage calculation:
For the estimate of return, we leverages the λ-return $\hat{R}_t$ , which can be written as the sum of the generalized advantage estimate and the value estimate:
Baselines: We compare ReFT with SFT and self-training (Xie et al., 2020; Amini et al., 2022) baselines.
Experimental Setup:
Results
ReFT Outperforms SFT
Reward Hacking for MathQA:
Majority Voting and Reranking: we perform sampling from both SFT and ReFT policies. We sample 100 CoT solutions
Our best result reported in Table 4, i.e., the CodeLLAMA + ReFT + Reranking with P-CoT setting, even surpasses GPT-3.5-turbo. However, we obtain the result with a model that is only in the size of 7B.
Experiments with Small Model
Ablation Study
When ReFT surpasses SFT: we perform ReFT training with different number of warm-up steps from SFT. Specifically, if the warmup step is 3, that means the policy initialize from the 3rd-epoch SFT checkpoint.
I would be really interested in (but the paper lacks of)