REFT: Reasoning with REinforced Fine-Tuning (Luong el al., 2024)

Research Question:

Intuitively, it would be better for the algorithm to learn from multiple annotated reasoning paths given a question. To address this issue, we propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning, with math problem-solving as an example.

Approach:

Overview of ReFT:
Models: We conduct extensive experiments using two foundational models, CodeLLAMA (Roziere et al., 2023) and Galactica (Taylor et al., 2022)
Datasets: We conduct experiments on three math problem datasets: GSM8K (Cobbe et al., 2021a), SVAMP (Patel et al., 2021) and MathQA (Amini et al., 2019).
- Data format: In this work, we focus on natural language CoT (N-CoT) (Wei et al., 2022) (Figure 1) and program-based CoT (Gao et al., 2023) (P-CoT) using Python.
- Data construction: We perform few-shot prompting (Wei et al., 2022; Gao et al., 2023) using GPT-3.5-turbo to obtain both the N-CoT and P-CoT annotations
Algorithms:
- Overview:
- Warm-up stage: In this stage, the policy is fine-tuned for a few epochs on a dataset comprising of the “(question, CoT)” tuples: (x, e). It enables the model to have basic problem-solving skills to generate a proper response. The underlying concept is similar to the verifier training (Cobbe et al., 2021a) to generate multiple solutions.
- Reinforcement Learning stage: In this stage, the policy improves its performance via a form of online self-learning using a dataset comprising of (question, answer) tuples: (x, y). Specifically, the policy model learns by repeatedly sampling responses, evaluating the response’s answer correctness, and updating its parameters in an online fashion (line 7-14 in Algorithm 1)
  - Value Model: Following Ziegler et al. (2019), the value model $V_{\phi}$ is constructed by appending a linear value head on top of the last hidden states of the policy model $\pi_{\theta}$, which is the model after the warm-up stage.
  - Reward Model:
    - At the terminal state, we use a reward function that directly compares the answer extracted from the state’s CoT and the ground-truth answer y .
    - On dataset whose answers are all numeric, partial reward (Zhong et al., 2017; Le et al., 2022) of 0.1 can be applied when the answer can be extracted and it is of numeric type
    - In addition, following Zheng et al. (2023), our total reward is the sum of the reward function score and the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951) between the learned RL policy and initial policy scaled by a coefficient factor $\beta$
  - RL techs I do not understand yet:
    - The generalized advantage estimate (Schulman et al., 2018) is used for advantage calculation:
    - For the estimate of return, we leverages the λ-return $\hat{R}_t$ , which can be written as the sum of the generalized advantage estimate and the value estimate:

Experiments

Baselines: We compare ReFT with SFT and self-training (Xie et al., 2020; Amini et al., 2022) baselines.
- SFT simply fine-tunes the language model on the train-data
- Self-training:
  - Offline Self-Training (Offline-ST) (He et al., 2020)
    - The Offline-ST method is similar to expert iteration
    - We first use the SFT checkpoint from the early checkpoint to sample the CoTs, then only retain those expert samples that have a correct answer. Finally, we perform SFT on the combination of original training data and the expert samples.
  - Online (Hoi et al., 2021) Self-Training (Online-ST)
    - Online-ST has the same warm-up process as ReFT. After that, we perform continual training with the samples generated on the fly.
    - At each training step, the model first samples CoTs for a batch and only retains those with correct answers (but one can also use neg answers here right?). The resulting batch consists of both sampled and ground-truth CoTs.
  - Quick comparison:
    - Online-ST does SFT with continuously updated self-generated CoT data (data on the fly), while offline-ST’s training data is static and generated only once. In essence, offline-ST is like a snapshot of one of online-ST stages.
    - Compared with ReFT, Online-ST neither makes use of negative responses (with an incorrect answer) nor has a dedicated mechanism to prevent the model from significantly diverging from the initial model
Experimental Setup:
- In addition to the comparison with baselines, we also apply common techniques, majority voting (Wang et al., 2023b) and reward model reranking (Lightman et al., 2023) on GSM8K.
- Hyper-parameters: refer to the paper section 4.3
- Reward Model Reranking:
  - To construct the RM training data, we use the model from the warm-up stage and perform sampling to obtain 100 CoTs for each question in the training set.
  - The Reward Model is an ORM
- Evaluation: We report value accuracy for both N-CoT and P-CoT on all datasets. For majority voting and reranking (Table 4), we sample 100 CoTs for evaluation.
Results
- ReFT Outperforms SFT
- Reward Hacking for MathQA:
  - The final reasoning step still predicts the option “C” as the final answer as the model would always predict one of the options from {A, B, C, D, E} regardless of the correctness of intermediate CoT. Such a misleading CoT will receive a positive reward “1” and misguide the model to treat this as a correct CoT.
  - Training with PRM would alleviate this kind of problem
- Majority Voting and Reranking: we perform sampling from both SFT and ReFT policies. We sample 100 CoT solutions
  
  Our best result reported in Table 4, i.e., the CodeLLAMA + ReFT + Reranking with P-CoT setting, even surpasses GPT-3.5-turbo. However, we obtain the result with a model that is only in the size of 7B.
- Experiments with Small Model
- Ablation Study
  - Without the partial reward, ReFT obtains a lower accuracy 74.4 but it is still much better than SFT.
  - In addition, the policy distribution will easily collapse to produce unexpected results (i.e., 0 accuracy) if we set the KL coefficient β to 0.
  - We also experiment with a separate value model (Andrychowicz et al., 2021; Cobbe et al., 2021b), where the torso parameters are initialized the same as the policy model. We found that such a setting allows the policy to converge faster in early RL training, but eventually reaches an on par performance.

Analysis

When ReFT surpasses SFT: we perform ReFT training with different number of warm-up steps from SFT. Specifically, if the warmup step is 3, that means the policy initialize from the 3rd-epoch SFT checkpoint.

I would be really interested in (but the paper lacks of)

Generalization study of this work
Performance of pure PPO without Warm-up (only then can we make conclusion about the real impact of socalled ReFT)