Improve Mathematical Reasoning in Language Models by Automated Process Supervision (Luo et al., 2024)

Research Question

Inspired by AlphaGo Zero (Silver et al., 2017), we developed a novel divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named OmegaPRM for automated collection of process supervision data.

Contributions:

We propose a novel divide-and-conquer style Monte Carlo Tree Search algorithm for automated process supervision data generation.
The algorithm enables the efficient generation of over 1.5 million process supervision annotations, representing the largest and highest quality dataset of its kind to date. Additionally, the entire process operates without any human annotation, making our method both financially and computationally cost-effective.
We combine our verifier with weighted self-consistency to further boost the performance of LLM reasoning. We reached 69.4% success rate on the MATH benchmark.

Approach

Notations:
- $q:$ query
- $x:$ response
- ORM: $p = ORM(q, x)$
- PRM: $p_t = PRM( [q, x_{1:𝑡−1}], x_t),$ where $x_{1:i} = [x_i, . . . , x_i]$ represents the first $i$ steps in the solution, also called rollouts in RL.
- $s:$ State or Node, contains the question and all preceding reasoning steps
Construction of the MC Tree: For each question, we build a Monte Carlo Tree, as shown in Fig. 1 (THIS IS THE FINAL OUTPUT FOR EACH QUESTION)
- Each node $s$ in the tree contains the question $q$ and prefix solution $x_{1:t}$, together with all previous rollouts $\{(s, r_i)\}_{i=1}^k$ from the state. The nodes also store a set of statistics $\{N(s), \text{MC}(s), Q(s, r)\},$
  - where $N(s)$ denotes the visit count of a state,
  - $\text{MC}(s)$ represents the Monte Carlo estimation of a state, calculated as follows:
    - the Monte Carlo (MC) ratio: $c_t = MonteCarlo(q, x_{1:t}) = MonteCarlo(s) = \frac{\text{num(correct rollouts from } t\text{-th step)}}{\text{num(total rollouts from } t\text{-th step)}}$ where $c_t$ measures the proportion of correct rollouts from the $t$-th step.
    - visualization:
  - $Q(s, r)$ is a state-rollout value function that is correlated to the chance of selecting a rollout during the selection phase of tree traversal.
    
    Specifically, $Q(s, r) = \alpha^{1 - \text{MC}(s)} \cdot \beta^{\frac{\text{len}(r)}{L}},$ where $\alpha$ and $\beta$ are constants, and $L$ represents the maximum rollout length.
- Each edge $(s, a)$ is either a single step or a sequence of consecutive steps from the node $s$.
Monte Carlo Tree Search
- Objective: As suggested by Lightman et al. (2023), supervising up to the first incorrect step in a solution is sufficient to train a PRM. Therefore, our objective is locating the first error in an efficient way.
- Overview of three stages: The dotted lines in $Select$ stage represent the available rollouts for binary search. The bold colored edges represent steps with correctness estimations. The yellow color indicates a correct step, i.e., with a preceding state $s$ that $MC(s) > 0$ and the blue color indicates an incorrect step, i.e., with $MC(s) = 0$. The number of dashes in each colored edge indicates the number of steps.
- Selection Stage: In selection phase, we maintain a pool of all rollouts $\{(s_i, r_i^t)\}$ from previous searches that satisfy $0 < \text{MC}(s_i) < 1$.
  - Selection metric: during each selection, a rollout is popped and selected according to tree statistics $(s, r) = \arg \max_{(s, r)} [Q(s, r) + U(s)],$ using a variant of the PUCT (Rosin, 2011) algorithm, $U(s) = c_{\text{puct}} \frac{\sqrt{\sum_i N(s_i)}}{1 + N(s)},$ where $c_{\text{puct}}$ is a constant determining the level of exploration. This strategy initially favors rollouts with low visit counts but gradually shifts preference towards those with high rollout values.
  - Priority: Lightman et al. (2023) suggests surfacing the convincing wrong-answer solutions for annotators during labeling. Inspired by this, we propose to prioritize supposed-to-be-correct wrong-answer rollouts during selection.
    - supposed-to-be-correct: refers to the state with a Monte Carlo estimation $MC(s)$ closed to 1
    - wrong-answer: refers that the specific rollout $r$ has a wrong final answer.
- Binary Search Stage (to identify the first error location in the selected rollout): Given the inefficiency of performing rollouts for every step (as done in previous works), a binary search approach is proposed to find the first incorrect step as follows:
  1. Split the solution at the midpoint $m$ and perform rollouts for $x_{1:m}$.
  2. If $c_m > 0$, indicating at least one correct rollout, the error is in the latter half of the solution.
  3. If $c_m = 0$, the error is in the first half.
  4. This process iterates until the first error is isolated, reducing complexity to $O(k \log M)$ instead of $O(kM)$, where $M$ is the total steps in the solution.
  5. Visualization:
    
    All divide-and-rollout positions before the first error become new states. The trajectory $s[q] → s[q, x_{1:4}] → s[q, x_{1:6}] → s[q, x_{1:7}]$ is added to the tree after the binary search. The edges $s[q] → s[q, x_{1:4}]$ and $s[q, x_{1:4}] → s[q, x_{1:6}]$ are correct, with MC values of 0.25 and 0.5, respectively; while the edge $s[q, x_{1:6}] → s[q, x_{1:7}]$ is incorrect with MC value of 0.
- Maintain Stage: After the binary search, the tree statistics $N(s), MC(s)$ and $Q(s,r)$ are updated. Specifically, $N(s)$ is incremented by 1 for the selected $(s,r)$. Both $MC(s)$ and $Q(s,r)$ are updated for the new rollouts sampled from the binary search.
PRM Training: We use the pointwise soft label (use the Monte Carlo estimation as the correctness label) when evaluating the main result

Experiments

Components:
- Data Generation: We use the same training and testing split as described in Lightman et al. (2023). For creating the process annotation data, we use the questions from the training split and set the search limit to 100 per question. We use $\alpha = 0.5, \beta = 0.9$ and $L = 500$ for calculating $Q(s,r)$ and $c_{\text{puct}} = 0.125$.
- Base Model: we tune a pretrained Gemini Pro (Gemini Team et al., 2023) by distilling knowledge from Gemini Ultra on math instruction datasets, resulting in a policy model that achieves an accuracy of approximately 51% on the MATH test set.
- Metrics and baselines: Baseline datasets include PRM800K (Lightman et al., 2023) and Math-Shepherd (Wang et al., 2024a), both publicly available. Additionally, we generate a process annotation dataset with our Gemini policy model using the brute-force approach described in Wang et al. (2024b), referred to as MiPS
Results:
- Main results
  - the fine-tuned Gemini Pro achieves 69.4% accuracy on MATH using OmegaPRM-weighted majority voting
  - when the number of samples is small, all the PRM models outperformed the majority vote.
- Step Distribution:
  
  Different to Lightman et al., 2023; Wang et al., 2024a,b which use newline as delimiters, we propose a more flexible method for step division, treating any sequence of consecutive tokens in a solution as a valid step. We observe that many step divisions in Math-Shepherd lack semantic coherence to some extent. Therefore, we hypothesize that semantically explicit cutting is not necessary for training a PRM.
  
  During binary search, we aim to divide a full solution into 16 pieces. To calculate the expected step length, we divide the average solution length by 16. The binary search terminates when a step is shorter than this value.

Research Question

Contributions:

Approach

Experiments

*Hardcore: No Appendix provided in this paper