Generative Verifiers: Reward Modeling as Next-Token Prediction

Research Question:

Existing LLM-based verifiers do not utilize the text generation capabilities of pretrained LLMs. we instead propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation and compare the results with standard verifiers.

Approach:

Details of GenRMs:

GenRM represents solution correctness using the LLM’s probability distribution over tokens, instead of predicting a separate numerical score

Direct Verifier:
- Training: we minimize the SFT loss on the dataset $D_{Direct}$ containing problem-solution pairs and a ‘Yes‘ or ‘No’ verification token.
- Inference: At inference, we use the likelihood of the ‘Yes’ token as the verifier’s score for re-ranking solutions.
- Data: $\mathcal{D}_{\text{Direct}} = \{(x, y^+, I), \text{Yes'} \} \cup \{(x, y^-, I), \text{No'}\}, \quad I = \text{`Is the answer correct (Yes/No)?'}$
GenRM-CoT:
- Training: To train CoT verifiers, we can minimize the SFT loss $L_{GenRM}$ on the dataset $D_{CoT}$ containing problem-solution pairs as inputs, and corresponding verification rationales $v_{CoT}$ appended with a final question $I$ and ‘Yes’ or ‘No’ token as targets
- Inference: During inference, we first generate a CoT rationale $v_{CoT}$ from GenRM-CoT and then use the probability of ‘Yes’ for assigning the correctness score.
  - The generative verifier can use different reasoning paths and yield different correctness probabilities for the same problem-solution pair.
  - Unless otherwise specified, we report GenRM-CoT performance based on majority voting with 32 votes.
- Data: $\mathcal{D}{\text{CoT}} = \{(x, y^+, I{\text{CoT}}), (v_{\text{CoT}}, I, \text{Yes'})\} \cup \{(x, y^-, I_{\text{CoT}}), (v_{\text{CoT}}, I, \text{No'})\}$where $I_{\text{CoT}}$ =‘Let’s verify step by step.’. Notably, these rationales can either be human or LLM-generated, both of which we explore in this work.
Unifying Generation and Verification

Given a verification dataset $\mathcal{D}{\text{verify}}$, which can be $\mathcal{D}{\text{Direct}}$ or $\mathcal{D}_{\text{CoT}}$ of problems-solution pairs with correctness tokens (optionally with CoT rationales), GenRM minimizes the loss:

$\mathcal{L}{\text{GenRM}}(\theta, \mathcal{D}{\text{verify}}) = \mathcal{L}{\text{SFT}}(\theta, \mathcal{D}{\text{verify}}) + \lambda \mathcal{L}{\text{SFT}}(\theta, \mathcal{D}{\text{correct}}),$

where $\lambda > 0$ is a hyperparameter that controls the mixture ratio between verification ($\mathcal{D}{\text{verify}}$) and generating correct solutions ($\mathcal{D}{\text{correct}}$) (It seems like this is the just the correct (Q,A) pair from the generator, i.e., without the verification text, but I need to further confirm this).

Data:

Data demo of $\mathcal{D}_{\text{CoT}}$

Synthetic verification CoT rationales for training:
- One naïve approach is to simply use the ‘Let’s verify step by step’ prompt given a problem-solution pair, and keep the generated rationales only when they accurately verify the correctness of a solution, but they are still of poor quality
- Reference-guided grading (In Table A.2): is the method we use. Specifically, we provide a reference solution in addition to the problem and solution to verify (see Table A.2), making it easier for an LLM to point out any reasoning error in the provided solution.
for training Discriminative Verifiers, we always use a balanced data mixture between correct and incorrect problem-solution pairs.

Models & Training:

Verifiers (open-weights Gemma models):
- Algorithmic tasks: Gemma-2B
- GSM8K: Gemma 2B, 7B, and Gemma-2 9B
Generators (and also LLM-as-a-Judge):
- Algorithmic tasks: Gemma-2B
- GSM8K: Gemini 1.0 Pro
Training data (Shown above in Approach Section):
- Algorithmic tasks: we generate oracle rationales programmatically (like in Table A.1 above)
- GSM8K: we generate synthetic rationales using Gemini 1.0 Pro with reference-guided grading (Table A.2).

Experiments

Setup

Questions:
- How does GenRM compare to discriminative verifiers and other approaches?
- Does unified training of GenRM improve generation and verification performance?
- Can GenRM effectively utilize CoT reasoning to improve its performance?
- How does GenRM scale with model size and inference-time compute?
Task:
- Algorithmic reasoning: Last Letter Concatenation (Wei et al., 2022) and Word Sorting from Big-Bench (Suzgun et al., 2022)
- Math reasoning: GSM8K (Cobbe et al. 2021), MATH dataset (Hendrycks et al., 2021)