GenRM represents solution correctness using the LLM’s probability distribution over tokens, instead of predicting a separate numerical score
Direct Verifier:
GenRM-CoT:
Unifying Generation and Verification
Given a verification dataset $\mathcal{D}{\text{verify}}$, which can be $\mathcal{D}{\text{Direct}}$ or $\mathcal{D}_{\text{CoT}}$ of problems-solution pairs with correctness tokens (optionally with CoT rationales), GenRM minimizes the loss:
$\mathcal{L}{\text{GenRM}}(\theta, \mathcal{D}{\text{verify}}) = \mathcal{L}{\text{SFT}}(\theta, \mathcal{D}{\text{verify}}) + \lambda \mathcal{L}{\text{SFT}}(\theta, \mathcal{D}{\text{correct}}),$
where $\lambda > 0$ is a hyperparameter that controls the mixture ratio between verification ($\mathcal{D}{\text{verify}}$) and generating correct solutions ($\mathcal{D}{\text{correct}}$) (It seems like this is the just the correct (Q,A) pair from the generator, i.e., without the verification text, but I need to further confirm this).
Synthetic verification CoT rationales for training:
One naïve approach is to simply use the ‘Let’s verify step by step’ prompt given a problem-solution pair, and keep the generated rationales only when they accurately verify the correctness of a solution, but they are still of poor quality
Reference-guided grading (In Table A.2): is the method we use. Specifically, we provide a reference solution in addition to the problem and solution to verify (see Table A.2), making it easier for an LLM to point out any reasoning error in the provided solution.
for training Discriminative Verifiers, we always use a balanced data mixture between correct and incorrect problem-solution pairs.