Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models.
The Knights and Knaves (K&K) puzzles [17] constitute an algorithmically generated reasoning dataset. The objective is to determine the nature of each character based on their statements.
System Prompt
Rewards
Format Reward
Answer Reward
We adopt a modified version of REINFORCE++ as our baseline algorithm, which has demonstrated superior performance compared to GRPO in our experimental setup
Reinforce Return Calculation:
$G_t = \sum_{k=t+1}^{T} \gamma^{k-t} r_k$
First modification: Use KL Loss
$r(s_t, a_t) = I(s_t = [EOS])r(x, y) - \beta \text{KL}(t)$
where $I(s_t = [EOS])$ is an identity function that evaluates to 1 when the $<eos>$ token is reached, and $β$ controls the weight of the KL penalty.
Note that the GRPO implementation does not include the KL-divergence as part of the reward function. Instead, it directly incorporates the KL-divergence into the loss function. Following this rationale, we also add KL loss like GRPO to the objective.
Second Modification: KL Estimation
Same as in GRPO:
$D_{\text{KL}}[\pi \theta \parallel \pi_{\text{ref}}] = \frac{\pi_{\text{ref}}(o_i, t \mid q, o_i, < t)}{\pi_\theta (o_i, t \mid q, o_i, < t)} - \log \frac{\pi_{\text{ref}}(o_i, t \mid q, o_i, < t)}{\pi_\theta (o_i, t \mid q, o_i, < t)} - 1$
This approach ensures that the KL estimate is always non-negative, whereas the original formulation may yield negative values. GRPO’s estimator provides a more stable and reliable measure of divergence during training.