Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning (Xie et al.)

Research Question

Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models.

Approach

Data

The Knights and Knaves (K&K) puzzles [17] constitute an algorithmically generated reasoning dataset. The objective is to determine the nature of each character based on their statements.

Reward Design

System Prompt
Rewards
- Format Reward
- Answer Reward

RL algorithm

We adopt a modified version of REINFORCE++ as our baseline algorithm, which has demonstrated superior performance compared to GRPO in our experimental setup

Reinforce Return Calculation:

$G_t = \sum_{k=t+1}^{T} \gamma^{k-t} r_k$
First modification: Use KL Loss

$r(s_t, a_t) = I(s_t = [EOS])r(x, y) - \beta \text{KL}(t)$

where $I(s_t = [EOS])$ is an identity function that evaluates to 1 when the $<eos>$ token is reached, and $β$ controls the weight of the KL penalty.

Note that the GRPO implementation does not include the KL-divergence as part of the reward function. Instead, it directly incorporates the KL-divergence into the loss function. Following this rationale, we also add KL loss like GRPO to the objective.
Second Modification: KL Estimation

Same as in GRPO:

$D_{\text{KL}}[\pi \theta \parallel \pi_{\text{ref}}] = \frac{\pi_{\text{ref}}(o_i, t \mid q, o_i, < t)}{\pi_\theta (o_i, t \mid q, o_i, < t)} - \log \frac{\pi_{\text{ref}}(o_i, t \mid q, o_i, < t)}{\pi_\theta (o_i, t \mid q, o_i, < t)} - 1$

This approach ensures that the KL estimate is always non-negative, whereas the original formulation may yield negative values. GRPO’s estimator provides a more stable and reliable measure of divergence during training.

Research Question

Approach

Data

Reward Design

RL algorithm

Training Schedule