Research Question

Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models.

Approach

Data

The Knights and Knaves (K&K) puzzles [17] constitute an algorithmically generated reasoning dataset. The objective is to determine the nature of each character based on their statements.

image.png

Reward Design

RL algorithm

We adopt a modified version of REINFORCE++ as our baseline algorithm, which has demonstrated superior performance compared to GRPO in our experimental setup

Training Schedule

image.png