DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Research Question

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.

DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT).
DeepSeek-R1 incorporates multi-stage training and cold-start data before RL and achieves performance comparable to OpenAI-o1-1217.

Approach

DeepSeek-R1-Zero

Model: we use DeepSeek-V3-Base as the base model
Data: there is NO supervised data, the training template used is

We intentionally limit our constraints to this structural format, avoiding any content-specific biases - such as mandating reflective reasoning or promoting particular problem-solving strategies - to ensure that we can accurately observe the model’s natural progression during the RL process.
Algorithm: we use GRPO (Shao et al., 2024) as the RL framework, which foregoes the critic model that is typically the same size as the policy model, and estimates the baseline from group scores instead.
- GRPO objective: for each question $q$, GRPO samples a group of outputs $\{o_1, o_2, \dots, o_G\}$ from the old policy $\pi_{\theta_{old}}$ and then optimizes the policy model $\pi_{\theta}$ by maximizing the following objective:
  
  $\mathcal{J}{\text{GRPO}}(\theta) = \mathbb{E}[q \sim P(Q), \{o_i\}{i=1}^{G} \sim \pi_{\theta_{old}}(O|q)]\frac{1}{G} \sum_{i=1}^{G} \left( \min \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i, \text{clip} \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right) - \beta \mathbb{D}{KL} (\pi{\theta} || \pi_{\text{ref}}) \right),$with $\mathbb{D}{KL} (\pi{\theta} || \pi_{\text{ref}}) = \frac{\pi_{\text{ref}}(o_i|q)}{\pi_{\theta}(o_i|q)} - \log \frac{\pi_{\text{ref}}(o_i|q)}{\pi_{\theta}(o_i|q)} - 1,$
  
  where $\epsilon$ and $\beta$ are hyper-parameters, and $A_i$ is the advantage, computed using a group of rewards $\{r_1, r_2, \dots, r_G\}$ corresponding to the outputs within each group:
  
  $A_i = \frac{r_i - \text{mean}(\{r_1, r_2, \dots, r_G\})}{\text{std}(\{r_1, r_2, \dots, r_G\})}.$
- Reward modeling: we adopt a rule-based reward system that mainly consists of two types of rewards:
  - Accuracy rewards: evaluates whether the response is correct.
  - Format rewards: we employ a format reward model that enforces the model to put its thinking process between ‘<think>’ and ‘</think>’ tags.
Training results
- Performance
  
  Besides the results shown in the table, when majority voting is employed on the AIME benchmark, DeepSeek-R1-Zero’s performance escalates from 71.0% to 86.7%, thereby exceeding the performance of OpenAI-o1-0912
- Self-evolution Process
  
  The thinking time of DeepSeek-R1-Zero shows consistent improvement throughout the training process. Sophisticated behaviors such as reflection - where the model revisits and reevaluates its previous steps - and the exploration of alternative approaches to problem-solving arise spontaneously
- Aha Moment of DeepSeek-R1-Zero
  
  In an intermediate version of the model, DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach.
- Drawback of DeepSeek-R1-Zero: poor readability, and language mixing.

DeepSeek-R1

DeepSeek-R1 incorporates a small amount of cold-start data and a 4-stage training pipeline. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model’s reasoning and non-reasoning capabilities.