We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
Model: we use DeepSeek-V3-Base as the base model
Data: there is NO supervised data, the training template used is
We intentionally limit our constraints to this structural format, avoiding any content-specific biases - such as mandating reflective reasoning or promoting particular problem-solving strategies - to ensure that we can accurately observe the model’s natural progression during the RL process.
Algorithm: we use GRPO (Shao et al., 2024) as the RL framework, which foregoes the critic model that is typically the same size as the policy model, and estimates the baseline from group scores instead.
GRPO objective: for each question $q$, GRPO samples a group of outputs $\{o_1, o_2, \dots, o_G\}$ from the old policy $\pi_{\theta_{old}}$ and then optimizes the policy model $\pi_{\theta}$ by maximizing the following objective:
$\mathcal{J}{\text{GRPO}}(\theta) = \mathbb{E}[q \sim P(Q), \{o_i\}{i=1}^{G} \sim \pi_{\theta_{old}}(O|q)]\frac{1}{G} \sum_{i=1}^{G} \left( \min \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i, \text{clip} \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right) - \beta \mathbb{D}{KL} (\pi{\theta} || \pi_{\text{ref}}) \right),$with $\mathbb{D}{KL} (\pi{\theta} || \pi_{\text{ref}}) = \frac{\pi_{\text{ref}}(o_i|q)}{\pi_{\theta}(o_i|q)} - \log \frac{\pi_{\text{ref}}(o_i|q)}{\pi_{\theta}(o_i|q)} - 1,$
where $\epsilon$ and $\beta$ are hyper-parameters, and $A_i$ is the advantage, computed using a group of rewards $\{r_1, r_2, \dots, r_G\}$ corresponding to the outputs within each group:
$A_i = \frac{r_i - \text{mean}(\{r_1, r_2, \dots, r_G\})}{\text{std}(\{r_1, r_2, \dots, r_G\})}.$
Reward modeling: we adopt a rule-based reward system that mainly consists of two types of rewards:
Training results
Performance
Besides the results shown in the table, when majority voting is employed on the AIME benchmark, DeepSeek-R1-Zero’s performance escalates from 71.0% to 86.7%, thereby exceeding the performance of OpenAI-o1-0912
Self-evolution Process
The thinking time of DeepSeek-R1-Zero shows consistent improvement throughout the training process. Sophisticated behaviors such as reflection - where the model revisits and reevaluates its previous steps - and the exploration of alternative approaches to problem-solving arise spontaneously
Aha Moment of DeepSeek-R1-Zero
In an intermediate version of the model, DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach.
Drawback of DeepSeek-R1-Zero: poor readability, and language mixing.
DeepSeek-R1 incorporates a small amount of cold-start data and a 4-stage training pipeline. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model’s reasoning and non-reasoning capabilities.