Research Question
- We delve into the study of scaling laws and present our distinctive findings that facilitate the scaling of large scale models in two prevalent used opensource configurations, 7B and 67B
- Specifically, we first examined the scaling laws of batch size and learning rate, and found their trends with model size.
- Building on this, we conducted a comprehensive study of the scaling laws of the data and model scale
- We discovered that the scaling laws derived from different datasets show significant differences.
- Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective.
- To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens
- We further conduct supervised fine-tuning (SFT) and direct preference optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models
Pre-training
Data
To enhance the richness and diversity, we have organized our approach into three essential stages:
Deduplication
The deduplication and remixing stages ensure a diverse representation of the data by sampling unique instances.
- We adopted an aggressive deduplication strategy, expanding the deduplication scope.
- Deduplicating across 91 dumps eliminates four times more documents than a single dump method.

Filtering
- The filtering stage enhances the density of information, thereby enabling more efficient and effective model training.
- In the filtering stage, we focus on developing robust criteria for document quality assessment.
- This involves a detailed analysis incorporating both linguistic and semantic evaluations, providing a view of data quality from individual and global perspectives.
Remixing
In the remixing phase, we adjust our approach to address data imbalances, focusing on increasing the presence of underrepresented domains.
Tokenizer