Research Question
- In this work, we introduce QWEN1, the first installment of our large language model series.
- It includes QWEN, the base pretrained language models, and QWEN-CHAT, the chat models finetuned with human alignment techniques.
- The chat models possess advanced tool-use and planning capabilities for creating agent applications,
- Furthermore, we have developed coding-specialized models, CODE-QWEN and CODE-QWEN-CHAT,
- as well as mathematics-focused models, MATH-QWEN-CHAT, which are built upon base language models.

Pretraining
DATA
Sources:
- includes public web documents, encyclopedia, books, codes, etc.
- Additionally, our dataset is multilingual, with a significant portion of the data being in English and Chinese.
Processing:
- To increase the diversity of our data, we employ deduplication techniques, including exact-match deduplication after normalization and fuzzy deduplication using MinHash and LSH algorithms.
- To filter out low-quality data, we employ a combination of rule-based and machine-learning-based methods.
- To further enhance the quality of our data, we selectively up-sample data from certain sources, to ensure that our models are trained on a diverse range of high-quality content.
- To further enhance the performance of our model, we have incorporated high-quality instruction data into our pretraining process.
- Finally, we have built a dataset of up to 3 trillion tokens.
TOKENIZATION