Research Question

In this work, we introduce QWEN1, the first installment of our large language model series.
It includes QWEN, the base pretrained language models, and QWEN-CHAT, the chat models finetuned with human alignment techniques.
- The chat models possess advanced tool-use and planning capabilities for creating agent applications,
Furthermore, we have developed coding-specialized models, CODE-QWEN and CODE-QWEN-CHAT,
as well as mathematics-focused models, MATH-QWEN-CHAT, which are built upon base language models.

Pretraining

includes public web documents, encyclopedia, books, codes, etc.
Additionally, our dataset is multilingual, with a significant portion of the data being in English and Chinese.

To increase the diversity of our data, we employ deduplication techniques, including exact-match deduplication after normalization and fuzzy deduplication using MinHash and LSH algorithms.
To filter out low-quality data, we employ a combination of rule-based and machine-learning-based methods.
To further enhance the quality of our data, we selectively up-sample data from certain sources, to ensure that our models are trained on a diverse range of high-quality content.
To further enhance the performance of our model, we have incorporated high-quality instruction data into our pretraining process.
Finally, we have built a dataset of up to 3 trillion tokens.