Research Question
We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model.
TOKENIZER & MODEL
TOKENIZER
- Following Qwen (Bai et al., 2023a), we employ the identical tokenizer based on byte-level bytepair encoding.
- Models of all sizes employ a common vocabulary consisting of 151,643 regular tokens and 3 control tokens.
- It should be noted that, owing to considerations in distributed training, the effective size for the embeddings is larger.
MODEL ARCHITECTURE

QWEN2 DENSE MODEL
- Key differences from Qwen are described below:
- Grouped Query Attention We adopt Grouped Query Attention (GQA, Ainslie et al., 2023) instead of conventional multi-head attention (MHA).
- Dual Chunk Attention with YARN
- To expand the context window of Qwen2, we implement Dual Chunk Attention (DCA, An et al., 2024), which segments long sequences into chunks of manageable lengths.
- If the input can be handled in a chunk, DCA produces the same result as the original attention.
- Otherwise, DCA facilitates effective capture of relative positional information between tokens within and across chunks, thereby improving long context performance.
- Moreover, we also employ YARN (Peng et al., 2023) to rescale the attention weights for better length extrapolation.
- All other settings are the same as Qwen, including QKV bias
QWEN2 MIXTURE-OF-EXPERTS MODEL
- Expert Granularity: our model employs fine-grained experts (Dai et al., 2024 / DeepSeekMoE), creating smaller-scale experts while activating a greater number of experts simultaneously.
- Expert Routing: Recently, there has been a notable trend towards integrating both shared and routing-specific experts within MoE layers (Rajbhandari et al., 2022; Dai et al., 2024). We adopt this approach
- Expert Initialization: We initialize the experts in a similar way to upcycling (Komatsuzaki et al., 2023), leveraging the weights of a dense model, with modifications:
- Given the designated expert intermediate size $h_E$, the number of experts $n$, and the original FFN intermediate size $h_{FFN}$, the FFN is replicated $⌈n×h_E/h_{FFN}⌉$ times.
- To promote diversity within each FFN copy, parameters are shuffled along the intermediate dimension.
- For each fine-grained expert, 50% of its parameters are randomly reinitialized.
PRE-TRAINING
In the pre-training of Qwen2, our efforts were focused on refining the dataset and investigating methods to handle extended context lengths effectively.
PRE-TRAINING DATA