Research Question

We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model.

TOKENIZER & MODEL

TOKENIZER

Following Qwen (Bai et al., 2023a), we employ the identical tokenizer based on byte-level bytepair encoding.
Models of all sizes employ a common vocabulary consisting of 151,643 regular tokens and 3 control tokens.
It should be noted that, owing to considerations in distributed training, the effective size for the embeddings is larger.

MODEL ARCHITECTURE

QWEN2 DENSE MODEL

Key differences from Qwen are described below:
- Grouped Query Attention We adopt Grouped Query Attention (GQA, Ainslie et al., 2023) instead of conventional multi-head attention (MHA).
- Dual Chunk Attention with YARN
  - To expand the context window of Qwen2, we implement Dual Chunk Attention (DCA, An et al., 2024), which segments long sequences into chunks of manageable lengths.
    - If the input can be handled in a chunk, DCA produces the same result as the original attention.
    - Otherwise, DCA facilitates effective capture of relative positional information between tokens within and across chunks, thereby improving long context performance.
  - Moreover, we also employ YARN (Peng et al., 2023) to rescale the attention weights for better length extrapolation.
All other settings are the same as Qwen, including QKV bias

QWEN2 MIXTURE-OF-EXPERTS MODEL

Expert Granularity: our model employs fine-grained experts (Dai et al., 2024 / DeepSeekMoE), creating smaller-scale experts while activating a greater number of experts simultaneously.
Expert Routing: Recently, there has been a notable trend towards integrating both shared and routing-specific experts within MoE layers (Rajbhandari et al., 2022; Dai et al., 2024). We adopt this approach
Expert Initialization: We initialize the experts in a similar way to upcycling (Komatsuzaki et al., 2023), leveraging the weights of a dense model, with modifications:
- Given the designated expert intermediate size $h_E$, the number of experts $n$, and the original FFN intermediate size $h_{FFN}$, the FFN is replicated $⌈n×h_E/h_{FFN}⌉$ times.
- To promote diversity within each FFN copy, parameters are shuffled along the intermediate dimension.
- For each fine-grained expert, 50% of its parameters are randomly reinitialized.

PRE-TRAINING

In the pre-training of Qwen2, our efforts were focused on refining the dataset and investigating methods to handle extended context lengths effectively.