Research Question

In this report, we introduce Qwen2.5

In pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens.
In post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning, including offline learning DPO and online learning GRPO.
The open-weight offerings include base models and instruction-tuned models in sizes of 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters. Quantized versions of the instruction-tuned models are also provided.
In addition, for hosted solutions (API), the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5Turbo and Qwen2.5-Plus

Architecture & Tokenizer

Architectures:
- For dense models, we maintain the Transformer-based decoder architecture (Vaswani et al., 2017; Radford et al., 2018) as Qwen2
- For MoE, following the approaches demonstrated in Qwen1.5-MoE (Yang et al., 2024a), we implement fine-grained expert segmentation (Dai et al., 2024) and shared experts routing (Rajbhandari et al., 2022; Dai et al., 2024).
Tokenizer:
- For tokenization, we utilize Qwen’s tokenizer with a vocabulary of 151,643 regular tokens.
- We have expanded the set of control tokens from 3 to 22 compared to previous Qwen versions, adding two new tokens for tool functionality and allocating the remainder for other model capabilities.

Several key components.

Better data filtering
- We leverage Qwen2-Instruct models as data quality filters that perform comprehensive, multi-dimensional analysis to evaluate and score training samples.
Better math and code data
- During the pre-training phase of Qwen2.5, we incorporate training data from Qwen2.5-Math (Yang et al., 2024b) and Qwen2.5-Coder (Hui et al., 2024).
- This data integration strategy proves highly effective
Better synthetic data.
- To generate high-quality synthetic data, particularly in mathematics, code, and knowledge domains, we leverage both Qwen2-72B-Instruct (Yang et al., 2024a) and Qwen2Math-72B-Instruct (Qwen Team, 2024c).
- The quality of this synthesized data is further enhanced through rigorous filtering using our proprietary general reward model and the specialized Qwen2-Math-RM-72B (Qwen Team, 2024c) model.
Better data mixture
- To optimize the pre-training data distribution, we employ Qwen2-Instruct models to classify and balance content across different domains.
- strategic down-sampling of overrepresented domains and up-sampling of high-value domains