Research Question
- Conventional MoE architectures like GShard, which activate the top-πΎ out of π experts, face challenges in ensuring expert specialization,
- i.e. each expert acquires non-overlapping and focused knowledge.
- we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies:
- (1) finely segmenting the experts into ππ ones and activating ππΎ from them
- (2) isolating πΎπ experts as shared ones
- We first demonstrate the ability of a model with 2B parameters, then scale up DeepSeekMoE to 16B parameters, we lastly describe our preliminary efforts to scale up DeepSeekMoE to 145B p

Approach
Motivation
Existing MoE architectures potentially suffer from issues of knowledge hybridity and knowledge redundancy, which limit the expert specialization
- Knowledge Hybridity (from an expertβs perspective): existing MoE practices often employ a limited number of experts (e.g., 8 or 16), and thus a specific expert will be receiving different tokens that likely cover diverse knowledge.
- Knowledge Redundancy (from tokensβ perspective): tokens assigned to different experts may require common knowledge.
- As a result, multiple experts may converge in acquiring shared knowledge in their respective parameters, thereby leading to redundancy in expert parameters.
In short, DeepSeek MoE aims to achieve the state where specialized knowledge is specialized by experts and shared knowledge is shared!
Preliminaries
In following representations, we omit the layer normalization operation for brevity
Transformer Block
$\mathbf{u}{1:T}^l = \text{Self-Att}\left( \mathbf{h}{1:T}^{l-1} \right) + \mathbf{h}_{1:T}^{l-1}$
$\mathbf{h}_t^l = \text{FFN}\left( \mathbf{u}_t^l \right) + \mathbf{u}_t^l$
- $\mathbf{u}_{1:T}^l β R^{πΓπ}$ are the hidden states of all tokens after the $π-th$ attention module with $T$ being the sequence length,