Research Question

Conventional MoE architectures like GShard, which activate the top-𝐾 out of 𝑁 experts, face challenges in ensuring expert specialization,
- i.e. each expert acquires non-overlapping and focused knowledge.
we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies:
- (1) finely segmenting the experts into 𝑚𝑁 ones and activating 𝑚𝐾 from them
- (2) isolating 𝐾𝑠 experts as shared ones
We first demonstrate the ability of a model with 2B parameters, then scale up DeepSeekMoE to 16B parameters, we lastly describe our preliminary efforts to scale up DeepSeekMoE to 145B p

Approach

Existing MoE architectures potentially suffer from issues of knowledge hybridity and knowledge redundancy, which limit the expert specialization

Knowledge Hybridity (from an expert’s perspective): existing MoE practices often employ a limited number of experts (e.g., 8 or 16), and thus a specific expert will be receiving different tokens that likely cover diverse knowledge.
Knowledge Redundancy (from tokens’ perspective): tokens assigned to different experts may require common knowledge.
- As a result, multiple experts may converge in acquiring shared knowledge in their respective parameters, thereby leading to redundancy in expert parameters.

In short, DeepSeek MoE aims to achieve the state where specialized knowledge is specialized by experts and shared knowledge is shared!

In following representations, we omit the layer normalization operation for brevity

$\mathbf{u}{1:T}^l = \text{Self-Att}\left( \mathbf{h}{1:T}^{l-1} \right) + \mathbf{h}_{1:T}^{l-1}$

$\mathbf{h}_t^l = \text{FFN}\left( \mathbf{u}_t^l \right) + \mathbf{u}_t^l$

$\mathbf{u}_{1:T}^l ∈ R^{𝑇×𝑑}$ are the hidden states of all tokens after the $𝑙-th$ attention module with $T$ being the sequence length,