Research Question

image.png

Approach

Motivation

Existing MoE architectures potentially suffer from issues of knowledge hybridity and knowledge redundancy, which limit the expert specialization

In short, DeepSeek MoE aims to achieve the state where specialized knowledge is specialized by experts and shared knowledge is shared!

Preliminaries

In following representations, we omit the layer normalization operation for brevity

Transformer Block

$\mathbf{u}{1:T}^l = \text{Self-Att}\left( \mathbf{h}{1:T}^{l-1} \right) + \mathbf{h}_{1:T}^{l-1}$

$\mathbf{h}_t^l = \text{FFN}\left( \mathbf{u}_t^l \right) + \mathbf{u}_t^l$