We present DeepSeek-V2,
It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens.
DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE.
Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and
Computing QKV
$\mathbf{q}_t = W^Q \mathbf{h}_t,$ $\mathbf{k}_t = W^K \mathbf{h}_t,$ $\mathbf{v}_t = W^V \mathbf{h}_t,$
Multi-head attention: $\mathbf{q}t, \mathbf{k}t, \mathbf{v}t$ will be sliced into $n_h$ heads for the multi-head attention computation: $[\mathbf{q}{t,1}; \mathbf{q}{t,2}; \dots; \mathbf{q}{t,n_h}] = \mathbf{q}t,$ **$[\mathbf{k}{t,1}; \mathbf{k}{t,2}; \dots; \mathbf{k}{t,n_h}] = \mathbf{k}t$ **$[\mathbf{v}{t,1}; \mathbf{v}{t,2}; \dots; \mathbf{v}{t,n_h}] = \mathbf{v}t$ **$\mathbf{o}{t,i} = \sum_{j=1}^{t} \text{Softmax}j\left(\frac{\mathbf{q}{t,i}^\top \mathbf{k}{j,i}}{\sqrt{d_h}}\right) \mathbf{v}{j,i}$ $\mathbf{u}t = W^O [\mathbf{o}{t,1}; \mathbf{o}{t,2}; \dots; \mathbf{o}{t,n_h}]$
Cache space
In multi-head attention (MHA) of a single layer, for each token, you need to cache all the past keys ($\mathbf{k}$) and past values ($\mathbf{v}$) during inference.
Each token's key and value vectors are sliced into $n_h$ heads. Each head has dimensionality $d_h$. So each token's key cache needs to store $n_h \times d_h$ elements. Similarly, each token's value cache also needs to store $n_h \times d_h$ elements. Thus, keys + values together require: $2 \times n_h \times d_h$ elements per token.
Now, if the model has $l$ layers, the total number of cached elements becomes: $2 \times n_h \times d_h \times l$ for each token.
Low-rank joint compression for keys and values
The core of MLA is the low-rank joint compression for keys and values to reduce KV cache:
$\mathbf{c}_t^{\text{KV}} = W^{\text{DKV}} \mathbf{h}_t,$
$\mathbf{k}_t^C = W^{\text{UK}} \mathbf{c}_t^{\text{KV}},$
$\mathbf{v}_t^C = W^{\text{UV}} \mathbf{c}_t^{\text{KV}}$
where $\mathbf{c}_t^{\text{KV}} \in \mathbb{R}^{d_c}$ is the compressed latent vector for keys and values;
During inference, MLA only needs to cache $\mathbf{c}_t^{\text{KV}}$ , so its $KV$ cache has only $𝑑_𝑐𝑙$ elements (for each K and V)
In addition, during inference, since $𝑊^{𝑈𝐾}$ can be absorbed into $𝑊_𝑄$, and $𝑊_{𝑈𝑉}$ can be absorbed into $𝑊^𝑂$, we even do not need to compute keys and values out for attention.
Low-rank joint compression for queries
In order to reduce the activation memory during training, we also perform low-rank compression for the queries, even if it cannot reduce the KV cache:
$\mathbf{c}_t^Q = \mathbf{W}^{DQ}\mathbf{h}_t$
$\mathbf{q}_t^C = \mathbf{W}^{UQ}\mathbf{c}_t^Q$
where $\mathbf{c}_t^Q \in \mathbb{R}^{d'_c}$ is the compressed latent vector for queries;
We intend to use the Rotary Position Embedding (RoPE) (Su et al., 2024) for DeepSeek-V2. However, RoPE is incompatible with low-rank KV compression.
Problem:
Solution:
We propose the decoupled RoPE strategy that uses additional multi-head queries $q^𝑅_{𝑡,𝑖} ∈ R^{𝑑^𝑅_ℎ}$ and a shared key $k^𝑅_𝑡 ∈ R^{𝑑^𝑅_ℎ}$ to carry RoPE, where $𝑑^𝑅_ℎ$ denotes the per-head dimension of the decoupled queries and key.
$[\mathbf{q}{t,1}^R; \mathbf{q}{t,2}^R; ...; \mathbf{q}_{t,n_h}^R] = \mathbf{q}_t^R = \text{RoPE}(W^{QR}\mathbf{c}_t^Q),$
$\mathbf{k}_t^R = \text{RoPE}(W^{KR}\mathbf{h}_t),$
$\mathbf{q}{t,i} = [\mathbf{q}{t,i}^C; \mathbf{q}_{t,i}^R],$
$\mathbf{k}{t,i} = [\mathbf{k}{t,i}^C; \mathbf{k}_t^R],$
$\mathbf{o}{t,i} = \sum{j=1}^{t} \text{Softmax}j\left(\frac{\mathbf{q}{t,i}^T\mathbf{k}{j,i}}{\sqrt{d_h + d_h^R}}\right)\mathbf{v}{j,i},$
$\mathbf{u}t = W^O[\mathbf{o}{t,1}; \mathbf{o}{t,2}; ...; \mathbf{o}{t,n_h}],$
where $W^{QR} \in \mathbb{R}^{d_h^R n_h \times d_c^q} \text{ and } W^{KR} \in \mathbb{R}^{d_h^R \times d}$ are matrices to produce the decouples queries and key,
Notes