Research Question

We present DeepSeek-V2,

It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens.
DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE.
Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and
- meanwhile saves 42.5% of training costs,
- reduces the KV cache by 93.3%, and
- boosts the maximum generation throughput to 5.76 times.

Architecture

Multi-Head Latent Attention

Preliminaries: Standard Multi-Head Attention

Computing QKV

$\mathbf{q}_t = W^Q \mathbf{h}_t,$ $\mathbf{k}_t = W^K \mathbf{h}_t,$ $\mathbf{v}_t = W^V \mathbf{h}_t,$
- Dimensions: $\mathbf{q}_t, \mathbf{k}_t, \mathbf{v}_t \in \mathbb{R}^{d_h n_h} \text{ through three matrices } W^Q, W^K, W^V \in \mathbb{R}^{d_h n_h \times d}$, and $h_t \in \mathbb{R}^d$ where
- $d$ is the embedding dimension
- $n_h$ is the number of attention heads
- $d_h$ is the dimension per head
Multi-head attention: $\mathbf{q}t, \mathbf{k}t, \mathbf{v}t$ will be sliced into $n_h$ heads for the multi-head attention computation: $[\mathbf{q}{t,1}; \mathbf{q}{t,2}; \dots; \mathbf{q}{t,n_h}] = \mathbf{q}t,$ **$[\mathbf{k}{t,1}; \mathbf{k}{t,2}; \dots; \mathbf{k}{t,n_h}] = \mathbf{k}t$ **$[\mathbf{v}{t,1}; \mathbf{v}{t,2}; \dots; \mathbf{v}{t,n_h}] = \mathbf{v}t$ **$\mathbf{o}{t,i} = \sum_{j=1}^{t} \text{Softmax}j\left(\frac{\mathbf{q}{t,i}^\top \mathbf{k}{j,i}}{\sqrt{d_h}}\right) \mathbf{v}{j,i}$ $\mathbf{u}t = W^O [\mathbf{o}{t,1}; \mathbf{o}{t,2}; \dots; \mathbf{o}{t,n_h}]$
- where $\mathbf{q}{t,i}, \mathbf{k}{t,i}, \mathbf{v}_{t,i} \in \mathbb{R}^{d_h}$ denote the query, key, and value of the $i$-th attention head, respectively;
- $W^O \in \mathbb{R}^{d \times d_h n_h}$ denotes the output projection matrix.
Cache space
- Statement: MHA needs to cache $2𝑛_ℎ𝑑_ℎ𝑙$ elements for each token
- Reason:
  - In multi-head attention (MHA) of a single layer, for each token, you need to cache all the past keys ($\mathbf{k}$) and past values ($\mathbf{v}$) during inference.
    
    Each token's key and value vectors are sliced into $n_h$ heads. Each head has dimensionality $d_h$. So each token's key cache needs to store $n_h \times d_h$ elements. Similarly, each token's value cache also needs to store $n_h \times d_h$ elements. Thus, keys + values together require: $2 \times n_h \times d_h$ elements per token.
    
    Now, if the model has $l$ layers, the total number of cached elements becomes: $2 \times n_h \times d_h \times l$ for each token.

Low-Rank Joint Compression

Low-rank joint compression for keys and values

The core of MLA is the low-rank joint compression for keys and values to reduce KV cache:

$\mathbf{c}_t^{\text{KV}} = W^{\text{DKV}} \mathbf{h}_t,$

$\mathbf{k}_t^C = W^{\text{UK}} \mathbf{c}_t^{\text{KV}},$

$\mathbf{v}_t^C = W^{\text{UV}} \mathbf{c}_t^{\text{KV}}$

where $\mathbf{c}_t^{\text{KV}} \in \mathbb{R}^{d_c}$ is the compressed latent vector for keys and values;
- $d_c (\ll d_h n_h)$ denotes the $KV$ compression dimension;
- $W^{\text{DKV}} \in \mathbb{R}^{d_c \times d}$ is the down-projection matrix; and
- $W^{\text{UK}}, W^{\text{UV}} \in \mathbb{R}^{d_h n_h \times d_c}$ are the up-projection matrices for keys and values, respectively.
- Notes
  - During inference, MLA only needs to cache $\mathbf{c}_t^{\text{KV}}$ , so its $KV$ cache has only $𝑑_𝑐𝑙$ elements (for each K and V)
  - In addition, during inference, since $𝑊^{𝑈𝐾}$ can be absorbed into $𝑊_𝑄$, and $𝑊_{𝑈𝑉}$ can be absorbed into $𝑊^𝑂$, we even do not need to compute keys and values out for attention.
    
    Detailed Derivation on Absorption
Low-rank joint compression for queries

In order to reduce the activation memory during training, we also perform low-rank compression for the queries, even if it cannot reduce the KV cache:

$\mathbf{c}_t^Q = \mathbf{W}^{DQ}\mathbf{h}_t$

$\mathbf{q}_t^C = \mathbf{W}^{UQ}\mathbf{c}_t^Q$

where $\mathbf{c}_t^Q \in \mathbb{R}^{d'_c}$ is the compressed latent vector for queries;
- $d'_c(\ll d_h n_h)$ denotes the query compression dimension; and
- $\mathbf{W}^{DQ} \in \mathbb{R}^{d_c^q \times d}, \mathbf{W}^{UQ} \in \mathbb{R}^{d_h n_h \times d_c^q}$ are the down-projection and up-projection matrices for queries, respectively

Decoupled Rotary Position Embedding

We intend to use the Rotary Position Embedding (RoPE) (Su et al., 2024) for DeepSeek-V2. However, RoPE is incompatible with low-rank KV compression.

Problem:
- If we apply RoPE for the keys $k^𝐶_𝑡 , 𝑊^{𝑈𝐾}$ will be coupled with a position-sensitive RoPE matrix.
- In this way, $𝑊^{𝑈𝐾}$ cannot be absorbed into $𝑊^𝑄$ any more during inference, since a RoPE matrix related to the currently generating token will lie between $𝑊^𝑄$ and $𝑊^{𝑈𝐾}$ and matrix multiplication does not obey a commutative law.
- As a result, we must recompute the keys for all the prefix tokens during inference, which will significantly hinder the inference efficiency.
Detailed Illustration of the Problem
Solution:
- We propose the decoupled RoPE strategy that uses additional multi-head queries $q^𝑅_{𝑡,𝑖} ∈ R^{𝑑^𝑅_ℎ}$ and a shared key $k^𝑅_𝑡 ∈ R^{𝑑^𝑅_ℎ}$ to carry RoPE, where $𝑑^𝑅_ℎ$ denotes the per-head dimension of the decoupled queries and key.
  
  $[\mathbf{q}{t,1}^R; \mathbf{q}{t,2}^R; ...; \mathbf{q}_{t,n_h}^R] = \mathbf{q}_t^R = \text{RoPE}(W^{QR}\mathbf{c}_t^Q),$
  
  $\mathbf{k}_t^R = \text{RoPE}(W^{KR}\mathbf{h}_t),$
  
  $\mathbf{q}{t,i} = [\mathbf{q}{t,i}^C; \mathbf{q}_{t,i}^R],$
  
  $\mathbf{k}{t,i} = [\mathbf{k}{t,i}^C; \mathbf{k}_t^R],$
  
  $\mathbf{o}{t,i} = \sum{j=1}^{t} \text{Softmax}j\left(\frac{\mathbf{q}{t,i}^T\mathbf{k}{j,i}}{\sqrt{d_h + d_h^R}}\right)\mathbf{v}{j,i},$
  
  $\mathbf{u}t = W^O[\mathbf{o}{t,1}; \mathbf{o}{t,2}; ...; \mathbf{o}{t,n_h}],$
  
  where $W^{QR} \in \mathbb{R}^{d_h^R n_h \times d_c^q} \text{ and } W^{KR} \in \mathbb{R}^{d_h^R \times d}$ are matrices to produce the decouples queries and key,
- Notes
  - During inference, the decoupled key should also be cached. Therefore, DeepSeek-V2 requires a total KV cache containing $(𝑑_𝑐 + 𝑑^𝑅_ℎ)𝑙$ elements.