Research Question

We present DeepSeek-V2,

Architecture

Multi-Head Latent Attention

image.png

Preliminaries: Standard Multi-Head Attention

Low-Rank Joint Compression

Decoupled Rotary Position Embedding

We intend to use the Rotary Position Embedding (RoPE) (Su et al., 2024) for DeepSeek-V2. However, RoPE is incompatible with low-rank KV compression.