Absorption of $\mathbf{W}^{UV}$ into $\mathbf{W}^{Q}$

Step 1: Original MLA equations for KV compression

Step 2: Attention score computation with absorbed key weights The attention score between query at position t and key at position m:

Defining $\mathbf{W}^{Q'} = (\mathbf{W}^Q)^T\mathbf{W}^{UK}$, we get:

Absorption of $\mathbf{W}^{UV}$ into $\mathbf{W}^{o}$

Step 1: Output computation with attention weights

Step 2: Final output with absorbed value weights Defining $\mathbf{o}t' = \sum{m=1}^{t} \alpha_{t,m} \mathbf{c}_m^{KV}$:

Defining $\mathbf{W}^{O'} = \mathbf{W}^O\mathbf{W}^{UV}$: