Detailed Illustration of the Problem

Step 1: Original MLA equations with RoPE

With RoPE applied after projection, the correct processing sequence is:

Compressed query latent vector: $\mathbf{c}_t^Q = \mathbf{W}^{DQ}\mathbf{h}_t$
Query projection (before RoPE): $\mathbf{q}_t^{pre} = \mathbf{W}^{UQ}\mathbf{c}_t^Q$
Query with RoPE: $\mathbf{q}_t = f_q(\mathbf{q}t^{pre}, t) = \mathbf{R}{\Theta,t}^d \mathbf{q}t^{pre} = \mathbf{R}{\Theta,t}^d \mathbf{W}^{UQ}\mathbf{c}_t^Q$

Similarly for keys:

Compressed KV latent vector: $\mathbf{c}_m^{KV} = \mathbf{W}^{DKV}\mathbf{h}_m$
Key projection (before RoPE): $\mathbf{k}_m^{pre} = \mathbf{W}^{UK}\mathbf{c}_m^{KV}$
Key with RoPE: $\mathbf{k}m = f_k(\mathbf{k}m^{pre}, m) = \mathbf{R}{\Theta,m}^d \mathbf{k}m^{pre} = \mathbf{R}{\Theta,m}^d \mathbf{W}^{UK}\mathbf{c}_m^{KV}$

Step 2: Attention score computation with RoPE

The attention score between query at position t and key at position m becomes:

$\text{score}_{t,m} = \frac{\mathbf{q}_t^T\mathbf{k}m}{\sqrt{d_h}} = \frac{(\mathbf{R}{\Theta,t}^d \mathbf{W}^{UQ}\mathbf{c}t^Q)^T(\mathbf{R}{\Theta,m}^d \mathbf{W}^{UK}\mathbf{c}_m^{KV})}{\sqrt{d_h}}$

$\text{score}{t,m} = \frac{(\mathbf{c}t^Q)^T(\mathbf{W}^{UQ})^T(\mathbf{R}{\Theta,t}^d)^T\mathbf{R}{\Theta,m}^d \mathbf{W}^{UK}\mathbf{c}_m^{KV}}{\sqrt{d_h}}$

Step 3: The key property of RoPE

Recall that RoPE has this key property: $(\mathbf{R}{\Theta,t}^d)^T\mathbf{R}{\Theta,m}^d = \mathbf{R}_{\Theta,m-t}^d$

So our attention score becomes: $\text{score}_{t,m} = \frac{(\mathbf{c}t^Q)^T(\mathbf{W}^{UQ})^T \mathbf{R}{\Theta,m-t}^d \mathbf{W}^{UK}\mathbf{c}_m^{KV}}{\sqrt{d_h}}$

Step 4: The problem with absorption

The issue now becomes clear. In our previous derivation without RoPE, we could absorb $\mathbf{W}^{UK}$ into a modified query weight as: $\mathbf{W}^{Q'} = (\mathbf{W}^{UQ})^T\mathbf{W}^{UK}$

However, with RoPE, we have: $\text{score}_{t,m} = \frac{(\mathbf{c}t^Q)^T(\mathbf{W}^{UQ})^T \mathbf{R}{\Theta,m-t}^d \mathbf{W}^{UK}\mathbf{c}_m^{KV}}{\sqrt{d_h}}$

The rotation matrix $\mathbf{R}_{\Theta,m-t}^d$ sits between $(\mathbf{W}^{UQ})^T$ and $\mathbf{W}^{UK}$, making it impossible to pre-compute their product because: