With RoPE applied after projection, the correct processing sequence is:
Similarly for keys:
The attention score between query at position t and key at position m becomes:
$\text{score}_{t,m} = \frac{\mathbf{q}_t^T\mathbf{k}m}{\sqrt{d_h}} = \frac{(\mathbf{R}{\Theta,t}^d \mathbf{W}^{UQ}\mathbf{c}t^Q)^T(\mathbf{R}{\Theta,m}^d \mathbf{W}^{UK}\mathbf{c}_m^{KV})}{\sqrt{d_h}}$
$\text{score}{t,m} = \frac{(\mathbf{c}t^Q)^T(\mathbf{W}^{UQ})^T(\mathbf{R}{\Theta,t}^d)^T\mathbf{R}{\Theta,m}^d \mathbf{W}^{UK}\mathbf{c}_m^{KV}}{\sqrt{d_h}}$
Recall that RoPE has this key property: $(\mathbf{R}{\Theta,t}^d)^T\mathbf{R}{\Theta,m}^d = \mathbf{R}_{\Theta,m-t}^d$
So our attention score becomes: $\text{score}_{t,m} = \frac{(\mathbf{c}t^Q)^T(\mathbf{W}^{UQ})^T \mathbf{R}{\Theta,m-t}^d \mathbf{W}^{UK}\mathbf{c}_m^{KV}}{\sqrt{d_h}}$
The issue now becomes clear. In our previous derivation without RoPE, we could absorb $\mathbf{W}^{UK}$ into a modified query weight as: $\mathbf{W}^{Q'} = (\mathbf{W}^{UQ})^T\mathbf{W}^{UK}$
However, with RoPE, we have: $\text{score}_{t,m} = \frac{(\mathbf{c}t^Q)^T(\mathbf{W}^{UQ})^T \mathbf{R}{\Theta,m-t}^d \mathbf{W}^{UK}\mathbf{c}_m^{KV}}{\sqrt{d_h}}$
The rotation matrix $\mathbf{R}_{\Theta,m-t}^d$ sits between $(\mathbf{W}^{UQ})^T$ and $\mathbf{W}^{UK}$, making it impossible to pre-compute their product because: