Research Question
- One major challenge with Transformer is the speed of incremental inference, the speed is limited by the memory bandwidth necessary to reload the large "keys" and "values" tensors.
- We propose a variant called multi-query attention, where the keys and values are shared across all of the different attention "heads"
Approach
- Multi-head attention consists of multiple attention layers (heads) in parallel with different linear transformations on the queries, keys, values and outputs.
- Multi-query attention is identical except that the different heads share a single set of keys and values.
Experiment
Setup
- Dataset: WMT 2014 English-German translation task
- Baseline:
- As a baseline, we use an encoder-decoder Transformer model with 6 layers,
- d_model = 1024 d_ff = 4096, h = 8, dk = dv = 128, learned positional embeddings,
- and weight-sharing between the token-embedding and output layers.
- Implementation:
- In our "multi-query" model, we replace all of the attention layers in the model to multi-query attention. This includes the encoder-self-attention, decoder-self-attention and encoder-decoder-attention layers.
Model Quality

- The multiquery model performed similarly to the baseline, and actually had the highest BLEU score (28.5) with beam-4 decoding.
Speed
