Research Question

One major challenge with Transformer is the speed of incremental inference, the speed is limited by the memory bandwidth necessary to reload the large "keys" and "values" tensors.
We propose a variant called multi-query attention, where the keys and values are shared across all of the different attention "heads"

Approach

Multi-head attention consists of multiple attention layers (heads) in parallel with different linear transformations on the queries, keys, values and outputs.
Multi-query attention is identical except that the different heads share a single set of keys and values.

Dataset: WMT 2014 English-German translation task
Baseline:
- As a baseline, we use an encoder-decoder Transformer model with 6 layers,
- d_model = 1024 d_ff = 4096, h = 8, dk = dv = 128, learned positional embeddings,
- and weight-sharing between the token-embedding and output layers.
Implementation:
- In our "multi-query" model, we replace all of the attention layers in the model to multi-query attention. This includes the encoder-self-attention, decoder-self-attention and encoder-decoder-attention layers.

The multiquery model performed similarly to the baseline, and actually had the highest BLEU score (28.5) with beam-4 decoding.