How it works
The input consists of queries and keys of dimension $d_k$, and values of dimension $d_v$.
We compute the dot products of the query with all keys, divide each by $\sqrt d_k$, and apply a softmax function to obtain the weights on the values.
In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$ (So is $K$ and $V$). We compute the matrix of outputs as:
$\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V$
Advantage of dot-product attention against additive attention (also widely used at that time):
Why use scaling factor $\frac{1}{\sqrt d_k}$:
Instead of performing a single attention function with $d_{model}$-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values $h$ times with different, learned linear projections to $d_k, d_k$ and $d_v$ dimensions, respectively.
Mathematically:
$MultiHead(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O$
$\text{where head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)$
Where the projections are parameter matrices: $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}, \quad W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}, \quad W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$ and $W^O \in \mathbb{R}^{h d_v \times d_{\text{model}}}$
In this work we employ $h = 8$ parallel attention layers, or heads. For each of these we use $d_k = d_v = d_{model}/h = 64$.
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network with ReLU activation:
$\text{FFN}(x) = \max(0, x W_1 + b_1) W_2 + b_2$
The dimensionality of input and output is $d_{model} = 512$, and the inner-layer has dimensionality $d_{ff} = 2048$.