Research Questions

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences.
In all but a few cases, however, such attention mechanisms are still used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on self-attention mechanism to draw global dependencies between input and output.

Approach

Model Architecture

Encoder and Decoder Stacks

Encoder:
- The encoder is composed of a stack of N = 6 identical layers, each layer has two sub-layers.
  - The first is a multi-head self-attention mechanism,
  - and the second is a simple, position-wise fully connected feed-forward network
- The output of each sub-layer is $LayerNorm(x + Sublayer(x))$ (residual connection followed by layer normalization), where $Sublayer(x)$ is the function implemented by the sub-layer itself.
- All sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_{model} = 512.$
Decoder:
- The decoder is also composed of a stack of N = 6 identical layers.
- the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack (receives $K$ and $V$ from the encoder, explained later).
- We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions.
  - This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position $i$ can depend only on the known outputs at positions less than $i.$

Attention

Scaled Dot-Product Attention:
- How it works
  - The input consists of queries and keys of dimension $d_k$, and values of dimension $d_v$.
  - We compute the dot products of the query with all keys, divide each by $\sqrt d_k$, and apply a softmax function to obtain the weights on the values.
  - In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$ (So is $K$ and $V$). We compute the matrix of outputs as:
    
    $\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V$
- Advantage of dot-product attention against additive attention (also widely used at that time):
  - Additive attention computes the compatibility function using a feed-forward network with a single hidden layer.
  - Dot-product attention is identical to our algorithm, except for the scaling factor of $\frac{1}{\sqrt d_k}$
  - Dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code
- Why use scaling factor $\frac{1}{\sqrt d_k}$:
  - While for small values of $d_k$ the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of $d_k$.
  - We suspect that for large values of $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients, causing vanishing gradient during backpropagation.
  - To counteract this effect, we scale the dot products by $\frac{1}{\sqrt d_k}$
Multi-Head Attention
- How it works:
  - Instead of performing a single attention function with $d_{model}$-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values $h$ times with different, learned linear projections to $d_k, d_k$ and $d_v$ dimensions, respectively.
  - Mathematically:
    
    $MultiHead(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O$
    
    $\text{where head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)$
    
    Where the projections are parameter matrices: $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}, \quad W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}, \quad W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$ and $W^O \in \mathbb{R}^{h d_v \times d_{\text{model}}}$
  - In this work we employ $h = 8$ parallel attention layers, or heads. For each of these we use $d_k = d_v = d_{model}/h = 64$.
- Why use multiheads:
  - Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.
  - More practically speaking, each head can focus on different types of relationships (like syntax, semantics, or contextual meaning)
Applications of Attention in our Model
- In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence.
- The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. (This sentence tells use more about the real architecture of the Transformer, which is not being visualized in figure 1. Namely, the 6 encoders are stacked and only the output of the last encoder is passed to the first decoder block.)
- The encoder contains masked self-attention layers. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.

Position-wise Feed-Forward Network

How it works:
- In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network with ReLU activation:
  
  $\text{FFN}(x) = \max(0, x W_1 + b_1) W_2 + b_2$
- The dimensionality of input and output is $d_{model} = 512$, and the inner-layer has dimensionality $d_{ff} = 2048$.