Gated Linear Units (GLU) and Variants

Rectified-linear units (ReLU)

GeLU and Swish

GLU and its Variants

Important Note

Experiments

Model Architecture

Component Specification
Model Base Raffel et al., 2019
Encoder Layers 12
Decoder Layers 12
Model Dimension ($d_{\text{model}}$) 768
Attention Heads ($h$) 12
Key/Value Dimension ($d_k = d_v$) 64
FFN Hidden Size ($d_{\text{ff}}$) 3072 (standard FFN)
GLU-variant FFN Hidden Size 2048 ($\frac{8}{3}d_{model}$, to keep parameters/ops constant)
GLU-variant FFN Weights 3 weight matrices (vs. 2 in standard FFN)

Pre-Training and Perplexity Results

Settings