Rectified-linear units (ReLU)
$\text{FFN}(x, W_1, W_2, b_1, b_2) = \max(0, xW_1 + b_1)W_2 + b_2$
Rectified-linear units (ReLU) without bias (Used by T5)
$\text{FFN}_{\text{ReLU}}(x, W_1, W_2) = \max(xW_1, 0)W_2$
Gaussian Error Linear Units (GeLU)
$\text{FFN}_{\text{GELU}}(x, W_1, W_2) = \text{GELU}(xW_1)W_2$
Swish ($\beta=1$) (is the same as SiLU)
$\text{FFN}_{\text{Swish}}(x, W_1, W_2) = \text{Swish}_1(xW_1)W_2$
Gated Linear Units (GLU) and Bilinear
$\begin{aligned} \text{GLU}(x, W, V, b, c) &= \sigma(xW + b) \otimes (xV + c) \\ \text{Bilinear}(x, W, V, b, c) &= (xW + b) \otimes (xV + c) \end{aligned}$
GLU variants
$\begin{aligned} \text{ReGLU}(x, W, V, b, c) &= \max(0, xW + b) \otimes (xV + c) \\ \text{GEGLU}(x, W, V, b, c) &= \text{GELU}(xW + b) \otimes (xV + c) \\ \text{SwiGLU}(x, W, V, b, c, \beta) &= \text{Swish}_\beta(xW + b) \otimes (xV + c) \end{aligned}$
Further variants without bias (proposed by this paper)
$\begin{aligned} \text{FFN}{\text{GLU}}(x, W, V, W_2) &= (\sigma(xW) \otimes xV)W_2 \\ \text{FFN}{\text{Bilinear}}(x, W, V, W_2) &= (xW \otimes xV)W_2 \\ \text{FFN}{\text{ReGLU}}(x, W, V, W_2) &= (\max(0, xW) \otimes xV)W_2 \\ \text{FFN}{\text{GEGLU}}(x, W, V, W_2) &= (\text{GELU}(xW) \otimes xV)W_2 \\ \text{FFN}_{\text{SwiGLU}}(x, W, V, W_2) &= (\text{Swish}_1(xW) \otimes xV)W_2 \end{aligned}$
Component | Specification |
---|---|
Model Base | Raffel et al., 2019 |
Encoder Layers | 12 |
Decoder Layers | 12 |
Model Dimension ($d_{\text{model}}$) | 768 |
Attention Heads ($h$) | 12 |
Key/Value Dimension ($d_k = d_v$) | 64 |
FFN Hidden Size ($d_{\text{ff}}$) | 3072 (standard FFN) |
GLU-variant FFN Hidden Size | 2048 ($\frac{8}{3}d_{model}$, to keep parameters/ops constant) |
GLU-variant FFN Weights | 3 weight matrices (vs. 2 in standard FFN) |