Gated Linear Units (GLU) and Variants

Rectified-linear units (ReLU)

Rectified-linear units (ReLU)

$\text{FFN}(x, W_1, W_2, b_1, b_2) = \max(0, xW_1 + b_1)W_2 + b_2$
Rectified-linear units (ReLU) without bias (Used by T5)

$\text{FFN}_{\text{ReLU}}(x, W_1, W_2) = \max(xW_1, 0)W_2$

Gaussian Error Linear Units (GeLU)

$\text{FFN}_{\text{GELU}}(x, W_1, W_2) = \text{GELU}(xW_1)W_2$
Swish ($\beta=1$) (is the same as SiLU)

$\text{FFN}_{\text{Swish}}(x, W_1, W_2) = \text{Swish}_1(xW_1)W_2$

Gated Linear Units (GLU) and Bilinear

$\begin{aligned} \text{GLU}(x, W, V, b, c) &= \sigma(xW + b) \otimes (xV + c) \\ \text{Bilinear}(x, W, V, b, c) &= (xW + b) \otimes (xV + c) \end{aligned}$
- GLU’s gating is sigmoid while bilinear gats with linear projection
GLU variants

$\begin{aligned} \text{ReGLU}(x, W, V, b, c) &= \max(0, xW + b) \otimes (xV + c) \\ \text{GEGLU}(x, W, V, b, c) &= \text{GELU}(xW + b) \otimes (xV + c) \\ \text{SwiGLU}(x, W, V, b, c, \beta) &= \text{Swish}_\beta(xW + b) \otimes (xV + c) \end{aligned}$
- The prefix in front of GLU represents the gating mechanism they use, e.g., ReGLU uses ReLU as gating, and GeGLU uses GeGLU, SwiGLU uses Swish.
Further variants without bias (proposed by this paper)

$\begin{aligned} \text{FFN}{\text{GLU}}(x, W, V, W_2) &= (\sigma(xW) \otimes xV)W_2 \\ \text{FFN}{\text{Bilinear}}(x, W, V, W_2) &= (xW \otimes xV)W_2 \\ \text{FFN}{\text{ReGLU}}(x, W, V, W_2) &= (\max(0, xW) \otimes xV)W_2 \\ \text{FFN}{\text{GEGLU}}(x, W, V, W_2) &= (\text{GELU}(xW) \otimes xV)W_2 \\ \text{FFN}_{\text{SwiGLU}}(x, W, V, W_2) &= (\text{Swish}_1(xW) \otimes xV)W_2 \end{aligned}$

All of these layers have three weight matrices, as opposed to two for the original FFN.
To keep the number of parameters and the amount of computation constant, we reduce the number of hidden units $d_{ff}$ (the second dimension of $W$ and $V$ and the first dimension of $W2$) by a factor of $2/3$ when comparing these layers to the original two-matrix version.
This is exactly why most modern LLM that use GLU activation functions set the hidden dimension of FFN to $8/3*d_{model}$

Component	Specification
Model Base	Raffel et al., 2019
Encoder Layers	12
Decoder Layers	12
Model Dimension ($d_{\text{model}}$)	768
Attention Heads ($h$)	12
Key/Value Dimension ($d_k = d_v$)	64
FFN Hidden Size ($d_{\text{ff}}$)	3072 (standard FFN)
GLU-variant FFN Hidden Size	2048 ($\frac{8}{3}d_{model}$, to keep parameters/ops constant)
GLU-variant FFN Weights	3 weight matrices (vs. 2 in standard FFN)