Mixtral uses the same modifications as described in Mistral 7B with the notable exceptions that
Mixtral supports a fully dense context length of 32k tokens, and
the feedforward blocks are replaced by Mixture-of-Expert layers
The parameter details are:
Output from weighted experts:
$\sum_{i=0}^{n-1} G(x)_i \cdot E_i(x).$
where $G(x)_i$ is the weights given by the gaiting network and $E_i(x)$ is the output from the expert $E_i$
Sparse gating
$G(x) := \text{Softmax}(\text{TopK}(x \cdot W_g)),$
i.e., except for the top-k values, all other values for $x \cdot W_g$ are set to $-inf$, leading to $0$ weights.
Implementation: For Mixtral we use the same SwiGLU architecture as the expert function Ei(x) and set K = 2. The output y for an input token x is computed as:
$y = \sum_{i=0}^{n-1} \text{Softmax}(\text{Top2}(x \cdot W_g))_i \cdot \text{SwiGLU}_i(x).$