Research Question

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model.
- Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts).
- For every token, at each layer, a router network selects two experts to process the current state and combine their outputs.
- We also provide a model finetuned to follow instructions, Mixtral 8x7B – Instruct, using supervised fine-tuning and Direct Preference Optimization
- Mixtral outperforms Llama 2 70B and GPT-3.5 on most benchmarks.

Architectural details

Mixtral uses the same modifications as described in Mistral 7B with the notable exceptions that

Output from weighted experts:

$\sum_{i=0}^{n-1} G(x)_i \cdot E_i(x).$

where $G(x)_i$ is the weights given by the gaiting network and $E_i(x)$ is the output from the expert $E_i$
Sparse gating

$G(x) := \text{Softmax}(\text{TopK}(x \cdot W_g)),$

i.e., except for the top-k values, all other values for $x \cdot W_g$ are set to $-inf$, leading to $0$ weights.
Implementation: For Mixtral we use the same SwiGLU architecture as the expert function Ei(x) and set K = 2. The output y for an input token x is computed as:

$y = \sum_{i=0}^{n-1} \text{Softmax}(\text{Top2}(x \cdot W_g))_i \cdot \text{SwiGLU}_i(x).$

MoE layers can be run efficiently on single GPUs with high performance specialized kernels
- For example, Megablocks [13] casts the feed-forward network (FFN) operations of the MoE layer as large sparse matrix multiplications
MoE layer can be distributed to multiple GPUs through standard Model Parallelism techniques, and through a particular kind of partitioning strategy called Expert Parallelism (EP)
- During the MoE layer’s execution, tokens meant to be processed by a specific expert are routed to the corresponding GPU for processing, and the expert’s output is returned to the original token location.
- EP introduces challenges in load balancing, as it is essential to distribute the workload evenly across the GPUs