Research Question

We introduce Mistral 7B, a 7–billion-parameter language model
Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks,
Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost.
We also provide a model fine-tuned to follow instructions, Mistral 7B – Instruct
No pretraining details and information about datasets are given in this paper 😟

Architectural details

The main parameters of the architecture are summarized in Table 1.

Compared to Llama, it introduces a few changes that we summarize below.

Sliding Window Attention.

The hidden state in position $i$ of the layer $k$, $hi$, attends to all hidden states from the previous layer with positions between $i − W$ and $i$.
At the last layer, using a window size of W = 4096, we have a theoretical attention span of approximately 131K tokens.

Note:
- The meaning of the window size parameter here (or in any decoder only transformer model) is not the same as in the original Longformer paper.
  - In the Longformer paper, the author uses encoder-decoder model, and define $W$ to be window that a token can attend to both behinid and in front of it, i.e., token $k$ can attend $\frac{1}{2}W$ previous and succeeding tokens
- In practice, for a sequence length of 16K and W = 4096, changes made to FlashAttention [11] and xFormers [18] yield a 2x speed improvement over a vanilla attention baseline.

Rolling Buffer Cache

Attention is computed by multiplying Q, K and V, and since attention window size for each token is fixed, there is no need to compute extra positions beyond the window size.