Research Question

Architectural details

The main parameters of the architecture are summarized in Table 1.

image.png

Compared to Llama, it introduces a few changes that we summarize below.

Sliding Window Attention.

image.png

Rolling Buffer Cache

Attention is computed by multiplying Q, K and V, and since attention window size for each token is fixed, there is no need to compute extra positions beyond the window size.