Research Question

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length.
Existing approaches partition or shorten the long context into smaller sequences that fall within the typical 512 token limit of BERT-style pretrained models
To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length

Approach

The original Transformer model has a self-attention component with $O(n^2)$ time and memory complexity where n is the input sequence length.
To address this challenge, we sparsify the full self-attention matrix according to an “attention pattern” specifying pairs of input locations attending to one another.

Implementation
- Employ a fixed-size window attention surrounding each token.
  - Using multiple stacked layers of such windowed attention results in a large receptive field, where top layers have access to all input locations and have the capacity to build representations that incorporate information across the entire input, similar to CNNs
- Given a fixed window size $w$, each token attends to $\frac{1}{2}w$ tokens on each side (Fig. 2b).
- The computation complexity of this pattern is $O(n × w),$ which scales linearly with input sequence length $n$.
Properties
- In a transformer with $l$ layers, the receptive field size at the top layer is $l × w$ (assuming w is fixed for all layers).
- Depending on the application, it might be helpful to use different values of w for each layer to balance between efficiency and model representation capacity (§4.1).

Implementation
- To further increase the receptive field without increasing computation, the sliding window can be “dilated” - the window has gaps of size dilation $d$ (Fig. 2c).
Properties
- Assuming a fixed $d$ and $w$ for all layers, the receptive field is $l × d × w$, which can reach tens of thousands of tokens even for small values of $d$.
- We found settings with different dilation configurations per head improves performance by
  - allowing some heads without dilation to focus on local context,
  - while others with dilation focus on longer context.

Implementation
- the windowed and dilated attention are not flexible enough to learn task-specific representations. Accordingly, we add “global attention” on few pre-selected input locations.
- we make this attention operation symmetric:
  - a token with a global attention attends to all tokens across the sequence,
  - and all tokens in the sequence attend to it.
- For example
  - for classification, global attention is used for the [CLS] token
  - in QA global attention is provided on all question tokens.
Properties:
- The complexity of the combined local and global attention is still O(n).
Linear Projections for Global Attention
- We use two sets of projections,
  - $Q_s, K_s, V_s$ to compute attention scores of sliding window attention,
  - $Q_g, K_g, V_g$ to compute attention scores for the global attention.