Research Question
- Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length.
- Existing approaches partition or shorten the long context into smaller sequences that fall within the typical 512 token limit of BERT-style pretrained models
- To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length
Approach
- The original Transformer model has a self-attention component with $O(n^2)$ time and memory complexity where n is the input sequence length.
- To address this challenge, we sparsify the full self-attention matrix according to an “attention pattern” specifying pairs of input locations attending to one another.
Attention Pattern

Sliding Window
- Implementation
- Employ a fixed-size window attention surrounding each token.
- Using multiple stacked layers of such windowed attention results in a large receptive field, where top layers have access to all input locations and have the capacity to build representations that incorporate information across the entire input, similar to CNNs
- Given a fixed window size $w$, each token attends to $\frac{1}{2}w$ tokens on each side (Fig. 2b).
- The computation complexity of this pattern is $O(n × w),$ which scales linearly with input sequence length $n$.
- Properties
- In a transformer with $l$ layers, the receptive field size at the top layer is $l × w$ (assuming w is fixed for all layers).
- Depending on the application, it might be helpful to use different values of w for each layer to balance between efficiency and model representation capacity (§4.1).
Dilated Sliding Window
- Implementation
- To further increase the receptive field without increasing computation, the sliding window can be “dilated” - the window has gaps of size dilation $d$ (Fig. 2c).
- Properties
- Assuming a fixed $d$ and $w$ for all layers, the receptive field is $l × d × w$, which can reach tens of thousands of tokens even for small values of $d$.
- We found settings with different dilation configurations per head improves performance by
- allowing some heads without dilation to focus on local context,
- while others with dilation focus on longer context.
Global Attention
- Implementation
- the windowed and dilated attention are not flexible enough to learn task-specific representations. Accordingly, we add “global attention” on few pre-selected input locations.
- we make this attention operation symmetric:
- a token with a global attention attends to all tokens across the sequence,
- and all tokens in the sequence attend to it.
- For example
- for classification, global attention is used for the [CLS] token
- in QA global attention is provided on all question tokens.
- Properties:
- The complexity of the combined local and global attention is still O(n).
- Linear Projections for Global Attention
- We use two sets of projections,
- $Q_s, K_s, V_s$ to compute attention scores of sliding window attention,
- $Q_g, K_g, V_g$ to compute attention scores for the global attention.