Longformer Model

Context:
- It can (typically) use a combination of Local Attention and Global Attention to reduce compute.
- It can removes attention for some tokens like delimiters.
- It can (typically) have a Sparse Attention Matrix (versus the dense attention).
- It can allow input sequences with thousands of tokens to be processed.
- ...
Example(s):
- one used in (Chalkidis et al., 2022)
- one at https://huggingface.co/zedfum/arman-longformer-8k-finetuned-ensani.
- one at https://huggingface.co/lexlms/legal-longformer-base.
- ...
Counter-Example(s):
- one using the the Transformer Model Architecture from (Vaswani et al., 2017)
- A Convolutional Neural Network.
See Also: Efficient Transformer, Sparse Transformer

References

(Niklaus & Giofré, 2023) ⇒ Joel Niklaus, and Daniele Giofré. (2023). “Can We Pretrain a SotA Legal Language Model on a Budget From Scratch?.” In: Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP).
- QUOTE: ... In this work, we train Longformer models with the efficient RTD task on long-context legal data to showcase that pretraining efficient LMs is possible using less than 12 GPU days. ...
- QUOTE: ... Longformer (Beltagy et al., 2020) is one of these efficient transformer architectures for long sequences, leveraging windowed and global attention. So far, to the best of our knowledge, there does not yet exist a public Longformer model pretrained on English legal data1, although Xiao et al. (2021) have proven the effectiveness of the Longformer in dealing with long legal text in many Chinese-related tasks. This work aims to fill this gap. ...

https://huggingface.co/docs/transformers/model_doc/longformer
- QUOTE: ... Since the Longformer is based on RoBERTa, it doesn’t have token_type_ids. You don’t need to indicate which token belongs to which segment. Just separate your segments with the separation token tokenizer.sep_token (or </s>).
  A transformer model replacing the attention matrices by sparse matrices to go faster. Often, the local context (e.g., what are the two tokens left and right?) is enough to take action for a given token. Some preselected input tokens are still given global attention, but the attention matrix has way less parameters, resulting in a speed-up. See the local attention section for more information.
- Longformer self attention employs self attention on both a “local” context and a “global” context. Most tokens only attend “locally” to each other meaning that each token attends to its 12w21w previous tokens and 12w21w succeeding tokens with ww being the window length as defined in config.attention_window. Note that config.attention_window can be of type List to define a different ww for each layer. A selected few tokens attend “globally” to all other tokens, as it is conventionally done for all tokens in BertSelfAttention. ...