Attention Mechanism Computational Complexity Analysis Task

An Attention Mechanism Computational Complexity Analysis Task is a computational complexity analysis task that focuses on analyzing the number of attention mechanism operations to process input data.

Context:
- Task Input: an Attention Mechanism (as part of a memory augmented neural network).
- Task Output: a Attention Mechanism Computational Complexity Performance Measure, including the number of operations and memory requirements.
- It can range from being a Attention Mechanism Time Complexity Analysis Task to being a Attention Mechanism Space Complexity Analysis Task.
- …
Example(s):
- an Attention Mechanism Space Complexity Analysis Task, such as:
  - of the Scaled Dot-Product Attention Mechanism in Transformer models, characterized by [math]\displaystyle{ \mathcal{O}(n^2d) }[/math] time complexity with respect to the input sequence length [math]\displaystyle{ n }[/math] and hidden dimension size [math]\displaystyle{ d }[/math].
  - of the Dilated Attention mechanism in LONGNET, characterized by [math]\displaystyle{ \mathcal{O}(Nd) }[/math] time complexity, where [math]\displaystyle{ N }[/math] is the sequence length and [math]\displaystyle{ d }[/math] is the hidden dimension size.
  - …
- An Attention Mechanism Space Complexity Analysis Task, such as:
  - for Multi-Head Attention Mechanism with respect to the number of attention heads and input sequence length.
  - for the presence of additional network components like residual connections, embeddings, and feed-forward networks.
  - …
Counter-Example(s)
- A DTree Algorithm Complexity Analysis,
- A Computational Complexity Reduction Task for optimizing neural networks.
- A Matrix Multiplication Complexity Analysis Task.
- A Layer Normalization Complexity Analysis Task.
- A Feed-Forward Neural Networks Complexity Analysis Task.
- A Embeddings Complexity Analysis Task.
- A Residual Connections Complexity Analysis Task.
- A Output Projections Complexity Analysis Task.
- …
See: Computational Complexity Performance Metric.

References

2023

(Ding, Ma et al., 2023) ⇒ Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, and Furu Wei. (2023). “LongNet: Scaling Transformers to 1,000,000,000 Tokens.” doi:10.48550/arXiv.2307.02486
- QUOTE: ... Another strand of scaling the sequence length is to decrease the complexity of Transformers, i.e., the quadratic complexity of self-attention. Implementing sliding windows or convolution modules over the attention is a straightforward way to make the complexity nearly linear. Nevertheless, this sacrifices the ability to recall the early tokens, forgetting the prompts at the very beginning of the sequence. Sparse attention reduces the computation by sparsifying the attention matrix, p√reserving the possibility of recalling long-distant information. For example, [CGRS19] obtains O(N Nd) time complexity with a fixed sparse pattern. Besides the heuristic patterns [ZGD+20, BPC20], the learnable patterns prove to be useful for sparse attention [KKL20, ALdJ+23]. There are also some other efficient Transformer-based variants, including low-rank attention [WLK+20, WCL+20], kernel-based meth- ods [KVPF20, CLD+21, QHS+22], downsampling approaches [LLK+19, JGB+21, MKW+21], recurrent models [DYY+19, BKB23], and retrieval-based methods [WRHS22, WDC+23]. Yet, none has been scaled to 1 billion tokens (see Figure 1). ...

Method	Computation Complexity
Recurrent	O(Nd2)
Vanilla Attention	O(N 2d)
Sparse Attention	O(N  Nd)
Dilated Attention (This Work)	O(Nd)

Table 1: Comparison of computation complexity among different methods. N is the sequence length and d is the hidden dimension.

Attention Mechanism Computational Complexity Analysis Task

References

2023

Navigation menu

Search