Attention Mechanism
An Attention Mechanism is a neural network component within a memory augmented neural network that allows the model to dynamically focus on certain parts of its input or its own internal state (memory) that are most relevant for performing a given task
- AKA: Neural Attention Model, Neural Network Attention System.
- Context:
- It can (typically) be part of a Neural Network with Attention Mechanism.
- It can (typically) utilize an Attention Pattern Matrix that encodes the pairwise relevance between tokens, allowing the model to selectively focus on different parts of the input when updating each token's representation.
- It can (typically) compute Attention Scores between query vectors (representing the current state) and key vectors (representing the input elements), which are then normalized using a softmax function to obtain attention weights.
- It can (typically) use the computed Attention Weights to take a weighted sum of value vectors, which correspond to the input elements, to obtain a context vector that captures the most relevant information for the current state.
- It can (typically) update its Query Vectors, Key Vectors, and Value Vectors through learnable linear transformations, allowing the model to adapt and learn the most suitable representations for the given task during training.
- It can be described by a Neural Network Attention Function, which mathematically defines how attention scores are computed.
- It can range from being a Local Neural Attention Model to being a Global Neural Attention Model.
- It can range from being a Self-Attention Mechanism (where attention is computed relative to the input itself) to being a Multi-Head Attention Mechanism (where attention is computed through multiple representation subspaces).
- It can range from being an Additive Attention Mechanism (which uses a feed-forward network) to being a Dot Product Attention Mechanism (which computes the dot product between the query and key vectors), based on different scoring methods.
- It can range from being a Deterministic Attention Mechanism (which produces fixed attention weights) to being Stochastic Attention Mechanism (which produces random attention weights).
- It can range from being a Soft Attention Mechanism (which assigns real-valued weights to the input data) to being a Hard Attention Mechanism.
- It can be an input to an Attention Mechanism Computational Complexity Analysis.
- …
- Example(s)
- Additive and Dot-Product Attention:
- an Additive Attention Mechanism, which computes attention weights by using a feed-forward network with a single hidden layer to combine query and key vectors.
- a Dot Product Attention Mechanism, where attention weights are computed as the dot product between the query and key vectors, often used due to its computational efficiency.
- Task-Specific Attention:
- Encoder-Decoder Attention Mechanism, which is widely used in sequence-to-sequence models for tasks such as machine translation, allowing the decoder to attend over all positions in the input sequence.
- Context-based Attention Mechanism, focusing on context information, where the model adjusts its focus based on the surrounding context of a specific input element.
- Stochastic and Hard Attention:
- Hard Stochastic Attention Mechanism, where attention decisions are sampled from a probability distribution, leading to discrete attention focusing.
- Segment-Level and Scaled Attention:
- Segment-Level Attention Mechanism, often used in natural language processing to attend to whole segments or phrases in the input sequence for better semantic understanding.
- Scaled Dot-Product Attention Mechanism, which scales the dot products by the dimensionality of the vectors, improving stability in models with large dimension sizes.
- Deterministic and Sparse Attention:
- a Soft Deterministic Attention Mechanism, which uses a deterministic approach to compute attention weights but allows for a distribution over inputs, balancing between focusing and distributing attention.
- a Block Sparse Attention Mechanism, which introduces sparsity into the attention mechanism by computing attention within blocks or between specific blocks, reducing computational complexity.
- Hierarchical and Multi-Head Attention:
- a Tiered Attention Mechanism, which employs multiple levels of attention, such as focusing first on broader categories and then on more specific details within those categories.
- a Multi-Head Attention Mechanism, which runs several attention mechanisms in parallel, allowing the model to capture different types of relationships in the input data.
- Efficient Attention Mechanisms:
- a Grouped Query Attention (GQA) Mechanism, which effectively combines elements of Multi-Head Attention (MHA) and Multi-Query Attention (MQA).
- …
- Additive and Dot-Product Attention:
- Counter-Example(s):
- Self-Attention Mechanism without Positional Encoding: A self-attention mechanism that does not incorporate positional information, which may limit its ability to capture sequential or spatial relationships in the input data.
- Uniform Attention Distribution: A mechanism that assigns equal attention weights to all input elements, effectively not focusing on any specific part of the input, which may not be suitable for tasks that require selective attention.
- Static Attention Mechanism: An attention mechanism where the attention weights are fixed and not learned or updated during training, which may not be able to adapt to different input sequences or tasks.
- Single-Head Attention Mechanism: An attention mechanism that uses only one attention head, which may not be able to capture multiple types of relationships or attend to different aspects of the input simultaneously, compared to a Multi-Head Attention Mechanism.
- Attention Mechanism without Query-Key-Value Separation: An attention mechanism that does not separate the input elements into query, key, and value vectors, which may limit its expressiveness and ability to compute complex attention patterns.
- Coverage Mechanism, which prevents the model from attending to the same information repeatedly,
- Gating Mechanism such as that of a GRU, used to control the flow of information,
- Sequential Memory Cell such as that of an LSTM Unit, which is designed to remember patterns over time,
- Stacked Memory Cell, where multiple memory cells are stacked to form a deep network.
- See: Transformer Model, Seq2Seq Model with Attention, Attention Alignment, Attention Layer, Attention Map, Attention Mask, Attention Module, Attentional Neural Network, Attentive Neural Network.
References
2018a
- (Brown et al., 2018) ⇒ Andy Brown, Aaron Tuor, Brian Hutchinson, and Nicole Nichols. (2018). “Recurrent Neural Network Attention Mechanisms for Interpretable System Log Anomaly Detection.” In: Proceedings of the First Workshop on Machine Learning for Computing Systems (MLCS'18). ISBN:978-1-4503-5865-1 doi:10.1145/3217871.3217872
- QUOTE: In this work we use dot product attention (Figure 3), wherein an “attention vector” a is generated from three values: 1) a key matrix $\mathbf{K}$, 2) a value matrix \mathbf{V}, and 3) a query vector \mathbf{q}. In this formulation, keys are a function of the value matrix:
- QUOTE: In this work we use dot product attention (Figure 3), wherein an “attention vector” a is generated from three values: 1) a key matrix $\mathbf{K}$, 2) a value matrix \mathbf{V}, and 3) a query vector \mathbf{q}. In this formulation, keys are a function of the value matrix:
[math]\displaystyle{ \mathbf{K} = \tanh\left(\mathbf{VW}^a\right) }[/math] | (5) |
- parameterized by $\mathbf{W}a$ . The importance of each timestep is determined by the magnitude of the dot product of each key vector with the query vector $\mathbf{q} \in \R^{La}$ for some attention dimension hyperparameter, $La$. These magnitudes determine the weights, $\mathbf{d}$ on the weighted sum of value vectors, $\mathbf{a}$:
- parameterized by $\mathbf{W}a$ . The importance of each timestep is determined by the magnitude of the dot product of each key vector with the query vector $\mathbf{q} \in \R^{La}$ for some attention dimension hyperparameter, $La$. These magnitudes determine the weights, $\mathbf{d}$ on the weighted sum of value vectors, $\mathbf{a}$:
[math]\displaystyle{ \begin{align} \mathbf{d} &= \mathrm{softmax}\left(\mathbf{qKT} \right) \\ \mathbf{a} &= \mathbf{dV} \end{align} }[/math] | (6) |
(7) |
2018b
- (Yogatama et al., 2018) ⇒ Dani Yogatama, Yishu Miao, Gabor Melis, Wang Ling, Adhiguna Kuncoro, Chris Dyer, and Phil Blunsom. (2018). “Memory Architectures in Recurrent Neural Network Language Models.” In: Proceedings of 6th International Conference on Learning Representations.
- QUOTE: Random access memory. One common approach to retrieve information from the distant past more reliably is to augment the model with a random access memory block via an attention based method. In this model, we consider the previous $K$ states as the memory block, and construct a memory vector $\mathbf{m}_t$ by a weighted combination of these states:[math]\displaystyle{ \mathbf{m}_t = \displaystyle \sum_{i=t−K}^{t−1} a_i\mathbf{h}_i \quad }[/math], where [math]\displaystyle{ \quad a_i \propto \exp\left(\mathbf{w}_{m,i}\mathbf{h}_i + \mathbf{w}_{m,h} \mathbf{h}_t\right) }[/math]
- QUOTE: Random access memory. One common approach to retrieve information from the distant past more reliably is to augment the model with a random access memory block via an attention based method. In this model, we consider the previous $K$ states as the memory block, and construct a memory vector $\mathbf{m}_t$ by a weighted combination of these states:
2017a
- (Gupta et al., 2017) ⇒ Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. (2017). “DeepFix: Fixing Common C Language Errors by Deep Learning.” In: Proceeding of AAAI.
- QUOTE: We present an end-to-end solution, called DeepFix, that does not use any external tool to localize or fix errors. We use a compiler only to validate the fixes suggested by DeepFix. At the heart of DeepFix is a multi-layered sequence-to-sequence neural network with attention (Bahdanau, Cho, and Bengio 2014), comprising of an encoder recurrent neural network (RNN) to process the input and a decoder RNN with attention that generates the output. The network is trained to predict an erroneous program location along with the correct statement. DeepFix invokes it iteratively to fix multiple errors in the program one-by-one. (...)
DeepFix uses a simple yet effective iterative strategy to fix multiple errors in a program as shown in Figure 2 (...)
- QUOTE: We present an end-to-end solution, called DeepFix, that does not use any external tool to localize or fix errors. We use a compiler only to validate the fixes suggested by DeepFix. At the heart of DeepFix is a multi-layered sequence-to-sequence neural network with attention (Bahdanau, Cho, and Bengio 2014), comprising of an encoder recurrent neural network (RNN) to process the input and a decoder RNN with attention that generates the output. The network is trained to predict an erroneous program location along with the correct statement. DeepFix invokes it iteratively to fix multiple errors in the program one-by-one. (...)
2017b
- (See et al., 2017) ⇒ Abigail See, Peter J. Liu, and Christopher D. Manning. (2017). “Get To The Point: Summarization with Pointer-Generator Networks.” In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). DOI:10.18653/v1/P17-1099.
- QUOTE: The attention distribution at is calculated as in Bahdanau et al. (2015):
[math]\displaystyle{ \begin{align} e^t_i &= \nu^T \mathrm{tanh}\left(W_hh_i +W_sS_t +b_{attn}\right) \\ a^t &= \mathrm{softmax}\left(e^t \right) \end{align} }[/math] | (1) |
(2) |
- where $\nu$, $W_h$, $W_s$ and $b_{attn}$ are learnable parameters. The attention distribution can be viewed as a probability distribution over the source words, that tells the decoder where to look to produce the next word. Next, the attention distribution is used to produce a weighted sum of the encoder hidden states, known as the context vector $h^*_t$
- where $\nu$, $W_h$, $W_s$ and $b_{attn}$ are learnable parameters. The attention distribution can be viewed as a probability distribution over the source words, that tells the decoder where to look to produce the next word. Next, the attention distribution is used to produce a weighted sum of the encoder hidden states, known as the context vector $h^*_t$
[math]\displaystyle{ h^∗_t = \displaystyle\sum_i a^t_i h_i }[/math] | (3) |
2017c
- (Synced Review, 2017) ⇒ Synced (2017). “A Brief Overview of Attention Mechanism.” In: Medium - Synced Review Blog Post.
- QUOTE: And to build context vector is fairly simple. For a fixed target word, first, we loop over all encoders' states to compare target and source states to generate scores for each state in encoders. Then we could use softmax to normalize all scores, which generates the probability distribution conditioned on target states. At last, the weights are introduced to make context vector easy to train. That’s it. Math is shown below:
[math]\displaystyle{ \begin{align} \alpha_{ts} &=\dfrac{\exp\left(\mathrm{score}\left(\mathbf{h}_t,\mathbf{\overline{h}}_s\right)\right)}{\displaystyle\sum_{s'=1}^S \exp\left(\mathrm{score}\left(\mathbf{h}_t,\mathbf{\overline{h}}_{s'}\right)\right)}\\ \mathbf{c}_t &=\displaystyle\sum_s \alpha_{ts}\mathbf{\overline{h}}_s\\ \mathbf{a}_t &=f\left(\mathbf{c}_t,\mathbf{h}_t\right) = \mathrm{tanh}\left(\mathbf{W}_c\big[\mathbf{c}_t; \mathbf{h}_t\big]\right) \end{align} }[/math] | [Attention weights] | (1) |
[Context vector] | (2) | |
[Attention vector] | (3) |
- To understand the seemingly complicated math, we need to keep three key points in mind:
- 1. During decoding, context vectors are computed for every output word. So we will have a 2D matrix whose size is # of target words multiplied by # of source words. Equation (1) demonstrates how to compute a single value given one target word and a set of source word.
- 2. Once context vector is computed, attention vector could be computed by context vector, target word, and attention function $f$.
- 3. We need attention mechanism to be trainable. According to [equation (4), both styles offer the trainable weights (W in Luong’s, W1 and W2 in Bahdanau’s). Thus, different styles may result in different performance.
- To understand the seemingly complicated math, we need to keep three key points in mind:
2017d
2016a
2016c
2015a
(...)
2015b
2015b
2015c
2015d
|