Attention Mechanism

From GM-RKB
(Redirected from attention mechanism)
Jump to navigation Jump to search

An Attention Mechanism is a neural network component within a memory augmented neural network that allows the model to dynamically focus on certain parts of its input or its own internal state (memory) that are most relevant for performing a given task



References

2018a

[math]\displaystyle{ \mathbf{K} = \tanh\left(\mathbf{VW}^a\right) }[/math] (5)

parameterized by $\mathbf{W}a$ . The importance of each timestep is determined by the magnitude of the dot product of each key vector with the query vector $\mathbf{q} \in \R^{La}$ for some attention dimension hyperparameter, $La$. These magnitudes determine the weights, $\mathbf{d}$ on the weighted sum of value vectors, $\mathbf{a}$:

[math]\displaystyle{ \begin{align} \mathbf{d} &= \mathrm{softmax}\left(\mathbf{qKT} \right) \\ \mathbf{a} &= \mathbf{dV} \end{align} }[/math] (6)
(7)

2018b

2017a

2017 DeepFixFixingCommonCLanguageErr Fig2.png
Figure 2: The iterative repair strategy of DeepFix

2017b

[math]\displaystyle{ \begin{align} e^t_i &= \nu^T \mathrm{tanh}\left(W_hh_i +W_sS_t +b_{attn}\right) \\ a^t &= \mathrm{softmax}\left(e^t \right) \end{align} }[/math] (1)
(2)

where $\nu$, $W_h$, $W_s$ and $b_{attn}$ are learnable parameters. The attention distribution can be viewed as a probability distribution over the source words, that tells the decoder where to look to produce the next word. Next, the attention distribution is used to produce a weighted sum of the encoder hidden states, known as the context vector $h^*_t$

[math]\displaystyle{ h^∗_t = \displaystyle\sum_i a^t_i h_i }[/math] (3)

2017 GetToThePointSummarizationwithP Fig2.png
Figure 2: Baseline sequence-to-sequence model with attention. The model may attend to relevant words in the source text to generate novel words, e.g., to produce the novel word beat in the abstractive summary Germany beat Argentina 2-0 the model may attend to the words victorious and win in the source text.

2017c

[math]\displaystyle{ \begin{align} \alpha_{ts} &=\dfrac{\exp\left(\mathrm{score}\left(\mathbf{h}_t,\mathbf{\overline{h}}_s\right)\right)}{\displaystyle\sum_{s'=1}^S \exp\left(\mathrm{score}\left(\mathbf{h}_t,\mathbf{\overline{h}}_{s'}\right)\right)}\\ \mathbf{c}_t &=\displaystyle\sum_s \alpha_{ts}\mathbf{\overline{h}}_s\\ \mathbf{a}_t &=f\left(\mathbf{c}_t,\mathbf{h}_t\right) = \mathrm{tanh}\left(\mathbf{W}_c\big[\mathbf{c}_t; \mathbf{h}_t\big]\right) \end{align} }[/math] [Attention weights] (1)
[Context vector] (2)
[Attention vector] (3)
To understand the seemingly complicated math, we need to keep three key points in mind:
1. During decoding, context vectors are computed for every output word. So we will have a 2D matrix whose size is # of target words multiplied by # of source words. Equation (1) demonstrates how to compute a single value given one target word and a set of source word.
2. Once context vector is computed, attention vector could be computed by context vector, target word, and attention function $f$.
3. We need attention mechanism to be trainable. According to [equation (4), both styles offer the trainable weights (W in Luong’s, W1 and W2 in Bahdanau’s). Thus, different styles may result in different performance.
2017d
2017 AttentionisallYouNeed Fig2.png
Figure 2: (left) Scaled Dot—Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.
In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix [math]\displaystyle{ Q }[/math]. The keys and values are also packed together into matrices [math]\displaystyle{ K }[/math] and [math]\displaystyle{ V }[/math]. We compute the matrix of outputs as:

[math]\displaystyle{ Attention(Q,K,V)=\mathrm{softmax}\left(\dfrac{QK_T}{\sqrt{d_k}}\right)V }[/math] (1)
The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of [math]\displaystyle{ 1/\sqrt{d_k} }[/math]. Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

2016a

2016 HierarchicalAttentionNetworksfo Fig2.png
Figure 2: Hierarchical Attention Network.

2016c

2016 BidirectionalRecurrentNeuralNet Fig1.png
Figure 1: Description of the model predicting punctuation [math]\displaystyle{ y_t }[/math] at time step [math]\displaystyle{ t }[/math] for the slot before the current input word $x_t$.

2015a

[math]\displaystyle{ c_i = \displaystyle\sum^{Tx}_{j=1} \alpha_{ij}h_j }[/math]. (5)

(...)
The probability $\alpha_{ij}$, or its associated energy $e_{ij}$, reflects the importance of the annotation $h_j$ with respect to the previous hidden state $s_{i-1}$ in deciding the next state $s_i$ and generating $y_i$. Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed- length vector. With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.

2015b

2015 EffectiveApproachestoAttentionbr Fig2.png
Figure 2: Global attentional model - at each time step $t$, the model infers a variable-length alignment weight vector at based on the current target state $\mathbf{h}_t$ and all source states $\mathbf{\overline{h}}_s$. A global context vector $\mathbf{c}_t$ is then computed as the weighted average, according to $\mathbf{a}_t$, over all the source states.

2015 EffectiveApproachestoAttentionbr Fig3.png
Figure 3: Local attention model - the model first predicts a single aligned position $p_t$ for the current target word. A window centered around the source position $p_t$ is then used to compute a context vector $\mathbf{c}_t$, a weighted average of the source hidden states in the window. The weights $\mathbf{a}_t$ are inferred from the current target state $\mathbf{h}_t$ and those source states $\mathbf{\overline{h}}_s$ in the window.

2015b

2015c

2015d