Deterministic Attention Mechanism

A Deterministic Attention Mechanism is an Attention Mechanism that is based on a deterministic model.

Example(s):
- a Soft Deterministic Attention Mechanism (Xu et al., 2015),
Counter-Example(s):
See: Hierarchical Attention Network, Gated Convolutional Neural Network with Segment-Level Attention Mechanism, Sequencial Memory, Bidirectional Recurrent Neural Network with Attention Mechanism, Stack Memory, Neural Machine Translation (NMT), LSTM Network, Dynamic Control Problem, Speech Recognition Task, Image Caption Generation Task, Transduction Task.

References

2017

(Ma et al., 2017) ⇒ Chunpeng Ma, Lemao Liu, Akihiro Tamura, Tiejun Zhao, Eiichiro Sumita (2017, February). "Deterministic Attention for Sequence-to-Sequence Constituent Parsing". In: AAAI (pp. 3237-3243).
- QUOTE: To address the problems of the probabilistic attention mechanism, we used a novel method to calculate the context vector:

$c_{i}=\sum_{j \in \mathcal{D}_{i}} \mathbf{A}_{m} h_{j}$

(6)

Here, $\mathcal{D}_{i}$ is a list, saving the indices of the words that should be paid attention to at time step $i$ while generating the target-side sequence. $\mathbf{A}_{m}$ is a deterministic alignment matrix with the shape $dim(c) \times dim(h)$, where $dim(c)$ and $dim(h)$ are the dimensions of any $c_i$-s and any $h_j$ -s, respectively, and $1 \leq m \leq |\mathcal{D}_{i}|$ denoting the index of the parameter.

Compared with the probabilistic attention mechanism, our deterministic attention mechanism has the following characteristics:

Instead of calculating the context vector based on all source-side words, it deterministically selects a list of indices of words, i.e. $\mathcal{D}_{i}$, where most of the obviously unrelated source-side words are filtered. This allows the model to focus on the most important words, both improving the decoding accuracy and shortening the decoding time.
Unlike the $\alpha_{ij}$ (scalars) used in the probabilistic attention model, the parameters $\mathbf{A}_{m}$ (matrices) are not valid probabilities, allowing the parameters to be adjusted more flexibly. Also, the use of matrices rather than scalars as the parameters significantly increases the capacity of the model.

2015

(Xu et al., 2015) ⇒ Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. (2015). “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” In: Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), Volume 37.
- QUOTE: Learning stochastic attention requires sampling the attention location $s_t$ each time, instead we can take the expectation of the context vector $\mathbf{\hat{z}}_t$ directly,

$\displaystyle \mathbb{E}_{p\left(s_t\vert a\right)}\big[\mathbf{\hat{z}}_i\big]=\sum_{i=1}^L\alpha_{t,i}\mathbf{a}_i$

(8)

and formulate a deterministic attention model by computing a soft attention weighted annotation vector $\phi\left(\{\mathbf{a}_i\},\{\alpha_i\}\right)=\sum_{i=1}^L\alpha_i\mathbf{a}_i;$ as proposed by Bahdanau et al. (2014). This corresponds to feeding in a soft $\alpha$ weighted context into the system. The whole model is smooth and differentiable under the deterministic attention, so learning end-to-end is trivial by using standard back-propagation.

Learning the deterministic attention can also be understood as approximately optimizing the marginal likelihood in Eq. (5) under the attention location random variable $s_t$ from Sec. 4.1.

Deterministic Attention Mechanism

References

2017

2015

Navigation menu

Search