Stochastic Attention Mechanism

A Stochastic Attention Mechanism is an Attention Mechanism that is based on a stochastic model.

AKA: Probabilistic Attention Mechanism.
Example(s):
- a Doubly Stochastic Attention Mechanism (Xu et al., 2015),
- ...
- …
Counter-Example(s):
- a Deterministic Attention Mechanism.
See: Gating Mechanism, Neural Network with Attention Mechanism, Coverage Mechanism, Sequential Memory Cell, LSTM Unit, Stacked Memory Cell, Hierarchical Attention Network, Gated Convolutional Neural Network with Segment-Level Attention Mechanism, Bidirectional Recurrent Neural Network with Attention Mechanism, LSTM Network.

References

2015

(Xu et al., 2015) ⇒ Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. (2015). “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention". In: Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), Volume 37.
- QUOTE: In training the deterministic version of our model, we introduce a form a doubly stochastic regularization that encourages the model to pay equal attention to every part of the image. Whereas the attention at every point in time sums to $1$ by construction (i.e $\sum_i\alpha_{ti}=1$), the attention $\sum_i\alpha_{ti}$ is not constrained in any way. This makes it possible for the decoder to ignore some parts of the input image. In order to alleviate this, we encourage $\sum_t\alpha_{ti} \approx \tau$ where $\tau \geq \frac{L}{D}$. In our experiments, we observed that this penalty quantitatively improves overall performance and that this qualitatively leads to more descriptive captions.
  Additionally, the soft attention model predicts a gating scalar $\beta$ from previous hidden state $\mathbf{h}_{t-1}$ at each time step $t$, such that, $\phi\left(\{\mathbf{a_i}\},\{\alpha_i\}\right) = \beta\sum_i^L\alpha_{i}\mathbf{a}_i$, where $\beta_t = \sigma\left(f_\beta\left(\mathbf{h}_{t-1}\right)\right)$. This gating variable lets the decoder decide whether to put more emphasis on language modeling or on the context at each time step. Qualitatively, we observe that the gating variable is larger than the decoder describes an object in the image.
  The soft attention model is trained end-to-end by minimizing the following penalized negative log-likelihood:

[math]\displaystyle{ L_d=-\log\left(p\left(\mathbf{y}\vert \mathbf{a}\right)\right)+\lambda\displaystyle \sum_i^L\left(1-\displaystyle \sum_i^C\alpha_{ti}\right)^2 }[/math]

(9)

where we simply fixed $\tau$ to $1$.