Attention Module
An Attention Module is a neural network module that uses an alignment score function to amplify some parts of the input data while diminishing other parts.
- Context:
- It can range from being a Self-Attention Module, Hard Attention Module to being a Soft Attention Module.
- …
- Example(s):
- Scaled Dot Product-based Attention.
- Query-Key-Value-based Attention, composed of a Query Matrix (Q), Key Matrix (K) and a Value Matrix V
- hard, soft, self, cross, Luong, and Bahdanau
- …
- Counter-Example(s):
- See: Attention Mechanism, Transformer-based NNet, Differentiable Neural Computer.
References
2022
- (Wikipedia, 2022) ⇒ https://en.wikipedia.org/wiki/Attention_(machine_learning) Retrieved:2022-4-24.
- In neural networks, attention is a technique that mimics cognitive attention. The effect enhances some parts of the input data while diminishing other parts — the thought being that the network should devote more focus to that small but important part of the data. Learning which part of the data is more important than others depends on the context and is trained by gradient descent.
Attention-like mechanisms were introduced in the 1990s under names like multiplicative modules, sigma pi units, and hypernetworks.[1] Its flexibility comes from its role as "soft weights" that can change during runtime, in contrast to standard weights that must remain fixed at runtime. Uses of attention include memory in neural turing machines, reasoning tasks in differentiable neural computers,[2] language processing in transformers, and multi-sensory data processing (sound, images, video, and text) in perceivers.
- In neural networks, attention is a technique that mimics cognitive attention. The effect enhances some parts of the input data while diminishing other parts — the thought being that the network should devote more focus to that small but important part of the data. Learning which part of the data is more important than others depends on the context and is trained by gradient descent.
- ↑ Yann Lecun (2020). Deep Learning course at NYU, Spring 2020, video lecture Week 6. Event occurs at 53:00. Retrieved 2022-03-08.
- ↑ Graves, Alex; Wayne, Greg; Reynolds, Malcolm; Harley, Tim; Danihelka, Ivo; Grabska-Barwińska, Agnieszka; Colmenarejo, Sergio Gómez; Grefenstette, Edward; Ramalho, Tiago; Agapiou, John; Badia, Adrià Puigdomènech; Hermann, Karl Moritz; Zwols, Yori; Ostrovski, Georg; Cain, Adam; King, Helen; Summerfield, Christopher; Blunsom, Phil; Kavukcuoglu, Koray; Hassabis, Demis (2016-10-12). "Hybrid computing using a neural network with dynamic external memory". Nature. 538 (7626): 471–476. Bibcode:2016Natur.538..471G. doi:10.1038/nature20101. ISSN 1476-4687. PMID 27732574. S2CID 205251479.
2022
- (Wikipedia, 2022) ⇒ https://en.wikipedia.org/wiki/Attention_(machine_learning)#Variants Retrieved:2022-4-24.
- There are many variants of attention: dot product, query-key-value, hard, soft, self, cross, Luong, and Bahdanau to name a few. These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients (see legend).
2018
- https://lilianweng.github.io/posts/2018-06-24-attention/
- QUOTE: Below is a summary table of several popular attention mechanisms and corresponding alignment score functions:
Content-base attention
Below is a summary table of several popular attention mechanisms and corresponding alignment score functions:
Name | Alignment score function | Citation |
---|---|---|
Name | Alignment score function | Citation |
---|---|---|
Content-base attention | Graves2014 | |
Additive(*) | Bahdanau2015 | |
Location-Base | Note: This simplifies the softmax alignment to only depend on the target position. |
Luong2015 |
General | where |
Luong2015 |
Dot-Product | Luong2015 | |
Scaled Dot-Product(^) | Note: very similar to the dot-product attention except for a scaling factor; where n is the dimension of the source hidden state. |
Vaswani2017 |
(*) Referred to as “concat” in Luong, et al., 2015 and as “additive attention” in Vaswani, et al., 2017.
(^) It adds a scaling factor
Here are a summary of broader categories of attention mechanisms:
Name | Definition | Citation |
---|---|---|
Self-Attention(&) | Relating different positions of the same input sequence. Theoretically the self-attention can adopt any score functions above, but just replace the target sequence with the same input sequence. | Cheng2016 |
Global/Soft | Attending to the entire input state space. | Xu2015 |
Local/Hard | Attending to the part of input state space; i.e. a patch of the input image. | Xu2015; Luong2015 |