Positional Encoding Mechanism
Jump to navigation
Jump to search
A Positional Encoding Mechanism is a neural architecture pattern that ...
Refereneces
2023
- chat
- ... Three key innovations presented in the paper “Attention is All You Need” by Vaswani et al. are:
- ... Positional Encoding: Since the Transformer model does not have any inherent sense of the position of tokens in a sequence, the authors introduced positional encoding to inject information about the position of tokens in the input sequence. Positional encoding is added to the input embeddings before being processed by the self-attention layers, allowing the model to learn and use positional information when making predictions. The authors used a sinusoidal function to generate the positional encodings, ensuring that the encodings can be easily extended to varying sequence lengths.
- ... Three key innovations presented in the paper “Attention is All You Need” by Vaswani et al. are:
2019
- (Dai et al., 2019) ⇒ Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. (2019). "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context".; In: CoRR, arXiv:1901.02860.
- QUOTE: ... While we found the idea presented in the previous subsection very appealing, there is a crucial technical challenge we haven’t solved in or der to reuse the hidden states. That is, how can we keep the positional information coherent when we reuse the states? Recall that, in the standard Transformer, the information of sequence order is provided by a set of positional encodings, denoted as $\mathbf{U} \in \R^{L_{max}\times d}$, where the $i$-th row $\mathbf{U}_i$ corresponds to the $i$-th absolute position within a segment and $L_{max}$ prescribes the maximum possible length to be modeled. Then, the actual input to the Transformer is the element-wise addition of the word embeddings and the positional encodings. If we simply adapt this positional encoding to our recurrence mechanism, the hidden state sequence would be computed schematically by ...
2017
- (Vaswani et al., 2017) ⇒ Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. (2017). “Attention is all You Need.” In: Advances in Neural Information Processing Systems.
- QUOTE: ... Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension [math]\displaystyle{ d_{model} }[/math] as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed.
In this work, we use sine and cosine functions of different frequencies:
[math]\displaystyle{ PE_{(pos,2i)} = \sin(pos/10000^{2i/d_{model}}) }[/math]where [math]\displaystyle{ pos }[/math] is the position and $i$ is the dimension.[math]\displaystyle{ PE_{(pos,2i+1)} =\cos(pos/10000^{2i/d_{model}}) }[/math]
- QUOTE: ... Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension [math]\displaystyle{ d_{model} }[/math] as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed.