Positional Encoding Mechanism

See: Neural Transformer, Relative Positional Encoding, Transformer-based NNet.

Refereneces

chat
- ... Three key innovations presented in the paper “Attention is All You Need” by Vaswani et al. are:
  - ... Positional Encoding: Since the Transformer model does not have any inherent sense of the position of tokens in a sequence, the authors introduced positional encoding to inject information about the position of tokens in the input sequence. Positional encoding is added to the input embeddings before being processed by the self-attention layers, allowing the model to learn and use positional information when making predictions. The authors used a sinusoidal function to generate the positional encodings, ensuring that the encodings can be easily extended to varying sequence lengths.

(Dai et al., 2019) ⇒ Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. (2019). "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context".; In: CoRR, arXiv:1901.02860.
- QUOTE: ... While we found the idea presented in the previous subsection very appealing, there is a crucial technical challenge we haven’t solved in or der to reuse the hidden states. That is, how can we keep the positional information coherent when we reuse the states? Recall that, in the standard Transformer, the information of sequence order is provided by a set of positional encodings, denoted as $\mathbf{U} \in \R^{L_{max}\times d}$, where the $i$-th row $\mathbf{U}_i$ corresponds to the $i$-th absolute position within a segment and $L_{max}$ prescribes the maximum possible length to be modeled. Then, the actual input to the Transformer is the element-wise addition of the word embeddings and the positional encodings. If we simply adapt this positional encoding to our recurrence mechanism, the hidden state sequence would be computed schematically by ...