Recurrent Neural Network-based Language Model (RNN-LM)

Context:
- It can be produced by a Recurrent Neural Network Language Model Training System (that implements an RNN-based LM algorithm).
Example(s):
- .
- …
Counter-Example(s):
- a Statistical Language Model.
- a Transformer Network-based Language Model.
See: Hidden Layer, Recurrent Connection.

References

Rani Horev. (2019). “Transformer-XL Explained: Combining Transformers and RNNs into a State-of-the-art Language Model."
- QUOTE: ... A popular approach for language modeling is Recurrent Neural Networks (RNNs) as they capture dependencies between words well, especially when using modules such as LSTM. However, RNNs tend to be slow and their ability to learn long-term dependencies is still limited due to vanishing gradients. Transformers, invented in 2017, introduced a new approach — attention modules. Instead of processing tokens one by one, attention modules receive a segment of tokens and learn the dependencies between all of them at once ...

(Mikolov et al., 2013b) ⇒ Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. (2013). “Linguistic Regularities in Continuous Space Word Representations..” In: HLT-NAACL.
- QUOTE: The word representations We study are learned by a recurrent neural network language model (Mikolov et al., 2010), as illustrated in Figure 1. This architecture consists of an input layer, a hidden layer with recurrent connections, plus the corresponding Weight matrices. The input vector w (t) represents input word at time t encoded using 1-of-N coding, and the output layer y (t) produces a probability distribution over words. The hidden layer s (t) maintains a representation of the sentence history. The input vector w (t) and the output vector y (t) have dimensionality of the vocabulary. The values in the hidden and output layers are computed as follows: : [math]\displaystyle{ \begin{multline} \shoveleft s(t) = f(\mathbf{U}\mathbf{w} (t) + \mathbf{W}\mathbf{s} (t-1)) \ (1) \\ \shoveleft \mathbf{y}(t) = g (\mathbf{V}\mathbf{s} (t)), \ (2) \end{multline} }[/math] where
  :[math]\displaystyle{ f(z) = \frac{1} {1 + e^{-z} } , g(z_m) = \frac{e^{z_m} }{\Sigma_k e^{z_k} } . \ (3) }[/math]
  In this framework, the word representations are found in the columns of U, with each column representing a word. The RNN is trained with backpropagation to maximize the data log-likelihood under the model.

(Mikolov et al., 2011) ⇒ Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan Černocký, and Sanjeev Khudanpur. (2011). “Extensions of Recurrent Neural Network Language Model.” In: Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP-2011).
- QUOTE: We present several modifications of the original recurrent neural net work language model (RNN LM).

Yoshua Bengio. http://www.iro.umontreal.ca/~bengioy/yoshua_en/research.html
- One of the highest impact results I obtained was about the difficulty of learning sequential dependencies, either in recurrent neural networks or in dynamical graphical models (such as Hidden Markov Models). [[Bengio et al., 1994|The paper below]] suggests that with parametrized dynamical systems (such as a recurrent neural network), the error gradient propagated through many time steps is a poor source of information for learning to capture statistical dependencies that are temporally remote. The mathematical result is that either information is not easily transmitted (it is lost exponentially fast when trying to propagate it from the past to the future through a context variable, or it is vulnerable to perturbations and noise), or the gradients relating temporally remote events becomes exponentially small for larger temporal differences.