Recurrent Neural Network Language Model Training Task

References

Yoshua Bengio. http://www.iro.umontreal.ca/~bengioy/yoshua_en/research.html
- One of the highest impact results I obtained was about the difficulty of learning sequential dependencies, either in recurrent neural networks or in dynamical graphical models (such as Hidden Markov Models). [[Bengio et al., 1994|The paper below]] suggests that with parametrized dynamical systems (such as a recurrent neural network), the error gradient propagated through many time steps is a poor source of information for learning to capture statistical dependencies that are temporally remote. The mathematical result is that either information is not easily transmitted (it is lost exponentially fast when trying to propagate it from the past to the future through a context variable, or it is vulnerable to perturbations and noise), or the gradients relating temporally remote events becomes exponentially small for larger temporal differences.