Real-Time Recurrent Learning (RTRL) Algorithm
A Real-Time Recurrent Learning (RTRL) Algorithm is a Gradient Descent Algorithm that is an online learning algorithm for training RNNs.
- Context:
- It is an improved version of BPTT algorithm as it computes untruncated gradients.
- Example(s):
- Counter-Example(s):
- See: Recurrent Neural Network, Elman Networks, Jordan Networks, Gradient, Backpropagation Algorithm, Decoupled Neural Interfaces, Recurrent Backpropagation Algorithm, Equilibrium Propagation Algorithm.
References
2021
- (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Recurrent_neural_network#Gradient_descent Retrieved:2021-3-21.
- Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. In neural networks, it can be used to minimize the error term by changing each weight in proportion to the derivative of the error with respect to that weight, provided the non-linear activation functions are differentiable. Various methods for doing so were developed in the 1980s and early 1990s by Werbos, Williams, Robinson, Schmidhuber, Hochreiter, Pearlmutter and others.
The standard method is called “backpropagation through time” or BPTT, and is a generalization of back-propagation for feed-forward networks. Like that method, it is an instance of automatic differentiation in the reverse accumulation mode of Pontryagin's minimum principle. A more computationally expensive online variant is called “Real-Time Recurrent Learning” or RTRL, which is an instance of automatic differentiation in the forward accumulation mode with stacked tangent vectors. Unlike BPTT, this algorithm is local in time but not local in space.
- Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. In neural networks, it can be used to minimize the error term by changing each weight in proportion to the derivative of the error with respect to that weight, provided the non-linear activation functions are differentiable. Various methods for doing so were developed in the 1980s and early 1990s by Werbos, Williams, Robinson, Schmidhuber, Hochreiter, Pearlmutter and others.
2020
- (Menick et al., 2020) ⇒ Jacob Menick, Erich Elsen, Utku Evci, Simon Osindero, Karen Simonyan, and Alex Graves (2020). "A Practical Sparse Approximation for Real Time Recurrent Learning". arXiv preprint arXiv:2006.07232.
- QUOTE: Real Time Recurrent Learning (Williams & Zipser, 1989) computes the gradient as:
- QUOTE: Real Time Recurrent Learning (Williams & Zipser, 1989) computes the gradient as:
$\nabla_{\theta} \mathcal{L}=\displaystyle \sum_{t=1}^{T} \dfrac{\partial \mathcal{L}_{t}}{\partial h_{t}} \dfrac{\partial h_{t}}{\partial \theta}=\sum_{t=1}^{T} \dfrac{\partial \mathcal{L}_{t}}{\partial h_{t}}\left(\dfrac{\partial h_{t}}{\partial \theta_{t}}+\dfrac{\partial h_{t}}{\partial h_{t-1}} \dfrac{\partial h_{t-1}}{\partial \theta}\right)$ |
(2) |
- This can be viewed as an iterative algorithm, updating $\frac{\partial h_t}{\partial \theta}$ from the intermediate quantity $\frac{\partial h_{t−1}}{\partial\theta}$. To simplify equation 2 we introduce the following notation: have $J_{t}:=\frac{\partial h_{t}}{\partial \theta}$, $I_{t}:=\frac{\partial h_{t}}{\partial \theta_{t}}$ and $D_{t}:=\frac{\partial h_{t}}{\partial h_{t-1}}$. $J$ stands for “Jacobian”, $I$ for “immediate Jacobian”, and $D$ for “dynamics”. We sometimes refer to $J$ as the “influence matrix”. The recursion can be rewritten $J_t = I_t + D_tJ_{t−1}$.
2019
- (Benzing et al., 2019) ⇒ Frederik Benzing, Marcelo Matheus Gauy, Asier Mujika, Anders Martinsson, and Angelika Steger (2019, May). "Optimal Kronecker-Sum Approximation of Real Time Recurrent Learning". In: International Conference on Machine Learning (pp. 604-613). PMLR.
- QUOTE: The advantages of RTRL are that it provides untruncated gradients, which in principle allow the network to learn arbitrarily long-term dependencies, and that it is fully online, so that parameters are updated frequently allowing faster learning. However, its runtime and memory requirements scale poorly with the network size and make RTRL infeasible for practical applications. As a remedy to this problem, Tallec and Ollivier (2017a) proposed replacing the full gradient of RTRL by an unbiased, less computationally costly but noisy approximation (Unbiased Online Recurrent Optimisation, UORO). Recently, Mujika et al. (2018) reduced the noise of this approach and demonstrated empirically that their improvement (Kronecker factored RTRL, KF-RTRL) allows learning complex real world data sets
2018
- (Mujika et al., 2018) ⇒ Asier Mujika, Florian Meier, and Angelika Steger (2018). "Approximating Real-Time Recurrent Learning with Random Kronecker Factors". In: NeurIPS 2018.
- QUOTE: Our proposed Kronecker Factored RTRL algorithm (KF-RTRL) is an online learning algorithm for RNNs, which does not require storing any previous inputs or network states.
2017
- (Tallec & Ollivier, 2017) ⇒ Corentin Tallec, Yann Ollivier (2017). "Unbiased Online Recurrent Optimization". ArXiv:1702.05043.
2013
- (Williams & Zipser, 2013) ⇒ Ronald J. Williams, D. Zipser (1 February 2013). "Gradient-based learning algorithms for recurrent networks and their computational complexity". In Chauvin, Yves; Rumelhart, David E. (eds.). Backpropagation: Theory, Architectures, and Applications. Psychology Press. ISBN 978-1-134-77581-1
1989
- (Williams & Zipser, 1989) ⇒ Ronald J. Williams, and David Zipser (1989). “A Learning Algorithm for Continually Running Fully Recurrent Neural Networks". In: Neural Computation 1(2): 270-280.
1987
- (Robinson & Fallside, 1987) ⇒ Anthony J. Robison, and Frank Fallside (1987). "The Utility Driven Dynamic Error Propagation Network". Technical Report CUED/F-INFENG/TR.1. Department of Engineering, University of Cambridge.