Real-Time Recurrent Learning (RTRL) Algorithm

A Real-Time Recurrent Learning (RTRL) Algorithm is a Gradient Descent Algorithm that is an online learning algorithm for training RNNs.

Context:
- It is an improved version of BPTT algorithm as it computes untruncated gradients.
Example(s):
Counter-Example(s):
See: Recurrent Neural Network, Elman Networks, Jordan Networks, Gradient, Backpropagation Algorithm, Decoupled Neural Interfaces, Recurrent Backpropagation Algorithm, Equilibrium Propagation Algorithm.

References

2021

(Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Recurrent_neural_network#Gradient_descent Retrieved:2021-3-21.
- Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. In neural networks, it can be used to minimize the error term by changing each weight in proportion to the derivative of the error with respect to that weight, provided the non-linear activation functions are differentiable. Various methods for doing so were developed in the 1980s and early 1990s by Werbos, Williams, Robinson, Schmidhuber, Hochreiter, Pearlmutter and others.
  The standard method is called “backpropagation through time” or BPTT, and is a generalization of back-propagation for feed-forward networks. Like that method, it is an instance of automatic differentiation in the reverse accumulation mode of Pontryagin's minimum principle. A more computationally expensive online variant is called “Real-Time Recurrent Learning” or RTRL, which is an instance of automatic differentiation in the forward accumulation mode with stacked tangent vectors. Unlike BPTT, this algorithm is local in time but not local in space.

2020

(Menick et al., 2020) ⇒ Jacob Menick, Erich Elsen, Utku Evci, Simon Osindero, Karen Simonyan, and Alex Graves (2020). "A Practical Sparse Approximation for Real Time Recurrent Learning". arXiv preprint arXiv:2006.07232.
- QUOTE: Real Time Recurrent Learning (Williams & Zipser, 1989) computes the gradient as:

$\nabla_{\theta} \mathcal{L}=\displaystyle \sum_{t=1}^{T} \dfrac{\partial \mathcal{L}_{t}}{\partial h_{t}} \dfrac{\partial h_{t}}{\partial \theta}=\sum_{t=1}^{T} \dfrac{\partial \mathcal{L}_{t}}{\partial h_{t}}\left(\dfrac{\partial h_{t}}{\partial \theta_{t}}+\dfrac{\partial h_{t}}{\partial h_{t-1}} \dfrac{\partial h_{t-1}}{\partial \theta}\right)$

(2)

This can be viewed as an iterative algorithm, updating $\frac{\partial h_t}{\partial \theta}$ from the intermediate quantity $\frac{\partial h_{t−1}}{\partial\theta}$. To simplify equation 2 we introduce the following notation: have $J_{t}:=\frac{\partial h_{t}}{\partial \theta}$, $I_{t}:=\frac{\partial h_{t}}{\partial \theta_{t}}$ and $D_{t}:=\frac{\partial h_{t}}{\partial h_{t-1}}$. $J$ stands for “Jacobian”, $I$ for “immediate Jacobian”, and $D$ for “dynamics”. We sometimes refer to $J$ as the “influence matrix”. The recursion can be rewritten $J_t = I_t + D_tJ_{t−1}$.

2013

(Williams & Zipser, 2013) ⇒ Ronald J. Williams, D. Zipser (1 February 2013). "Gradient-based learning algorithms for recurrent networks and their computational complexity". In Chauvin, Yves; Rumelhart, David E. (eds.). Backpropagation: Theory, Architectures, and Applications. Psychology Press. ISBN 978-1-134-77581-1

1989

(Williams & Zipser, 1989) ⇒ Ronald J. Williams, and David Zipser (1989). “A Learning Algorithm for Continually Running Fully Recurrent Neural Networks". In: Neural Computation 1(2): 270-280.

1987

(Robinson & Fallside, 1987) ⇒ Anthony J. Robison, and Frank Fallside (1987). "The Utility Driven Dynamic Error Propagation Network". Technical Report CUED/F-INFENG/TR.1. Department of Engineering, University of Cambridge.