Back-Propagation Through Time (BPTT) Algorithm
A Back-Propagation Through Time (BPTT) Algorithm is a Gradient Descent Algorithm that can be used to train some recurrent neural networks.
- AKA: Backpropagation Through Time.
- Context:
- It can be used to train Elman and Jordan Networks.
- It is an generalization of a Backpropagation of Errors (BP)-based Training Algorithm.
- Example(s):
- Counter-Example(s):
- See: Recurrent Neural Network, Elman Networks, Jordan Networks, Gradient, Backpropagation Algorithm, Decoupled Neural Interfaces, Recurrent Backpropagation Algorithm, Equilibrium Propagation Algorithm.
References
2021
- (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Backpropagation_through_time Retrieved:2021-3-21.
- Backpropagation through time (BPTT) is a gradient-based technique for training certain types of recurrent neural networks. It can be used to train Elman networks. The algorithm was independently derived by numerous researchers.
(...)
The training data for a recurrent neural network is an ordered sequence of [math]\displaystyle{ k }[/math] input-output pairs, [math]\displaystyle{ \langle \mathbf{a}_0,\mathbf{y}_0 \rangle, \langle\mathbf{a}_1,\mathbf{y}_1 \rangle,\langle\mathbf{a}_2,\mathbf{y}_2\rangle,...,\langle\mathbf{a}_{k-1},\mathbf{y}_{k-1}\rangle }[/math]. An initial value must be specified for the hidden state [math]\displaystyle{ \mathbf{x}_0 }[/math]. Typically, a vector of all zeros is used for this purpose.
BPTT begins by unfolding a recurrent neural network in time. The unfolded network contains [math]\displaystyle{ k }[/math] inputs and outputs, but every copy of the network shares the same parameters. Then the backpropagation algorithm is used to find the gradient of the cost with respect to all the network parameters.
Consider an example of a neural network that contains a recurrent layer [math]\displaystyle{ f }[/math] and a feedforward layer [math]\displaystyle{ g }[/math]. There are different ways to define the training cost, but the total cost is always the average of the costs of each of the time steps. The cost of each time step can be computed separately. The figure above shows how the cost at time [math]\displaystyle{ t + 3 }[/math] can be computed, by unfolding the recurrent layer [math]\displaystyle{ f }[/math] for three time steps and adding the feedforward layer [math]\displaystyle{ g }[/math]. Each instance of [math]\displaystyle{ f }[/math] in the unfolded network shares the same parameters. Thus the weight updates in each instance ([math]\displaystyle{ f_1, f_2, f_3 }[/math]) are summed together.
- Backpropagation through time (BPTT) is a gradient-based technique for training certain types of recurrent neural networks. It can be used to train Elman networks. The algorithm was independently derived by numerous researchers.
2020
- (GeeksforGeeks, 2020) ⇒ https://www.geeksforgeeks.org/ml-back-propagation-through-time/ Last Updated: 04 May, 2020.
- QUOTE: This method of Back Propagation through time (BPTT) can be used up to a limited number of time steps like 8 or 10. If we back propagate further, the gradient $\delta$ becomes too small. This problem is called the “Vanishing gradient” problem. The problem is that the contribution of information decays geometrically over time. So, if the number of time steps is >10 (Let’s say), that information will effectively be discarded.
2016a
- (Gruslys et al., 2016) ⇒ Audrunas Gruslys, Remi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves (2016). "Memory-Efficient Backpropagation Through Time". arXiv preprint arXiv:1606.03401.
- QUOTE: Backpropagation Through Time algorithm (BPTT) (Rumelhart et al., 1985, Werbos, 1990) is typically used to obtain gradients during training. One important problem is the large memory consumption required by the BPTT. This is especially troublesome when using Graphics Processing Units (GPUs) due to the limitations of GPU memory.
2016b
- (Li, 2016) ⇒ Minchen Li (2016). "A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN".
- QUOTE: A MATLAB program which implements the entire BPTT for GRU and the pseudo-codes describing the algorithms explicitly will be presented. We provide two algorithms for BPTT, a direct but quadratic time algorithm for easy understanding, and an optimized linear time algorithm. This tutorial starts with a specification of the problem followed by a mathematical derivation before the computational solutions.
1990a
- (Werbos, 1990) ⇒ Paul J. Werbos (1990). "Backpropagation through time: what it does and how to do it". In: Proceedings of the IEEE, 78(10):1550–1560.
1990b
- (Williams & Peng, 1990) ⇒ Ronald J. Williams, and Jing Peng (1990). "An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories". In: Neural Computation 2(4): 490-501.
1985
- (Rumelhart et al., 1985) ⇒ David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams (1985). "Learning Internal Representations by Error Propagation". In: Technical report, DTIC Document, 1985.