Truncated Back-Propagation Through Time (TBPTT) Algorithm
(Redirected from truncated backpropagation through time (BPTT))
Jump to navigation
Jump to search
A Truncated Back-Propagation Through Time (TBPTT) Algorithm is a Backpropagation Through Time Algorithm that only backpropagates the gradient for a fixed number of time steps (a predefined truncation horizon).
- Context:
- It was initially developed by Williams & Peng, (1990).
- It for forbids RNN learning dependencies beyond a truncation horizon.
- Example(s):
- …
- Counter-Example(s):
- See: Recurrent Neural Network, Elman Networks, Jordan Networks, Gradient, Backpropagation Algorithm.
References
2019
- (Benzing et al., 2019) ⇒ Frederik Benzing, Marcelo Matheus Gauy, Asier Mujika, Anders Martinsson, and Angelika Steger (2019, May). "Optimal Kronecker-Sum Approximation of Real Time Recurrent Learning". In: International Conference on Machine Learning (pp. 604-613). PMLR.
- QUOTE: Since Williams and Peng (1990) developed Truncated Backpropagation through Time (TBPTT), it continues to be the most popular training method in many areas (Mnih et al., 2016; Mehri et al., 2017; Merity et al., 2018) - despite the fact that it does not seem to align well with the goal of learning arbitrary long-term dependencies. This is because TBPTT ‘unrolls’ the RNN only for a fixed number of time steps $T$ (the truncation horizon) and backpropagates the gradient for these steps only. This almost categorically forbids learning dependencies beyond the truncation horizon. Unfortunately, extending the truncation horizon makes TBPTT increasingly memory consuming, since long input sequences need to be stored, and considerably slows down learning, since parameters are updated less frequently, a phenomenon known as ‘update lock’ (Jaderberg et al., 2017)
2016
- (Mnih et al., 2016) ⇒ Volodymyr Mnih, Adria Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu (2016). "Asynchronous Methods for Deep Reinforcement Learning". In: Proceedings of The 33rd International Conference on Machine Learning (ICML 2016).
- QUOTE: We found that using the forward view is easier when training neural networks with momentum-based methods and backpropagation through time. In order to compute a single update, the algorithm first selects actions using its exploration policy for up to $t_{max}$ steps or until a terminal state is reached.
1990
- (Williams & Peng, 1990) ⇒ Ronald J. Williams, and Jing Peng (1990). "An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories". In: Neural Computation 2(4): 490-501.
- QUOTE: However, if weights are adjusted as the network operates, as they necessarily must be for any on-line algorithm, use of a gradient computation method based on the assumption that the weights are fixed over all past time involves a different kind of approximation that can actually be mitigated by ignoring dependencies into the distant past, as occurs when using truncated BPTT. (...)
Another potential benefit of the truncation strategy is that it can help provide a useful inductive bias by forcing the learning system to consider only reasonably short-term correlations between input and desired output; of course, this is appropriate only when these short-term correlations are actually the important ones, as is often the case.
- QUOTE: However, if weights are adjusted as the network operates, as they necessarily must be for any on-line algorithm, use of a gradient computation method based on the assumption that the weights are fixed over all past time involves a different kind of approximation that can actually be mitigated by ignoring dependencies into the distant past, as occurs when using truncated BPTT. (...)