Truncated Back-Propagation Through Time (TBPTT) Algorithm

Context:
- It was initially developed by Williams & Peng, (1990).
- It for forbids RNN learning dependencies beyond a truncation horizon.
Example(s):
- …
Counter-Example(s):
See: Recurrent Neural Network, Elman Networks, Jordan Networks, Gradient, Backpropagation Algorithm.

References

(Benzing et al., 2019) ⇒ Frederik Benzing, Marcelo Matheus Gauy, Asier Mujika, Anders Martinsson, and Angelika Steger (2019, May). "Optimal Kronecker-Sum Approximation of Real Time Recurrent Learning". In: International Conference on Machine Learning (pp. 604-613). PMLR.
- QUOTE: Since Williams and Peng (1990) developed Truncated Backpropagation through Time (TBPTT), it continues to be the most popular training method in many areas (Mnih et al., 2016; Mehri et al., 2017; Merity et al., 2018) - despite the fact that it does not seem to align well with the goal of learning arbitrary long-term dependencies. This is because TBPTT ‘unrolls’ the RNN only for a fixed number of time steps $T$ (the truncation horizon) and backpropagates the gradient for these steps only. This almost categorically forbids learning dependencies beyond the truncation horizon. Unfortunately, extending the truncation horizon makes TBPTT increasingly memory consuming, since long input sequences need to be stored, and considerably slows down learning, since parameters are updated less frequently, a phenomenon known as ‘update lock’ (Jaderberg et al., 2017)

(Williams & Peng, 1990) ⇒ Ronald J. Williams, and Jing Peng (1990). "An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories". In: Neural Computation 2(4): 490-501.
- QUOTE: However, if weights are adjusted as the network operates, as they necessarily must be for any on-line algorithm, use of a gradient computation method based on the assumption that the weights are fixed over all past time involves a different kind of approximation that can actually be mitigated by ignoring dependencies into the distant past, as occurs when using truncated BPTT. (...)
  Another potential benefit of the truncation strategy is that it can help provide a useful inductive bias by forcing the learning system to consider only reasonably short-term correlations between input and desired output; of course, this is appropriate only when these short-term correlations are actually the important ones, as is often the case.