Learning Rate Annealing Schedule Algorithm

References

(Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Learning_rate#Learning_rate_schedule Retrieved:2021-7-4.
- Initial rate can be left as system default or can be selected using a range of techniques. A learning rate schedule changes the learning rate during learning and is most often changed between epochs/iterations. This is mainly done with two parameters: decay and momentum . There are many different learning rate schedules but the most common are time-based, step-based and exponential(...)

(Sukhbaatar et al., 2015) ⇒ Sainbayar Sukhbaatar, Jason Weston, and Rob Fergus. (2015). “End-to-end Memory Networks.” In: Advances in Neural Information Processing Systems.
- QUOTE: The training procedure we use is the same as the QA tasks, except for the following. For each mini-batch update, the L2 norm of the whole gradient of all parameters is measured^[1] and if larger than $L = 50$, then it is scaled down to have norm $L$. This was crucial for good performance. We use the learning rate annealing schedule from Mikolov et al. (2014), namely, if the validation cost has not decreased after one epoch, then the learning rate is scaled down by a factor 1.5. Training terminates when the learning rate drops below $10^{-5}$, i.e. after 50 epochs or so. Weights are initialized using $N (0, 0.05) and batch size is set to 128. On the Penn tree dataset, we repeat each training 10 times with different random initializations and pick the one with smallest validation cost. However, we have done only a single training run on Text8 dataset due to limited time constraints.

(Zeiler, 2012) ⇒ Matthew D. Zeiler. (2012). “ADADELTA: An Adaptive Learning Rate Method.” In: e-print arXiv:1212.5701.
- QUOTE: There have been several attempts to use heuristics for estimating a good learning rate at each iteration of gradient descent. These either attempt to speed up learning when suitable or to slow down learning near a local minima. Here we consider the latter.
  When gradient descent nears a minima in the cost surface, the parameter values can oscillate back and forth around the minima. One method to prevent this is to slow down the parameter updates by decreasing the learning rate. This can be done manually when the validation accuracy appears to plateau. Alternatively, learning rate schedules have been proposed Robinds & Monro (1951) to automatically anneal the learning rate based on how many epochs through the data have been done. These approaches typically add additional hyperparameters to control how quickly the learning rate decays.

(Robinds & Monro, 1951) ⇒ H. Robinds and S. Monro (1951). “A stochastic approximation method.” In: Annals of Mathematical Statistics, vol. 22, pp. 400-407, 1951.