Adaptive Learning Rate

AKA: ALR.
Context:
- It can be automatically tuned by using an adaptive gradient algorithm.
- …
Example(s):
- A learning rate that decreases with the variance and increases with norm of an stochastic gradient.
- …
Counter-Example(s):
See: Adaptive Gradient Algorithm, Stochastic Optimization, Convex Optimization, Gradient Descent, Proximal Function, Hebb's Rule, Loss Function, Neural Network Topology, Stochastic Gradient Descent, Adam, Backprop.

References

(Ravaut & Gorti, 2018) ⇒ Mathieu Ravaut and Satya Krishna Gorti (unknown year). "Faster gradient descent via an adaptive learning rate". Retrieved: 2018-10-20.
- QUOTE: Any gradient descent requires to choose a learning rate. With deeper and deeper models, tuning that learning rate can easily become tedious and does not necesarily lead to an ideal convergence. We propose a variation of the gradient descent algorithm in the which the learning rate η is not fixed. Instead, we learn η itself, either by another gradient descent (first-order method), or by Newton’s method (second-order). This way, gradient descent for any machine learning algorithm can be optimized.

(Ranganath et al., 2013) ⇒ Rajesh Ranganath, Chong Wang, David M. Blei, and Eric P. Xing. (2013). "An Adaptive Learning Rate for Stochastic Variational Inference.” In: International Conference on Machine Learning (ICML-2013).
- QUOTE: In this paper, we develop an adaptive learning rate for stochastic variational inference. The step size decreases when the variance of the noisy gradient is large, mitigating the risk of taking a large step in the wrong direction. The step size increases when the norm of the expected noisy gradient is large, indicating that the algorithm is far away from the optimal point. With this approach, the user need not set any learning-rate parameters to find a good variational distribution, and it is implemented with computations already made within stochastic inference. Further, we found it consistently led to improved convergence and estimation over the best decreasing and constant rates.

(Plagianakos et al., 2001) ⇒ V. P. Plagianakos, G. D. Magoulas, and M. N. Vrahatis. (2001). "Learning rate adaptation in stochastic gradient descent". In Advances in convex analysis and global optimization (pp. 433-444). Springer, Boston, MA.
- QUOTE: BPNN training research usually focuses on deterministic gradient-based algorithms with adaptive learning rate that aim to accelerate the learning process. The following strategies are usually suggested: (i) start with a small learning rate and increase it exponentially, if successive epochs reduce the error, or rapidly decrease it, if a significant error increase occurs [3, 25], (ii) start with a small learning rate and increase it, if successive epochs keep gradient direction fairly constant, or rapidly decrease it, if the direction of the gradient varies greatly at each epoch [6], (iii) for each weight, an individual learning rate is given, which increases if the successive changes in the weights are in the same direction and decreases otherwise [10, 15, 17, 22], and (iv) use a closed formula to calculate a common learning rate for all the weights at each iteration [9, 12, 16] or a different learning rate for each weight [7, 13]. Note that all the above–mentioned strategies employ heuristic parameters in an attempt to enforce the decrease of the learning error at each iteration and to secure the converge of the training algorithm.