Learning Rate Annealing Schedule Algorithm: Difference between revisions
Jump to navigation
Jump to search
(Created page with "A Learning Rate Annealing Schedule is a Learning Rate Schedule that is based on simulated annealing. * <B>Example(s):</B> ** ... * <B>Counter-Example(s):</B> ** ...") |
(ContinuousReplacement) Tag: continuous replacement |
||
Line 9: | Line 9: | ||
---- | ---- | ||
---- | ---- | ||
== References == | == References == | ||
=== 2012 === | === 2012 === | ||
* ([[2012_ADADELTAAnAdaptiveLearningRateM|Zeiler, 2012]]) | * ([[2012_ADADELTAAnAdaptiveLearningRateM|Zeiler, 2012]]) ⇒ [[Matthew D. Zeiler]]. (2012). “[https://arxiv.org/pdf/1212.5701.pdf ADADELTA: An Adaptive Learning Rate Method].” In: e-print [https://arxiv.org/abs/1212.5701 arXiv:1212.5701]. | ||
** QUOTE: There have been several attempts to use [[heuristic]]s for [[estimating]] a good [[learning rate]] at each [[iteration]] of [[gradient descent]]. </s> These either attempt to [[speed up]] [[learning]] when suitable or to slow down [[learning]] near a [[local minima]]. </s> Here we consider the latter. </s> <P> When [[gradient descent]] nears a [[minima]] in the [[cost surface]], the [[parameter value]]s can oscillate back and forth around the [[minima]]. </s> One [[method]] to prevent this is to slow down the [[parameter update]]s by decreasing the [[learning rate]]. </s> This can be done manually when the [[validation]] [[accuracy]] appears to [[plateau]]. </s> Alternatively, [[learning rate schedule]]s have been proposed [[#1951_Robinds|Robinds & Monro (1951)]] to [[automatically anneal]] the [[learning rate]] based on how many [[epoch]]s through the [[data]] have been done. </s> These [[approach]]es typically add additional [[hyperparameter]]s to control how quickly the [[learning rate]] [[decay]]s. </s> | ** QUOTE: There have been several attempts to use [[heuristic]]s for [[estimating]] a good [[learning rate]] at each [[iteration]] of [[gradient descent]]. </s> These either attempt to [[speed up]] [[learning]] when suitable or to slow down [[learning]] near a [[local minima]]. </s> Here we consider the latter. </s> <P> When [[gradient descent]] nears a [[minima]] in the [[cost surface]], the [[parameter value]]s can oscillate back and forth around the [[minima]]. </s> One [[method]] to prevent this is to slow down the [[parameter update]]s by decreasing the [[learning rate]]. </s> This can be done manually when the [[validation]] [[accuracy]] appears to [[plateau]]. </s> Alternatively, [[learning rate schedule]]s have been proposed [[#1951_Robinds|Robinds & Monro (1951)]] to [[automatically anneal]] the [[learning rate]] based on how many [[epoch]]s through the [[data]] have been done. </s> These [[approach]]es typically add additional [[hyperparameter]]s to control how quickly the [[learning rate]] [[decay]]s. </s> | ||
=== 1951 === | === 1951 === | ||
* <span id="1951_Robinds">(Robinds & Monro, 1951)</span> | * <span id="1951_Robinds">(Robinds & Monro, 1951)</span> ⇒ H. Robinds and S. Monro (1951). “A stochastic approximation method”. In: Annals of Mathematical Statistics, vol. 22, pp. 400-407, 1951. | ||
---- | ---- |
Revision as of 19:35, 4 July 2021
A Learning Rate Annealing Schedule is a Learning Rate Schedule that is based on simulated annealing.
- Example(s):
- ...
- Counter-Example(s):
- See: Learning Rate, Hyperparameter, Gradient Descend Algorithm.
References
2012
- (Zeiler, 2012) ⇒ Matthew D. Zeiler. (2012). “ADADELTA: An Adaptive Learning Rate Method.” In: e-print arXiv:1212.5701.
- QUOTE: There have been several attempts to use heuristics for estimating a good learning rate at each iteration of gradient descent. These either attempt to speed up learning when suitable or to slow down learning near a local minima. Here we consider the latter.
When gradient descent nears a minima in the cost surface, the parameter values can oscillate back and forth around the minima. One method to prevent this is to slow down the parameter updates by decreasing the learning rate. This can be done manually when the validation accuracy appears to plateau. Alternatively, learning rate schedules have been proposed Robinds & Monro (1951) to automatically anneal the learning rate based on how many epochs through the data have been done. These approaches typically add additional hyperparameters to control how quickly the learning rate decays.
- QUOTE: There have been several attempts to use heuristics for estimating a good learning rate at each iteration of gradient descent. These either attempt to speed up learning when suitable or to slow down learning near a local minima. Here we consider the latter.
1951
- (Robinds & Monro, 1951) ⇒ H. Robinds and S. Monro (1951). “A stochastic approximation method”. In: Annals of Mathematical Statistics, vol. 22, pp. 400-407, 1951.