Learning Rate Annealing Schedule Algorithm: Difference between revisions
Jump to navigation
Jump to search
(Created page with "A Learning Rate Annealing Schedule is a Learning Rate Schedule that is based on simulated annealing. * <B>Example(s):</B> ** ... * <B>Counter-Example(s):</B> ** ...") |
m (Text replacement - ">↵=== " to "> === ") |
||
(25 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
A [[Learning Rate Annealing Schedule]] is a [[Learning Rate Schedule]] that is based on [[simulated annealing]]. | A [[Learning Rate Annealing Schedule Algorithm]] is a [[Learning Rate Schedule Algorithm]] that is based on [[simulated annealing]]. | ||
* <B>Example(s):</B> | * <B>Example(s):</B> | ||
** | ** [[Learning Rate Cosine Annealing Schedule]], | ||
** … | |||
* <B>Counter-Example(s):</B> | * <B>Counter-Example(s):</B> | ||
** [[Learning Rate Time-based Schedule]], | ** [[Learning Rate Time-based Schedule]], | ||
** [[Learning Rate Step-based Schedule]], | ** [[Learning Rate Step-based Schedule]], | ||
** [[Learning Rate Exponential Schedule]]. | ** [[Learning Rate Exponential Decay Schedule]]. | ||
* <B>See: </B> [[Learning Rate]], [[Hyperparameter]], [[Gradient Descend Algorithm]]. | * <B>See: </B> [[Learning Rate]], [[Hyperparameter]], [[Gradient Descend Algorithm]]. | ||
---- | ---- | ||
---- | ---- | ||
== References == | == References == | ||
=== 2021a === | |||
* (Mxnet,2021) ⇒ https://mxnet.apache.org/versions/1.4.1/tutorials/gluon/learning_rate_schedules_advanced.html Retrieved:2021-7-4. | |||
** QUOTE: Continuing with the idea that smooth [[decay]] profiles give improved [[performance]] over [[stepwise decay]], [[#2017|Ilya Loshchilov, Frank Hutter (2016)]] used [[Cosine Annealing Schedule|“cosine annealing” schedule]]s to good effect. As with [[triangular schedule]]s, the original idea was that this should be used as part of a [[cyclical schedule]], but we begin by implementing the [[cosine annealing]] component before the full [[Stochastic Gradient Descent with Warm Restarts (SGDR) method]] later in the tutorial. | |||
=== 2021b === | |||
* (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Learning_rate#Learning_rate_schedule Retrieved:2021-7-4. | |||
** Initial rate can be left as system default or can be selected using a range of techniques. A learning rate schedule changes the learning rate during learning and is most often changed between epochs/iterations. This is mainly done with two parameters: '''decay''' and '''momentum''' . There are many different learning rate schedules but the most common are '''time-based, step-based''' and '''exponential'''(...) | |||
=== 2018 === | |||
* ([[2018_ContextualStringEmbeddingsforSe|Akbik et al., 2018]]) ⇒ [[Alan Akbik]], [[Duncan Blythe]], and [[Roland Vollgraf]]. (2018). “[https://www.aclweb.org/anthology/C18-1139.pdf Contextual String Embeddings for Sequence Labeling].” In: [[Proceedings of the 27th International Conference on Computational Linguistics, (COLING 2018)]]. | |||
** QUOTE: We train the [[sequence tagging model]] using vanilla [[SGD]] with no [[momentum]], [[clipping gradient]]s at 5, for 150 [[epoch]]s. </s> We employ a simple [[learning rate annealing method]] in which we halve the [[learning rate]] if [[training loss]] does not fall for 5 consecutive [[epoch]]s. </s> | |||
=== 2017 === | |||
* ([[2017_SGDRStochasticGradientDescentwi|Loshchilov & Hutter, 2017]]) ⇒ [[Ilya Loshchilov]], and [[Frank Hutter]]. (2017). “[https://openreview.net/pdf?id=Skq89Scxx SGDR: Stochastic Gradient Descent with Warm Restarts].” In: [[Conference Track Proceedings of the 5th International Conference on Learning Representations (ICLR 2017)]]. | |||
=== 2015 === | |||
* ([[2015_EndtoEndMemoryNetworks|Sukhbaatar et al., 2015]]) ⇒ [[Sainbayar Sukhbaatar]], [[Jason Weston]], and [[Rob Fergus]]. (2015). “[http://papers.nips.cc/paper/5846-end-to-end-memory-networks.pdf End-to-end Memory Networks].” In: Advances in Neural Information Processing Systems. | |||
** QUOTE: The [[training procedure]] we use is the same as the [[QA task]]s, except for the following. </s> For each [[mini-batch update]], the [[L2 norm]] of the whole [[gradient]] of all [[parameter]]s is [[measured]]<ref name="ftn-5">In the [[QA task]]s, the [[gradient]] of each [[weight matrix]] is measured separately</ref> and if larger than $L = 50$, then it is scaled down to have [[norm]] $L$. </s> This was crucial for good [[performance]]. </s> We use the [[learning rate annealing schedule]] from [[#2014_Mikolov|Mikolov et al. (2014)]], namely, if the [[validation cost]] has not decreased after one epoch, then the [[learning rate]] is scaled down by a factor 1.5. [[Training]] terminates when the [[learning rate]] drops below $10^{-5}$, i.e. after 50 [[epoch]]s or so. </s> [[Weight]]s are initialized using $N (0, 0.05) and [[batch size]] is set to 128. </s> On the [[Penn tree dataset]], we repeat each [[training]] 10 times with different [[random initialization]]s and pick the one with smallest [[validation cost]]. </s> However, we have done only a [[single training run]] on [[Text8 dataset]] due to limited [[time constraint]]s. </s> | |||
<references/> | |||
=== 2012 === | === 2012 === | ||
* ([[2012_ADADELTAAnAdaptiveLearningRateM|Zeiler, 2012]]) | * ([[2012_ADADELTAAnAdaptiveLearningRateM|Zeiler, 2012]]) ⇒ [[Matthew D. Zeiler]]. (2012). “[https://arxiv.org/pdf/1212.5701.pdf ADADELTA: An Adaptive Learning Rate Method].” In: e-print [https://arxiv.org/abs/1212.5701 arXiv:1212.5701]. | ||
** QUOTE: There have been several attempts to use [[heuristic]]s for [[estimating]] a good [[learning rate]] at each [[iteration]] of [[gradient descent]]. </s> These either attempt to [[speed up]] [[learning]] when suitable or to slow down [[learning]] near a [[local minima]]. </s> Here we consider the latter. </s> <P> When [[gradient descent]] nears a [[minima]] in the [[cost surface]], the [[parameter value]]s can oscillate back and forth around the [[minima]]. </s> One [[method]] to prevent this is to slow down the [[parameter update]]s by decreasing the [[learning rate]]. </s> This can be done manually when the [[validation]] [[accuracy]] appears to [[plateau]]. </s> Alternatively, [[learning rate schedule]]s have been proposed [[#1951_Robinds|Robinds & Monro (1951)]] to [[automatically anneal]] the [[learning rate]] based on how many [[epoch]]s through the [[data]] have been done. </s> These [[approach]]es typically add additional [[hyperparameter]]s to control how quickly the [[learning rate]] [[decay]]s. </s> | ** QUOTE: There have been several attempts to use [[heuristic]]s for [[estimating]] a good [[learning rate]] at each [[iteration]] of [[gradient descent]]. </s> These either attempt to [[speed up]] [[learning]] when suitable or to slow down [[learning]] near a [[local minima]]. </s> Here we consider the latter. </s> <P> When [[gradient descent]] nears a [[minima]] in the [[cost surface]], the [[parameter value]]s can oscillate back and forth around the [[minima]]. </s> One [[method]] to prevent this is to slow down the [[parameter update]]s by decreasing the [[learning rate]]. </s> This can be done manually when the [[validation]] [[accuracy]] appears to [[plateau]]. </s> Alternatively, [[learning rate schedule]]s have been proposed [[#1951_Robinds|Robinds & Monro (1951)]] to [[automatically anneal]] the [[learning rate]] based on how many [[epoch]]s through the [[data]] have been done. </s> These [[approach]]es typically add additional [[hyperparameter]]s to control how quickly the [[learning rate]] [[decay]]s. </s> | ||
=== 1951 === | === 1951 === | ||
* <span id="1951_Robinds">(Robinds & Monro, 1951)</span> | * <span id="1951_Robinds">(Robinds & Monro, 1951)</span> ⇒ H. Robinds and S. Monro (1951). “A stochastic approximation method.” In: Annals of Mathematical Statistics, vol. 22, pp. 400-407, 1951. | ||
---- | ---- | ||
__NOTOC__ | __NOTOC__ | ||
[[Category:Concept]] | [[Category:Concept]] | ||
[[Category:Machine Learning]] | [[Category:Machine Learning]] |
Latest revision as of 10:25, 12 February 2024
A Learning Rate Annealing Schedule Algorithm is a Learning Rate Schedule Algorithm that is based on simulated annealing.
- Example(s):
- Counter-Example(s):
- See: Learning Rate, Hyperparameter, Gradient Descend Algorithm.
References
2021a
- (Mxnet,2021) ⇒ https://mxnet.apache.org/versions/1.4.1/tutorials/gluon/learning_rate_schedules_advanced.html Retrieved:2021-7-4.
- QUOTE: Continuing with the idea that smooth decay profiles give improved performance over stepwise decay, Ilya Loshchilov, Frank Hutter (2016) used “cosine annealing” schedules to good effect. As with triangular schedules, the original idea was that this should be used as part of a cyclical schedule, but we begin by implementing the cosine annealing component before the full Stochastic Gradient Descent with Warm Restarts (SGDR) method later in the tutorial.
2021b
- (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Learning_rate#Learning_rate_schedule Retrieved:2021-7-4.
- Initial rate can be left as system default or can be selected using a range of techniques. A learning rate schedule changes the learning rate during learning and is most often changed between epochs/iterations. This is mainly done with two parameters: decay and momentum . There are many different learning rate schedules but the most common are time-based, step-based and exponential(...)
2018
- (Akbik et al., 2018) ⇒ Alan Akbik, Duncan Blythe, and Roland Vollgraf. (2018). “Contextual String Embeddings for Sequence Labeling.” In: Proceedings of the 27th International Conference on Computational Linguistics, (COLING 2018).
- QUOTE: We train the sequence tagging model using vanilla SGD with no momentum, clipping gradients at 5, for 150 epochs. We employ a simple learning rate annealing method in which we halve the learning rate if training loss does not fall for 5 consecutive epochs.
2017
- (Loshchilov & Hutter, 2017) ⇒ Ilya Loshchilov, and Frank Hutter. (2017). “SGDR: Stochastic Gradient Descent with Warm Restarts.” In: Conference Track Proceedings of the 5th International Conference on Learning Representations (ICLR 2017).
2015
- (Sukhbaatar et al., 2015) ⇒ Sainbayar Sukhbaatar, Jason Weston, and Rob Fergus. (2015). “End-to-end Memory Networks.” In: Advances in Neural Information Processing Systems.
- QUOTE: The training procedure we use is the same as the QA tasks, except for the following. For each mini-batch update, the L2 norm of the whole gradient of all parameters is measured[1] and if larger than $L = 50$, then it is scaled down to have norm $L$. This was crucial for good performance. We use the learning rate annealing schedule from Mikolov et al. (2014), namely, if the validation cost has not decreased after one epoch, then the learning rate is scaled down by a factor 1.5. Training terminates when the learning rate drops below $10^{-5}$, i.e. after 50 epochs or so. Weights are initialized using $N (0, 0.05) and batch size is set to 128. On the Penn tree dataset, we repeat each training 10 times with different random initializations and pick the one with smallest validation cost. However, we have done only a single training run on Text8 dataset due to limited time constraints.
- ↑ In the QA tasks, the gradient of each weight matrix is measured separately
2012
- (Zeiler, 2012) ⇒ Matthew D. Zeiler. (2012). “ADADELTA: An Adaptive Learning Rate Method.” In: e-print arXiv:1212.5701.
- QUOTE: There have been several attempts to use heuristics for estimating a good learning rate at each iteration of gradient descent. These either attempt to speed up learning when suitable or to slow down learning near a local minima. Here we consider the latter.
When gradient descent nears a minima in the cost surface, the parameter values can oscillate back and forth around the minima. One method to prevent this is to slow down the parameter updates by decreasing the learning rate. This can be done manually when the validation accuracy appears to plateau. Alternatively, learning rate schedules have been proposed Robinds & Monro (1951) to automatically anneal the learning rate based on how many epochs through the data have been done. These approaches typically add additional hyperparameters to control how quickly the learning rate decays.
- QUOTE: There have been several attempts to use heuristics for estimating a good learning rate at each iteration of gradient descent. These either attempt to speed up learning when suitable or to slow down learning near a local minima. Here we consider the latter.
1951
- (Robinds & Monro, 1951) ⇒ H. Robinds and S. Monro (1951). “A stochastic approximation method.” In: Annals of Mathematical Statistics, vol. 22, pp. 400-407, 1951.