Learning Rate Annealing Schedule Algorithm: Difference between revisions

From GM-RKB
Jump to navigation Jump to search
(Created page with "A Learning Rate Annealing Schedule is a Learning Rate Schedule that is based on simulated annealing. * <B>Example(s):</B> ** ... * <B>Counter-Example(s):</B> ** ...")
 
m (Text replacement - ">↵=== " to "> === ")
 
(25 intermediate revisions by 4 users not shown)
Line 1: Line 1:
A [[Learning Rate Annealing Schedule]] is a [[Learning Rate Schedule]] that is based on [[simulated annealing]].
A [[Learning Rate Annealing Schedule Algorithm]] is a [[Learning Rate Schedule Algorithm]] that is based on [[simulated annealing]].
* <B>Example(s):</B>
* <B>Example(s):</B>
** ...
** [[Learning Rate Cosine Annealing Schedule]],
** …
* <B>Counter-Example(s):</B>
* <B>Counter-Example(s):</B>
** [[Learning Rate Time-based Schedule]],  
** [[Learning Rate Time-based Schedule]],  
** [[Learning Rate Step-based Schedule]],
** [[Learning Rate Step-based Schedule]],
** [[Learning Rate Exponential Schedule]].
** [[Learning Rate Exponential Decay Schedule]].
* <B>See: </B> [[Learning Rate]], [[Hyperparameter]], [[Gradient Descend Algorithm]].
* <B>See: </B> [[Learning Rate]], [[Hyperparameter]], [[Gradient Descend Algorithm]].
----
----
----
----
== References ==
== References ==
=== 2021a ===
* (Mxnet,2021) &rArr; https://mxnet.apache.org/versions/1.4.1/tutorials/gluon/learning_rate_schedules_advanced.html Retrieved:2021-7-4.
** QUOTE: Continuing with the idea that smooth [[decay]] profiles give improved [[performance]] over [[stepwise decay]], [[#2017|Ilya Loshchilov, Frank Hutter (2016)]] used [[Cosine Annealing Schedule|“cosine annealing” schedule]]s to good effect. As with [[triangular schedule]]s, the original idea was that this should be used as part of a [[cyclical schedule]], but we begin by implementing the [[cosine annealing]] component before the full [[Stochastic Gradient Descent with Warm Restarts (SGDR) method]] later in the tutorial.
=== 2021b ===
* (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Learning_rate#Learning_rate_schedule Retrieved:2021-7-4.
** Initial rate can be left as system default or can be selected using a range of techniques.  A learning rate schedule changes the learning rate during learning and is most often changed between epochs/iterations. This is mainly done with two parameters: '''decay''' and '''momentum''' . There are many different learning rate schedules but the most common are '''time-based, step-based''' and '''exponential'''(...)
=== 2018 ===
* ([[2018_ContextualStringEmbeddingsforSe|Akbik et al., 2018]]) ⇒ [[Alan Akbik]], [[Duncan Blythe]], and [[Roland Vollgraf]]. (2018). &ldquo;[https://www.aclweb.org/anthology/C18-1139.pdf Contextual String Embeddings for Sequence Labeling].&rdquo; In: [[Proceedings of the 27th International Conference on Computational Linguistics, (COLING 2018)]].
** QUOTE: We train the [[sequence tagging model]] using vanilla [[SGD]] with no [[momentum]], [[clipping gradient]]s at 5, for 150 [[epoch]]s. </s> We employ a simple [[learning rate annealing method]] in which we halve the [[learning rate]] if [[training loss]] does not fall for 5 consecutive [[epoch]]s. </s>
===  2017 ===
* ([[2017_SGDRStochasticGradientDescentwi|Loshchilov & Hutter, 2017]]) ⇒ [[Ilya Loshchilov]], and [[Frank Hutter]]. (2017). &ldquo;[https://openreview.net/pdf?id=Skq89Scxx SGDR: Stochastic Gradient Descent with Warm Restarts].&rdquo; In: [[Conference Track Proceedings of the 5th International Conference on Learning Representations (ICLR 2017)]].
=== 2015 ===
* ([[2015_EndtoEndMemoryNetworks|Sukhbaatar et al., 2015]]) ⇒ [[Sainbayar Sukhbaatar]], [[Jason Weston]], and [[Rob Fergus]]. (2015). “[http://papers.nips.cc/paper/5846-end-to-end-memory-networks.pdf End-to-end Memory Networks].&rdquo; In: Advances in Neural Information Processing Systems.
** QUOTE: The [[training procedure]] we use is the same as the [[QA task]]s, except for the following. </s> For each [[mini-batch update]], the [[L2 norm]] of the whole [[gradient]] of all [[parameter]]s is [[measured]]<ref name="ftn-5">In the [[QA task]]s, the [[gradient]] of each [[weight matrix]] is measured separately</ref> and if larger than $L = 50$, then it is scaled down to have [[norm]] $L$. </s> This was crucial for good [[performance]]. </s> We use the [[learning rate annealing schedule]] from [[#2014_Mikolov|Mikolov et al. (2014)]], namely, if the [[validation cost]] has not decreased after one epoch, then the [[learning rate]] is scaled down by a factor 1.5. [[Training]] terminates when the [[learning rate]] drops below $10^{-5}$, i.e. after 50 [[epoch]]s or so. </s> [[Weight]]s are initialized using $N (0, 0.05) and [[batch size]] is set to 128. </s> On the [[Penn tree dataset]], we repeat each [[training]] 10 times with different [[random initialization]]s and pick the one with smallest [[validation cost]]. </s> However, we have done only a [[single training run]] on [[Text8 dataset]] due to limited [[time constraint]]s. </s>
<references/>
=== 2012 ===
=== 2012 ===
* ([[2012_ADADELTAAnAdaptiveLearningRateM|Zeiler, 2012]]) &rArr; [[Matthew D. Zeiler]]. (2012). &ldquo;[https://arxiv.org/pdf/1212.5701.pdf ADADELTA: An Adaptive Learning Rate Method].&rdquo; In: e-print [https://arxiv.org/abs/1212.5701 arXiv:1212.5701].
* ([[2012_ADADELTAAnAdaptiveLearningRateM|Zeiler, 2012]]) [[Matthew D. Zeiler]]. (2012). &ldquo;[https://arxiv.org/pdf/1212.5701.pdf ADADELTA: An Adaptive Learning Rate Method].&rdquo; In: e-print [https://arxiv.org/abs/1212.5701 arXiv:1212.5701].
** QUOTE: There have been several attempts to use [[heuristic]]s for [[estimating]] a good [[learning rate]] at each [[iteration]] of [[gradient descent]]. </s> These either attempt to [[speed up]] [[learning]] when suitable or to slow down [[learning]] near a [[local minima]]. </s> Here we consider the latter. </s> <P> When [[gradient descent]] nears a [[minima]] in the [[cost surface]], the [[parameter value]]s can oscillate back and forth around the [[minima]]. </s> One [[method]] to prevent this is to slow down the [[parameter update]]s by decreasing the [[learning rate]]. </s> This can be done manually when the [[validation]] [[accuracy]] appears to [[plateau]]. </s> Alternatively, [[learning rate schedule]]s have been proposed [[#1951_Robinds|Robinds & Monro (1951)]] to [[automatically anneal]] the [[learning rate]] based on how many [[epoch]]s through the [[data]] have been done. </s> These [[approach]]es typically add additional [[hyperparameter]]s to control how quickly the [[learning rate]] [[decay]]s. </s>
** QUOTE: There have been several attempts to use [[heuristic]]s for [[estimating]] a good [[learning rate]] at each [[iteration]] of [[gradient descent]]. </s> These either attempt to [[speed up]] [[learning]] when suitable or to slow down [[learning]] near a [[local minima]]. </s> Here we consider the latter. </s>         <P>       When [[gradient descent]] nears a [[minima]] in the [[cost surface]], the [[parameter value]]s can oscillate back and forth around the [[minima]]. </s> One [[method]] to prevent this is to slow down the [[parameter update]]s by decreasing the [[learning rate]]. </s> This can be done manually when the [[validation]] [[accuracy]] appears to [[plateau]]. </s> Alternatively, [[learning rate schedule]]s have been proposed [[#1951_Robinds|Robinds & Monro (1951)]] to [[automatically anneal]] the [[learning rate]] based on how many [[epoch]]s through the [[data]] have been done. </s> These [[approach]]es typically add additional [[hyperparameter]]s to control how quickly the [[learning rate]] [[decay]]s. </s>
 
=== 1951 ===
=== 1951 ===
* <span id="1951_Robinds">(Robinds & Monro, 1951)</span> &rArr; H. Robinds and S. Monro (1951). “A stochastic approximation method”. In: Annals of Mathematical Statistics, vol. 22, pp. 400-407, 1951.
* <span id="1951_Robinds">(Robinds & Monro, 1951)</span> H. Robinds and S. Monro (1951). “A stochastic approximation method.In: Annals of Mathematical Statistics, vol. 22, pp. 400-407, 1951.


----
----
__NOTOC__
__NOTOC__
[[Category:Concept]]
[[Category:Concept]]
[[Category:Machine Learning]]
[[Category:Machine Learning]]

Latest revision as of 10:25, 12 February 2024

A Learning Rate Annealing Schedule Algorithm is a Learning Rate Schedule Algorithm that is based on simulated annealing.



References

2021a

2021b

  • (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Learning_rate#Learning_rate_schedule Retrieved:2021-7-4.
    • Initial rate can be left as system default or can be selected using a range of techniques. A learning rate schedule changes the learning rate during learning and is most often changed between epochs/iterations. This is mainly done with two parameters: decay and momentum . There are many different learning rate schedules but the most common are time-based, step-based and exponential(...)

2018

2017

2015

  1. In the QA tasks, the gradient of each weight matrix is measured separately

2012

1951

  • (Robinds & Monro, 1951) ⇒ H. Robinds and S. Monro (1951). “A stochastic approximation method.” In: Annals of Mathematical Statistics, vol. 22, pp. 400-407, 1951.