Learning Rate
A Learning Rate is a hyperparameter that is used for training a model or a artificial neural network.
- Context:
- It can range from being a Manually tuned Learning Rate to being an Automatic Tuned Learning Rate.
- Example(s):
- An Adaptive Learning Rate.
- A Cyclical Learning Rate.
- In gradient descent algorithm the learning rate (the gradient descent step size), [math]\displaystyle{ \eta }[/math]:
[math]\displaystyle{ w(i+1)= w(i) - \eta \nabla Q(w) }[/math]
where [math]\displaystyle{ w }[/math] is the parameter that optimizes the loss function [math]\displaystyle{ Q(w) }[/math] and [math]\displaystyle{ i }[/math] is the i-th interation.
- …
- Counter-Example(s):
- See: Adaptive Gradient Algorithm, Stochastic Optimization, Convex Optimization, Gradient Descent, Proximal Function, Hebb's Rule, Loss Function, Neural Network Topology, Stochastic Gradient Descent, Adam, Backprop.
References
2018a
- (Google ML Glossary, 2018) ⇒ (2018). learning rate. In: Machine Learning Glossary https://developers.google.com/machine-learning/glossary/ Retrieved: 2018-04-22.
- QUOTE: A scalar used to train a model via gradient descent. During each iteration, the gradient descent algorithm multiplies the learning rate by the gradient. The resulting product is called the gradient step. Learning rate is a key hyperparameter.
2018b
- (DL4J, 2018) ⇒ (2018) Learning Rate, https://deeplearning4j.org/troubleshootingneuralnets#lrate Retrieved: 2018-04-22.
- QUOTE: The learning rate is one of, if not the most important hyperparameter. If this is too large or too small, your network may learn very poorly, very slowly, or not at all. Typical values for the learning rate are in the range of 0.1 to 1e-6, though the optimal learning rate is usually data (and network architecture) specific. Some simple advice is to start by trying three different learning rates – 1e-1, 1e-3, and 1e-6 – to get a rough idea of what it should be, before further tuning this. Ideally, they run models with different learning rates simultaneously to save time.
The usual approach to selecting an appropriate learning rate is to use DL4J’s visualization interface to visualize the progress of training. You want to pay attention to both the loss over time, and the ratio of update magnitudes to parameter magnitudes (a ratio of approximately 1:1000 is a good place to start). For more information on tuning the learning rate, see this link. For training neural networks in a distributed manner, you may need a different (frequently higher) learning rate compared to training the same network on a single machine.
- QUOTE: The learning rate is one of, if not the most important hyperparameter. If this is too large or too small, your network may learn very poorly, very slowly, or not at all. Typical values for the learning rate are in the range of 0.1 to 1e-6, though the optimal learning rate is usually data (and network architecture) specific. Some simple advice is to start by trying three different learning rates – 1e-1, 1e-3, and 1e-6 – to get a rough idea of what it should be, before further tuning this. Ideally, they run models with different learning rates simultaneously to save time.
2018c
- (Wikipedia, 2018) ⇒ https://www.wikiwand.com/en/Stochastic_gradient_descent#/Background Retrieved: 2018-04-22.
- QUOTE: The sum-minimization problem also arises for empirical risk minimization. In this case, [math]\displaystyle{ Q_i(w) }[/math] is the value of the loss function at [math]\displaystyle{ i }[/math]-th example, and [math]\displaystyle{ Q(w) }[/math] is the empirical risk.
When used to minimize the above function, a standard (or "batch") gradient descent method would perform the following iterations:
[math]\displaystyle{ w := w - \eta \nabla Q(w) = w - \eta \sum_{i=1}^n \nabla Q_i(w)/n, }[/math]
where [math]\displaystyle{ \eta }[/math] is a step size (sometimes called the learning rate in machine learning).
In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, one-parameter exponential families allow economical function-evaluations and gradient-evaluations.
However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems[1].
- QUOTE: The sum-minimization problem also arises for empirical risk minimization. In this case, [math]\displaystyle{ Q_i(w) }[/math] is the value of the loss function at [math]\displaystyle{ i }[/math]-th example, and [math]\displaystyle{ Q(w) }[/math] is the empirical risk.
2018d
- (sklearn, 2018) ⇒ http://scikit-learn.org/stable/modules/neural_networks_supervised.html#algorithms Retrieved: 2018-04-22.
- QUOTE: MLP trains using Stochastic Gradient Descent, Adam, or L-BFGS. Stochastic Gradient Descent (SGD) updates parameters using the gradient of the loss function with respect to a parameter that needs adaptation, i.e.
[math]\displaystyle{ w \leftarrow w - \eta (\alpha \frac{\partial R(w)}{\partial w}+ \frac{\partial Loss}{\partial w}) }[/math]
where [math]\displaystyle{ \eta }[/math] is the learning rate which controls the step-size in the parameter space search. Loss is the loss function used for the network.
More details can be found in the documentation of SGD.
Adam is similar to SGD in a sense that it is a stochastic optimizer, but it can automatically adjust the amount to update parameters based on adaptive estimates of lower-order moments.
With SGD or Adam, training supports online and mini-batch learning.
L-BFGS is a solver that approximates the Hessian matrix which represents the second-order partial derivative of a function. Further it approximates the inverse of the Hessian matrix to perform parameter updates. The implementation uses the Scipy version of L-BFGS.
- QUOTE: MLP trains using Stochastic Gradient Descent, Adam, or L-BFGS. Stochastic Gradient Descent (SGD) updates parameters using the gradient of the loss function with respect to a parameter that needs adaptation, i.e.
2017
- (Smith, 2017) ⇒ Leslie N. Smith (2017, March). "Cyclical learning rates for training neural networks" (PDF). In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on (pp. 464-472). IEEE. arXiv:1506.01186
- ABSTRACT: It is known that the learning rate is the most important hyper-parameter to tune for training deep neural networks. This paper describes a new method for setting the learning rate, named cyclical learning rates, which practically eliminates the need to experimentally find the best values and schedule for the global learning rates. Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values. Training with cyclical learning rates instead of fixed values achieves improved classification accuracy without a need to tune and often in fewer iterations. This paper also describes a simple way to estimate "reasonable bounds" -- linearly increasing the learning rate of the network for a few epochs. In addition, cyclical learning rates are demonstrated on the CIFAR-10 and CIFAR-100 datasets with ResNets, Stochastic Depth networks, and DenseNets, and the ImageNet dataset with the AlexNet and GoogLeNet architectures. These are practical tools for everyone who trains neural networks.
2013
- (Ranganath et al., 2013) ⇒ Rajesh Ranganath, Chong Wang, David M. Blei, and Eric P. Xing (2013, February). "An adaptive learning rate for stochastic variational inference" (PDF). In: Proceedings of The International Conference on Machine Learning (pp. 298-306).
- ABSTRACT: Stochastic variational inference finds good posterior approximations of probabilistic models with very large datasets. It optimizes the variational objective with stochastic optimization, following noisy estimates of the natural gradient. Operationally, stochastic inference iteratively subsamples from the data, analyzes the subsample, and updates parameters with a decreasing learning rate. However, the algorithm is sensitive to that rate, which usually requires hand-tuning to each application. We solve this problem by developing an adaptive learning rate for stochastic variational inference. Our method requires no tuning and is easily implemented with computations already made in the algorithm. We demonstrate our approach with latent Dirichlet allocation applied to three large text corpora. Inference with the adaptive learning rate converges faster and to a better approximation than the best settings of hand-tuned rates.
2012
- (Wilson, 2012) ⇒ Bill Wilson, (1998 - 2012). "learning rate" In: The Machine Learning Dictionary
- QUOTE: a constant used in error backpropagation learning and other artificial neural network learning algorithms to affect the speed of learning. The mathematics of e.g. backprop are based on small changes being made to the weights at each step: if the changes made to weights are too large, the algorithm may "bounce around" the error surface in a counter-productive fashion. In this case, it is necessary to reduce the learning rate. On the other hand, the small the learning rate, the more steps it takes to get to the stopping criterion. See also momentum.
2010
- (Schaul & Schmidhuber, 2010) ⇒ Tom Schaul and Juergen Schmidhuber (2010), Scholarpedia, 5(6):4650. doi:10.4249/scholarpedia.4650
- QUOTE: In neural networks, synaptic weights can be changed by a single global algorithm (e.g. Backprop) or by local learning rules on the synapse level (e.g. Hebb's rule). In the latter case, metalearning can be used to determine how to use the local learning rule. For example, the local rules may depend on their position in the network structure, and evolution is used to pick good ones – one application is a robot navigation task by Mondada and Floreano (1996).
- [math]\displaystyle{ D\quad }[/math] observation-action-reward sequences.
- [math]\displaystyle{ D_T\quad }[/math] elements of [math]\displaystyle{ D }[/math]
- [math]\displaystyle{ \phi\quad }[/math] navigation performance.
- [math]\displaystyle{ \pi_\theta\quad }[/math] neural network controller.
- [math]\displaystyle{ \theta\quad }[/math] network weights.
- [math]\displaystyle{ L_\mu\quad }[/math] variants of Hebbian learning.
- [math]\displaystyle{ \mu\quad }[/math] learning rules, learning rate.
- ML evolution
- QUOTE: In neural networks, synaptic weights can be changed by a single global algorithm (e.g. Backprop) or by local learning rules on the synapse level (e.g. Hebb's rule). In the latter case, metalearning can be used to determine how to use the local learning rule. For example, the local rules may depend on their position in the network structure, and evolution is used to pick good ones – one application is a robot navigation task by Mondada and Floreano (1996).