Adaptive Gradient (AdaGrad) Algorithm

An Adaptive Gradient (AdaGrad) Algorithm is a gradient descent-based learning algorithm with a learning rate per parameter.

Context:
- It was first developed by Duchi et al., (2011).
- …
Example(s):
- an Adagrad Dual Averaging algorithm (AdagradDA), e.g. tf.train.AdagradDAOptimizer [1]
- a Proximal AdaGrad, e.g. tflearn.optimizers.ProximalAdaGrad[2]
- tf.train.AdagradOptimizer[3]
- ADAGRAD from CRAN gradDescent Repository [4].
- chainer.optimizers.AdaGrad [5], [6];
- torch.optim.Adagrad [7],[8],
- tflearn.optimizers.AdaGrad [9].
- …
Counter-Example(s):
See: Stochastic Optimization, Convex Optimization, Learning Rate, Gradient Descent, Outer Product, Hadamard Matrix Product, Euclidean Norm, Proximal Function.

References

2018a

(ML Glossary, 2018) ⇒ (2008). AdaGrad. In: Machine Learning Glossary https://developers.google.com/machine-learning/glossary/ Retrieved 2018-04-22.
- QUOTE: A sophisticated gradient descent algorithm that rescales the gradients of each parameter, effectively giving each parameter an independent learning rate. For a full explanation, see this paper.

2018b

(Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Stochastic_gradient_descent#AdaGrad Retrieved:2018-4-22.
- AdaGrad (for adaptive gradient algorithm) is a modified stochastic gradient descent with per-parameter learning rate, first published in 2011.^[1] Informally, this increases the learning rate for more sparse parameters and decreases the learning rate for less sparse ones. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition. It still has a base learning rate , but this is multiplied with the elements of a vector ${G j, j}$ which is the diagonal of the outer product matrix. : [math]\displaystyle{ G = \sum_{\tau=1}^t g_\tau g_\tau^\mathsf{T} }[/math] where [math]\displaystyle{ g_\tau = \nabla Q_i(w) }[/math], the gradient, at iteration . The diagonal is given by : [math]\displaystyle{ G_{j,j} = \sum_{\tau=1}^t g_{\tau,j}^2 }[/math] . This vector is updated after every iteration. The formula for an update is now : [math]\displaystyle{ w := w - \eta\, \mathrm{diag}(G)^{-\frac{1}{2}} \circ g }[/math] or, written as per-parameter updates, : [math]\displaystyle{ w_j := w_j - \frac{\eta}{\sqrt{G_{j,j}}} g_j. }[/math] Each ${G (i, i)}$ gives rise to a scaling factor for the learning rate that applies to a single parameter $w i$ . Since the denominator in this factor, [math]\displaystyle{ \sqrt{G_i} = \sqrt{\sum_{\tau=1}^t g_\tau^2} }[/math] is the ℓ₂ norm of previous derivatives, extreme parameter updates get dampened, while parameters that get few or small updates receive higher learning rates.^[2] While designed for convex problems, AdaGrad has been successfully applied to non-convex optimization.

↑ Duchi et al., 2011
↑ Perla, Joseph (2014). "Notes on AdaGrad" (PDF).

2018c

(Wijaya et al., 2018) ⇒ Galih Praja Wijaya, Dendi Handian, Imam Fachmi Nasrulloh, Lala Septem Riza, Rani Megasari, Enjun Junaeti (2018), "gradDescent: Gradient Descent for Regression Tasks", "Reference manual (PDF).
- QUOTE: An implementation of various learning algorithms based on Gradient Descent for dealing with regression tasks. The variants of gradient descent algorithm are: Mini-Batch Gradient Descent (MBGD), which is an optimization to use training data partially to reduce the computation load. Stochastic Gradient Descent (SGD), which is an optimization to use a random data in learning to reduce the computation load drastically. Stochastic Average Gradient (SAG), which is a SGD-based algorithm to minimize stochastic step to average. Momentum Gradient Descent (MGD), which is an optimization to speed-up gradient descent learning. Accelerated Gradient Descent (AGD), which is an optimization to accelerate gradient descent learning. Adagrad, which is a gradient-descent-based algorithm that accumulate previous cost to do adaptive learning. Adadelta, which is a gradient-descent-based algorithm that use hessian approximation to do adaptive learning. RMSprop, which is a gradient-descent-based algorithm that combine Adagrad and Adadelta adaptive learning ability. Adam, which is a gradient-descent-based algorithm that mean and variance moment to do adaptive learning. Stochastic Variance Reduce Gradient (SVRG), which is an optimization SGD-based algorithm to accelerates the process toward converging by reducing the gradient. Semi Stochastic Gradient Descent (SSGD),which is a SGD-based algorithm that combine GD and SGD to accelerates the process toward converging by choosing one of the gradients at a time. Stochastic Recursive Gradient Algorithm (SARAH), which is an optimization algorithm similarly SVRG to accelerates the process toward converging by accumulated stochastic information. Stochastic Recursive Gradient Algorithm+ (SARAHPlus), which is a SARAH practical variant algorithm to accelerates the process toward converging provides a possibility of earlier termination.

2018d

(DL4J) ⇒ https://deeplearning4j.org/updater#adagrad Retrieved: 2018-04-29
- QUOTE: Adagrad scales alpha for each parameter according to the history of gradients (previous steps) for that parameter. That’s basically done by dividing the current gradient in the update rule by the sum of previous gradients. As a result, when the gradient is very large, alpha is reduced, and vice-versa.

2017

(Kakade, 2017) ⇒ Sham Kakade (2017). Adaptive Gradient Methods AdaGrad/Adam (PDF). In: Machine Learning for Big Data CSE547/STAT548, University of Washington.

2016

(Pasupat, 2016) ⇒ Panupong Pasupat (2016). AdaGrad - Adaptive Subgradient Methods Retrieved: 2018-04-22.
- QUOTE: AdaGrad is an optimization method that allows different step sizes for different features. It increases the influence of rare but informative features(...)

2011

(Duchi et al., 2011) ⇒ John Duchi, Elad Hazan, and Yoram Singer. (2011). “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” In: The Journal of Machine Learning Research, 12.
- ABSTRACT: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms.

[duchi-1] Duchi et al., 2011

[Zeiler_2012-2] Perla, Joseph (2014). "Notes on AdaGrad" (PDF).

[1]

[2]