2013 AcceleratingStochasticGradientD
- (Johnson & Zhang, 2013) ⇒ Rie Johnson, and Tong Zhang. (2013). “Accelerating Stochastic Gradient Descent Using Predictive Variance Reduction.” In: Proceedings of Advances in Neural Information Processing Systems 26 (NIPS 2013).
Subject Headings: Stochastic Variance Reduced Gradient Algorithm.
Notes
Cited By
Quotes
Abstract
Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG). However, our analysis is significantly simpler and more intuitive. Moreover, unlike SDCA or SAG, method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning.
1 Introduction
In machine learning, we often encounter the following optimization problem. Let 1L1, . . . 41,, be a sequence of vector functions from Rd to R. Our goal is to find an approximate solution of the following optimization problem
' 1 TL m1nP(w), P<w) :: g 2 114w). (1) 1:1 For example, given a sequence of n training examples (901,111; . . . , (907171171), where x,- 6 Rd and
y,- E R, if we use the squared loss who) : (wai 7 31,-), then we can obtain least squares regression. In this case, ¢,(-) represents a loss function in machine learning. One may also include regularization conditions. For example, if we take $1110) : 1n(1 + exp(inx,-yi)) + 0.5AwTw (y,- 6 {i1}), then the optimization problem becomes regularized logistic regression.
A standard method is gradient descent, which can be described by the following update rule for t : 1, 2, . . .
TL 7 7 7 7] 7 w“) : w“ 1) 7 WPW 1)) = w“ 1) 7 j ZWM“ I)). (2) 1:1
However, at each step, gradient descent requires evaluation of n derivatives, which is expensive. A popular modification is stochastic gradient descent (SGD): where at each iteration t : 1, 2, . . ., we
draw it randomly from {1, . . . , n} and w“) : wail) 7 qupixwfiilb. (3)
The expectation R[wm \w(t’1)] is identical to (2). A more general version of SGD is the following
w“) : wail) 7 Ut9t(w(til)a £07 (4)
where g, is a random variable that may depend on wfi’l), and the expectation (with respect to §,) E[9t(w(t’1),§t)]w(t’l)] : VP(w(t’1)). The advantage of stochastic gradient is that each step only relies on a single derivative WM), and thus the computational cost is 1/7; that of the standard gradient descent. However, a disadvantage of the method is that the randomness introduces variance — this is caused by the fact that gt(w(t’1),§t) equals the gradient VP(w(t’1)) in expectation, but each gt(w(t’1),§t) is different. In particular, if gt(w(t’1),§t) is large, then we have a relatively large variance which slows down the convergence. For example, consider the case that each who) is smooth
Mm) 7 wxw’) 7 0.5LHw 7 w’Hz : Wi<w’)T<w 7 w’), (5)
and convex; and P(w) is strongly convex P(w) *PW') *0~5VHW*W'H§ Z VP(W')T(w*w’)7 (6)
where L 2 y 2 0. As long as we pick m as a constant 7] < 1/L, we have linear convergence of O((l 7 y/LY) Nesterov [2004]. However, for SGD, due to the variance of random sampling, we generally need to choose m : O(l/t) and obtain a slower sub—linear convergence rate of O(l/t). This means that we have a trade—off of fast computation per iteration and slow convergence for SGD versus slow computation per iteration and fast convergence for gradient descent. Although the fast computation means it can reach an approximate solution relatively quickly, and thus has been proposed by various researchers for large scale problems Zhang [2004], ShaleV—Shwartz et al. [2007] (also see Leon Bottou’s Webpage http: / /1eon . bottou . org/project s / sgd), the convergence slows down when we need a more accurate solution.
In order to improve SGD, one has to design methods that can reduce the variance, which allows us to use a larger learning rate m. Two recent papers Le Roux et al. [2012], ShaleV—Shwartz and Zhang [2012] proposed methods that achieve such a variance reduction effect for SGD, which leads to a linear convergence rate when ¢,(w) is smooth and strongly convex. The method in Le Roux et al. [2012] was referred to as SAG (stochastic average gradient), and the method in ShaleV—Shwartz and Zhang [2012] was referred to as SDCA. These methods are suitable for training convex linear prediction problems such as logistic regression or least squares regression, and in fact, SDCA is the method implemented in the popular lib—SVM package Hsieh et al. [2008]. However, both propos— als require storage of all gradients (or dual variables). Although this issue may not be a problem for training simple regularized linear prediction problems such as least squares regression, the re— quirement makes it unsuitable for more complex applications where storing all gradients would be impractical. One example is training certain structured learning problems with convex loss, and an— other example is training nonconvex neural networks. In order to remedy the problem, we propose a different method in this paper that employs explicit variance reduction without the need to store the intermediate gradients. We show that if who) is strongly convex and smooth, then the same con— vergence rate as those of Le Roux et al. [2012], ShaleV—Shwartz and Zhang [2012] can be obtained. Even if who) is nonconvex (such as neural networks), under mild assumptions, it can be shown that asymptotically the variance of SGD goes to zero, and thus faster convergence can be achieved. In summary, this work makes the following three contributions:
0 Our method does not require the storage of full gradients, and thus is suitable for some problems where methods such as Le Roux et al. [2012], ShaleV—Shwartz and Zhang [2012] cannot be applied.
0 We provide a much simpler proof of the linear convergence results for smooth and strongly convex loss, and our View provides a significantly more intuitive explanation of the fast convergence by explicitly connecting the idea to variance reduction in SGD. The resulting insight can easily lead to additional algorithmic development.
c The relatively intuitive variance reduction explanation also applies to nonconveX optimiza— tion problems, and thus this idea can be used for complex problems such as training deep neural networks.
2 Stochastic Variance Reduced Gradient
One practical issue for SGD is that in order to ensure convergence the learning rate [math]\displaystyle{ \eta_t }[/math] has to decay to zero. This leads to slower convergence. The need for a small learning rate is due to the variance of SGD (that is, SGD approximates the full gradient using a small batch of samples or even a single example, and this introduces variance). However, there is a fix described below. At each time, we keep a version of estimated [math]\displaystyle{ w }[/math] as [math]\displaystyle{ \tilde{w} }[/math] that is close to the optimal [math]\displaystyle{ w }[/math]. For example, we can keep a snapshot of [math]\displaystyle{ \tilde{w} }[/math] after every [math]\displaystyle{ m }[/math] SGD iterations. Moreover, we maintain the average gradient
[math]\displaystyle{ \tilde{\mu}=\Delta P (\tilde{w})=\dfrac{1}{n}\sum_i=1^n\Delta \psi_i(\tilde{w}) }[/math],
and its computation requires one pass over the data using [math]\displaystyle{ \tilde{w} }[/math]. Note that the expectation of [math]\displaystyle{ \Delta \psi_i(\tilde{w}) - \tilde{\mu} }[/math] over [math]\displaystyle{ i }[/math] is zero, and thus the following update rule is generalized SGD: randomly draw [math]\displaystyle{ i_t }[/math] from [math]\displaystyle{ \{1,\cdots, n\} }[/math]:
[math]\displaystyle{ ... }[/math]
We thus have
[math]\displaystyle{ ... }[/math]
That is, if we let the random variable g,.[math]\displaystyle{ ... }[/math] then (7) is a special case of (4).
The update rule in (7) can also be obtained by defining the auxiliary function
[math]\displaystyle{ ... }[/math]
Since [math]\displaystyle{ ... }[/math] we know that
[math]\displaystyle{ ... }[/math]
Now we may apply the standard SGD to the new representation [math]\displaystyle{ ... }[/math] and obtain the update rule (7).
To see that the variance of the update rule (7) is reduced, we note that when both [math]\displaystyle{ ... }[/math] and [math]\displaystyle{ ... }[/math] converge to the same parameter [math]\displaystyle{ ... }[/math] then [math]\displaystyle{ ... }[/math]. Therefore if [math]\displaystyle{ ... }[/math], then
[math]\displaystyle{ ... }[/math]
This argument will be made more rigorous in the next section, where we will analyze the algorithm in Figure 1 that summarizes the ideas described in this section. We call this method stochastic variance reduced gradient (SVRG) because it explicitly reduces the variance of SGD. Unlike SGD, the learning rate [math]\displaystyle{ m }[/math] for SVRG does not have to decay, which leads to faster convergence as one can use a relatively large learning rate. This is confirmed by our experiments.
In practical implementations, it is natural to choose option I, or take [math]\displaystyle{ ... }[/math] to be the average of the past 25 iterates. However, our analysis depends on option II. Note that each stage [math]\displaystyle{ ... }[/math] requires [math]\displaystyle{ 2m + n }[/math] gradient computations (for some convex problems, one may save the intermediate gradients [math]\displaystyle{ ... }[/math]), and thus only [math]\displaystyle{ m + n }[/math] gradient computations are needed). Therefore it is natural to choose m to be the same order of [math]\displaystyle{ h }[/math] but slightly larger (for example [math]\displaystyle{ m = 2n }[/math] for convex problems and [math]\displaystyle{ m = 5n }[/math] for nonconvex problems in our experiments). In comparison, standard SGD requires only m gradient computations. Since gradient may be the computationally most intensive operation, for fair comparison, we compare SGD to SVRG based on the number of gradient computations.
3 Analysis
For simplicity we will only consider the case that each 11410) is convex and smooth, and P(w) is strongly convex.
Theorem 1. Consider SVRG in Figure I with option II. Assume that all % are convex and both (5) and (6) hold with 'y > 0. Let 10* : arg minw P(w). Assume that m is suficiently large so that
1 2L7] a = m m then we have geometric convergence in expectation for SVRG: EPWJSEPWQ+MPWM7PWM
<1,
Figure 1: Stochastic Variance Reduced Gradient
Proof. Given any 1’, consider
91(10): WKW) 7 ¢i(w*) 7 V¢¢(w*)T(w 7 10*). We know that 9410*) : minw g)(w) since Vgi(w*) : 0. Therefore
0 = 9100*) S ngnth-(w 7 Wgz-(le
S Hgnbz-(IU) 7 nllng-(IUMIS + 0-5Ln2 llng-(w) H3] = 91W) 7 fiHVm-(Wl? Thatis, llv¢t(w) 7 V¢t(w*)l|§ S 2L[¢¢(W) 7 ¢an 7 V¢i(w*)T(w 7 1%)]-
By summing the above inequality overt : 1, . . . ,n, and using the fact that VP(w*) : 0, we obtain
n71; \vaw) 7 WKMMIE : 2L[P<w) 7 PM]. (8)
We can now proceed to prove the theorem. Let 1}: : Vim: (wt71) 7 Vim: (111) + [1. Conditioned on wt71, we can take expectation with respect to it, and obtain:
E Hvtllg 32E l|v¢it(wt7l) 7 WEN
- 2E Hv¢tt (10:71) 7 V¢u<
7 E [mm 7 mm H2 32E ”VIM (10:71) 7 VIM (10* ”3 + 2 E ”VIP“ (71)) 7 VIM (100”; g4L[P(w,71) 7 PM) + Pan) 7 Pam].
H3 + 2 E HWiflZ-M) 7 V111: (MN 7 VPWMIE H3 + 2 E HWiflz-M) 7 V111: (MN ] 2
v
w>|<
v
w>|<
vv
The first inequality uses Ha + ME 3 2]|a]|§ + 2]|b]|§ and [1 : VP(LD). The second inequality uses IE Hg 7 E&Hg : IE HéHg 7 H E&Hg 3 IE HéHg for any random vector 6. The third inequality uses (8).
Now by noticing that conditioned on wt71, we have IE 12: : VP(wt71); and this leads to
E Hw: 7 MS = sz7i 7 10*”; 7 277(wt71 7 w*)T E 1}: +772E Hvtllg Ellwt71 710*“; 7277(wt71 7w*)T vP(wt71) +4L772 [P (W7 1)7 P(w* )+P(7~U )7 P(w*)l Ellwt71 710*“; 7 277[P(wt7 1)7 PW*)] +4L7i2[P(wt71) 7 PW*) 7P0” )7 P(w*)l = Hwt71 7 MS 7 277(17 2L77)[P (wt71) 7 P(w*)l 7 4L7i2[P(7~3) 7 PW*)]-
The first inequality uses the previously obtained inequality for IE H1}: 3 and the second inequality convexity of P(w), which implies that 7(wt71 7 w*)TVP(wt71) g P(w*) 7 P(wt71).
We consider a fixed stage s, so that LI) : 13571 and 7118 is selected after all of the updates have completed. By summing the previous inequality over 25 : 1, . . . ,m, taking expectation with all the history, and using option II at stage 5, we obtain
IE me 7 1mm 2n<17 2mm [Pm 7 Pm»
- E Hwo 7 1mm 4Lm772 MM) 7 Pm»
7E Ma 7 1mm 4Lm772 MM) 7 Pam]
- % MM) 7 P<w*)l + 4Lm772 MM) 7 Pm
72w + 2m?) MM) 7 Pm.
The second inequality uses the strong convexity property (6). We thus obtain
- 1 2L7; E P 7 P * < — + — [ (108) (u) N _ 771(1 7 2L7y)m 1 7 2L7]
This implies that IE [P(LDS) 7 P(w*)] g as E [P(LDO) 7 P(w*)]. The desired bound follows. El
E[P(IDS71) 7 P(w*)l~
The bound we obtained in Theorem 1 is comparable to those obtained in Le Roux et al. [2012], ShaleV—Shwartz and Zhang [2012] (if we ignore the log factor). To see this, we may consider for simplicity the most indicative case where the condition number L / y : n. Due to the poor condition number, the standard batch gradient descent requires complexity of nln(1/e) iterations over the data to achieve accuracy of e, which means we have to process n2 ln(1/e) number of examples. In comparison, in our procedure we may take 7] : 0.1 / L and m : O(n) to obtain a convergence rate of Oz : 1/2. Therefore to achieve an accuracy of e, we need to process nln(1/e) number of examples. This matches the results of Le Roux et al. [2012], ShaleV—Shwartz and Zhang [2012]. Nevertheless, our analysis is significantly simpler than both Le Roux et al. [2012] and ShaleV—Shwartz and Zhang [2012], and the explicit variance reduction argument provides better intuition on why this method works. In fact, in Section 4 we show that a similar intuition can be used to explain the effectiveness of SDCA.
The SVRG algorithm can also be applied to smooth but not strongly convex problems. A con— vergence rate of O(l/T) may be obtained, which improves the standard SGD convergence rate of O(l/x/T). In order to apply SVRG to nonconveX problems such as neural networks, it is useful to start with an initial vector 1110 that is close to a local minimum (which may be obtained with SGD), and then the method can be used to accelerate the local convergence rate of SGD (which may converge very slowly by itself). If the system is locally (strongly) convex, then Theorem 1 can be directly applied, which implies local geometric convergence rate with a constant learning rate.
4 SDCA as Variance Reduction
It can be shown that both SDCA and SAG are connected to SVRG in the sense they are also a variance reduction methods for SGD, although using different techniques. In the following we present the variance reduction View of SDCA, which provides additional insights into these recently proposed fast convergence methods for stochastic optimization. In SDCA, we consider the following problem with convex (1)410):
10* : argmin P(w), P(w) : l E (171(10) +0.5AwTw. (9) n 1'71
This is the same as our formulation with 11410) : (1)410) + 0.5AwTw.
We can take the derivative of (9) and derive a “dual” representation of u) at the solution w* as:
TL 10*:201: (j:1,...,k), 1:1
where the dual variables
- 1
al. : 7%Vd),(w*). (10) Therefore in the SGD update (3), if we maintain a representation TL W=Zflh (m i:1
then the update of Oz becomes:
(t) _ (17 mm?” 7 deh-(w) z 71’ 0‘11 7 (:71) ~ (12) <1 7 WM 6 7 i This update rule requires m 7 0 when t 7 00.
Alternatively, we may consider starting with SGD by maintaining (l l), and then apply the following Dual Coordinate Ascent rule:
U70 7 _ (:71) 070 _ . a. V 1 w + Anal- 5 7 1 , aé‘)_{ €71) 77“ d" ) ) . (371171) (13) oz) 6 7é ’L and then update w as w“) : wfi’l) + (Oz?) 7 Otitilb. It can be checked that if we take expectation over random 1’ E {1, . . . , n}, then the SGD rule in (12)
and the dual coordinate ascent rule (13) both yield the gradient descent rule E[w(t]w(t’1)] : wail) 7 mVPfiUUTU).
Therefore both can be regarded as different realizations of the more general stochastic gradient rule in (4). However, the advantage of (13) is that we may take a larger step when t 7 00. This is be— cause according to (10), when the primal—dual parameters (w, Oz) converge to the optimal parameters (uh, 01*), we have (Vrbflw) + Anal) 7 0,
which means that even if the learning rate m stays bounded away from zero, the procedure can converge. This is the same effect as SVRG, in the sense that the variance goes to zero asymptotically: as w 7 w* and Oz 7 01*, we have
%Z(V¢i(w) + AW)? 7 0. i:1
That is, SDCA is also a variance reduction method for SGD, which is similar to SVRG.
From this discussion, we can View SVRG as an explicit variance reduction technique for SGD which is similar to SDCA. However, it is simpler, more intuitive, and easier to analyze. This relationship provides useful insights into the underlying optimization problem that may allow us to make further improvements.
5 Experiments
To confirm the theoretical results and insights, we experimented with SVRG (Fig. 1 Option I) in comparison with SGD and SDCA with linear predictors (convex) and neural nets (nonconveX). In all the figures, the x—axis is computational cost measured by the number of gradient computations diVided by n. For SGD, it is the number of passes to go through the training data, and for SVRG in the nonconveX case (neural nets), it includes the additional computation of Vipifib) both in each iteration and for computing the gradient average [1. For SVRG in our convex case, however, Vipifib) does not have to be re—computed in each iteration. Since in this case the gradient is always a multiple of xi, i.e., V¢,(w) : ¢§(wai)xi where 114w) : (1),;(wai), Vipihb) can be compactly saved in memory by only saving scalars d){L-(’ZIJT$¢) with the same memory consumption as SDCA and SAG. The interval m was set to 2n (convex) and 5h (nonconveX). The weights for SVRG were initialized by performing l iteration (convex) or 10 iterations (nonconveX) of SGD; therefore, the line for SVRG starts after 90 : 1 (convex) or x : 10 (nonconveX) in the respective figures.
Figure 2: Multiclass logistic regression (convex) on MNIST[1]. (a) Training loss comparison with SGD with fixed learning rates. The numbers in the legends are the learning rate. (b) Training loss residual P(w) — P(uu ); comparison with best—tuned SGD with learning rate scheduling and SDCA. (c) Variance of weight update (including multiplication with the learning rate).
First, we performed L2—regularized multiclass logistic regression (convex optimization) on MNIST1 with regularization parameter A :le—4. Fig. 2 (a) shows training loss (i.e., the optimization objective P(w)) in comparison with SGD with fixed learning rates. The results are indicative of the known weakness of SGD, which also illustrates the strength of SVRG. That is, when a relatively large learning rate 7] is used with SGD, training loss drops fast at first, but it oscillates above the minimum and never goes down to the minimum. With small 7], the minimum may be approached eventually, but it will take many iterations to get there. Therefore, to accelerate SGD, one has to start with relatively large 7] and gradually decrease it (learning rate scheduling), as commonly practiced. By contrast, using a single relatively large value of 7], SVRG smoothly goes down faster than SGD. This is in line with our theoretical prediction that one can use a relatively large 7] with SVRG, which leads to faster convergence.
Fig. 2 (b) and (c) compare SVRG with best—tuned SGD with learning rate scheduling and SDCA. ‘SGD—best’ is the best—tuned SGD, which was chosen by preferring smaller training loss from a large number of parameter combinations for two types of learning scheduling: exponential decay h(t) : nOaW”) with parameters 710 and a to adjust and t—inverse h(t) : 710(1 + b[t/n] )’1 with 710 and b to adjust. (Not surprisingly, the best—tuned SGD with learning rate scheduling outperformed the best—tuned SGD with a fixed learning rate throughout our experiments.) Fig. 2 (b) shows training loss residual, which is training loss minus the optimum (estimated by running gradient descent for a very long time): P(w) 7 P(w*). We observe that SVRG’s loss residual goes down exponentially, which is in line with Theorem 1, and that SVRG is competitive with SDCA (the two lines are almost overlapping) and decreases faster than SGD—best. In Fig. 2 (c), we show the variance of SVRG update 7n(V¢,(w) 7 Vipifib) + [1) in comparison with the variance of SGD update 7n(t)V¢,-(w) and SDCA. As expected, the variance of both SVRG and SDCA decreases as optimization proceeds, and the variance of SGD with a fixed learning rate (‘SGD:0.001’) stays high. The variance of the best—tuned SGD decreases, but this is due to the forced exponential decay of the learning rate and the variance of the gradients V¢,(w) (the dotted line labeled as ‘SGD—best/7](t)’) stays high.
Fig. 3 shows more convex—case results (L2—regularized logistic regression) in terms of training loss residual (top) and test error rate (bottom) on rcv.binary and covtype.binary from the LIBSVM site[2], protein[3], and CIFAR—10[4]. As protein and covtype do not come with labeled test data, we randomly split the training data into halves to make the training/test split. CIFAR was normalized into [0, 1] by diVision with 255 (which was also done with MNIST and CIFAR in the other figures), and protein was standardized. A was set to le—3 (CIFAR) and le—5 (rest). Overall, SVRG is competitive with SDCA and clearly more advantageous than the best—tuned SGD. It is also worth mentioning that a recent study Schmidt et al. [2013] reports that SAG and SDCA are competitive.
To test SVRG with nonconvex objectives, we trained neural nets (with one fully—connected hidden layer of 100 nodes and ten softmax output nodes; sigmoid activation and L2 regularization) with mini—batches of size 10 on MNIST and CIFAR—10, both of which are standard datasets for deep neural net studies; A was set to le—4 and 1e73, respectively. In Fig. 4 we confirm that the results are similar to the convex case; i.e., SVRG reduces the variance and smoothly converges faster than the best—tuned SGD with learning rate scheduling, which is a de facto standard method for neural net training. As said earlier, methods such as SDCA and SAG are not practical for neural nets due to their memory requirement. We View these results as promising. However, further investigation, in particular with larger/deeper neural nets for which training cost is a critical issue, is still needed.
Figure 3: More convex—case results. Loss residual P(w) — P(wt) (top) and test error rates (down). L2— regularized logistic regression (lO—class for CIFAR—lO and binary for the rest).
Figure 4: Neural net results (nonconvex).
6 Conclusion
This paper introduces an explicit variance reduction method for stochastic gradient descent meth— ods. For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of SDCA and SAG. However, our proof is significantly simpler and more intuitive. Moreover, unlike SDCA or SAG, this method does not require the storage of gradients, and thus is more easily applicable to complex problems such as structured prediction or neural network learning.
Acknowledgment
We thank Leon Bottou and Alekh Agarwal for spotting a mistake in the original theorem.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2013 AcceleratingStochasticGradientD | Tong Zhang Rie Johnson | Accelerating Stochastic Gradient Descent Using Predictive Variance Reduction | 2013 |
- ↑ http://yann.lecun.com/exdb/mnist/
- ↑ http : //www.csie.ntu.edu.tw/ ~cjlin/1ibsvmtools/datasets/
- ↑ http://Osmot.cs.cornell.edu/kddcup/datasets.html
- ↑ www.cs.toronto.edu/ ~k]:iz/cifa]:.html