2018 VarianceReducedStochasticGradie
- (Jothimurugesan et al., 2018) ⇒ Ellango Jothimurugesan, Ashraf Tahmasbi, Phillip B Gibbons, and Srikanta Tirthapura. (2018). “Variance-Reduced Stochastic Gradient Descent on Streaming Data.” In: Proceedings of 32nd Conference on Neural Information Processing Systems (NIPS 2018).
Subject Headings: STRSAGA, Stochastic Gradient Descent, Streaming Stochastic Variance Reduced Gradient.
Notes
Cited By
Quotes
Abstract
We present an algorithm STRSAGA for efficiently maintaining a machine learning model over data points that arrive over time, quickly updating the model as new training data is observed. We present a competitive analysis comparing the suboptimality of the model maintained by STRSAGA with that of an offline algorithm that is given the entire data beforehand, and analyze the risk-competitiveness of STRSAGA under different arrival patterns. Our theoretical and experimental results show that the risk of STRSAGA is comparable to that of offline algorithms on a variety of input arrival patterns, and its experimental performance is significantly better than prior algorithms suited for streaming data, such as SGD and SSVRG.
1 Introduction
2 Related work
Stochastic Gradient Descent (SGD) [RM51] and its extensions are used extensively in practice for learning from large datasets. While an iteration of SGD is cheap relative to an iteration of a full gradient method, its variance can be high. To control the variance, the learning rate of SGD must decay over time, resulting in a sublinear convergence rate. Newer variance-reduced versions of SGD, on the other hand, achieve linear convergence on strongly convex objective functions, generally by incorporating a correction term in each update step that approximates a full gradient, while still ensuring each iteration is efficient like SGD.
SAG [RSB12] was among the first variance reduction methods proposed and achieves linear convergence rate for smooth and strongly convex problems. SAG requires storage of the last gradient computed for each data point and uses their average in each update. SAGA [DBLJ14] improves on SAG by eliminating a bias in the update. Stochastic Variance-Reduced Gradient (SVRG) [JZ13] is another variance reduction method that does not store the computed gradients, but periodically computes a full-data gradient, requiring more computation than SAGA. Semi-Stochastic Gradient Descent (S2GD) [KR13] differs from SVRG by computing a random stochastic gradient at each iteration and a full gradient occasionally. The gap between two full gradient computations is determined by a geometric law. CHEAPSVRG [SAKS16] is another variant of SVRG. In contrast with SVRG, it estimates the gradient through computing the gradient on a subset of training data points rather than all the data points. However, all of the above variance-reduced methods require [math]\displaystyle{ O(n log n) }[/math] iterations to guarantee convergence to statistical accuracy (to yield a good fit to the underlying data) for [math]\displaystyle{ n }[/math] data points. DYNASAGA [DLH16] achieves statistical accuracy in only [math]\displaystyle{ O(n) }[/math] iterations by using a gradually increasing sample set and running SAGA on it.
So far, all the algorithms we have discussed are offline, and assume the entire dataset is available beforehand. Streaming SVRG (SSVRG) [FGKS15] is an algorithm that handles streaming data arrivals, and processes them in a single pass through data, using limited memory. In our experimental study, we found STRSAGA to be significantly more accurate than SSVRG. Further, our analysis of STRSAGA shows that it handles arrival distributions which allow for burstiness in the stream, while SSVRG is not suited for this case. In many practical situations, restricting a streaming data algorithm to use limited memory is overly restrictive and as our results show, leads to worse accuracy.
3 Model and preliminaries
4 STRSAGA: gradient descent over streaming data
5 Competitive analysis of STRSAGA on specific arrival distribution
6 Experimental results
7 Conclusion
References
- [FGKS15] - Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. (2015). “Competing with the Empirical Risk Minimizer in a Single Pass.” In: Proceedings of In Conference on learning theory (JMLR 2015).;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2018 VarianceReducedStochasticGradie | Rong Ge Sham M. Kakade Ellango Jothimurugesan Ashraf Tahmasbi Phillip B Gibbons Srikanta Tirthapura Roy Frostig Aaron Sidford | Variance-Reduced Stochastic Gradient Descent on Streaming Data | 2015 2018 |