Stochastic Variance Reduced Gradient (SVRG) Algorithm
A Stochastic Variance Reduced Gradient (SVRG) Algorithm is a SGD-based Algorithm designed to reduce the SGD's Variance.
- Context:
- It was first developed by Johnson & Zhang(2013).
- Example(s):
- A Distributed Stochastic Variance Reduced Gradient (DSVRG) Algorithm (Lee et al., 2017),
- A Fast Stochastic Variance Reduction Gradient (FSVRG) Algorithm (Shang et al., 2017),
- A Stochastic Variance Reduced Gradient Langevin Dynamics (SVRG-LD) Algorithm (Zou et al., 2018),
- A Streaming Stochastic Variance Reduced Gradient (SSVRG) Algorithm (Frostig et al., 2015).
- …
- Counter-Example(s):
- See: Stochastic Optimization, Convex Optimization, Learning Rate, Gradient Descent, Adaptive Gradient (AdaGrad) Algorithm.
References
2018a
- (Jothimurugesan et al., 2018) ⇒ Ellango Jothimurugesan, Ashraf Tahmasbi, Phillip B Gibbons, and Srikanta Tirthapura. (2018). “Variance-Reduced Stochastic Gradient Descent on Streaming Data.” In: Proceedings of 32nd Conference on Neural Information Processing Systems (NIPS 2018).
- QUOTE: SAG [RSB12] was among the first variance reduction methods proposed and achieves linear convergence rate for smooth and strongly convex problems. SAG requires storage of the last gradient computed for each data point and uses their average in each update. SAGA [DBLJ14] improves on SAG by eliminating a bias in the update. Stochastic Variance-Reduced Gradient (SVRG) [JZ13] is another variance reduction method that does not store the computed gradients, but periodically computes a full-data gradient, requiring more computation than SAGA. Semi-Stochastic Gradient Descent (S2GD) [KR13] differs from SVRG by computing a random stochastic gradient at each iteration and a full gradient occasionally. The gap between two full gradient computations is determined by a geometric law. CHEAPSVRG [SAKS16] is another variant of SVRG. In contrast with SVRG, it estimates the gradient through computing the gradient on a subset of training data points rather than all the data points. However, all of the above variance-reduced methods require [math]\displaystyle{ O(n log n) }[/math] iterations to guarantee convergence to statistical accuracy (to yield a good fit to the underlying data) for [math]\displaystyle{ n }[/math] data points. DYNASAGA [DLH16] achieves statistical accuracy in only [math]\displaystyle{ O(n) }[/math] iterations by using a gradually increasing sample set and running SAGA on it.
So far, all the algorithms we have discussed are offline, and assume the entire dataset is available beforehand. Streaming SVRG (SSVRG) [FGKS15] is an algorithm that handles streaming data arrivals, and processes them in a single pass through data, using limited memory. In our experimental study, we found STRSAGA to be significantly more accurate than SSVRG. Further, our analysis of STRSAGA shows that it handles arrival distributions which allow for burstiness in the stream, while SSVRG is not suited for this case. In many practical situations, restricting a streaming data algorithm to use limited memory is overly restrictive and as our results show, leads to worse accuracy.
- QUOTE: SAG [RSB12] was among the first variance reduction methods proposed and achieves linear convergence rate for smooth and strongly convex problems. SAG requires storage of the last gradient computed for each data point and uses their average in each update. SAGA [DBLJ14] improves on SAG by eliminating a bias in the update. Stochastic Variance-Reduced Gradient (SVRG) [JZ13] is another variance reduction method that does not store the computed gradients, but periodically computes a full-data gradient, requiring more computation than SAGA. Semi-Stochastic Gradient Descent (S2GD) [KR13] differs from SVRG by computing a random stochastic gradient at each iteration and a full gradient occasionally. The gap between two full gradient computations is determined by a geometric law. CHEAPSVRG [SAKS16] is another variant of SVRG. In contrast with SVRG, it estimates the gradient through computing the gradient on a subset of training data points rather than all the data points. However, all of the above variance-reduced methods require [math]\displaystyle{ O(n log n) }[/math] iterations to guarantee convergence to statistical accuracy (to yield a good fit to the underlying data) for [math]\displaystyle{ n }[/math] data points. DYNASAGA [DLH16] achieves statistical accuracy in only [math]\displaystyle{ O(n) }[/math] iterations by using a gradually increasing sample set and running SAGA on it.
2018b
- (Shang et al., 2018) ⇒ Fanhua Shang, Yuanyuan Liu, Kaiwen Zhou, James Cheng, Kelvin Kai Wing Ng, and Yuichi Yoshida. (2018). “Guaranteed Sufficient Decrease for Stochastic Variance Reduced Gradient Optimization.” In: Proceedings of Twenty-First International Conference on Artificial Intelligence and Statistics, PMLR (2018).
2018c
- (Zou et al., 2018) ⇒ Difan Zou, Pan Xu, and Quanquan Gu. (2018). “Subsampled Stochastic Variance-Reduced Gradient Langevin Dynamics.” In: Proceedings of International Conference on Uncertainty in Artificial Intelligence.
2017a
- (Lee et al., 2017) ⇒ Jason D. Lee, Qihang Lin, Tengyu Ma, and Tianbao Yang. (2017). “Distributed Stochastic Variance Reduced Gradient Methods by Sampling Extra Data with Replacement.” In: The Journal of Machine Learning Research, 18(1).
2017b
- (Shang et al., 2017) ⇒ Fanhua Shang, Yuanyuan Liu, James Cheng, and Jiacheng Zhuo. (2017). “Fast Stochastic Variance Reduced Gradient Method with Momentum Acceleration for Machine Learning.”, arXiv:1703.07948.
2015
- (Frostig et al., 2015) ⇒ Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. (2015). “Competing with the Empirical Risk Minimizer in a Single Pass.” In: Proceedings of In Conference on learning theory (JMLR 2015).
2013
- (Johnson & Zhang, 2013) ⇒ Rie Johnson, and Tong Zhang. (2013). “Accelerating Stochastic Gradient Descent Using Predictive Variance Reduction.” In: Proceedings of Advances in Neural Information Processing Systems 26 (NIPS 2013).
- QUOTE: (...) We call this method stochastic variance reduced gradient (SVRG) because it explicitly reduces the variance of SGD. Unlike SGD, the learning rate [math]\displaystyle{ m }[/math] for SVRG does not have to decay, which leads to faster convergence as one can use a relatively large learning rate.