Stochastic Gradient Descent (SGD)-based Regression System
A Stochastic Gradient Descent (SGD)-based Regression System is a Linear Regression System that implements a Stochastic Gradient Descent Algorithm to solve a SGD Regression Task.
- Example(s):
- Counter-Example(s):
- See: Regression Analysis Task, Classification Task, Random Variable, L2-norm, Linear Support Vector Machine.
References
2017a
- (Scikit Learn, 2017) ⇒ http://scikit-learn.org/stable/modules/linear_model.html#stochastic-gradient-descent-sgd Retrieved:2017-09-17
- QUOTE: Stochastic gradient descent is a simple yet very efficient approach to fit linear models. It is particularly useful when the number of samples (and the number of features) is very large. The
partial_fit
method allows only/out-of-core learning.The classes
SGDClassifier
andSGDRegressor
provide functionality to fit linear models for classification and regression using different (convex) loss functions and different penalties. E.g., withloss="log"
,SGDClassifier
fits a logistic regression model, while withloss="hinge"
it fits a linear support vector machine (SVM).
- QUOTE: Stochastic gradient descent is a simple yet very efficient approach to fit linear models. It is particularly useful when the number of samples (and the number of features) is very large. The
2017b
- (Scikit Learn, 2017) ⇒ http://scikit-learn.org/stable/modules/sgd.html#regression Retrieved:2017-09-17
- The class SGDRegressor implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties to fit linear regression models. SGDRegressor is well suited for regression problems with a large number of training samples (> 10.000), for other problems we recommend Ridge, Lasso, or ElasticNet.
The concrete loss function can be set via the loss parameter. SGDRegressor supports the following loss functions:
- loss="squared_loss": Ordinary least squares,
- loss="huber": Huber loss for robust regression,
- loss="epsilon_insensitive": linear Support Vector Regression.
- The class SGDRegressor implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties to fit linear regression models. SGDRegressor is well suited for regression problems with a large number of training samples (> 10.000), for other problems we recommend Ridge, Lasso, or ElasticNet.
- The Huber and epsilon-insensitive loss functions can be used for robust regression. The width of the insensitive region has to be specified via the parameter epsilon. This parameter depends on the scale of the target variables.
SGDRegressor supports averaged SGD as SGDClassifier. Averaging can be enabled by setting `average=True`.
For regression with a squared loss and a l2 penalty, another variant of SGD with an averaging strategy is available with Stochastic Average Gradient (SAG) algorithm, available as a solver in Ridge.
- The Huber and epsilon-insensitive loss functions can be used for robust regression. The width of the insensitive region has to be specified via the parameter epsilon. This parameter depends on the scale of the target variables.
2017c
- (Scikit Learn, 2017) ⇒ http://scikit-learn.org/stable/modules/sgd.html#mathematical-formulation Retrieved:2017-09-17
- QUOTE: Given a set of training examples [math]\displaystyle{ (x_1, y_1), \ldots, (x_n, y_n) }[/math] where [math]\displaystyle{ x_i \in \mathbf{R}^m }[/math] and [math]\displaystyle{ y_i \in \{-1,1\} }[/math], our goal is to learn a linear scoring function [math]\displaystyle{ f(x) = w^T x + b }[/math] with model parameters [math]\displaystyle{ w \in \mathbf{R}^m }[/math] and intercept [math]\displaystyle{ b \in \mathbf{R} }[/math]. In order to make predictions, we simply look at the sign of [math]\displaystyle{ f(x) }[/math]. A common choice to find the model parameters is by minimizing the regularized training error given by
[math]\displaystyle{ E(w,b) = \frac{1}{n}\sum_{i=1}^{n} L(y_i, f(x_i)) + \alpha R(w) }[/math]
where [math]\displaystyle{ L }[/math] is a loss function that measures model (mis)fit and [math]\displaystyle{ R }[/math] is a regularization term (aka penalty) that penalizes model complexity; [math]\displaystyle{ \alpha \gt 0 }[/math] is a non-negative hyperparameter.
Different choices for L entail different classifiers such as
- Hinge: (soft-margin) Support Vector Machines.
- Log: Logistic Regression.
- Least-Squares: Ridge Regression.
- Epsilon-Insensitive: (soft-margin) Support Vector Regression.
- QUOTE: Given a set of training examples [math]\displaystyle{ (x_1, y_1), \ldots, (x_n, y_n) }[/math] where [math]\displaystyle{ x_i \in \mathbf{R}^m }[/math] and [math]\displaystyle{ y_i \in \{-1,1\} }[/math], our goal is to learn a linear scoring function [math]\displaystyle{ f(x) = w^T x + b }[/math] with model parameters [math]\displaystyle{ w \in \mathbf{R}^m }[/math] and intercept [math]\displaystyle{ b \in \mathbf{R} }[/math]. In order to make predictions, we simply look at the sign of [math]\displaystyle{ f(x) }[/math]. A common choice to find the model parameters is by minimizing the regularized training error given by
- (...)
- Popular choices for the regularization term R include:
- L2 norm: [math]\displaystyle{ R(w) := \frac{1}{2} \sum_{i=1}^{n} w_i^2 }[/math],
- L1 norm: [math]\displaystyle{ R(w) := \sum_{i=1}^{n} |w_i| }[/math], which leads to sparse solutions.
- Elastic Net: [math]\displaystyle{ R(w) := \frac{\rho}{2} \sum_{i=1}^{n} w_i^2 + (1-\rho) \sum_{i=1}^{n} |w_i| }[/math], a convex combination of L2 and L1, where [math]\displaystyle{ \rho }[/math] is given by
1 - l1_ratio
.