Stochastic Gradient Descent (SGD)-based Classification System
A Stochastic Gradient Descent (SGD)-based Classification System is a supervised classification system that implements a Stochastic Gradient Descent Algorithm to solve a SGD Classification Task.
- Example(s):
sklearn.linear_model.SGDClassifier
[1]:- SGD: Maximum margin separating hyperplane.
- Plot multi-class SGD on the iris dataset.
- SGD: Weighted samples.
- Comparing various online solvers.
- SVM: Separating hyperplane for unbalanced classes.
- Sample pipeline for text feature extraction and evaluation.
- Classification of text documents using sparse features
- Counter-Example(s):
- See: Regression Analysis Task, Classification Task, Random Variable, L2-norm, Linear Support Vector Machine.
References
2017a
- (Scikit Learn, 2017) ⇒ http://scikit-learn.org/stable/modules/linear_model.html#stochastic-gradient-descent-sgd Retrieved:2017-09-17
- QUOTE: Stochastic gradient descent is a simple yet very efficient approach to fit linear models. It is particularly useful when the number of samples (and the number of features) is very large. The
partial_fit
method allows only/out-of-core learning.The classes
SGDClassifier
andSGDRegressor
provide functionality to fit linear models for classification and regression using different (convex) loss functions and different penalties. E.g., withloss="log"
,SGDClassifier
fits a logistic regression model, while withloss="hinge"
it fits a linear support vector machine (SVM).
- QUOTE: Stochastic gradient descent is a simple yet very efficient approach to fit linear models. It is particularly useful when the number of samples (and the number of features) is very large. The
2017b
- (Scikit Learn, 2017) ⇒ http://scikit-learn.org/stable/modules/sgd.html#classification
- QUOTE: The class SGDClassifier implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties for classification
(...)
SGDClassifier supports the following loss functions:
- loss="hinge": (soft-margin) linear Support Vector Machine,
- loss="modified_huber": smoothed hinge loss,
- loss="log": logistic regression,
- QUOTE: The class SGDClassifier implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties for classification
- and all regression losses below (...)
- SGDClassifier supports averaged SGD (ASGD). Averaging can be enabled by setting `average=True`. ASGD works by averaging the coefficients of the plain SGD over each iteration over a sample. When using ASGD the learning rate can be larger and even constant leading on some datasets to a speed up in training time.
For classification with a logistic loss, another variant of SGD with an averaging strategy is available with Stochastic Average Gradient (SAG) algorithm, available as a solver in LogisticRegression.
2017c
- (Scikit Learn, 2017) ⇒ http://scikit-learn.org/stable/modules/sgd.html#mathematical-formulation Retrieved:2017-09-17
- QUOTE: Given a set of training examples [math]\displaystyle{ (x_1, y_1), \ldots, (x_n, y_n) }[/math] where [math]\displaystyle{ x_i \in \mathbf{R}^m }[/math] and [math]\displaystyle{ y_i \in \{-1,1\} }[/math], our goal is to learn a linear scoring function [math]\displaystyle{ f(x) = w^T x + b }[/math] with model parameters [math]\displaystyle{ w \in \mathbf{R}^m }[/math] and intercept [math]\displaystyle{ b \in \mathbf{R} }[/math]. In order to make predictions, we simply look at the sign of [math]\displaystyle{ f(x) }[/math]. A common choice to find the model parameters is by minimizing the regularized training error given by
[math]\displaystyle{ E(w,b) = \frac{1}{n}\sum_{i=1}^{n} L(y_i, f(x_i)) + \alpha R(w) }[/math]
where [math]\displaystyle{ L }[/math] is a loss function that measures model (mis)fit and [math]\displaystyle{ R }[/math] is a regularization term (aka penalty) that penalizes model complexity; [math]\displaystyle{ \alpha \gt 0 }[/math] is a non-negative hyperparameter.
Different choices for L entail different classifiers such as
- Hinge: (soft-margin) Support Vector Machines.
- Log: Logistic Regression.
- Least-Squares: Ridge Regression.
- Epsilon-Insensitive: (soft-margin) Support Vector Regression.
- QUOTE: Given a set of training examples [math]\displaystyle{ (x_1, y_1), \ldots, (x_n, y_n) }[/math] where [math]\displaystyle{ x_i \in \mathbf{R}^m }[/math] and [math]\displaystyle{ y_i \in \{-1,1\} }[/math], our goal is to learn a linear scoring function [math]\displaystyle{ f(x) = w^T x + b }[/math] with model parameters [math]\displaystyle{ w \in \mathbf{R}^m }[/math] and intercept [math]\displaystyle{ b \in \mathbf{R} }[/math]. In order to make predictions, we simply look at the sign of [math]\displaystyle{ f(x) }[/math]. A common choice to find the model parameters is by minimizing the regularized training error given by
- (...)
- Popular choices for the regularization term R include:
- L2 norm: [math]\displaystyle{ R(w) := \frac{1}{2} \sum_{i=1}^{n} w_i^2 }[/math],
- L1 norm: [math]\displaystyle{ R(w) := \sum_{i=1}^{n} |w_i| }[/math], which leads to sparse solutions.
- Elastic Net: [math]\displaystyle{ R(w) := \frac{\rho}{2} \sum_{i=1}^{n} w_i^2 + (1-\rho) \sum_{i=1}^{n} |w_i| }[/math], a convex combination of L2 and L1, where [math]\displaystyle{ \rho }[/math] is given by
1 - l1_ratio
.