Stochastic Gradient Descent (SGD)-based Classification System

Example(s):
- sklearn.linear_model.SGDClassifier [1]:
Counter-Example(s):
- SGD Regression System.
See: Regression Analysis Task, Classification Task, Random Variable, L2-norm, Linear Support Vector Machine.

References

(Scikit Learn, 2017) ⇒ http://scikit-learn.org/stable/modules/linear_model.html#stochastic-gradient-descent-sgd Retrieved:2017-09-17
- QUOTE: Stochastic gradient descent is a simple yet very efficient approach to fit linear models. It is particularly useful when the number of samples (and the number of features) is very large. The partial_fit method allows only/out-of-core learning.
  The classes SGDClassifier and SGDRegressor provide functionality to fit linear models for classification and regression using different (convex) loss functions and different penalties. E.g., with loss="log", SGDClassifier fits a logistic regression model, while with loss="hinge" it fits a linear support vector machine (SVM).

and all regression losses below (...)

SGDClassifier supports averaged SGD (ASGD). Averaging can be enabled by setting `average=True`. ASGD works by averaging the coefficients of the plain SGD over each iteration over a sample. When using ASGD the learning rate can be larger and even constant leading on some datasets to a speed up in training time.

For classification with a logistic loss, another variant of SGD with an averaging strategy is available with Stochastic Average Gradient (SAG) algorithm, available as a solver in LogisticRegression.

(...)

Popular choices for the regularization term R include:

L2 norm: [math]\displaystyle{ R(w) := \frac{1}{2} \sum_{i=1}^{n} w_i^2 }[/math],
L1 norm: [math]\displaystyle{ R(w) := \sum_{i=1}^{n} |w_i| }[/math], which leads to sparse solutions.
Elastic Net: [math]\displaystyle{ R(w) := \frac{\rho}{2} \sum_{i=1}^{n} w_i^2 + (1-\rho) \sum_{i=1}^{n} |w_i| }[/math], a convex combination of L2 and L1, where [math]\displaystyle{ \rho }[/math] is given by 1 - l1_ratio.