Linear Least-Squares L1-Regularized Regression Task
A Linear Least-Squares L1-Regularized Regression Task is a regularized linear least-squares regression that is based on L1 regularization and feature selection task.
- AKA: Least Absolute Shrinkage and Selection Operator (LASSO) Regression.
- Context:
- Task Input: an N-observed Numerically-Labeled Training Dataset [math]\displaystyle{ D=\{(x_1,y_1),(x_2,y_2),\cdots(x_n,y_n)\} }[/math] that can be represented by [math]\displaystyle{ \mathbf{Y} }[/math] response variable continuous dataset and [math]\displaystyle{ \mathbf{X} }[/math] predictor variables continuous dataset.
- Task Output:
- [math]\displaystyle{ \boldsymbol{\beta}=\{\beta_0,\beta_1,...,\beta_p\} }[/math], estimated linear model parameters vector, a continuous dataset.
- [math]\displaystyle{ \mathbf{\hat{Y}}=f(x_i,\hat{\beta_j}) }[/math], predicted values (the Fitted Linear Function), a continuous dataset.
- [math]\displaystyle{ \sigma_x,\sigma_y,\rho_{X,Y}... }[/math], standard deviations, correlation coefficient, standard error of estimate and other statistical information the fitting parameters.
- Task Requirements
- It requires to minimize a regularized objective function where the regularization function, [math]\displaystyle{ R(f) }[/math], is of the form L1 Norm. This is
[math]\displaystyle{ \underset{\boldsymbol{\beta}}{\text{minimize}}\{E(f)+\lambda \sum_{j=0}^p \parallel \beta_j\parallel \} }[/math],
[math]\displaystyle{ E(f) }[/math] is usually the linear least-squares task objective function.
- A regression diagnostic test to determine goodness of fit of the regression model and the statistical significance of the estimated parameters
- It requires to minimize a regularized objective function where the regularization function, [math]\displaystyle{ R(f) }[/math], is of the form L1 Norm. This is
- It can be solved by LASSO Regression System that implements a LASSO Algorithm.
- For the linear regression task represented by the equation : [math]\displaystyle{ y_i=\beta_0+\beta_1x_i+\beta_2x_i+\cdots+\beta_px_i+\varepsilon_i }[/math], the lasso regression task can be solved by
[math]\displaystyle{ \underset{\boldsymbol{\beta}}{\text{minimize}} \{ \sum_{i=1}^n \parallel y_i - \sum_{j=1}^p x_{ij}\beta_j\parallel^2 + \lambda \sum_{j=1}^p \parallel\beta_j\parallel \} }[/math] with [math]\displaystyle{ x_{i0}=0 }[/math] and [math]\displaystyle{ x_{ij}=x_i }[/math] for [math]\displaystyle{ j\gt 0 }[/math].
- For the linear regression task in represented in the matrix form: [math]\displaystyle{ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta} + \mathbf{U} }[/math], the lasso regression task can be solved by
[math]\displaystyle{ \underset{\boldsymbol{\beta}}{\text{minimize}}\{ \parallel \mathbf{X}\boldsymbol{\beta} - \mathbf{Y} \parallel^2 + \lambda \parallel \boldsymbol{\mathbf{I}\beta}\parallel \} }[/math] where [math]\displaystyle{ \mathbf{I} }[/math] is the identity matrix.
- Example(s):
- What is 10-Fold RMSE of LASSO on sklearn's Boston Dataset? It is approximately ____ (based on sklearn.linear model.Lasso).
- …
- Counter-Example(s):
- See: Regularization Parameter, Regularization, L1 Regularization, Lp Regularization.
References
2017a
- (Zhang, 2017) ⇒ Xinhua Zhang (2017). “Regularization" in “Encyclopedia of Machine Learning and Data Mining” (Sammut & Webb, 2017) pp 1083 - 1088 ISBN: 978-1-4899-7687-1, DOI: 10.1007/978-1-4899-7687-1_718
- QUOTE: A common approach to regularization is to penalize a model by its complexity measured by some real-valued function, e.g., a certain “norm” of [math]\displaystyle{ \mathbf{w} }[/math]. We list some examples below.
L1 regularization
L1 regularizer, [math]\displaystyle{ \left \|\mathbf{w}\right \|_{1} :=\sum _{i}\left \vert w_{i}\right \vert }[/math], is a popular approach to finding sparse models, i.e., only a few components of [math]\displaystyle{ \mathbf{w} }[/math] are nonzero, and only a corresponding small number of features are relevant to the prediction. A well-known example is the LASSO algorithm (Tibshirani,1996), which uses a L1-regularized least square:
[math]\displaystyle{ \displaystyle{\min _{\mathbf{w}\in \mathbb{R}^{p}}\left \|X^{\top }\mathbf{w} -\mathbf{ y}\right \|^{2} +\lambda \left \|\mathbf{w}\right \|_{ 1}.} }[/math].
- QUOTE: A common approach to regularization is to penalize a model by its complexity measured by some real-valued function, e.g., a certain “norm” of [math]\displaystyle{ \mathbf{w} }[/math]. We list some examples below.
2017b
- (Quadrianto & Buntine, 2017) ⇒ Novi Quadrianto, Wray L. Buntine (2017). "Linear Regression" in "Encyclopedia of Machine Learning and Data Mining (2017)" pp 747-750 DOI:10.1007/978-1-4899-7687-1_481 ISBN: 978-1-4899-7687-1.
- QUOTE: Regularized/Penalized Least Squares Method
The issue of over-fitting as mentioned in Regression is usually addressed by introducing a regularization or penalty term to the objective function. The regularized objective function is now in the form of
[math]\displaystyle{ E_{\mathrm{reg}} = E(w) +\lambda R(w) }[/math] (9)
Here [math]\displaystyle{ E }[/math]([math]\displaystyle{ w }[/math]) measures the quality (such as least squares quality) of the solution on the observed data points, [math]\displaystyle{ R }[/math]([math]\displaystyle{ w }[/math]) penalizes complex solutions, and [math]\displaystyle{ λ }[/math] is called the regularization parameter which controls the relative importance between the two. This regularized formulation is sometimes called coefficient shrinkage as it shrinks coefficients/weights toward zero (cf. coefficient subset selection formulation where the best [math]\displaystyle{ k }[/math] out of [math]\displaystyle{ H }[/math] basis functions are greedily selected). Two simple penalty terms [math]\displaystyle{ R }[/math]([math]\displaystyle{ w }[/math]) are given next, but more generally measures of curvature can also be used to penalize non-smooth functions(...)
Lasso Regression
The regularization term is in the form of
[math]\displaystyle{ \displaystyle{ R(w) =\sum _{ d=1}^{D}\vert w_{ d}\vert. }\quad }[/math] (14)
In contrast to ridge regression, lasso regression (Tibshirani,1996) has no closed-form solution. In fact, the non-differentiability of the regularization term has produced many approaches. Most of the methods involve quadratic programming and recently coordinate-wise descent algorithms for large lasso problems (Friedman et al. 2007). Lasso regression leads to sparsity in w, that is, only a subset of w is nonzero, so irrelevant basis functions will be ignored.
- QUOTE: Regularized/Penalized Least Squares Method
2017c
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Lasso_(statistics) Retrieved:2017-8-27.
- In statistics and machine learning, lasso (least absolute shrinkage and selection operator) (also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces. It was introduced by Robert Tibshirani in 1996 based on Leo Breiman’s Nonnegative Garrote.[1] [2] Lasso was originally formulated for least squares models and this simple case reveals a substantial amount about the behavior of the estimator, including its relationship to ridge regression and best subset selection and the connections between lasso coefficient estimates and so-called soft thresholding. It also reveals that (like standard linear regression) the coefficient estimates need not be unique if covariates are collinear.
Though originally defined for least squares, lasso regularization is easily extended to a wide variety of statistical models including generalized linear models, generalized estimating equations, proportional hazards models, and M-estimators, in a straightforward fashion.[1][3] Lasso’s ability to perform subset selection relies on the form of the constraint and has a variety of interpretations including in terms of geometry, Bayesian statistics, and convex analysis.
The LASSO is closely related to basis pursuit denoising.
- In statistics and machine learning, lasso (least absolute shrinkage and selection operator) (also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces. It was introduced by Robert Tibshirani in 1996 based on Leo Breiman’s Nonnegative Garrote.[1] [2] Lasso was originally formulated for least squares models and this simple case reveals a substantial amount about the behavior of the estimator, including its relationship to ridge regression and best subset selection and the connections between lasso coefficient estimates and so-called soft thresholding. It also reveals that (like standard linear regression) the coefficient estimates need not be unique if covariates are collinear.
2017d
- (Scikit-Learn, 2017) ⇒ "1.1.3. Lasso" http://scikit-learn.org/stable/modules/linear_model.html#lasso
- QUOTE: The
Lasso
is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer parameter values, effectively reducing the number of variables upon which the given solution is dependent. For this reason, the Lasso and its variants are fundamental to the field of compressed sensing. Under certain conditions, it can recover the exact set of non-zero weights (see Compressive sensing: tomography reconstruction with L1 prior (Lasso)).Mathematically, it consists of a linear model trained with [math]\displaystyle{ \ell_1 }[/math] prior as regularizer. The objective function to minimize is:
[math]\displaystyle{ \underset{w}{min\,} { \frac{1}{2n_{samples}} ||X w - y||_2 ^ 2 + \alpha ||w||_1} }[/math]
The lasso estimate thus solves the minimization of the least-squares penalty with [math]\displaystyle{ \alpha ||w||_1 }[/math] added, where [math]\displaystyle{ \alpha }[/math] is a constant and [math]\displaystyle{ ||w||_1 }[/math] is the [math]\displaystyle{ \ell_1 }[/math]-norm of the parameter vector.
- QUOTE: The
- ↑ 1.0 1.1 Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the lasso”. Journal of the Royal Statistical Society. Series B (methodological) 58 (1). Wiley: 267–88. http://www.jstor.org/stable/2346178.
- ↑ Breiman, Leo. 1995. “Better Subset Regression Using the Nonnegative Garrote”. Technometrics 37 (4). Taylor & Francis, Ltd.: 373–84. doi:10.2307/1269730.
- ↑ Tibshirani, Robert. 1997. “The lasso Method for Variable Selection in the Cox Model". Statistics in Medicine, Vol. 16, 385—395 (1997)