Lasso Regression Algorithm
(Redirected from L1-Norm Regularizer)
Jump to navigation
Jump to search
A Lasso Regression Algorithm is a linear regression algorithm that uses shrinkage and selection.
- AKA: Least Absolute Shrinkage and Selection Operator.
- Context:
- It can be represented as an L1 Regularized Optimization Algorithm for [math]\displaystyle{ \hat{\beta}(\lambda)={\rm arg}\ {\rm min}_{\beta}\ L({\rm y},X\beta)+\lambda J(\beta) }[/math], such that L is squared error loss and [math]\displaystyle{ J(β) = ∥β∥_1 }[/math] is the [math]\displaystyle{ \ell_{1} }[/math] norm of β.
- It can be applied by a Lasso Regression System (that can solve a lasso regression task).
- Example(s):
- Counter-Example(s):
- See: L1-Norm Regularizer, L1-Norm Regularization, Convex Optimization, Least Angle Regression, Elastic Net Regularization.
References
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Least_squares#Lasso_method Retrieved:2015-1-14.
- An alternative regularized version of least squares is lasso (least absolute shrinkage and selection operator), which uses the constraint that [math]\displaystyle{ \|\beta\|_1 }[/math], the L1-norm of the parameter vector, is no greater than a given value. (As above, this is equivalent to an unconstrained minimization of the least-squares penalty with [math]\displaystyle{ \alpha\|\beta\|_1 }[/math] added.) In a Bayesian context, this is equivalent to placing a zero-mean Laplace prior distribution on the parameter vector. The optimization problem may be solved using quadratic programming or more general convex optimization methods, as well as by specific algorithms such as the least angle regression algorithm. One of the prime differences between Lasso and ridge regression is that in ridge regression, as the penalty is increased, all parameters are reduced while still remaining non-zero, while in Lasso, increasing the penalty will cause more and more of the parameters to be driven to zero. This is an advantage of Lasso over ridge regression, as driving parameters to zero deselects the features from the regression. Thus, Lasso automatically selects more relevant features and discards the others, whereas Ridge regression never fully discards any features. Some feature selection techniques are developed based on the LASSO including Bolasso which bootstraps samples, and FeaLect which analyzes the regression coefficients corresponding to different values of [math]\displaystyle{ \alpha }[/math] to score all the features.
The L1-regularized formulation is useful in some contexts due to its tendency to prefer solutions with fewer nonzero parameter values, effectively reducing the number of variables upon which the given solution is dependent. For this reason, the Lasso and its variants are fundamental to the field of compressed sensing. An extension of this approach is elastic net regularization.
- An alternative regularized version of least squares is lasso (least absolute shrinkage and selection operator), which uses the constraint that [math]\displaystyle{ \|\beta\|_1 }[/math], the L1-norm of the parameter vector, is no greater than a given value. (As above, this is equivalent to an unconstrained minimization of the least-squares penalty with [math]\displaystyle{ \alpha\|\beta\|_1 }[/math] added.) In a Bayesian context, this is equivalent to placing a zero-mean Laplace prior distribution on the parameter vector. The optimization problem may be solved using quadratic programming or more general convex optimization methods, as well as by specific algorithms such as the least angle regression algorithm. One of the prime differences between Lasso and ridge regression is that in ridge regression, as the penalty is increased, all parameters are reduced while still remaining non-zero, while in Lasso, increasing the penalty will cause more and more of the parameters to be driven to zero. This is an advantage of Lasso over ridge regression, as driving parameters to zero deselects the features from the regression. Thus, Lasso automatically selects more relevant features and discards the others, whereas Ridge regression never fully discards any features. Some feature selection techniques are developed based on the LASSO including Bolasso which bootstraps samples, and FeaLect which analyzes the regression coefficients corresponding to different values of [math]\displaystyle{ \alpha }[/math] to score all the features.
2011
- http://www-stat.stanford.edu/~tibs/lasso.html
- The Lasso is a shrinkage and selection method for linear regression. It minimizes the usual sum of squared errors, with a bound on the sum of the absolute values of the coefficients. It has connections to soft-thresholding of wavelet coefficients, forward stagewise regression, and boosting methods.
- http://en.wikipedia.org/wiki/Least_squares#LASSO_method
- In some contexts a regularized version of the least squares solution may be preferable. The LASSO (least absolute shrinkage and selection operator) algorithm, for example, finds a least-squares solution with the constraint that [math]\displaystyle{ |\beta|_1 }[/math], the L1-norm of the parameter vector, is no greater than a given value. Equivalently, it may solve an unconstrained minimization of the least-squares penalty with [math]\displaystyle{ \alpha|\beta|_1 }[/math] added, where [math]\displaystyle{ \alpha }[/math] is a constant (this is the Lagrangian form of the constrained problem.) This problem may be solved using quadratic programming or more general convex optimization methods, as well as by specific algorithms such as the least angle regression algorithm. The L1-regularized formulation is useful in some contexts due to its tendency to prefer solutions with fewer nonzero parameter values, effectively reducing the number of variables upon which the given solution is dependent. For this reason, the LASSO and its variants are fundamental to the field of compressed sensing.
2009
- (Hans, 2009) ⇒ Chris Hans. (2009). “Bayesian Lasso Regression.” In: Biometrika, 96(4).
- ABSTRACT: The lasso estimate for linear regression corresponds to a posterior mode when independent, double-exponential prior distributions are placed on the regression coefficients. This paper introduces new aspects of the broader Bayesian treatment of lasso regression. A direct characterization of the regression coefficients’ posterior distribution is provided, and computation and inference under this characterization is shown to be straightforward. Emphasis is placed on point estimation using the posterior mean, which facilitates prediction of future observations via the posterior predictive distribution. It is shown that the standard lasso prediction method does not necessarily agree with model-based, Bayesian predictions. A new Gibbs sampler for Bayesian lasso regression is introduced.
- Key words: Double-exponential distribution, Gibbs sampler, L1 penalty, Laplace distribution, Markov chain Monte Carlo, Posterior predictive distribution, Regularization.
2008
- (Friedman et al., 2008) ⇒ Jerome Friedman, Trevor Hastie, and Robert Tibshirani. (2008). “Sparse Inverse Covariance Estimation with the Graphical Lasso.” In: Biostatistics, 9(3). doi:10.1093/biostatistics/kxm045.
2007
- (Rosset & Zhu, 2007) ⇒ Saharon Rosset, and Ji Zhu. (2007). “Piecewise Linear Regularized Solution Paths." The Annals of Statistics
- QUOTE: We consider the generic regularized optimization problem [math]\displaystyle{ \hat{\beta}(\lambda)={\rm arg}\ {\rm min}_{\beta}\ L({\rm y},X\beta)+\lambda J(\beta) }[/math]. Efron, Hastie, Johnstone and Tibshirani (Ann. Statist. 32. 2004. 407-499) have shown that for the LASSO - that is, if L is squared error loss and [math]\displaystyle{ J(β) = ∥β∥_1 }[/math] is the [math]\displaystyle{ \ell_{1} }[/math] norm of β - the optimal coefficient path is piecewise linear, that is, $\partial \ hat{\beta} (\ lambda) / \ partial \ lambda $ is piecewise constant.
2005
- (Tibshirani et al., 2005) ⇒ Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. (2005). “Sparsity and Smoothness via the Fused Lasso.” In: Journal of the Royal Statistical Society (Series B), 67(1).
- ABSTRACT: The lasso penalizes a least squares regression by the sum of the absolute values (<math>L_1-norm<math>) of the coefficients. The form of this penalty encourages sparse solutions (with many coefficients equal to 0). We propose the 'fused lasso', a generalization that is designed for problems with features that can be ordered in some meaningful way. The fused lasso penalizes the $L_1-norm$ of both the coefficients and their successive differences. Thus it encourages sparsity of the coefficients and also sparsity of their differences - i.e. local constancy of the coefficient profile. The fused lasso is especially useful when the number of features p is much greater than N, the sample size. The technique is also extended to the 'hinge' loss function that underlies the support vector classifier. We illustrate the methods on examples from protein mass spectroscopy and gene expression data.
1996
- (Tibshirani, 1996) ⇒ Robert Tibshirani. (1996). “Regression Shrinkage and Selection via the Lasso.” In: Journal of the Royal Statistical Society, Series B, 58(1).