Regularized Linear Regression Task

A Regularized Linear Regression Task is a linear regression task that is a based on the minimization of a regularized objective function.

Context:
- Task Input:
  A N-observed Numerically-Labeled Training Dataset [math]\displaystyle{ D=\{(x_1,y_1),(x_2,y_2),\cdots(x_n,y_n)\} }[/math] that can be represented by
  - [math]\displaystyle{ \mathbf{Y} }[/math], response variable continuous dataset.
  - [math]\displaystyle{ \mathbf{X} }[/math], predictor variables continuous dataset.
- output:
  - [math]\displaystyle{ \boldsymbol{\beta} }[/math], estimated model parameters vector, a continuous dataset.
  - [math]\displaystyle{ \mathbf{Y^*} }[/math], the Fitted Linear Function values , a continuous dataset.
  - [math]\displaystyle{ \lambda }[/math], regularization parameters.
  - [math]\displaystyle{ \sigma_x,\sigma_y,\rho_{X,Y}... }[/math], standard deviations, correlation coefficient, standard error of estimate and other statistical information the fitting parameters.
- Task Requirements
  It requires the best-fitting [math]\displaystyle{ \beta_j }[/math] linear model parameters and [math]\displaystyle{ \lambda }[/math] regularization parameter that optimizes a regularlized objective function of the form:
  [math]\displaystyle{ E_{reg}(f)=E(f)+ \lambda R(f) }[/math]
  where [math]\displaystyle{ E(f) }[/math] is the objective function given by the unregularized linear regression task;
  and [math]\displaystyle{ R(f) }[/math] is a regularization function that penalizes the complexity of latent linear regression function.
  
  A regression diagnostic test to determine goodness of fit the regression model and the statistical significance of the estimated parameters.

Example(s):
Ridge Regression Task.

LASSO Regression Task

Multi-task Lasso Regression Task

Basis Pursuit Denoising Task.

Bayesian Ridge Regression Task.

Counter-Example(s):
a Simple Linear Regression Task.

a Ordinary Least-Squares Regression Task.

a Non-Regularized Linear Regression Task.

See: Regularization, Regularized Learning Task, Regression Analysis, Statistical Classification.

References
2017a
(Zhang, 2017) ⇒ Xinhua Zhang (2017). “Regularization" in “Encyclopedia of Machine Learning and Data Mining” (Sammut & Webb, 2017) pp 1083 - 1088 ISBN: 978-1-4899-7687-1, DOI: 10.1007/978-1-4899-7687-1_718
QUOTE: In general, a regularizer a quantifier of the complexity of a model, and many successful machine learning algorithms fall in the framework of regularized risk minimization:
[math]\displaystyle{ (\text{How well the model fits the training data})\quad \quad (1) }[/math]
[math]\displaystyle{ +\lambda \cdot (\text{complexity/regularization of the model})\quad\quad(2) }[/math]
where the positive real number [math]\displaystyle{ \lambda }[/math] controls the trade-off.
There is a variety of regularizers, which yield different statistical and computational properties. In general, there is no universally best regularizer, and a regularization approach must be chosen depending on the dataset.
2017b
(Stats Stack Exchange, 2017) http://stats.stackexchange.com/questions/228763/regularization-methods-for-logistic-regression Retrieved: 2017-08-20
QUOTE: Yes, Regularization can be used in all linear methods, including both regression and classification. I would like to show you that there are not too much difference between regression and classification: the only difference is the loss function.
Specifically, there are three major components of linear method, Loss Function, Regularization, Algorithms. Where loss function plus regularization is the objective function in the problem in optimization form and the algorithm is the way to solve it (the objective function is convex, we will not discuss in this post).
In loss function setting, we can have different loss in both regression and classification cases. For example, Least squares and least absolute deviation loss can be used for regression. And their math representation are [math]\displaystyle{ L(\hat y,y)=(\hat y -y)^2 }[/math] and [math]\displaystyle{ L(\hat y,y)=|\hat y -y| }[/math]. (The function [math]\displaystyle{ L( \cdot ) }[/math] is defined on two scalar, [math]\displaystyle{ y }[/math] is ground truth value and [math]\displaystyle{ \hat y }[/math] is predicted value.)
On the other hand, logistic loss and hinge loss can be used for classification. Their math representations are [math]\displaystyle{ L(\hat y, y)=\log (1+ \exp(-\hat y y)) }[/math] and [math]\displaystyle{ L(\hat y, y)= (1- \hat y y)_+ }[/math]. (Here, [math]\displaystyle{ y }[/math] is the ground truth label in [math]\displaystyle{ \{-1,1\} }[/math] and [math]\displaystyle{ \hat y }[/math] is predicted "score". The definition of [math]\displaystyle{ \hat y }[/math] is a little bit unusual, please see the comment section.)
In regularization setting, you mentioned about the L1 and L2 regularization, there are also other forms, which will not be discussed in this post.
Therefore, in a high level a linear method is
[math]\displaystyle{ \underset{w}{\text{minimize}}~~~ \sum_{x,y} L(w^{\top} x,y)+\lambda h(w) }[/math]
If you replace the Loss function from regression setting to logistic loss, you get the logistic regression with regularization.
For example, in ridge regression, the optimization problem is
[math]\displaystyle{ \underset{w}{\text{minimize}}~~~ \sum_{x,y} (w^{\top} x-y)^2+\lambda w^\top w }[/math]
If you replace the loss function with logistic loss, the problem becomes
[math]\displaystyle{ \underset{w}{\text{minimize}}~~~ \sum_{x,y} \log(1+\exp{(-w^{\top}x \cdot y)})+\lambda w^\top w }[/math]
Here you have the logistic regression with L2 regularization.
2017c
(Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Regularization_(mathematics)#Use_of_regularization_in_classification Retrieved:2017-8-20.
One particular use of regularization is in the field of classification. Empirical learning of classifiers (learning from a finite data set) is always an underdetermined problem, because in general we are trying to infer a function of any [math]\displaystyle{ x }[/math] given only some examples [math]\displaystyle{ x_1, x_2, ... x_n }[/math] .
A regularization term (or regularizer) [math]\displaystyle{ R(f) }[/math] is added to a loss function: : [math]\displaystyle{ \min_f \sum_{i=1}^{n} V(f(\hat x_i), \hat y_i) + \lambda R(f) }[/math] where [math]\displaystyle{ V }[/math] is an underlying loss function that describes the cost of predicting [math]\displaystyle{ f(x) }[/math] when the label is [math]\displaystyle{ y }[/math] , such as the square loss or hinge loss; and [math]\displaystyle{ \lambda }[/math] is a parameter which controls the importance of the regularization term. [math]\displaystyle{ R(f) }[/math] is typically chosen to impose a penalty on the complexity of [math]\displaystyle{ f }[/math] . Concrete notions of complexity used include restrictions for smoothness and bounds on the vector space norm. A theoretical justification for regularization is that it attempts to impose Occam's razor on the solution, as depicted in the figure. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters.
Regularization can be used to learn simpler models, induce models to be sparse, introduce group structure into the learning problem, and more.
The same idea arose in many fields of science. For example, the least-squares method can be viewed as a very simple form of regularization . A simple form of regularization applied to integral equations, generally termed Tikhonov regularization after Andrey Nikolayevich Tikhonov, is essentially a trade-off between fitting the data and reducing a norm of the solution. More recently, non-linear regularization methods, including total variation regularization, have become popular.

Regularized Linear Regression Task

References

2017a

2017b

2017c

Navigation menu

Search