2013 ARiskComparisonofOrdinaryLeastS

(Dhillon et al., 2013) ⇒ Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, and Lyle H. Ungar. (2013). “A Risk Comparison of Ordinary Least Squares Vs Ridge Regression.” In: The Journal of Machine Learning Research, 14(1).

Subject Headings: Ordinary Least Squares Estimate.

Notes

Cited By

Quotes

Abstract

We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a finite dimensional subspace (as specified by a principal component analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace. This note shows that the risk of this ordinary least squares method (PCA-OLS) is within a constant factor (namely 4) of the risk of ridge regression (RR).

1. Introduction

Consider the fixed design setting where we have a set of n vectors X = {X_i}, and let X denote the matrix where the ith row of X is Xi. The observed label vector is [math]\displaystyle{ Y \in R^n }[/math].

Suppose that: Y = Xb+e, where e is independent noise in each coordinate, with the variance of ei being s2. The objective is to learn E[Y] = Xb. The expected loss of a vector b estimator is: L(b) = 1 n EY[kY -Xbk2], Let ˆb be an estimator of b (constructed with a sample Y). Denoting � := 1 n XTX,

we have that the risk (i.e., expected excess loss) is: Risk(ˆb) := Eˆb [L(ˆb)-L(b)] = Eˆb kˆb -bk2 �, where kxk� = x?�x and where the expectation is with respect to the randomness in Y.

We show that a simple variant of ordinary (un-regularized) least squares always compares favorably to ridge regression (as measured by the risk). This observation is based on the following bias variance decomposition:

Risk(ˆb) = Ekˆb - ¯b k2 � | {z }

Variance + k¯b -bk2 � | {z }

Prediction Bias , (1) where ¯b = E[ˆb].

1.1 The Risk of Ridge Regression (RR)

Ridge regression or Tikhonov Regularization (Tikhonov, 1963) penalizes the l2 norm of a parameter vector b and “shrinks” it towards zero, penalizing large values more. The estimator is: ˆb l = argmin b {kY -Xbk2+lkbk2}. The closed form estimate is then: ˆb l = (�+lI)-1 � 1 n XTY � .

Note that ˆb 0 = ˆbl=0 = argmin b {kY -Xbk2}, is the ordinary least squares estimator. Without loss of generality, rotate X such that: � = diag(l1,l2, . . . ,lp), where the li’s are ordered in decreasing order.

To see the nature of this shrinkage observe that: [ˆbl] j := lj lj +l [ˆb0] j, where ˆb0 is the ordinary least squares estimator.

…

2. Ordinary Least Squares with PCA (PCA-OLS)

Now let us construct a simple estimator based on [math]\displaystyle{ \lambda }[/math]. Note that our rotated coordinate system where � is equal to diag(l1,l2, . . . ,lp) corresponds the PCA coordinate system.

Consider the following ordinary least squares estimator on the “top” PCA subspace — it uses the least squares estimate on coordinate j if lj = l and 0 otherwise

[ˆbPCA,l] j = � [ˆb0] j if lj = l 0 otherwise .

…

3. Experiments

First, we generated synthetic data with p = 100 and varying values of n= {20, 50, 80, 110}. …

…

4. Conclusion

We showed that the risk inflation of a particular ordinary least squares estimator (on the “top” PCA subspace) is within a factor 4 of the ridge estimator. It turns out the converse is not true — this PCA estimator may be arbitrarily better than the ridge one.

References

1. D. P. Foster and E. I. George. The Risk Inflation Criterion for Multiple Regression. The Annals of Statistics, Pages 1947-1975, 1994.
2. A. N. Tikhonov. Solution of Incorrectly Formulated Problems and the Regularization Method. Soviet Math Dokl 4, Pages 501-504, 1963.

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2013 ARiskComparisonofOrdinaryLeastS	Lyle H. Ungar Paramveer S. Dhillon Dean P. Foster Sham M. Kakade			A Risk Comparison of Ordinary Least Squares Vs Ridge Regression