39
A Generalized Approach to Ridge Regression Kathryn Vasilaky A Generalized Approach to Ridge Regression Kathryn Vasilaky 1 December 7, 2016 1 Postdoc @ Columbia University, Earth Institute; [email protected], www.kvasilaky.com

A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

  • Upload
    vananh

  • View
    227

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

A Generalized Approach to Ridge Regression

Kathryn Vasilaky1

December 7, 2016

1Postdoc @ Columbia University, Earth Institute; [email protected],www.kvasilaky.com

Page 2: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

My Background

1. Applied economist with an interest in development economics,policy and information communication

2. RCTs, lab experiments in the field, natural experiments, as well asbroader methods in data science

3. Past work looked at how groups and social networks affecttechnology adoption and learning, particularly for females

4. Recent work combines the use of crowd sourced and sensored data(IoT) for prediction and feedback to larger information systems

5. As a complement to this applied work, I work on learning andimproving applied statistical methodologies

Page 3: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Overview of talk

1. Part I1.1 Review Standard Ridge1.2 Derivation of Generalized Ridge Results

2. Part II2.1 Test Generalized Ridge under simulated case2.2 Test Generalized Ridge with real data and cross validation2.3 Reconstruct an image from projections

Page 4: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Motivation of Inverse ProblemsI Inverse problems: compute information about some “interior”

properties using “exterior” measurements.

I Inference: Covariates − > Coefficients − > Outcome

I Tomography: Xray source − > Object − > X Ray Dampening

I ML: Features − > Effect Size − > Classifier/Prediction

Page 5: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Why is Regularization Used?

I OLS is BLUE when the covariate matrix (A) is full rank

I But when A is ill-conditioned (covariates correlated), estimatorswill be sensitive to noise

I Regularization methods are used to dampen the effects of thesensitivity to noise

I I present a generalization to the frequently used Ridge RegressionI Performs as well or better than Standard RidgeI Allows for a more flexible weighting of singular values than

Standard RidgeI Useful for data where covariates are correlated: large consumer data

sets, health data

Page 6: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Reviewing the Basics of Gauss MarkovRecall the Gauss Markov estimator for the least squares problem:

y = Ax + e

minx ||Ax − y ||22A is nxp, with rank p,

x = (A′A)−1A′y

Var(x) = σ2(A′A)−1

and the MSE(x) = σ2Tr(A′A)−1 = σ2∑i

1σ2i

,

where σ2i is the ith eigenvalue of A′A

This will serve as a comparison later on.

Page 7: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

When is Gauss Markov MSE Large?

I When A′A is ill-conditioned (nearly not full rank), the solution toOLS is sensitive to noise (y = y + ε)

I This can occur when covariates are highly correlated, the numberof covariates exceeds the number observations, or A is sparse

I OLS is still BLUEI But the standard errors of x , and MSE, will be large (σi ’s are small)

Var(x) = σ2(A′A)−1

MSE(x) = σ2Tr(A′A)−1 = σ2∑

i1σ2i

I And x can deviate very far from the true solution x

Page 8: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Example: Noise is magnified if the covariate matrix isill-conditioned

A = UΣV ′ =

[1 00 1

] [1 00 10−6

] [1 00 1

](1)

(A′A)−1A′y =

[1 00 106

] [11

]=

[1

106

]Noiseless solution (2)

(A′A)−1A′(y+e) =

[1 00 106

] [1

1.1

]=

[1

106 + 105

]Naive solution (3)

Page 9: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Regularization problems formulate a nearby problem, which has amore stable solution:

Minx(||Ax − y ||22 + λ||x ||22), λ > 0

where we introduce the term λ||x ||22 perturbing the least-squaresformulation.

I The estimated x is no longer unbiased, however, the MSE =Variance + Bias2, may be smaller than BLUE.

The regularization can be L1 (Lasso) or L2. The objective is to choosea lambda that brings x close to the noiseless solution.

Page 10: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

The regularized least squares problem problem becomes:

Minx ||Ax − y ||22 + λ||x − x0||22

x = (A′A + λIn)−1A′y

MSE = σ2∑∞n=1

σ2i

(σ2i +λ)2 + λ2∑∞

n=1

α2i

(σ2i +λ)2

(Hoerl and Kennard, 1970)2

2Note: where α = Vx , A′A = VΣ2V ′

Page 11: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Other regularization methods include:

I Lasso, with a L1 penalty weighted by λ

I Truncated SVD

I Elastic Net, a weighted sum of L1 and L2

I ‘Generalized’ Tikhanov

Page 12: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Introduction to Generalized Iterative Ridge

I Rather than computing one solution I will compute a sequence ofsolutions which approximates the noiseless solution (A′A)−1A′y onits way to the naive solution (A′A)−1A′(y + e).

I Since the operator (A′A + λI )−1 = UDiag( 1σ2i +λ

)V ′ has

eigenvalues that are less than one, it is contracting.

I I show that repeated application of this operator will converge tothe naive solution.

New normal equations with added constraint become:

(A′A + λI )x1 = A′y + λx0

Page 13: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn VasilakyIntroduction to Generalized Iterative Ridge

(A′A + λI )x1 = A′y + λx0

The solution for x1 is:

x1λ = (A′A + λI )−1A′y + λ(A′A + λI )−1x0 (Regular Ridge, x0 = 0)

Then substituting x1λ into x0, we obtain x2

λ, and if we substitute xk−1λ

into xk−2λ , we obtain:

xkλ = Σk

i=1λi−1((A′A + λI )−i (A′y) + λk(A′A + λI )−kx0)

where, Σki=1λ

i−1((A′A + λI )−i (A′y), is a contracting opearator.

Page 14: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Generalized Iterative Solution, xkλ Now we sum the geometric series

using single value decomposition (SVD) of A: A = UΣV T , whereΣ = Diag[σ1, .., σn, 0....0] ∈ Rmxn with σ1 ≥ σ2 ≥ · · · ≥ σn > 0, andwhere n is the rank of A. The iterative solution after k iterations, wehave:

V

1σ1− 1

σ1( λλ+σ2

1)k . . . 0 . . . 0

.... . .

...0 . . . 1

σn− 1

σn( λλ+σ2

n)k 0 . . . 0

UT y

For practical ease we can set x0=0. It does not affect the solution, asthe last term attached to x0 is pulled quickly towards 0.

Page 15: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

The limits of Generalized Iterative Ridge (GIR)

The GIR solution tends towards the naive solution ask − >∞:(A′A)−1A′y = V

∑−1 U ′y , or

V

1σ1

. . . 0 . . . 0...

. . ....

0 . . . 1σn

0 . . . 0

UT y

Page 16: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

The limits of Generalized Iterative Ridge (GIR)

When k =1, GIR collapses to Standard Ridge:

V

σ1

λ+σ21

. . . 0 . . . 0

.... . .

...0 . . . σn

λ+σ2n

0 . . . 0

UT y

The diagonal entries are the filters for the singular values, and it is clearthat the iterative solution’s filter is a generalization of the two extremes.

Page 17: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Summary, Generalized Iterative Ridge

I Falls between Standard Ridge and OLS

I The filter is more general than Standard Ridge

Page 18: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Choosing k, and λ using the Residual Error

We can use this simple and nice expression for the residual error,||Axk − y ||, as a function of k and λ, and choose it’s lowest point ormaximum curvature in convexity.

RE(λ, k) = ||

( λλ+σ2

1)k . . . 0 . . . 0

.... . .

...0 . . . ( λ

λ+σ2n

)k 0 . . . 0

U ′y ||

Page 19: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn VasilakyChoosing k, and λ using the Residual Error

We can see that the residual error, ||Axk − y ||, is convex and decreasesas k increases, for a given lambda.

Residual error as a function of k, given lambda. Choose k at the elbow.

Page 20: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Intuition for choosing k at residual error’s elbow

Residual error’s convexity can be seen when we take the difference of xkλ

and the noiseless OLS solution x .

I First term, iteration error declines monotonically as k − >∞I Second term, noise term, increases monotonically with k (to the

OLS residual).

(S1)

-V

1σ1

( λλ+σ2

1)k . . . 0 0 . . . 0

.... . .

...0 . . . 1

σn( λλ+σ2

n)k 0 . . . 0

UT y

(S2) V

1σ1− 1

σ1( λλ+σ2

1)k . . . 0 0 . . . 0

.... . .

...0 . . . 1

σn− 1

σn( λλ+σ2

n)k 0 . . . 0

UT e

Page 21: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Summary, Choosing k and λ

I Grid search over λ

I Choose k at the elbow

Page 22: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Comparing the Filters of Standard Ridge and GIR

I A key contribution of the iterative solution is in the filters.

I The filters dampen the effects of small singular values.

Blue 1σi

Standard Ridge 1σi

(1− ( λλ+σ2

1)) =

σ2i

λ+σ2i

(1− ( λλ+σ2

1)) =

σ2i

λ+σ2i

(1− ( λλ+σ2

1)) =

σ2i

λ+σ2i

Iterative Ridge 1σi

(1− ( λλ+σ2

1)k)(1− ( λ

λ+σ21

)k)(1− ( λλ+σ2

1)k)

What is the extra advantage of the k iteration parameter in the filter?

Page 23: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Iterative Ridge introduces additional flexibility in the weighting ofeach covariate

I With Standard Ridge/Tikhanov, even medium valued sigma’s areheavily penalized (for lambda = 0.1)

I Notice λ = 0.0001 Standard Ridge and Iterative Ridgeλ = 0.01, k = 20 have similar shapes, however, a small λ increasesthe noise.

Page 24: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn VasilakyComparing the MSEs for BLUE, Standard Ridge, and GIR

Last but not least, I have the generalized MSE for Iterative Ridge,exposes the variance and the bias:3

MSE OLS

σ2∑i

1

σ2i

MSE Ridge

σ2∞∑n=1

σ2i

(σ2i + λ)2

+ λ2∞∑n=1

α2i

(σ2i + λ)2

MSE Iterative Ridge

σ2∞∑n=1

1

σ2i

(1− (λ

λ+ σ2i

)k)2 + λ2k∞∑n=1

α2i (

1

σ2i + λ

)2k

3Note: where α = Vx , A′A = VΣ2V ′

Page 25: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Part II

Page 26: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

How do we know we can do better with Generalized Ridge?

I Provide a small simulated example with known x, noisy y, andill-conditioned A

I Provide a real data example where we test our out-of-samplepredictions

I Provide an image example where we recover an image

Page 27: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Example 1: Simulated Data

Page 28: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Let’s take the Golub matrix. An Ill conditioned matrix.1 3 11 0 −11 −15

18 55 209 15 −198 −277−23 −33 144 532 259 82

9 55 405 437 −100 −2853 −4 −111 −180 39 219−13 −9 202 346 401 253

The singular values are:

s = [9.545e+02, 7.240e+02, 1.767e+02, 7.301e+01, 3.536e-01,3.169e-10]

So the matrix’s condition is 9.545e+023.169e−10

, or 3 quadrillion.

Page 29: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Framework of Simulated Problem

Suppose we know the true x = [1, 1, 1, 1, 1, 1].

So we also know true y = A ∗ x

We add a small normally distributed error to y , with s.d. 0.1

Now, we would like to recover known x, but with the presence of noise.

Page 30: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Recall, the noiseless solution, x = [1, 1, 1, 1, 1, 1], is:

V

1σ1

. . . 0 . . . 0...

. . ....

0 . . . 1σn

0 . . . 0

UT y

And the naive solution, (A′A)−1A′(y + e):

V

1σ1

. . . 0 . . . 0...

. . ....

0 . . . 1σn

0 . . . 0

UT (y + e)

Page 31: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Solutions to Simulated Problem

x with OLS, Ridge, and Iterative Ridge:

OLS with noise[3.08e+08,−1.44e+08, 1.12e+07, 1.36e+06,−9.11e+04,−1.49e+04]||x − x || = 340, 863, 251!

Ridge, λ = 0.1[0.067, 0.24, 1.25, 0.89, 0.88, 1.054]||x − x || = 0.92

Iterative Ridge, λ = 0.1, k =5[0.08, 0.29, 1.23, 0.89, 0.89, 1.05]||x − x || = 0.68

λ was chosen through a grid search, and k via the elbow algorithm.

Page 32: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Example 2: Real Data

Page 33: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

I Using Body Fat (Carnegie Mellon Data Statistics Library)

I Predict body fat based on 17 variables

I Covariates include: neck, chest, abdomen, hip, thigh, wristcircumference, and hence correlated

I The data are ill conditioned with σ17 = .000027

I We split data into training [202:17] and test [50:17]

Page 34: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Results

I BLUE unperturbed MSPE = 0.7740

I With added error, N(0,1)4. , MSPE = 2.47 a 302% increase.5

I Ridge, λ = 100, k = 1, MSPE = 2.16, about 15% improvementover BLUE.

I Iterative Ridge k = 2 MSP = 1.73, about 30% improvement overBLUE.

I Iterative Ridge λ = 300, k=5, MSP=1.72

I Iterative Ridge λ = 500 k=7, MSP= 1.71

4std(y) = 7.7509 so that e with std=1 is not excessive5Mean Squared Predicted Error is the norm of predicted and true y.

Page 35: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Example 3: Image Data

Page 36: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Example of an image that is 64 by 64 square pixels, where A is sparse:

Page 37: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Blue

Ridge, λ = 0.00346

Iterative Ridge, λ = 1, k = 20

6Computed by a method known as the L-Curve.

Page 38: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Summary

I Developed a generalization of ridge regression

I The additional parameter k provides more flexibility in balancingbias and noise

I Iterative Ridge performs better when there are number of smallsingular values, A is sparse, and y is noisy

Page 39: A Generalized Approach to Ridge Regression - GitHub … · A Generalized Approach to Ridge Regression ... 1Postdoc @ Columbia University, Earth ... I The lters dampen the e ects of

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

Future Work

I Improve on the grid search for λ and k

I Prove conditions when Generalized Ridge does better for any givenlambda.

I Apply to “big” data problem - consumer data (FB); weather &sensored data

I Overall research is converging to using larger data sets for appliedproblems