59
Regression, Ridge Regression, Lasso Fabio G. Cozman - [email protected] October 4, 2020

Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Regression, Ridge Regression, Lasso

Fabio G. Cozman - [email protected]

October 4, 2020

Page 2: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

A general definition

Regression studies the relationship between aresponse variable Y and covariates X1, . . . ,Xn.

A covariate is also called a feature or a predictor.

The term “regression” is usually employedwhen Y is a continuous variable.

That is, variables are quantitative rather thanqualitative.

Page 3: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Basic model and basic questions

Model:Y = f (X) + ε,

where X are the covariates and ε is some“noise”.Questions:

Is there a relationship? A linear relationship? Howstrong?Are all covariates useful?How can we predict Y ?Usually referred to as statistical inference.

Page 4: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Linear regression

Linear regression adopts a linear relationship:

Y = β0 + β1X1 + β2X2 + · · ·+ βnXn.

Page 5: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Examples

0 50 100 150 200 250 300

51

01

52

02

5

TV

Sa

les

X1

X2

Y

Textbook, Figures 3.1 and 3.4 (Pages 62 and 73).

Page 6: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

The least-squares solution

Suppose we have observations

x1,j , x2,j , . . . , xn,j , yj

for j ∈ {1, . . . ,N}.

Consider the residual sum of squares

RSS =N∑j=1

(yj − β0 − β1x1,j − · · · − βnxn,j)2.

We might choose βi to minimize RSS.

Page 7: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

The least-squares solution

Suppose we have observations

x1,j , x2,j , . . . , xn,j , yj

for j ∈ {1, . . . ,N}.Consider the residual sum of squares

RSS =N∑j=1

(yj − β0 − β1x1,j − · · · − βnxn,j)2.

We might choose βi to minimize RSS.

Page 8: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

The least-squares solution

Suppose we have observations

x1,j , x2,j , . . . , xn,j , yj

for j ∈ {1, . . . ,N}.Consider the residual sum of squares

RSS =N∑j=1

(yj − β0 − β1x1,j − · · · − βnxn,j)2.

We might choose βi to minimize RSS.

Page 9: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

The solution for a single covariate

Suppose we have

Y = β0 + β1X1.

To minimize RSS, we must set:

β0 =∑j

yj/N − β1∑j

x1,j/N ,

β1 =

∑j(x1,j − (

∑k x1,k/N))(yj − (

∑k yk/N))∑

j(x1,j − (∑

k x1,k/N))2.

Page 10: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

The solution for a single covariate

Suppose we have

Y = β0 + β1X1.

To minimize RSS, we must set:

β0 =∑j

yj/N − β1∑j

x1,j/N ,

β1 =

∑j(x1,j − (

∑k x1,k/N))(yj − (

∑k yk/N))∑

j(x1,j − (∑

k x1,k/N))2.

Page 11: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Often used notation

Adopt x1 =∑N

j=1 x1,jN and y =

∑Nj=1 yjN .

Then:β0 = y − β1x1,

β1 =

∑j(x1,j − x1)(yj − y)∑

j(x1,j − x1)2.

Page 12: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Deriving the expressions

Differentiate RSS =∑N

j=1(yj − β0 − β1x1,j)2:

∂RSS

∂β0= −2

∑j

(yj − β0 − β1x1,j),

∂RSS

∂β1= −2

∑j

x1,j(yj − β0 − β1x1,j).

Solve:

y − β0 − β1x1 = 0, ¯x1y − β0x1 − β1x21 = 0.

Get:

β0 = y − β1x1, β1 =¯x1y − x1y

x21 − x2.

Page 13: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

A probabilistic version

Suppose we assume

Y = β0 + β1X1 + Z ,

where Z ∼ N(0, σ2) is a “probabilisticdisturbance”.

Then the previous β0 and β1 are the maximumlikelihood estimates of β0 and β1, and moreover

σ2 =1

N

∑j

(yj − yj)2

maximizes likelihood for

yj = β0 + β1x1,j .

Page 14: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

A probabilistic version

Suppose we assume

Y = β0 + β1X1 + Z ,

where Z ∼ N(0, σ2) is a “probabilisticdisturbance”.

Then the previous β0 and β1 are the maximumlikelihood estimates of β0 and β1, and moreover

σ2 =1

N

∑j

(yj − yj)2

maximizes likelihood for

yj = β0 + β1x1,j .

Page 15: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Maximum likelihood versus unbiasedness

Funny: the maximum likelihood estimator forσ2 is not unbiased; often the following unbiasedestimator is used:

σ2 =1

N − 2

∑j

(yj − yj)2.

(This estimator is sometimes called RSE.)

Page 16: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Some properties

Maximum likelihood estimators of β0 and β1are consistent and asymptotic normal.

There are closed-form expressions for thevariance of these estimators and for confidenceintervals (details, not covered in this course, inthe Textbook, page 66).

Page 17: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Checking whether there is relationship

There is a standard (asymptotic) test forH0 : β1 = 0 versus H1 : β1 6= 0.

Reject H0 when∣∣∣∣∣∣∣β1√

σ2∑

j(x1,j − x1)2

∣∣∣∣∣∣∣ > zα/2

where σ2 is (usually) the unbiased estimate ofσ2.

Page 18: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

And the p-value

Test is of the form: T > c for T that dependson data, some c .

So, p-value: probability that T is larger thanobserved T , for true H0 (that is, β1 = 0).

Small p-value: evidence against H0.

Page 19: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Usual measure of fit

The R2 statistic

R2 = 1− RSS∑Nj=1(yj − y)2

(the proportion of variability of Y that can beexplained by covariates, between 0 and 1;higher is better).

For one covariate, R2 is equal to the square ofthe correlation between X1 and Y : ∑

j(x1,j − x1)(yj − y)√∑j(x1,j − x1)2

√∑j(yj − y)2

2

.

Page 20: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Usual measure of fit

The R2 statistic

R2 = 1− RSS∑Nj=1(yj − y)2

(the proportion of variability of Y that can beexplained by covariates, between 0 and 1;higher is better).For one covariate, R2 is equal to the square ofthe correlation between X1 and Y :

∑j(x1,j − x1)(yj − y)√∑

j(x1,j − x1)2√∑

j(yj − y)2

2

.

Page 21: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Usual measure of fit

The R2 statistic

R2 = 1− RSS∑Nj=1(yj − y)2

(the proportion of variability of Y that can beexplained by covariates, between 0 and 1;higher is better).For one covariate, R2 is equal to the square ofthe correlation between X1 and Y : ∑

j(x1,j − x1)(yj − y)√∑j(x1,j − x1)2

√∑j(yj − y)2

2

.

Page 22: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Many covariates

Now suppose

Y = β1X1 + · · ·+ βnXn + Z ,

where E[Z ] = 0 (we can take X1 equal to oneto “simulate” a fixed value at covariates zero).

The RSS is then

(C − AB)T (C − AB),

where

A =

x1,1 x1,2 . . . xn,1...

......

...x1,N x2,N . . . xn,N

,B =

β1...βn

,C =

y1...yN

.

Page 23: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Many covariates

Now suppose

Y = β1X1 + · · ·+ βnXn + Z ,

where E[Z ] = 0 (we can take X1 equal to oneto “simulate” a fixed value at covariates zero).

The RSS is then

(C − AB)T (C − AB),

where

A =

x1,1 x1,2 . . . xn,1...

......

...x1,N x2,N . . . xn,N

,B =

β1...βn

,C =

y1...yN

.

Page 24: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Regression for many covariates

ForY = β1X1 + · · ·+ βnXn + Z ,

thenB = (ATA)−1ATC .

Page 25: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Some properties

This estimator is consistent.

If Z has variance σ2, then B has varianceσ2(ATA)−1.

The estimator B is approximately Gaussian, forlarge N .

Page 26: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Checking whether there is relationship

There is a standard (asymptotic) test forH0 : β1 = · · · = βn = 0 versus alternative H1.

Test is “Reject H0 when F -statistic is large”.The F -statistic:

F =(TSS− RSS)/n

TSS/(N − n − 1),

where TSS is the total sum of squares:

TSS =N∑j=1

(yj − y)2.

The F -statistic has an F -distribution...

Page 27: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Checking whether there is relationship

There is a standard (asymptotic) test forH0 : β1 = · · · = βn = 0 versus alternative H1.

Test is “Reject H0 when F -statistic is large”.

The F -statistic:

F =(TSS− RSS)/n

TSS/(N − n − 1),

where TSS is the total sum of squares:

TSS =N∑j=1

(yj − y)2.

The F -statistic has an F -distribution...

Page 28: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Checking whether there is relationship

There is a standard (asymptotic) test forH0 : β1 = · · · = βn = 0 versus alternative H1.

Test is “Reject H0 when F -statistic is large”.The F -statistic:

F =(TSS− RSS)/n

TSS/(N − n − 1),

where TSS is the total sum of squares:

TSS =N∑j=1

(yj − y)2.

The F -statistic has an F -distribution...

Page 29: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Checking whether there is relationship

There is a standard (asymptotic) test forH0 : β1 = · · · = βn = 0 versus alternative H1.

Test is “Reject H0 when F -statistic is large”.The F -statistic:

F =(TSS− RSS)/n

TSS/(N − n − 1),

where TSS is the total sum of squares:

TSS =N∑j=1

(yj − y)2.

The F -statistic has an F -distribution...

Page 30: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Checking subset of parameters

There are also tests to verify whether subsetsof parameters are equal to zero.

Take H0 : βn−m+1 = βn−m+2 = · · · = βn = 0.

Reject H0 when the following F -statistic is“large”:

F =(RSS0 − RSS)/m

TSS/(N − n − 1)

where RSS0 is the sum of residual squares forthe model containing only β1, . . . , βn−m.

Possible to check each parameter at a time(m = 1).

Page 31: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Checking subset of parameters

There are also tests to verify whether subsetsof parameters are equal to zero.

Take H0 : βn−m+1 = βn−m+2 = · · · = βn = 0.

Reject H0 when the following F -statistic is“large”:

F =(RSS0 − RSS)/m

TSS/(N − n − 1)

where RSS0 is the sum of residual squares forthe model containing only β1, . . . , βn−m.

Possible to check each parameter at a time(m = 1).

Page 32: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Checking subset of parameters

There are also tests to verify whether subsetsof parameters are equal to zero.

Take H0 : βn−m+1 = βn−m+2 = · · · = βn = 0.

Reject H0 when the following F -statistic is“large”:

F =(RSS0 − RSS)/m

TSS/(N − n − 1)

where RSS0 is the sum of residual squares forthe model containing only β1, . . . , βn−m.

Possible to check each parameter at a time(m = 1).

Page 33: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Usual measure of fit

The R2 statistic

R2 = 1− RSS∑Nj=1(yj − y)2

(the proportion of variability of Y that can beexplained by covariates, between 0 and 1;higher is better).

But: the more covariates, the higher the R2

statistic.

Page 34: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Usual measure of fit

The R2 statistic

R2 = 1− RSS∑Nj=1(yj − y)2

(the proportion of variability of Y that can beexplained by covariates, between 0 and 1;higher is better).

But: the more covariates, the higher the R2

statistic.

Page 35: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Feature selection

Usually we must discard some covariates.Too many covariates lead to small bias but largevariance, while too few covariates lead to smallvariance but large bias (the bias-variancetrade-off).In particular, a covariate should not be a linearfunction of other covariates...

We must “score” each set of covariates, tochoose the best one. A possible score isempirical error (by cross-validation).

Page 36: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

The Akaike Information Criterion

One popular score is the AIC:

LS − |S |,

whereS is set of covariates in the scored model, andLS is the log-likelihood of the model withcovariates in S , evaluated at the maximumlikelihood estimates.

That is, “goodness of fit - model complexity”

Page 37: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Another score

The BIC (Bayesian Information Criterion):

LS − (|S |/2) logN .

For large N , the posterior probability of thescored model is proportional to eBIC, when allpossible models get identical probability.

Page 38: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Yet another score

Mallow’s Cp:

2|S |σ2 +N∑j=1

(ySj − yj)2,

where σ2 is the estimate of variance with allcovariates, while ySj are produced withcovariates in S .

This score estimates the “training error” that isexpected from a model with covariates in S .

Page 39: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Structure search

Forward stepwise regression: start with nocovariates, add one that leads to best score,then add one that leds to best score, etc.

Backward stepwise regression: start with allcovariates, drop one that leads to best score,etc.

There are many other search schemes.

Additional details, not covered in this course, in theTextbook, Section 6.1.

Page 40: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Structure search

Forward stepwise regression: start with nocovariates, add one that leads to best score,then add one that leds to best score, etc.

Backward stepwise regression: start with allcovariates, drop one that leads to best score,etc.

There are many other search schemes.

Additional details, not covered in this course, in theTextbook, Section 6.1.

Page 41: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Qualitative features in linear regression

Use some encoding.

If values are ordered: turn into numbers.

If not appropriate to do so: one-hot encoding.

Page 42: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Other challenges in linear regression

Relationship is not linear: detect with tests andresidual plots.

Disturbances are correlated or vary withfeatures.

Outliers and high leverage points (must betested for, discarded).

Collinearity (leads to numerical problems andhigher variance).

These issues, not covered in this course, arediscussed in the Textbook, Section 3.3.3.

Page 43: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Introducing non-linear features

Suppose we have X1 and X2. We can assume:

Y = β0 + β1X1 + β2X2 + β12X1X2,

or any other set of functions of X1 and X2.

If functions are polynomials of covariates, we havepolynomial regression.If functions are all produced by a set oftransformations, they are referred to as basisfunctions.

Other functions lead to General AdditiveModels (not discussed in this course; seeTextbook Section 7.7).

Page 44: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Introducing non-linear features

Suppose we have X1 and X2. We can assume:

Y = β0 + β1X1 + β2X2 + β12X1X2,

or any other set of functions of X1 and X2.

If functions are polynomials of covariates, we havepolynomial regression.If functions are all produced by a set oftransformations, they are referred to as basisfunctions.

Other functions lead to General AdditiveModels (not discussed in this course; seeTextbook Section 7.7).

Page 45: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Introducing non-linear features

Suppose we have X1 and X2. We can assume:

Y = β0 + β1X1 + β2X2 + β12X1X2,

or any other set of functions of X1 and X2.

If functions are polynomials of covariates, we havepolynomial regression.If functions are all produced by a set oftransformations, they are referred to as basisfunctions.

Other functions lead to General AdditiveModels (not discussed in this course; seeTextbook Section 7.7).

Page 46: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Nonparametric regression

We might assume that Y depends on splines(Y is a piecewise but smooth function ofcovariates).

Or we might assume that Y is a function ofneighboring points (kNN regression):

1 To obtain y corresponding to given covariates,weigh each point in the neighborhood.

2 Then run weighted regression with those pointsonly.

These topics are not covered in this course; they arediscussed in Textbook, Sections 7.5 and 7.6, andSection 3.5.

Page 47: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Nonparametric regression

We might assume that Y depends on splines(Y is a piecewise but smooth function ofcovariates).

Or we might assume that Y is a function ofneighboring points (kNN regression):

1 To obtain y corresponding to given covariates,weigh each point in the neighborhood.

2 Then run weighted regression with those pointsonly.

These topics are not covered in this course; they arediscussed in Textbook, Sections 7.5 and 7.6, andSection 3.5.

Page 48: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Shrinkage methods: Regularization

The idea is to penalize the “size” ofparameters, to “shrink” them, to reducevariance (but with increase in bias...).

Useful to reduce overfitting, particularly whenthere are too many covariates.

Two main strategies:ridge regression, andlasso (least absolute shrinkage and selectionoperator).

Page 49: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Ridge regression

Suppose we minimize N∑j=1

(yj −

n∑i=1

βixi ,j

)2+ λ

n∑i=1

β2i .

The larger the parameter λ, the smaller thevalues of βi .

Tuning λ: usually by cross-validation.

Solution is easy (but biased!):

B = (ATA + λI )−1ATC .

Page 50: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Standardization

In ridge regression, the measurement scale forcovariates is important...

In linear regression, multiplying covariates leads tomultiplied estimates.

Usual assumption: data are standardized(mean 0, variance 1):

xi ,j =xi ,j√

(1/N)∑N

j=1(xi ,j − xi)2.

Usually Y is centered (mean is subtracted).

Page 51: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

The bias-variance trade-off

Ridge regression increases bias, decreasesvariance (compared to linear regression).So it is useful when variance is large.

For instance, when the number of covariates is verylarge.

Page 52: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Bias-variance

1e−01 1e+01 1e+03

01

02

03

04

05

06

0

Mean S

quare

d E

rror

0.0 0.2 0.4 0.6 0.8 1.0

01

02

03

04

05

06

0

Mean S

quare

d E

rror

λ ‖βRλ ‖2/‖β‖2

Textbook, Figure 6.5.

Page 53: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

The Bayesian interpretation

Suppose we have Gaussian likelihood and priorproportional to e−λ

∑i β

2i .

Then the posterior to be maximized (for 0-1loss) lead to minimization of N∑

j=1

(yj −

n∑i=1

βixi ,j

)2+ λ

n∑i=1

β2i .

Page 54: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Lasso I

Suppose we minimize N∑j=1

(yj −

n∑i=1

βixi ,j

)2+ λ

n∑i=1

|βi |.

The larger the parameter λ, the smaller thevalues of βi .

Tuning λ: usually by cross-validation.

Page 55: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Lasso II

Minimize N∑j=1

(yj −

n∑i=1

βixi ,j

)2+ λ

n∑i=1

|βi |.

Usual assumption: Xi standardized (mean 0,variance 1), Y centered.Amazing fact: if λ is “large enough”, many βisare set to zero: so covariate selection is doneautomatically!

The result is a sparse solution.

Page 56: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Working with lasso

No closed-form solution, but convexoptimization does it.Gains in accuracy have been observed,particularly when too many covariates arepresent.

Too many covariates can overfit any input data...With too many covariates, many may have thesame predictive power, many may be highlycorrelated and useless.Thus selection of covariates is a big plus.

Page 57: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Another perspective, with some intuition

It is possible to show that both ridge regressionand lasso minimize RSS, but

Ridge regression: subject to∑n

i=1 β2i ≤ λ.

Lasso: subject to∑n

i=1 |βi | ≤ λ.

Page 58: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Intuition

Textbook, Figure 6.7.

Page 59: Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

A final note

Some of the figures in this presentation are takenfrom An Introduction to Statistical Learning, withapplications in R (Springer, 2013) with permissionfrom the authors: G. James, D. Witten, T. Hastieand R. Tibshirani.