Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Regression, Ridge Regression, Lasso

Fabio G. Cozman - [email protected]

October 4, 2020

A general definition

Regression studies the relationship between aresponse variable Y and covariates X1, . . . ,Xn.

A covariate is also called a feature or a predictor.

The term “regression” is usually employedwhen Y is a continuous variable.

That is, variables are quantitative rather thanqualitative.

Basic model and basic questions

Model:Y = f (X) + ε,

where X are the covariates and ε is some“noise”.Questions:

Is there a relationship? A linear relationship? Howstrong?Are all covariates useful?How can we predict Y ?Usually referred to as statistical inference.

Linear regression

Linear regression adopts a linear relationship:

Y = β0 + β1X1 + β2X2 + · · ·+ βnXn.

Examples

0 50 100 150 200 250 300

51

01

52

02

5

TV

Sa

les

X1

X2

Y

Textbook, Figures 3.1 and 3.4 (Pages 62 and 73).

The least-squares solution

Suppose we have observations

x1,j , x2,j , . . . , xn,j , yj

for j ∈ {1, . . . ,N}.

Consider the residual sum of squares

RSS =N∑j=1

(yj − β0 − β1x1,j − · · · − βnxn,j)2.

We might choose βi to minimize RSS.



x1,j , x2,j , . . . , xn,j , yj

for j ∈ {1, . . . ,N}.Consider the residual sum of squares

RSS =N∑j=1

(yj − β0 − β1x1,j − · · · − βnxn,j)2.




x1,j , x2,j , . . . , xn,j , yj

for j ∈ {1, . . . ,N}.Consider the residual sum of squares

RSS =N∑j=1

(yj − β0 − β1x1,j − · · · − βnxn,j)2.


The solution for a single covariate

Suppose we have

Y = β0 + β1X1.

To minimize RSS, we must set:

β0 =∑j

yj/N − β1∑j

x1,j/N ,

β1 =

∑j(x1,j − (

∑k x1,k/N))(yj − (

∑k yk/N))∑

j(x1,j − (∑

k x1,k/N))2.

The solution for a single covariate

Suppose we have

Y = β0 + β1X1.

To minimize RSS, we must set:

β0 =∑j

yj/N − β1∑j

x1,j/N ,

β1 =

∑j(x1,j − (

∑k x1,k/N))(yj − (

∑k yk/N))∑

j(x1,j − (∑

k x1,k/N))2.

Often used notation

Adopt x1 =∑N

j=1 x1,jN and y =

∑Nj=1 yjN .

Then:β0 = y − β1x1,

β1 =

∑j(x1,j − x1)(yj − y)∑

j(x1,j − x1)2.

Deriving the expressions

Differentiate RSS =∑N

j=1(yj − β0 − β1x1,j)2:

∂RSS

∂β0= −2

∑j

(yj − β0 − β1x1,j),

∂RSS

∂β1= −2

∑j

x1,j(yj − β0 − β1x1,j).

Solve:

y − β0 − β1x1 = 0, ¯x1y − β0x1 − β1x21 = 0.

Get:

β0 = y − β1x1, β1 =¯x1y − x1y

x21 − x2.

A probabilistic version

Suppose we assume

Y = β0 + β1X1 + Z ,

where Z ∼ N(0, σ2) is a “probabilisticdisturbance”.

Then the previous β0 and β1 are the maximumlikelihood estimates of β0 and β1, and moreover

σ2 =1

N

∑j

(yj − yj)2

maximizes likelihood for

yj = β0 + β1x1,j .

A probabilistic version

Suppose we assume

Y = β0 + β1X1 + Z ,

where Z ∼ N(0, σ2) is a “probabilisticdisturbance”.

Then the previous β0 and β1 are the maximumlikelihood estimates of β0 and β1, and moreover

σ2 =1

N

∑j

(yj − yj)2

maximizes likelihood for

yj = β0 + β1x1,j .

Maximum likelihood versus unbiasedness

Funny: the maximum likelihood estimator forσ2 is not unbiased; often the following unbiasedestimator is used:

σ2 =1

N − 2

∑j

(yj − yj)2.

(This estimator is sometimes called RSE.)

Some properties

Maximum likelihood estimators of β0 and β1are consistent and asymptotic normal.

There are closed-form expressions for thevariance of these estimators and for confidenceintervals (details, not covered in this course, inthe Textbook, page 66).

Checking whether there is relationship

There is a standard (asymptotic) test forH0 : β1 = 0 versus H1 : β1 6= 0.

Reject H0 when∣∣∣∣∣∣∣β1√

σ2∑

j(x1,j − x1)2

∣∣∣∣∣∣∣ > zα/2

where σ2 is (usually) the unbiased estimate ofσ2.

And the p-value

Test is of the form: T > c for T that dependson data, some c .

So, p-value: probability that T is larger thanobserved T , for true H0 (that is, β1 = 0).

Small p-value: evidence against H0.

Usual measure of fit

The R2 statistic

R2 = 1− RSS∑Nj=1(yj − y)2

(the proportion of variability of Y that can beexplained by covariates, between 0 and 1;higher is better).

For one covariate, R2 is equal to the square ofthe correlation between X1 and Y : ∑

j(x1,j − x1)(yj − y)√∑j(x1,j − x1)2

√∑j(yj − y)2

2

.


The R2 statistic

R2 = 1− RSS∑Nj=1(yj − y)2

(the proportion of variability of Y that can beexplained by covariates, between 0 and 1;higher is better).For one covariate, R2 is equal to the square ofthe correlation between X1 and Y :

∑j(x1,j − x1)(yj − y)√∑

j(x1,j − x1)2√∑

j(yj − y)2

2

.


The R2 statistic

R2 = 1− RSS∑Nj=1(yj − y)2

(the proportion of variability of Y that can beexplained by covariates, between 0 and 1;higher is better).For one covariate, R2 is equal to the square ofthe correlation between X1 and Y : ∑

j(x1,j − x1)(yj − y)√∑j(x1,j − x1)2

√∑j(yj − y)2

2

.

Many covariates

Now suppose

Y = β1X1 + · · ·+ βnXn + Z ,

where E[Z ] = 0 (we can take X1 equal to oneto “simulate” a fixed value at covariates zero).

The RSS is then

(C − AB)T (C − AB),

where

A =

x1,1 x1,2 . . . xn,1...

......

...x1,N x2,N . . . xn,N

,B =

β1...βn

,C =

y1...yN

.

Many covariates

Now suppose

Y = β1X1 + · · ·+ βnXn + Z ,

where E[Z ] = 0 (we can take X1 equal to oneto “simulate” a fixed value at covariates zero).

The RSS is then

(C − AB)T (C − AB),

where

A =

x1,1 x1,2 . . . xn,1...

......

...x1,N x2,N . . . xn,N

,B =

β1...βn

,C =

y1...yN

.

Regression for many covariates

ForY = β1X1 + · · ·+ βnXn + Z ,

thenB = (ATA)−1ATC .

Some properties

This estimator is consistent.

If Z has variance σ2, then B has varianceσ2(ATA)−1.

The estimator B is approximately Gaussian, forlarge N .


There is a standard (asymptotic) test forH0 : β1 = · · · = βn = 0 versus alternative H1.

Test is “Reject H0 when F -statistic is large”.The F -statistic:

F =(TSS− RSS)/n

TSS/(N − n − 1),

where TSS is the total sum of squares:

TSS =N∑j=1

(yj − y)2.

The F -statistic has an F -distribution...



Test is “Reject H0 when F -statistic is large”.

The F -statistic:

F =(TSS− RSS)/n

TSS/(N − n − 1),


TSS =N∑j=1

(yj − y)2.





F =(TSS− RSS)/n

TSS/(N − n − 1),


TSS =N∑j=1

(yj − y)2.





F =(TSS− RSS)/n

TSS/(N − n − 1),


TSS =N∑j=1

(yj − y)2.


Checking subset of parameters

There are also tests to verify whether subsetsof parameters are equal to zero.

Take H0 : βn−m+1 = βn−m+2 = · · · = βn = 0.

Reject H0 when the following F -statistic is“large”:

F =(RSS0 − RSS)/m

TSS/(N − n − 1)

where RSS0 is the sum of residual squares forthe model containing only β1, . . . , βn−m.

Possible to check each parameter at a time(m = 1).



Take H0 : βn−m+1 = βn−m+2 = · · · = βn = 0.


F =(RSS0 − RSS)/m

TSS/(N − n − 1)





Take H0 : βn−m+1 = βn−m+2 = · · · = βn = 0.


F =(RSS0 − RSS)/m

TSS/(N − n − 1)




The R2 statistic

R2 = 1− RSS∑Nj=1(yj − y)2


But: the more covariates, the higher the R2

statistic.


The R2 statistic

R2 = 1− RSS∑Nj=1(yj − y)2


But: the more covariates, the higher the R2

statistic.

Feature selection

Usually we must discard some covariates.Too many covariates lead to small bias but largevariance, while too few covariates lead to smallvariance but large bias (the bias-variancetrade-off).In particular, a covariate should not be a linearfunction of other covariates...

We must “score” each set of covariates, tochoose the best one. A possible score isempirical error (by cross-validation).

The Akaike Information Criterion

One popular score is the AIC:

LS − |S |,

whereS is set of covariates in the scored model, andLS is the log-likelihood of the model withcovariates in S , evaluated at the maximumlikelihood estimates.

That is, “goodness of fit - model complexity”

Another score

The BIC (Bayesian Information Criterion):

LS − (|S |/2) logN .

For large N , the posterior probability of thescored model is proportional to eBIC, when allpossible models get identical probability.

Yet another score

Mallow’s Cp:

2|S |σ2 +N∑j=1

(ySj − yj)2,

where σ2 is the estimate of variance with allcovariates, while ySj are produced withcovariates in S .

This score estimates the “training error” that isexpected from a model with covariates in S .

Structure search

Forward stepwise regression: start with nocovariates, add one that leads to best score,then add one that leds to best score, etc.

Backward stepwise regression: start with allcovariates, drop one that leads to best score,etc.

There are many other search schemes.

Additional details, not covered in this course, in theTextbook, Section 6.1.

Structure search

Forward stepwise regression: start with nocovariates, add one that leads to best score,then add one that leds to best score, etc.

Backward stepwise regression: start with allcovariates, drop one that leads to best score,etc.

There are many other search schemes.

Additional details, not covered in this course, in theTextbook, Section 6.1.

Qualitative features in linear regression

Use some encoding.

If values are ordered: turn into numbers.

If not appropriate to do so: one-hot encoding.

Other challenges in linear regression

Relationship is not linear: detect with tests andresidual plots.

Disturbances are correlated or vary withfeatures.

Outliers and high leverage points (must betested for, discarded).

Collinearity (leads to numerical problems andhigher variance).

These issues, not covered in this course, arediscussed in the Textbook, Section 3.3.3.

Introducing non-linear features

Suppose we have X1 and X2. We can assume:

Y = β0 + β1X1 + β2X2 + β12X1X2,

or any other set of functions of X1 and X2.

If functions are polynomials of covariates, we havepolynomial regression.If functions are all produced by a set oftransformations, they are referred to as basisfunctions.

Other functions lead to General AdditiveModels (not discussed in this course; seeTextbook Section 7.7).



Y = β0 + β1X1 + β2X2 + β12X1X2,






Y = β0 + β1X1 + β2X2 + β12X1X2,




Nonparametric regression

We might assume that Y depends on splines(Y is a piecewise but smooth function ofcovariates).

Or we might assume that Y is a function ofneighboring points (kNN regression):

1 To obtain y corresponding to given covariates,weigh each point in the neighborhood.

2 Then run weighted regression with those pointsonly.

These topics are not covered in this course; they arediscussed in Textbook, Sections 7.5 and 7.6, andSection 3.5.

Nonparametric regression

We might assume that Y depends on splines(Y is a piecewise but smooth function ofcovariates).

Or we might assume that Y is a function ofneighboring points (kNN regression):

1 To obtain y corresponding to given covariates,weigh each point in the neighborhood.

2 Then run weighted regression with those pointsonly.

These topics are not covered in this course; they arediscussed in Textbook, Sections 7.5 and 7.6, andSection 3.5.

Shrinkage methods: Regularization

The idea is to penalize the “size” ofparameters, to “shrink” them, to reducevariance (but with increase in bias...).

Useful to reduce overfitting, particularly whenthere are too many covariates.

Two main strategies:ridge regression, andlasso (least absolute shrinkage and selectionoperator).

Ridge regression

Suppose we minimize N∑j=1

(yj −

n∑i=1

βixi ,j

)2+ λ

n∑i=1

β2i .

The larger the parameter λ, the smaller thevalues of βi .

Tuning λ: usually by cross-validation.

Solution is easy (but biased!):

B = (ATA + λI )−1ATC .

Standardization

In ridge regression, the measurement scale forcovariates is important...

In linear regression, multiplying covariates leads tomultiplied estimates.

Usual assumption: data are standardized(mean 0, variance 1):

xi ,j =xi ,j√

(1/N)∑N

j=1(xi ,j − xi)2.

Usually Y is centered (mean is subtracted).

The bias-variance trade-off

Ridge regression increases bias, decreasesvariance (compared to linear regression).So it is useful when variance is large.

For instance, when the number of covariates is verylarge.

Bias-variance

1e−01 1e+01 1e+03

01

02

03

04

05

06

0

Mean S

quare

d E

rror

0.0 0.2 0.4 0.6 0.8 1.0

01

02

03

04

05

06

0

Mean S

quare

d E

rror

λ ‖βRλ ‖2/‖β‖2

Textbook, Figure 6.5.

The Bayesian interpretation

Suppose we have Gaussian likelihood and priorproportional to e−λ

∑i β

2i .

Then the posterior to be maximized (for 0-1loss) lead to minimization of N∑

j=1

(yj −

n∑i=1

βixi ,j

)2+ λ

n∑i=1

β2i .

Lasso I

Suppose we minimize N∑j=1

(yj −

n∑i=1

βixi ,j

)2+ λ

n∑i=1

|βi |.

The larger the parameter λ, the smaller thevalues of βi .

Tuning λ: usually by cross-validation.

Lasso II

Minimize N∑j=1

(yj −

n∑i=1

βixi ,j

)2+ λ

n∑i=1

|βi |.

Usual assumption: Xi standardized (mean 0,variance 1), Y centered.Amazing fact: if λ is “large enough”, many βisare set to zero: so covariate selection is doneautomatically!

The result is a sparse solution.

Working with lasso

No closed-form solution, but convexoptimization does it.Gains in accuracy have been observed,particularly when too many covariates arepresent.

Too many covariates can overfit any input data...With too many covariates, many may have thesame predictive power, many may be highlycorrelated and useless.Thus selection of covariates is a big plus.

Another perspective, with some intuition

It is possible to show that both ridge regressionand lasso minimize RSS, but

Ridge regression: subject to∑n

i=1 β2i ≤ λ.

Lasso: subject to∑n

i=1 |βi | ≤ λ.

Intuition

Textbook, Figure 6.7.

A final note

Some of the figures in this presentation are takenfrom An Introduction to Statistical Learning, withapplications in R (Springer, 2013) with permissionfrom the authors: G. James, D. Witten, T. Hastieand R. Tibshirani.

Documents

Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a