Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
A general definition
Regression studies the relationship between aresponse variable Y and covariates X1, . . . ,Xn.
A covariate is also called a feature or a predictor.
The term “regression” is usually employedwhen Y is a continuous variable.
That is, variables are quantitative rather thanqualitative.
Basic model and basic questions
Model:Y = f (X) + ε,
where X are the covariates and ε is some“noise”.Questions:
Is there a relationship? A linear relationship? Howstrong?Are all covariates useful?How can we predict Y ?Usually referred to as statistical inference.
Linear regression
Linear regression adopts a linear relationship:
Y = β0 + β1X1 + β2X2 + · · ·+ βnXn.
Examples
0 50 100 150 200 250 300
51
01
52
02
5
TV
Sa
les
X1
X2
Y
Textbook, Figures 3.1 and 3.4 (Pages 62 and 73).
The least-squares solution
Suppose we have observations
x1,j , x2,j , . . . , xn,j , yj
for j ∈ {1, . . . ,N}.
Consider the residual sum of squares
RSS =N∑j=1
(yj − β0 − β1x1,j − · · · − βnxn,j)2.
We might choose βi to minimize RSS.
The least-squares solution
Suppose we have observations
x1,j , x2,j , . . . , xn,j , yj
for j ∈ {1, . . . ,N}.Consider the residual sum of squares
RSS =N∑j=1
(yj − β0 − β1x1,j − · · · − βnxn,j)2.
We might choose βi to minimize RSS.
The least-squares solution
Suppose we have observations
x1,j , x2,j , . . . , xn,j , yj
for j ∈ {1, . . . ,N}.Consider the residual sum of squares
RSS =N∑j=1
(yj − β0 − β1x1,j − · · · − βnxn,j)2.
We might choose βi to minimize RSS.
The solution for a single covariate
Suppose we have
Y = β0 + β1X1.
To minimize RSS, we must set:
β0 =∑j
yj/N − β1∑j
x1,j/N ,
β1 =
∑j(x1,j − (
∑k x1,k/N))(yj − (
∑k yk/N))∑
j(x1,j − (∑
k x1,k/N))2.
The solution for a single covariate
Suppose we have
Y = β0 + β1X1.
To minimize RSS, we must set:
β0 =∑j
yj/N − β1∑j
x1,j/N ,
β1 =
∑j(x1,j − (
∑k x1,k/N))(yj − (
∑k yk/N))∑
j(x1,j − (∑
k x1,k/N))2.
Often used notation
Adopt x1 =∑N
j=1 x1,jN and y =
∑Nj=1 yjN .
Then:β0 = y − β1x1,
β1 =
∑j(x1,j − x1)(yj − y)∑
j(x1,j − x1)2.
Deriving the expressions
Differentiate RSS =∑N
j=1(yj − β0 − β1x1,j)2:
∂RSS
∂β0= −2
∑j
(yj − β0 − β1x1,j),
∂RSS
∂β1= −2
∑j
x1,j(yj − β0 − β1x1,j).
Solve:
y − β0 − β1x1 = 0, ¯x1y − β0x1 − β1x21 = 0.
Get:
β0 = y − β1x1, β1 =¯x1y − x1y
x21 − x2.
A probabilistic version
Suppose we assume
Y = β0 + β1X1 + Z ,
where Z ∼ N(0, σ2) is a “probabilisticdisturbance”.
Then the previous β0 and β1 are the maximumlikelihood estimates of β0 and β1, and moreover
σ2 =1
N
∑j
(yj − yj)2
maximizes likelihood for
yj = β0 + β1x1,j .
A probabilistic version
Suppose we assume
Y = β0 + β1X1 + Z ,
where Z ∼ N(0, σ2) is a “probabilisticdisturbance”.
Then the previous β0 and β1 are the maximumlikelihood estimates of β0 and β1, and moreover
σ2 =1
N
∑j
(yj − yj)2
maximizes likelihood for
yj = β0 + β1x1,j .
Maximum likelihood versus unbiasedness
Funny: the maximum likelihood estimator forσ2 is not unbiased; often the following unbiasedestimator is used:
σ2 =1
N − 2
∑j
(yj − yj)2.
(This estimator is sometimes called RSE.)
Some properties
Maximum likelihood estimators of β0 and β1are consistent and asymptotic normal.
There are closed-form expressions for thevariance of these estimators and for confidenceintervals (details, not covered in this course, inthe Textbook, page 66).
Checking whether there is relationship
There is a standard (asymptotic) test forH0 : β1 = 0 versus H1 : β1 6= 0.
Reject H0 when∣∣∣∣∣∣∣β1√
σ2∑
j(x1,j − x1)2
∣∣∣∣∣∣∣ > zα/2
where σ2 is (usually) the unbiased estimate ofσ2.
And the p-value
Test is of the form: T > c for T that dependson data, some c .
So, p-value: probability that T is larger thanobserved T , for true H0 (that is, β1 = 0).
Small p-value: evidence against H0.
Usual measure of fit
The R2 statistic
R2 = 1− RSS∑Nj=1(yj − y)2
(the proportion of variability of Y that can beexplained by covariates, between 0 and 1;higher is better).
For one covariate, R2 is equal to the square ofthe correlation between X1 and Y : ∑
j(x1,j − x1)(yj − y)√∑j(x1,j − x1)2
√∑j(yj − y)2
2
.
Usual measure of fit
The R2 statistic
R2 = 1− RSS∑Nj=1(yj − y)2
(the proportion of variability of Y that can beexplained by covariates, between 0 and 1;higher is better).For one covariate, R2 is equal to the square ofthe correlation between X1 and Y :
∑j(x1,j − x1)(yj − y)√∑
j(x1,j − x1)2√∑
j(yj − y)2
2
.
Usual measure of fit
The R2 statistic
R2 = 1− RSS∑Nj=1(yj − y)2
(the proportion of variability of Y that can beexplained by covariates, between 0 and 1;higher is better).For one covariate, R2 is equal to the square ofthe correlation between X1 and Y : ∑
j(x1,j − x1)(yj − y)√∑j(x1,j − x1)2
√∑j(yj − y)2
2
.
Many covariates
Now suppose
Y = β1X1 + · · ·+ βnXn + Z ,
where E[Z ] = 0 (we can take X1 equal to oneto “simulate” a fixed value at covariates zero).
The RSS is then
(C − AB)T (C − AB),
where
A =
x1,1 x1,2 . . . xn,1...
......
...x1,N x2,N . . . xn,N
,B =
β1...βn
,C =
y1...yN
.
Many covariates
Now suppose
Y = β1X1 + · · ·+ βnXn + Z ,
where E[Z ] = 0 (we can take X1 equal to oneto “simulate” a fixed value at covariates zero).
The RSS is then
(C − AB)T (C − AB),
where
A =
x1,1 x1,2 . . . xn,1...
......
...x1,N x2,N . . . xn,N
,B =
β1...βn
,C =
y1...yN
.
Regression for many covariates
ForY = β1X1 + · · ·+ βnXn + Z ,
thenB = (ATA)−1ATC .
Some properties
This estimator is consistent.
If Z has variance σ2, then B has varianceσ2(ATA)−1.
The estimator B is approximately Gaussian, forlarge N .
Checking whether there is relationship
There is a standard (asymptotic) test forH0 : β1 = · · · = βn = 0 versus alternative H1.
Test is “Reject H0 when F -statistic is large”.The F -statistic:
F =(TSS− RSS)/n
TSS/(N − n − 1),
where TSS is the total sum of squares:
TSS =N∑j=1
(yj − y)2.
The F -statistic has an F -distribution...
Checking whether there is relationship
There is a standard (asymptotic) test forH0 : β1 = · · · = βn = 0 versus alternative H1.
Test is “Reject H0 when F -statistic is large”.
The F -statistic:
F =(TSS− RSS)/n
TSS/(N − n − 1),
where TSS is the total sum of squares:
TSS =N∑j=1
(yj − y)2.
The F -statistic has an F -distribution...
Checking whether there is relationship
There is a standard (asymptotic) test forH0 : β1 = · · · = βn = 0 versus alternative H1.
Test is “Reject H0 when F -statistic is large”.The F -statistic:
F =(TSS− RSS)/n
TSS/(N − n − 1),
where TSS is the total sum of squares:
TSS =N∑j=1
(yj − y)2.
The F -statistic has an F -distribution...
Checking whether there is relationship
There is a standard (asymptotic) test forH0 : β1 = · · · = βn = 0 versus alternative H1.
Test is “Reject H0 when F -statistic is large”.The F -statistic:
F =(TSS− RSS)/n
TSS/(N − n − 1),
where TSS is the total sum of squares:
TSS =N∑j=1
(yj − y)2.
The F -statistic has an F -distribution...
Checking subset of parameters
There are also tests to verify whether subsetsof parameters are equal to zero.
Take H0 : βn−m+1 = βn−m+2 = · · · = βn = 0.
Reject H0 when the following F -statistic is“large”:
F =(RSS0 − RSS)/m
TSS/(N − n − 1)
where RSS0 is the sum of residual squares forthe model containing only β1, . . . , βn−m.
Possible to check each parameter at a time(m = 1).
Checking subset of parameters
There are also tests to verify whether subsetsof parameters are equal to zero.
Take H0 : βn−m+1 = βn−m+2 = · · · = βn = 0.
Reject H0 when the following F -statistic is“large”:
F =(RSS0 − RSS)/m
TSS/(N − n − 1)
where RSS0 is the sum of residual squares forthe model containing only β1, . . . , βn−m.
Possible to check each parameter at a time(m = 1).
Checking subset of parameters
There are also tests to verify whether subsetsof parameters are equal to zero.
Take H0 : βn−m+1 = βn−m+2 = · · · = βn = 0.
Reject H0 when the following F -statistic is“large”:
F =(RSS0 − RSS)/m
TSS/(N − n − 1)
where RSS0 is the sum of residual squares forthe model containing only β1, . . . , βn−m.
Possible to check each parameter at a time(m = 1).
Usual measure of fit
The R2 statistic
R2 = 1− RSS∑Nj=1(yj − y)2
(the proportion of variability of Y that can beexplained by covariates, between 0 and 1;higher is better).
But: the more covariates, the higher the R2
statistic.
Usual measure of fit
The R2 statistic
R2 = 1− RSS∑Nj=1(yj − y)2
(the proportion of variability of Y that can beexplained by covariates, between 0 and 1;higher is better).
But: the more covariates, the higher the R2
statistic.
Feature selection
Usually we must discard some covariates.Too many covariates lead to small bias but largevariance, while too few covariates lead to smallvariance but large bias (the bias-variancetrade-off).In particular, a covariate should not be a linearfunction of other covariates...
We must “score” each set of covariates, tochoose the best one. A possible score isempirical error (by cross-validation).
The Akaike Information Criterion
One popular score is the AIC:
LS − |S |,
whereS is set of covariates in the scored model, andLS is the log-likelihood of the model withcovariates in S , evaluated at the maximumlikelihood estimates.
That is, “goodness of fit - model complexity”
Another score
The BIC (Bayesian Information Criterion):
LS − (|S |/2) logN .
For large N , the posterior probability of thescored model is proportional to eBIC, when allpossible models get identical probability.
Yet another score
Mallow’s Cp:
2|S |σ2 +N∑j=1
(ySj − yj)2,
where σ2 is the estimate of variance with allcovariates, while ySj are produced withcovariates in S .
This score estimates the “training error” that isexpected from a model with covariates in S .
Structure search
Forward stepwise regression: start with nocovariates, add one that leads to best score,then add one that leds to best score, etc.
Backward stepwise regression: start with allcovariates, drop one that leads to best score,etc.
There are many other search schemes.
Additional details, not covered in this course, in theTextbook, Section 6.1.
Structure search
Forward stepwise regression: start with nocovariates, add one that leads to best score,then add one that leds to best score, etc.
Backward stepwise regression: start with allcovariates, drop one that leads to best score,etc.
There are many other search schemes.
Additional details, not covered in this course, in theTextbook, Section 6.1.
Qualitative features in linear regression
Use some encoding.
If values are ordered: turn into numbers.
If not appropriate to do so: one-hot encoding.
Other challenges in linear regression
Relationship is not linear: detect with tests andresidual plots.
Disturbances are correlated or vary withfeatures.
Outliers and high leverage points (must betested for, discarded).
Collinearity (leads to numerical problems andhigher variance).
These issues, not covered in this course, arediscussed in the Textbook, Section 3.3.3.
Introducing non-linear features
Suppose we have X1 and X2. We can assume:
Y = β0 + β1X1 + β2X2 + β12X1X2,
or any other set of functions of X1 and X2.
If functions are polynomials of covariates, we havepolynomial regression.If functions are all produced by a set oftransformations, they are referred to as basisfunctions.
Other functions lead to General AdditiveModels (not discussed in this course; seeTextbook Section 7.7).
Introducing non-linear features
Suppose we have X1 and X2. We can assume:
Y = β0 + β1X1 + β2X2 + β12X1X2,
or any other set of functions of X1 and X2.
If functions are polynomials of covariates, we havepolynomial regression.If functions are all produced by a set oftransformations, they are referred to as basisfunctions.
Other functions lead to General AdditiveModels (not discussed in this course; seeTextbook Section 7.7).
Introducing non-linear features
Suppose we have X1 and X2. We can assume:
Y = β0 + β1X1 + β2X2 + β12X1X2,
or any other set of functions of X1 and X2.
If functions are polynomials of covariates, we havepolynomial regression.If functions are all produced by a set oftransformations, they are referred to as basisfunctions.
Other functions lead to General AdditiveModels (not discussed in this course; seeTextbook Section 7.7).
Nonparametric regression
We might assume that Y depends on splines(Y is a piecewise but smooth function ofcovariates).
Or we might assume that Y is a function ofneighboring points (kNN regression):
1 To obtain y corresponding to given covariates,weigh each point in the neighborhood.
2 Then run weighted regression with those pointsonly.
These topics are not covered in this course; they arediscussed in Textbook, Sections 7.5 and 7.6, andSection 3.5.
Nonparametric regression
We might assume that Y depends on splines(Y is a piecewise but smooth function ofcovariates).
Or we might assume that Y is a function ofneighboring points (kNN regression):
1 To obtain y corresponding to given covariates,weigh each point in the neighborhood.
2 Then run weighted regression with those pointsonly.
These topics are not covered in this course; they arediscussed in Textbook, Sections 7.5 and 7.6, andSection 3.5.
Shrinkage methods: Regularization
The idea is to penalize the “size” ofparameters, to “shrink” them, to reducevariance (but with increase in bias...).
Useful to reduce overfitting, particularly whenthere are too many covariates.
Two main strategies:ridge regression, andlasso (least absolute shrinkage and selectionoperator).
Ridge regression
Suppose we minimize N∑j=1
(yj −
n∑i=1
βixi ,j
)2+ λ
n∑i=1
β2i .
The larger the parameter λ, the smaller thevalues of βi .
Tuning λ: usually by cross-validation.
Solution is easy (but biased!):
B = (ATA + λI )−1ATC .
Standardization
In ridge regression, the measurement scale forcovariates is important...
In linear regression, multiplying covariates leads tomultiplied estimates.
Usual assumption: data are standardized(mean 0, variance 1):
xi ,j =xi ,j√
(1/N)∑N
j=1(xi ,j − xi)2.
Usually Y is centered (mean is subtracted).
The bias-variance trade-off
Ridge regression increases bias, decreasesvariance (compared to linear regression).So it is useful when variance is large.
For instance, when the number of covariates is verylarge.
Bias-variance
1e−01 1e+01 1e+03
01
02
03
04
05
06
0
Mean S
quare
d E
rror
0.0 0.2 0.4 0.6 0.8 1.0
01
02
03
04
05
06
0
Mean S
quare
d E
rror
λ ‖βRλ ‖2/‖β‖2
Textbook, Figure 6.5.
The Bayesian interpretation
Suppose we have Gaussian likelihood and priorproportional to e−λ
∑i β
2i .
Then the posterior to be maximized (for 0-1loss) lead to minimization of N∑
j=1
(yj −
n∑i=1
βixi ,j
)2+ λ
n∑i=1
β2i .
Lasso I
Suppose we minimize N∑j=1
(yj −
n∑i=1
βixi ,j
)2+ λ
n∑i=1
|βi |.
The larger the parameter λ, the smaller thevalues of βi .
Tuning λ: usually by cross-validation.
Lasso II
Minimize N∑j=1
(yj −
n∑i=1
βixi ,j
)2+ λ
n∑i=1
|βi |.
Usual assumption: Xi standardized (mean 0,variance 1), Y centered.Amazing fact: if λ is “large enough”, many βisare set to zero: so covariate selection is doneautomatically!
The result is a sparse solution.
Working with lasso
No closed-form solution, but convexoptimization does it.Gains in accuracy have been observed,particularly when too many covariates arepresent.
Too many covariates can overfit any input data...With too many covariates, many may have thesame predictive power, many may be highlycorrelated and useless.Thus selection of covariates is a big plus.
Another perspective, with some intuition
It is possible to show that both ridge regressionand lasso minimize RSS, but
Ridge regression: subject to∑n
i=1 β2i ≤ λ.
Lasso: subject to∑n
i=1 |βi | ≤ λ.
Intuition
Textbook, Figure 6.7.
A final note
Some of the figures in this presentation are takenfrom An Introduction to Statistical Learning, withapplications in R (Springer, 2013) with permissionfrom the authors: G. James, D. Witten, T. Hastieand R. Tibshirani.