30
ST 102: Elemen ta ry Sta tis tical Theory Lecture 36 – Linear Regression: Prediction and Diagnostics Piotr Fryzlewicz [email protected] Department of Statistics, LSE 1/30

Lecture36 2012 Full

Embed Size (px)

Citation preview

Page 1: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 1/30

ST 102: Elementary Statistical TheoryLecture 36 – Linear Regression: Prediction and

Diagnostics

Piotr [email protected]

Department of Statistics, LSE

1 / 3 0

Page 2: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 2/30

Objectives:

Confidence intervals for E (y )Predictive intervals for y 

Regression diagnostics: a summary

Based on the observations {(x i ,

y i )

i = 1, · · · ,

n}, we fit aregression model  y  = β 0 + β 1x .

Goal. Predict (unobserved) y  corresponding to (known) x .

Point prediction: y  = β 0 + β 1x .

2 / 3 0

Page 3: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 3/30

For the analysis to be more informative, we would like to have

some ‘error bars’ for our prediction. We introduce two methods:

Confidence interval for µ(x ) ≡ E (y ) = β 0 + β 1x 

Predictive interval for y 

Remark. Confidence interval is an interval estimator for anunknown parameter (i.e. for a constant) while predictive interval isfor a random variable. They are different and serve differentpurposes.

We assume the model is normal, i.e. ε = y −β 0 −β 1x ∼ N (0, σ2).

3 / 3 0

Page 4: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 4/30

Confidence interval for µ(x ) = Ey 

Let µ(x ) = β 0

+ β 1

x . Then µ(x ) is an unbiased estimator for µ(x ).

Theorem. µ(x ) is normally distributed with mean µ(x ) andvariance

Var{

 µ(x )} =

σ2

n

ni =1(x i − x )2

n

 j =1

(x  j − x )2.

Proof . Note both β 0 and β 1 are linear estimators. Therefore µ(x )may be written in the form µ(x ) =

ni =1 b i y i , where b 1, · · · , b n are

some constants. Hence

µ(x ) is normally distributed. To determine

its distribution entirely, we only need to find its mean and variance.

E { µ(x )} = E ( β 0) + E ( β 1)x  = β 0 + β 1x  = µ(x )

4 / 3 0

Page 5: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 5/30

Var{ µ(x )} = E [{( β 0 − β 0) + ( β 1 − β 1)x }2]

= Var(

 β 0) + x 2Var(

 β 1) + 2x Cov(

 β 0,

β 1).

In Lecture 33 we derived

Var( β 0) =σ2

n

ni =1

x 2i  n

 j =1

(x  j − x )2, Var( β 1) = σ2 n

 j =1

(x  j − x )2.

In Workshop 18 we showed Cov( β 0, β 1) = −σ2x n j =1(x  j − x )2.

Hence n j =1(x  j − x )2

σ2

Var{

 µ(x )} =

1

n

n

i =1

x 2i  + x 2 − 2x x 

=1

n

ni =1

x 2i  + nx 2 − 2x 

ni =1

x i 

=1

n

ni =1

(x i − x )2.

i.e. Var{ µ(x )} = (σ

2

/n) n

i =1(x i − x )

2

n

 j =1(x  j − x )

2

.5 / 3 0

Page 6: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 6/30

Now

{

 µ(x ) − µ(x )}

σ2

n

ni =1(x i − x )2

n j =1(x  j − x )2

1/2∼ N (0, 1),

and n−2σ2 σ2 ∼ χ2

n−2, where σ2 = 1n−2

ni =1(y i − β 0 − β 1x i )

2.

Furthermore, µ(x ) and σ2 are independent. Hence

{ µ(x ) − µ(x )} σ2

n ni =1(x i − x )2

n j =1(x  j − x )21/2

∼ t n−2.

A (1 − α) confidence interval for µ(x ) is

 µ(x ) ± t α/2, n−2 σ ni =1(x i − x )2

nn

 j =1(x  j − x )21/2

Recall: the above interval contains the true expectation Ey  = µ(x )with probability 1− α. It does not cover y with probability  1 − α.

6 / 3 0

Page 7: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 7/30

Predictive interval – an interval contains y  withprobability 1 − α

We may assume that y  to be predicted is independent from

y 1, · · · , y n used in estimation.

Hence y − µ(x ) is normal with mean 0 and variance

Var(y ) + Var{

 µ(x )} = σ2 +

σ2

n

ni =1(x i − x )2

n

 j =1(x  j − x )2

Therefore

{y −

 µ(x )}

 

σ2

1 +

ni =1(x i − x )2

n

n j =1(x  j − x )2

1/2∼ t n−2.

An interval covering y  with probability 1 − α is

 µ(x ) ± t α/2, n−2 σ 1 +

ni =1(x i − x )2

nn

 j =1(x  j − x )2

1/2

7 / 3 0

Page 8: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 8/30

Remark. (i) It holds that

y  ∈ µ(x ) ± t α/2, n−2 σ 1 +

ni =1(x i − x )2

nn

 j =1(x  j − x )2

1/2 = 1 − α.

(ii) The predictive interval for y  is longer than the confidenceinterval for E (y ). The former contains unobserved random variabley  with probability 1 − α, the latter contains unknown constantE (y ) with probability 1 − α.

8 / 3 0

Page 9: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 9/30

Example. The data set ‘usedFord.mtw’ contains the prices (y , in$1,000) of 100 three-year old Ford Tauruses together with their

mileages (x , in 1000 miles) when they were sold at auction. Basedon those data, a car dealer needs to make two decisions:

1 to prepare cash for bidding on a three-year old Ford Tauruswith the mileage of  x  = 40;

2

to prepare for buying several three-year old Ford Tauruseswith the mileages close to x  = 40 from a rental company.

For the first task, a predictive interval would be more appropriate.For the second task, he needs to know the average price and,

therefore, a confidence interval.

This can be down easily using Minitab.

9 / 3 0

Page 10: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 10/30

MTB > regr c1 1 c2;SUBC> predict 40.

Price = 17.2 - 0.0669 Mileage

Predictor Coef SE Coef T P

Constant 17.2487 0.1821 94.73 0.000Mileage -0.066861 0.004975 -13.44 0.000

S = 0.326489 R-Sq = 64.8% R-Sq(adj) = 64.5%

Analysis of Variance

Source DF SS MS F PRegression 1 1 9.256 1 9.256 1 80.64 0 .000

Residual Error 98 10.446 0.107Total 99 29.702

... ...

Predicted Values for New Observations

NewObs Fit SE Fit 95% CI 95% PI

1 14.5743 0.0382 (14.4985, 14.6501) (13.9220, 15.2266)

NewObs Mileage

1 40.0

We predict that a

Ford Taurus will sell

for between $13,922

and $15,227. The

average selling price

of several 3-year-old

Ford Tauruses is es-

timated to be be-tween $14,499 and

$14,650. Because

predicting the sell-

ing price for one car

is more difficult, thecorresponding inter-

val is wider.

10/30

Page 11: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 11/30

To produce the plots with both confidence intervals for E (y ) andpredictive intervals for y :MTB > Fitline c1 c2;SUBC> Confidence 95;SUBC> Ci;SUBC> Pi.

 11/30

Page 12: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 12/30

Regression Diagnostics

The usefulness of a fitted regression model rests on a basicassumption:Ey  = β 0 + β 1x 

Furthermore the inference such as the tests, the confidenceintervals and predictive intervals only make sense if  ε1, · · · , εn are

(approximately) independent and normal with constantvariance σ2.

Therefore it is important to check those conditions are met inpractice — this task is called Regression Diagnostics.

Basic idea: looking into the residuals εi  or the normalizedresiduals εi / σ.

12/30

Page 13: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 13/30

What to look for?

Do the residuals manifest i.i.d. normal behaviour?

Is the scatter plot of εi  versus x i  patternless?

Is the scatter plot of 

εi  versus

y i  patternless?

Is the scatter plot of εi  versus i  patternless?

If you see trends, periodic patterns, increasing variation in any oneof the above scatter plots, it is very likely that at least one

assumption is not met.

13/30

Th i id l l b b i d i Mi i b f ll

Page 14: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 14/30

The various residual plots can be obtained in Minitab as follows(using the same example):MTB > Fitline c1 c2;SUBC> gfourpack;SUBC> gvars c2.

 14/30

Page 15: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 15/30

 

15/30

Page 16: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 16/30

Two other issues in regression diagnostics: outliers andinfluential observations.

Outlier: an unusually small or unusually large y i  which lies outside of themajority of observations.

An outlier is often caused by an error in either sampling or recording data.If so, we should correct it before proceeding with the regression analysis.

If an observation which looks like an outlier indeed belongs to the sampleand no errors in sampling or recording were discovered, we may use morecomplex model or distribution to accommodate this ‘outlier’. Forexample, stock returns often exhibit extreme values and they oftencannot be modelled satisfactorily by a normal regression model.

Remark. Strictly speaking, outliers are defined with respect to the model:y  is very unlikely to be 2σ distance away from Ey  = β 0 + β 1x  under thenormal regression model. This is how Minitab identifies potential outliers.

16/30

Page 17: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 17/30

Influential observation: an x i  which is far away from other x s.

Such an observation may have a large influence on the fitted

regression line.

 

17/30

Page 18: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 18/30

Remark. (i) Minitab output marks both outliers and influentialobservations.

MTB > regr c1 1 c2;SUBC> predict 40.

Price = 17.2 - 0.0669 Mileage

... ...

Unusual Observations

Obs Mileage Price Fit SE Fit Residual St Resid8 19.1 15.7000 15.9717 0.0902 -0.2717 -0.87 X

14 34.5 15.6000 14.9420 0.0335 0.6580 2.03R19 48.6 14.7000 13.9993 0.0706 0.7007 2.20R63 21.2 15.4000 15.8313 0.0806 -0.4313 -1.36 X74 21.0 16.4000 15.8446 0.0815 0.5554 1.76 X78 44.3 13.6000 14.2868 0.0526 -0.6868 -2.13R

R denotes an observation with a large standardized residual.X denotes an observation whose X value gives it large influence.

... ...

18/30

Page 19: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 19/30

(ii) To mitigate the impact of both outliers and influentialobservations, we could use robust regression, i.e. estimate β 0 and

β 1 by minimising the sum of absolute deviations:

SAD (β 0, β 1) =n

i =1

|y i − β 0 − β 1x i |

However, note that since the function f (x ) = |x | is notdifferentiable at where it attains its minimum, we would not beable to find β 0 and β 1 by differentiating SAD (β 0, β 1) w.r.t. to β 0and β 1 and equating the partial derivatives to zero. More complex

minimisation techniques would have to be used. This may beviewed as a drawback of this approach.

19/30

Workshop 19

Page 20: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 20/30

Workshop 19

In this workshop we apply the simple linear regression method tostudy the relationship between two financial returns series: a

regression of Cisco Systems stock returns y  on S&P500 Indexreturns x . This regression model is an example of the CAPM(Capital Asset Pricing Model).

Stock returns:

return =current price − previous price

previous price≈ log

current price

previous price

when the difference between the two prices is small.

Dataset: “return4.mtw” (on moodle). Daily returns 3 January –29 December 2000 (n = 252 observations). Dataset has 5columns: c1 – date, c2 – 100×(S&P500 return), c3 – 100×(Ciscoreturn), and c4 and c5 are two other stock returns.

20/30

Workshop 19

Page 21: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 21/30

Workshop 19

Remark. Daily prices are definitely not independent. Howeverdaily returns may be seen as a sequence of  uncorrelated random

variables.

MTB > describe c2 c3.

Descriptive Statistics: S&P500, Cisco

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3

S&P500 252 0 -0.0424 0.0882 1.4002 -6.0045 -0.8543 -0.0379 0.8021Cisco 252 0 -0.134 0.267 4.234 -13.439 -3.104 -0.115 2.724

Variable MaximumS&P500 4.6546Cisco 15.415

For S&P500, average daily return is -0.04%, the maximum daily

return is 4.46%, the minimum daily return is -6.01%, and thestandard deviation is 1.40.

For Cisco, average daily return is -0.13%, the maximum dailyreturn is 15.42%, the minimum daily return is -13.44%, and the

standard deviation is 4.23. 21/30

Workshop 19

Page 22: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 22/30

Workshop 19

Remark. Cisco is much more volatile than S&P500.MTB > tsplot c2 c3;SUBC> overlay.

 22/30

Workshop 19

Page 23: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 23/30

Workshop 19

There is clear synchronisation between the movements of the tworeturn series.

MTB > corr c2 c3Pearson correlation of S&P500 and Cisco = 0.687

P-Value = 0.000

23/30

Workshop 19

Page 24: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 24/30

p 9

We fit a regression model: Cisco = β 0 + β 1S&P500 + ε

Rationale : part of the fluctuation in Cisco returns was driven bythe fluctuation of the S&P500 return.

MTB > regr c3 1 c2

The regression equation is Cisco = - 0.045 + 2.08 S&P500

Predictor Coef SE Coef T PConstant -0.0455 0.1943 -0.23 0.815

S&P500 2.0771 0.1390 14.94 0.000

S = 3.08344 R-Sq = 47.2% R-Sq(adj) = 47.0%

Analysis of Variance

Source DF SS MS F P

Regression 1 2123.1 2123.1 223.31 0.000

Residual Error 250 2376.9 9.5

Total 251 4500.0

24/30

Workshop 19

Page 25: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 25/30

p

Unusual Observations

Obs S&P500 Cisco Fit SE Fit Residual St Resid

2 -3.91 -5.771 -8.167 0.572 2.396 0.79 X

27 -2.10 2.357 -4.415 0.346 6.772 2.21R

36 0.63 11.208 1.259 0.215 9.949 3.23R

51 2.40 -2.396 4.936 0.391 -7.332 -2.40R

52 4.65 2.321 9.623 0.681 -7.302 -2.43RX

... ...210 1.37 -5.328 2.808 0.277 -8.135 -2.65R

211 2.17 11.431 4.470 0.364 6.961 2.27R

234 0.74 -5.706 1.487 0.222 -7.193 -2.34R

235 3.82 12.924 7.886 0.571 5.038 1.66 X

244 0.80 -11.493 1.624 0.227 -13.117 -4.27R246 -3.18 -13.439 -6.650 0.477 -6.789 -2.23RX

R denotes an observation with a large standardized residual.

X denotes an observation whose X value gives it large influence.

25/30

Workshop 19

Page 26: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 26/30

p

The estimated slope: β 1 = 2.077. The null hypothesis H 0 : β 1 = 0is rejected with p -value 0.000: extremely significant.

Attempted interpretation: when the market index goes up by 1%,Cisco stock goes up by 2.077% on average. However the errorterm ε in the model is large with the estimated σ = 3.08%.

The p -value for testing H 0 : β 0 = 0 is 0.815, so we cannot rejectthe hypothesis β 0 = 0. Recall β 0 = y − β 1x  and both y  and x  arevery close to 0.

There are many standardised residual values ≥ 2 or ≤ −2,indicating non-normal error distribution.

R 2 = 47.2% of the variation of the Cisco stock may explained bythe variation of the S&P500 index, or in other words the 47.2% of 

the risk in the Cisco stock is the  market-related risk — seeCAPM below.

26/30

Workshop 19

Page 27: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 27/30

p

CAPM — a simple asset pricing model in finance:

y i  = β 0 + β 1x i  + εi 

where y i  is a stock return and x i  is a market return at time i .

Total risk of the stock:

1

n

n

i =1(y i − y )

2

=

1

n

n

i =1 ( y i − y )

2

+

1

n

n

i =1 (y i − y i )

2

Market-related (or systematic) risk:

1

n

n

i =1

(

 y i − y )2 =

1

n β 21

n

i =1

(x i − x )2.

Firm-specific risk: 1n

ni =1(y i − y i )

2

Remark. (i) β 1 measures the market-related (or systematic) risk of 

the stock. 27/30

Workshop 19

Page 28: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 28/30

(ii) Market-related risk is unavoidable, while firm-specific risk may

be “diversified away” through hedging.

(iii) Variance is a simple and one of the most frequently usedmeasure for risk in finance.

To plot the data with fitted regression line together withconfidence bounds for E (y ) and predictive bounds for y :

MTB > Fitline c3 c2;

SUBC> gfourpack;

SUBC> confidence 95;SUBC> ci;

SUBC> pi.

28/30

Page 29: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 29/30

Workshop 19

Page 30: Lecture36 2012 Full

7/31/2019 Lecture36 2012 Full

http://slidepdf.com/reader/full/lecture36-2012-full 30/30

 Top-left panel: points below the line in the top-right corner, above

the line in the bottom-left corner — the residual distribution hasheavier tails than N (0, σ2).

30/30