32
Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

Embed Size (px)

Citation preview

Page 1: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

Multiple Linear Regression

Andy WangCIS 5930-03

Computer SystemsPerformance Analysis

Page 2: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

2

Multiple Linear Regression

• Models with more than one predictor variable

• But each predictor variable has a linear relationship to the response variable

• Conceptually, plotting a regression line in n-dimensional space, instead of 2-dimensional

Page 3: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

3

Basic Multiple Linear Regression Formula

• Response y is a function of k predictor variables x1,x2, . . . , xk

exbxbxbby kk 22110

Page 4: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

4

A Multiple Linear Regression Model

Given sample of n observations

model consists of n equations (note + vs. - typo in book):

nknnnk yxxxyxxx ,,,,,,,,,, 21112111

nknknnn

kk

kk

exbxbxbby

exbxbxbby

exbxbxbby

22110

2222212102

1121211101

Page 5: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

5

Looks Like It’s Matrix Arithmetic Time

y = Xb +e

nkknnn

k

k

n e

e

e

b

b

b

xxx

xxx

xxx

y

y

y

.

.

.

.

.

.

1

.....

.....

.....

1

1

.

.

.2

1

1

0

22

22212

12111

2

1

Page 6: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

6

Analysis ofMultiple Linear

Regression• Listed in box 15.1 of Jain• Not terribly important (for our purposes)

how they were derived– This isn’t a class on statistics

• But you need to know how to use them• Mostly matrix analogs to simple linear

regression results

Page 7: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

7

Example ofMultiple Linear

Regression• IMDB keeps numerical popularity

ratings of movies• Postulate popularity of Academy Award

winning films is based on two factors:– Year made– Running time

• Produce a regressionrating = b0 + b1(year) +b2(length)

Page 8: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

8

Some Sample Data

Title Year LengthRating

Silence of the Lambs 1991 118 8.1Terms of Endearment 1983 132 6.8Rocky 1976 119 7.0Oliver! 1968 153 7.4Marty 1955 91 7.7Gentleman’s Agreement 1947 118 7.5Mutiny on the Bounty 1935 132 7.6It Happened One Night 1934 105 8.0

Page 9: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

9

Now for Some Tedious Matrix Arithmetic

• We need to calculate X, XT, XTX, (XTX)-1, and XTy• Because• We will see that

b = (18.5430, -0.0051, -0.0086 )T

• Meaning the regression predicts:rating = 18.5430 – 0.0051*year

– 0.0086*length

yXXXb T1T

Page 10: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

10

X Matrix for Example

10519341

13219351

11819471

9119551

15319681

11919761

13219831

11819911

X

Page 11: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

11

Transpose to Get XT

10513211891153119132118

19341935194719551968197619831991

11111111TX

Page 12: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

12

Multiply To Get XTX

1195721899083968

18990833077138515689

968156898

XXT

Page 13: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

13

Invert to Get C=(XTX)-1

0004.00001.01328.0

0001.00003.062400

1328.06240.07585.1207

.1TXXC

Page 14: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

14

Multiply to Get XTy

57247

7.117840

160

.

.

yXT

Page 15: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

15

Multiply (XTX)-1(XTy)to Get b

00860

0051.0

5430.18

.

b

Page 16: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

16

How Good Is ThisRegression Model?

• How accurately does the model predict the rating of a film based on its age and running time?

• Best way to determine this analytically is to calculate the errors

or yXbyy TTTSSE

2ieSSE

Page 17: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

17

Calculating the ErrorsEstimated

Rating Year Length Rating ei ei^28.1 1991 118 7.4 -0.71 0.516.8 1983 132 7.3 0.51 0.267.0 1976 119 7.5 0.45 0.217.4 1968 153 7.2 -0.20 0.047.7 1955 91 7.8 0.10 0.017.5 1947 118 7.6 0.11 0.017.6 1935 132 7.6 -0.05 0.008.0 1934 105 7.8 -0.21 0.04

Page 18: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

18

Calculating the Errors, Continued

• So SSE = 1.08• SSY =• SS0 = • SST = SSY - SS0 = 452.9- 451.5 = 1.4• SSR = SST - SSE = .33

• In other words, this regression stinks

914522 .y i

54512 .yn

23.41.1

33.2 SST

SSRR

Page 19: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

19

Why Does It Stink?

• Let’s look at properties of the regression parameters

• Now calculate standard deviations of the regression parameters

46.5

08.1

3

n

SSEse

Page 20: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

20

Calculating STDEVof Regression Parameters

• Estimations only, since we’re working with a sample

• Estimated stdev of

16.1676.120746.000 csb e

0084.0003.46.111 csb e

0097.0004.46.222 csb e

Page 21: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

21

Calculating Confidence Intervals of STDEVs

• We will use 90% level• Confidence intervals for

• None is significant, at this level

10.51,02.1416.16015.254.180 b

012,.022.0084.015.2005.1 b

011,.028.0097.015.2009.2 b

Page 22: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

22

Analysis of Variance

• So, can we really say that none of the predictor variables are significant?– Not yet; predictors may be correlated

• F-tests can be used for this purpose– E.g., to determine if the SSR is significantly

higher than the SSE– Equivalent to testing that y does not

depend on any of the predictor variables• Alternatively, that no bi is significantly nonzero

Page 23: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

23

Running an F-Test

• Need to calculate SSR and SSE• From those, calculate mean squares of

regression (MSR) and errors (MSE)• MSR/MSE has an F distribution• If MSR/MSE > F-table, predictors explain

a significant fraction of response variation• Note typos in book’s table 15.3

– SSR has k degrees of freedom– SST matches not

yy yy ˆ

Page 24: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

24

F-Test for Our Example

• SSR = .33• SSE = 1.08• MSR = SSR/k = .33/2 = .16• MSE = SSE/(n-k-1) = 1.08/(8 - 2 - 1) = .22• F-computed = MSR/MSE = .76• F[90; 2,5] = 3.78 (at 90%)• So it fails the F-test at 90% (miserably)

Page 25: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

25

Multicollinearity

• If two predictor variables are linearly dependent, they are collinear– Meaning they are related– And thus second variable does not improve

the regression– In fact, it can make it worse

X1

1 0

X2 1 data No data

0 No data data

Page 26: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

26

Multicollinearity

• Typical symptom is inconsistent results from various significance tests

Page 27: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

27

Finding Multicollinearity

• Must test correlation between predictor variables

• If it’s high, eliminate one and repeat regression without it

• If significance of regression improves, it’s probably due to collinearity between the variables

Page 28: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

28

Is Multicollinearity a Problem in Our

Example?• Probably not, since significance tests are

consistent• But let’s check, anyway• Calculate correlation of age and length• After tedious calculation, 0.25

– Not especially correlated• Important point - adding a predictor

variable does not always improve a regression– See example on p. 253 of book

Page 29: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

29

Why Didn’t RegressionWork Well Here?

• Check scatter plots– Rating vs. age– Rating vs. length

• Regardless of how good or bad regressions look, always check the scatter plots

Page 30: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

30

Rating vs. Length

6.0

6.5

7.0

7.5

8.0

8.5

9.0

80 100 120 140 160Length

Rating

Page 31: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

31

Rating vs. Age

6.0

6.5

7.0

7.5

8.0

8.5

9.0

0 20 40 60 80Age

Rating

Page 32: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis

White Slide