42
Outline: Simple linear regression Multiple linear regression Statistics, part 5

Antwerpen2014sessie5%28regression%29 (2)

Embed Size (px)

DESCRIPTION

regression

Citation preview

Outline: Simple linear regression Multiple linear regression

Statistics, part 5

AV.CRED

1009080706050403020

ANN

.RIS

K

70

60

50

40

30

20

10

We take the least square line

What is the story behind this line?

The simple linear probabilistic model y = β0 + β1 x + ε

y = dependent variable x = independent variable E(y) = β0 + β1x is deterministic component

ε = random error β0 = y intercept of line β1 = slope of line

The linear model

y = β + β x 0 1

Model: y = β + β x + ε 0 1

x i

ε = y - (β + β x ) y i

i i 0 1 i β + β x

0 1 i

β 0 observed value

mean value

Line of means

random error

Suppose we can describe the relation between x and y by a straight line. Then we can make statistical inferences about x and y, e.g. for some specific x we can predict (with some confidence) the outcome of y.

What steps are needed to construct a probabilistic straight line (=linear) model, and what do we assume!

Executing a simple linear regression (to investigation relation between x and y): 1. assume straight line model can be used 2. estimate β0 and β1 with least square line using a sample 3. Test or make confidence interval to check validity of coefficients 4. Check whether complete model is useful 5. Use the model for predictions

Example: Let y be supply of some good and let x be the price of this good.

1. assume straight line model can be used

We assume that the straight line model can be used to describe the relation between y and x: y = β + β x + ε 0 1

Hence the deterministic component, which is mean of y is linear (straight line)

2. estimate β and β with least square line using a sample

0 1

Example: Suppose we have the following data: (x, y) = (0.5, 2), (x, y) = (1, 1), (x, y) = (1.5 , 3)

0.5 1 1.5

1

2

3

Find straight line that fits best to these points: y = β + β x

0 1 ^ ^ ^

Problem: If 0 1

^ ^ ^ y = β + β x is least square line,

How do we find the coefficients β and β ? 0 1 ^ ^

We have explicit formulas:

β = Cov(x,y) / sx 1

β = y - β x 0 1

^

^ ^

2

Hence y = 1 + x ^

Using Excel (analyse/regression/linear ):

Coefficients Standard

Error t Stat P-

value Lower 95% Upper 95%

Intercept 1 1.870828693 0.5345 0.6875 -22.7710306 24.7710306

X Variable 1 1 1.732050808 0.5774 0.6667 -21.0076979 23.0076979

1 ^

0 ^ β

β

Note that we can conclude that there is no relationship between x and y if β = 0 1

x

y

change of x does not affect y

3. test or make confidence interval to check validity of coefficients

a. Using test to check validity slope

Hypothesis H : β = 0 H : β ≠ 0

0 a

1 1

Test with a level of significance of α = 0.05

Hence, we may have β = 0 , so based on this sample we may conclude that there is NO relation between x and y

0

1

Excel yields (analyse/regression/linear)

Since p value is larger than significance level α = 0.05 we can conclude that the sample that supports this test does not contain enough evidence to reject H

Coefficients Standard

Error t Stat P-

value Lower 95% Upper 95%

Intercept 1 1.870828693 0.5345 0.6875 -22.7710306 24.7710306

X Variable 1 1 1.732050808 0.5774 0.6667 -21.0076979 23.0076979

b. Using confidence interval to check validity slope

A 100(1- α) % confidence interval for β is: 1

(β - t s , β + t s ) 1 1 ^ ^

n-2;α/2 n-2;α/2 1

^ 1

^ β β

Forget the formula, but observe that structure is similar To confidence interval of mean and population proportion

Excel yields

Conclusion is that 0 is contained in 95% confidence interval. Hence, β can take the value 0, which implies that there may be no relation between x and y

1

Coefficients Standard

Error t Stat P-

value Lower 95% Upper 95%

Intercept 1 1.870828693 0.5345 0.6875 -22.7710306 24.7710306

X Variable 1 1 1.732050808 0.5774 0.6667 -21.0076979 23.0076979

So, confidence interval (-21.0077, 23.0077)

4. Check whether complete model is useful

R = = 2

Checks the strength of the linear relationship: Coefficient of determination

SS - SSE SS

yy

yy

Explained variation y

Variation in y

where SSyy = ∑ (yi – y )2 and SSE = ∑ (yi – yi )2

^

^

4. Check whether complete model is useful

R = = 2

Properties: a. 0 ≤ R2 ≤ 1

b. 100(R2)% of the sample variation in y can be explained by using x to predict y in a linear model

Checks the strength of the linear relationship: Coefficient of determination

SS - SSE SS

yy

yy

Explained variation y

Variation in y

explanation of property b.

If = 0, then SS = SSE, then x contributes no information about y, since observed points are in the same distance from the line y = y, as from the least square line

yy SS - SSE SS

yy

yy

If SS - SSE SS

yy

yy = 1, we have SSE = 0,

hence all observed points are on the least square line

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.5R Square 0.25Adjusted R -0.5Standard E 1.224745Observatio 3

ANOVAdf SS MS F ignificance F

Regression 1 0.5 0.5 0.333333 0.666667Residual 1 1.5 1.5Total 2 2

Coefficientstandard Erro t Stat P-value Lower 95%Upper 95%Lower 95.0%Upper 95.0%Intercept 1 1.870829 0.534522 0.687494 -22.771 24.77103 -22.771 24.77103X Variable 1 1.732051 0.57735 0.666667 -21.0077 23.0077 -21.0077 23.0077

Hence, only 25 % of sample variation in y can be explained by using x in the linear model

Excel yields:

Simple linear regression: a complete example y = price of house x= area in square feet

Sample: 2306 145541 1753 1299002677 179900 3206 2350002324 149000 2474 1299001447 113900 2933 1995003333 189000 3987 3190003004 184500 2598 1855004142 339717 4934 3750002923 228000 2253 1690002902 209000 2998 1859001847 133000 2791 1898002148 168000 2865 1920002819 205000 4417 379900

x y x y

Step 1. We assume a linear relation between x and y. Hence, we use the model y = β + β x + ε,

0 1

Step 2. Determine least square line

y = -39001.1 + 84.987 x ^

Coefficie

nts Standard

Error t Stat P-value Lower 95%

Upper 95%

Intercept -39001.1 18237.94 -2.13846 0.043834 -76824.4 -1177.92

X Variable 1 84.98698 6.095676 13.94217 2.12E-12 72.3453 97.62865

Step 3. Is there a linear relation between x and y?

a. use test (with significance level of 5 %) Hypothesis H : β = 0

H : β ≠ 0 0 a

1 1

Coefficie

nts Standard

Error t Stat P-value Lower 95%

Upper 95%

Intercept -39001.1 18237.94 -2.13846 0.043834 -76824.4 -1177.92

X Variable 1 84.98698 6.095676 13.94217 2.12E-12 72.3453 97.62865

p-value smaller than 0.05, so based on this sample we can reject H0

b. use 95 % confidence interval of β 1

95% CI is (72.345, 97.629) Because 0 is not in the C.I. interval, we can conclude that there is a linear relation between x and y

Coefficie

nts Standard

Error t Stat P-value Lower 95%

Upper 95%

Intercept -39001.1 18237.94 -2.13846 0.043834 -76824.4 -1177.92

X Variable 1 84.98698 6.095676 13.94217 2.12E-12 72.3453 97.62865

4. Check whether complete model is useful

Hence, 89 % of sample variation in y can be explained by the linear model

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.947802

R Square 0.898329

Adjusted R Square 0.893708

Standard Error 24383.43

Observations 24

Step 5. Use model to make estimations or predictions

Question: Predict the selling price of a home with a area of 3000 square feet. Use a 95 % confidence interval (prediction interval)

C. I. for “estimation” and “prediction”

x x

y

95% limits “prediction”

Property: P.I. is smallest at x

Use SPSS gives:

3000 . 164326.0100 267593.5720

Interpretation PI (prediction interval): we have confidence of 95% that selling price of a house with area 3000 square feet will be in between $164326 and $267593

PI

Application of simple linear regression: 1. Capital Asset Pricing Model (revisited)

Excess share = β Excess market + α .

y = 0.7553x + 0.0026

-0.04

-0.02

0.00

0.02

0.04

0.06

0.08

0.10

-0.05 0.00 0.05 0.10

(capm.xlsx)

SUMMARY OUTPUT

Regression Statistics Multiple R 0.846514 R Square 0.716586 Adjusted R Square 0.692968 Standard Error 0.016815 Observations 14

ANOVA df SS MS F Significance F

Regression 1 0.008579 0.008579 30.34084 0.000134465 Residual 12 0.003393 0.000283 Total 13 0.011971

Coefficients Standard

Error t Stat P-value Lower 95% Upper 95% Lower 90.0% Upper 90.0% Intercept 0.002637 0.004621 0.570526 0.578848 -0.007432405 0.0127056 -0.005599936 0.0108731 X Variable 1 0.755344 0.13713 5.508252 0.000134 0.456564681 1.0541242 0.510940035 0.9997488

exces marktet exces company 0.08 0.09 0.06 0.03 0.02 0.02 0.02 -0.02

-0.03 -0.02 0.01 0.01

-0.01 -0.02 -0.03 -0.02 0.00 0.00

-0.03 -0.01 -0.01 0.00 0.04 0.03 0.01 0.03

-0.02 0.00

y = 0.7553x + 0.0026

-0.04

-0.02

0.00

0.02

0.04

0.06

0.08

0.10

-0.05 0.00 0.05 0.10

Application of simple linear regression: 2. The money market rate (time series)

Rt+1 = µ + ρ (Rt - µ) + εt+1

Rt+1 = c + ρ Rt + εt+1 with c = µ (1- ρ) (longtermrate.xlsx)

period interest rate

1 7.229 2 7.725 3 7.671 4 8.037 5 7.516 6 6.996 7 6.719 8 7.056 9 7.243

y=R_t+1 x=R_t 7.725 7.229 7.671 7.725 8.037 7.671 7.516 8.037 6.996 7.516 6.719 6.996 7.056 6.719 7.243 7.056 7.109 7.243

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.905166 R Square 0.819326 Adjusted R Square 0.816817 Standard Error 0.501022 Observations 74

ANOVA

df SS MS F Significan

ce F Regression 1 81.9613 81.9613 326.5086 1.83E-28 Residual 72 18.07368 0.251023 Total 73 100.035

Coefficien

ts Standard

Error t Stat P-value Lower 95%

Upper 95%

Lower 90.0%

Upper 90.0%

Intercept 0.592209 0.338067 1.751747 0.084075 -0.08172 1.266134 0.028889 1.155528 X Variable 1 0.907522 0.050224 18.06955 1.83E-28 0.807403 1.007641 0.823834 0.99121

ρ= 0.9075 c = 0.5922

Substitution in c = µ (1- ρ) gives: 0.5922= µ (1-0.9075) Hence µ = 6.4038

Rt+1 = c + ρ Rt + εt+1

Application of simple linear regression: 3. Difference in income (dummy variable)

Difference in income male and female Let y denote the income, and introduce dummy variable x representing male if x= 1 and female if x=0 Regression model: y = β1 x + β0

Observe that interpretation of slope β1 changes!! β1 = µ0 - µ1 with µ0 the mean income of man and µ1 the mean income of women Test mean income man higher than mean income women H0 : µ0 - µ1 = 0 Ha : µ0 - µ1 > 0

(incomedifference2012.xlsx)

average income average income single man single woman 1 000 euro 1 000 euro

15.4 13.9 2000 17.2 15.1 2001 17.6 15.8 2002 17.4 16 2003 17.6 16.2 2004 17.8 16.5 2005 18.9 17.1 2006 19.7 17.7 2007 20.2 18 2008 20.1 18.2 2009

20 18.1 2010

(income) (man/woman) y x

15.4 1 17.2 1 17.6 1 17.4 1 17.6 1 17.8 1 18.9 1 19.7 1 20.2 1 20.1 1

20 1 13.9 0 15.1 0 15.8 0

16 0 16.2 0 16.5 0 17.1 0 17.7 0

18 0 18.2 0 18.1 0

SUMMARY OUTPUT

Regression Statistics Multiple R 0.53318 R Square 0.284281 Adjusted R Square 0.248495 Standard Error 1.459919 Observations 22

ANOVA

df SS MS F Significanc

e F Regression 1 16.93136 16.93136 7.943911 0.010614 Residual 20 42.62727 2.131364 Total 21 59.55864

Coefficient

s Standard

Error t Stat P-value Lower 95% Upper 95% Lower 95.0%

Upper 95.0%

Intercept 16.6 0.440182 37.71166 4.67E-20 15.6818 17.5182 15.6818 17.5182 X Variable 1 1.754545 0.622512 2.818495 0.010614 0.456009 3.053082 0.456009 3.053082

so u_0 - u_1 > 0 so u_0 > u_1 (siignificance level of 0.05)

Conclusion: mean income single men > mean income single women

executing a multiple linear regression (investigation relation between x1 ,x2 …,xn and y): 1. assume multiple regression model can be used 2. estimate β0 ,β1 , ..., βk with least square method using a sample 3. test or make confidence intervals to check contribution independent variables 4. use other relationship tests 5. if model is ok, use it for predictions

Multiple regression: a complete example y x1 x2

10012 50.24 1072326 1.44 20

13376 64.71 135413767 49.14 1199

662 3.61 26857 2.84 503

1259 7.89 6418842 82.3 16346763 26.8 4239

16681 45.2 52697094 35 3383

10021 43.8 34725142 28 16215104 20.1 20987039 37 2006

y = revenue sales x1 = number of employees x2 = exp. R&D

Step 1. We assume a linear relation between x1, x2 and y. Hence, we use the model y = β + β x + β x + ε,

0 1 1 2 2 Step 2. Determine least square plane

Coefficie

nts Standard

Error t Stat P-value Lower 95% Upper 95%

Intercept -754.711 895.5027 -0.84278 0.415835 -2705.843455 1196.422214

X Variable 1 217.493 21.96107 9.903567 3.98E-07 169.6438844 265.3420219

X Variable 2 0.713124 0.327711 2.176075 0.050246 -0.000897191 1.427144982

y = -754.711 +217.493 x1+0.713 x2 ^

Step 3. Is there a linear relation between x1, x2 and y?

• F test (complete model) • t test (on coefficients) • coefficient of determination R2

use F - test (with significance level of 5 %)

Hypothesis H : β = β = 0 H :at least one of parameters not zero

0 a

1 2

3a. Is the complete model valid?

Because p-value is smaller that 0.05, we reject H0

ANOVA

df SS MS F Significance F

Regression 2 4.52E+08 2.26E+08 68.24643 2.78519E-07

Residual 12 39727837 3310653

Total 14 4.92E+08

Step 3a showed that model is useful Now do some t-test on the individual coefficients of the variables

Hence, reject H0 , hence x1 contributes to explanation of y. Hence, we can not reject G0 . So, not clear whether x2 contributes to explanation of y.

Test at a level of significance of 0.05

H : β = 0 H : β ≠ 0

0 a

1 1

G : β = 0 G : β ≠ 0

0 a

2 2

Coefficie

nts Standard

Error t Stat P-value Lower 95% Upper 95%

Intercept -754.711 895.5027 -0.84278 0.415835 -2705.843455 1196.422214

X Variable 1 217.493 21.96107 9.903567 3.98E-07 169.6438844 265.3420219

X Variable 2 0.713124 0.327711 2.176075 0.050246 -0.000897191 1.427144982

How much of the variability of y is explained by the model?

adjR2 = 0.9192 Hence, 91.92 % of sample variation in y can be explained by the linear model

Regression Statistics

Multiple R 0.958743

R Square 0.919188

Adjusted R Square 0.905719

Standard Error 1819.52

Observations 15

Step 4. use model to estimate and predict

Question: What are the revenues for a company that has 50 employees (x1) and R&B expenses of 1000 million dollar (x2).

We use SPSS

10012 50 1072 9398,11 12475,08 6684,15 15189,05 326 1 20 -2334,43 1479,91 -4826,54 3972,029 13376 65 1354 12322,22 16247,43 9861,22 18708,42 13767 49 1199 9332,70 12243,15 6564,88 15010,97 662 4 26 -1801,68 1899,64 -4326,10 4424,06 857 3 503 -1532,50 1975,84 -4113,48 4556,82 1259 8 64 -735,34 2749,24 -3323,40 5337,30 18842 82 1634 15688,44 20931,96 13557,30 23063,10 6763 27 4239 6000,59 10193,46 3612,45 12581,61 16681 45 5269 10328,59 15338,24 8144,01 17522,83 7094 35 3383 7799,05 10741,02 5041,54 13498,54 10021 44 3472 9764,18 12730,70 7014,66 15480,23 5142 28 1621 5438,21 7543,91 2389,24 10592,88 5104 20 2098 3870,05 6356,00 958,34 9267,71 7039 37 2006 7684,96 9761,15 4624,99 12821,11 , 50 1000 9272,79 12393,32 6572,67 15093,44

y x1 x2 lci uci lpi upi

95% PI