31
30C00200 Econometrics 2) Linear regression model and the OLS estimator Timo Kuosmanen Professor, Ph.D. http:// nomepre.net/index.php/timokuosmanen

Econometrics 2) Linear regression model and the OLS estimator€¦ · Important: explanatory variable x 3 influences the slope of regressor x 2 through the sample covariances. Note:

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

30C00200

Econometrics

2) Linear regression model

and the OLS estimator

Timo KuosmanenProfessor, Ph.D.

http://nomepre.net/index.php/timokuosmanen

Today’s topics

• Multiple linear regression – what changes?– Model and its interpretation

– Deriving the OLS estimator

• Standard error

• Multicollinearity

• Goodness of fit: R2 and Adj.R2

• Fitting nonlinear functions using linear regression

2

Dependent variable (y)

• Market price of apartment (€)

Explanatory variables (x)

• Size (m2)

• Number of bedrooms (#)

• Age (years)

Hedonic model of housing market

Dependent variable (y)

• Market price of apartment (€)

Explanatory variables (x) Expected signs

• Size (m2) positive

• Number of bedrooms (#) negative

• Age (years) negative

Hedonic model of housing market

Regression equation with K parameters:

yi = β1 + β2x2i + β3x3i + … + βKxKi + εi

• β1 is the intercept term (constant)

• βk is the slope coefficient of variable xk (marginal effect)

• εi is the disturbance term for observation i

Model

Interpretation of slope β2 for the variable ”size” (m2)

Single regression model

• Value of an additional m2, on the average, in apartments

located in Tapiola

Multiple regression model

• Value of an additional m2, on the average, in apartments with

the same nr. of bedrooms and age in Tapiola

Interpretation

Why intercept is β1 ?

• Consider a model without a constant term:

yi = β1x1i + β2x2i + β3x3i + … + βKxKi + εi

• Where x1 = (1, 1, 1, …, 1)

• This model is identical to

yi = β1 + β2x2i + β3x3i + … + βKxKi + εi

• We can think of the intercept term as the slope coefficient of a

regressor that is constant (1) for all observations.

Interpretation

Excel output of the hedonic model

- Single regression with m2

SUMMARY OUTPUT

Regression Statistics

Multiple R 0,808717

R Square 0,654023

Adjusted R

Square 0,648701

Standard Error 110242

Observations 67

ANOVA

df SS MS F Significance F

Regression 1 1,49E+12 1,49E+12 122,874 1,26E-16

Residual 65 7,9E+11 1,22E+10

Total 66 2,28E+12

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept -31430,8 33447,06 -0,93972 0,350841 -98229,2 35367,56

size m2 5460,683 492,6256 11,08486 1,26E-16 4476,842 6444,524

Excel output of the hedonic model

- Multiple regressionSUMMARY OUTPUT

Regression Statistics

Multiple R 0,905971

R Square 0,820784

Adjusted R

Square 0,81225

Standard Error 80593,26

Observations 67

ANOVA

df SS MS F Significance F

Regression 3 1,87E+12 6,25E+11 96,17703 1,76E-23

Residual 63 4,09E+11 6,5E+09

Total 66 2,28E+12

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 141366,7 36401,58 3,883532 0,000249 68623,96 214109,5

size m2 6972,404 857,5745 8,130378 2,11E-11 5258,678 8686,13

nr. bedrooms -65182,6 20401,6 -3,19498 0,002186 -105952 -24413,3

age -2820,1 495,0578 -5,6965 3,46E-07 -3809,39 -1830,8

OLS problem with K parameters:

Or equivalently

OLS estimator

2

1

1 2 2, 3 3, ,

min

. .

...

n

i

i

i i i K K i i

RRS e

s t

y b b x b x b x e

2

1 2 2, 3 3, ,

1

min ( ... )n

i i i K K i

i

RSS y b b x b x b x

First-order conditions:

System of K linear equations with K unknowns

- best solved by using matrix algebra

OLS estimator

1 2 2, 3 3, ,

11

2, 1 2 2, 3 3, ,

12

, 1 2 2, 3 3, ,

1

2 ( ... ) 0

2 ( ... ) 0

2 ( ... ) 0

n

i i i K K i

i

n

i i i i K K i

i

n

K i i i i K K i

iK

RSSy b b x b x b x

b

RSSx y b b x b x b x

b

RSSx y b b x b x b x

b

Closed form solution in matrix form:

Note: to study econometrics at more advanced level, you have

to master matrix algebra.

OLS estimator

-1b = (X X) X y

Closed form solutions:

The intercept term

OLS estimator

1 2 2 3 3 ... K Kb y b x b x b x

Closed form solutions:

Special case with two regressors x2 and x3

Slope b2

OLS estimator

2 3 3 2 3

2 2

2 3 2 3

. ( , ) . ( ) . ( , ) . ( , )

. ( ) . ( ) . ( , )

Est Cov x y Est Var x Est Cov x y Est Cov x xb

Est Var x Est Var x Est Cov x x

Important: explanatory variable x3 influences the slope of

regressor x2 through the sample covariances.

Note: if the regressor x2 does not correlate with the other

regressor x3, that is, the sample covariance is

then the slope b2 estimated from the multiple regression model

is exactly the same as that of the single regression of y on x2,

leaving the effects of x3 to the disturbance term

OLS estimator

2 3. ( , ) 0Est Cov x x

22

2

. ( , )

. ( )

Est Cov x yb

Est Var x

OLS estimates are calculated from observed data using the

formula:

Important: OLS estimator b is itself a random variable. (Why?)

That is, elements of vector b have a probability distribution with

the expected value E(b) and variance Var(b).

The estimated standard deviation of b is called standard error.

OLS estimator as a random variable

-1b = (X X) X y

Excel output of the hedonic model

- Multiple regressionSUMMARY OUTPUT

Regression Statistics

Multiple R 0,905971

R Square 0,820784

Adjusted R

Square 0,81225

Standard Error 80593,26

Observations 67

ANOVA

df SS MS F Significance F

Regression 3 1,87E+12 6,25E+11 96,17703 1,76E-23

Residual 63 4,09E+11 6,5E+09

Total 66 2,28E+12

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 141366,7 36401,58 3,883532 0,000249 68623,96 214109,5

size m2 6972,404 857,5745 8,130378 2,11E-11 5258,678 8686,13

nr. bedrooms -65182,6 20401,6 -3,19498 0,002186 -105952 -24413,3

age -2820,1 495,0578 -5,6965 3,46E-07 -3809,39 -1830,8

Unbiased estimator of is

Standard error of parameter estimate bk is

The general expression requires matrix algebra

Standard error

2

2 1

n

i

ie

e

sn K

2( )Var

2 1. .( ) ( )k ekk

s e b s X X

Special case of two regressors (x2, x3):

where rx2,x3 is the sample correlation coefficient.

Decomposed as product of 4 components:

Standard error

2

2 2

2 2, 3

. .( )( 1) . ( ) (1 )

e

x x

ss e b

n Est Var x r

2 2

2 2, 3

1 1 1. .( )

1 . ( ) 1e

x x

s e b sn Est Var x r

Four potential ways to improve precision:

1) Decrease se (improve empirical fit; add explanatory

variables?)

2) Increase sample size n

3) Increase sample variance of x2

4) Decrease correlation of x2 and x3

Standard error

2 2

2 2, 3

1 1 1. .( )

1 . ( ) 1e

x x

s e b sn Est Var x r

• High correlation between explanatory variables x can cause

loss of precision

• In practice, there is always some correlation

– Example: sample correlation between variables ”size (m2)” and ”nr. of

bedrooms” is 0.887

• Multiple regression analysis takes the correlation explicitly

into account – correlation as such is not a problem

– Depends on circumstances, particularly the sample size n, variance of

disturbance ε, sample variances of regressors x.

• When the high correlation is considered a problem, it is

referred to as multicollinearity

Multicollinearity

Typical symptoms of multicollinearity:

• Model has high R2 and is jointly significant in the F-test

• Slope coefficients are large (small) but still insignificant

• Slope coefficients have high standard errors

• Slope coefficients have unexpected signs

• Slope coefficients are much smaller or larger than expected

Note: If two variables are perfectly correlated, the OLS estimator

ill-defined and cannot be computed. This is the extreme type

of multicollinearity.

Multicollinearity

Indirect methods to influence the other factors:

1) Improving empirical fit to decrease s

– Include additional explanatory variables

2) Increase sample size n

3) Increase sample variance of problem variables

Direct treatments:

• Exclude one of problem variables

– problem: omitted variable bias

• Impose theoretical restriction

Multicollinearity

ANOVA and the coefficient of determination R2 :

Goodness of fit

TSS ESS RSS

2 1ESS RSS

RTSS TSS

2 2 2

1 1 1

ˆ ˆ( ) ( ) ( )

n n n

i i i i

i i i

y y y y y y

Excel output of the hedonic model

- Multiple regressionSUMMARY OUTPUT

Regression Statistics

Multiple R 0,905971

R Square 0,820784

Adjusted R

Square 0,81225

Standard Error 80593,26

Observations 67

ANOVA

df SS MS F Significance F

Regression 3 1 874 088 309 956 6,25E+11 96,17703 1,76E-23

Residual 63 409 202 213 242 6,5E+09

Total 66 2 283 290 523 198

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 141366,7 36401,58 3,883532 0,000249 68623,96 214109,5

size m2 6972,404 857,5745 8,130378 2,11E-11 5258,678 8686,13

nr. bedrooms -65182,6 20401,6 -3,19498 0,002186 -105952 -24413,3

age -2820,1 495,0578 -5,6965 3,46E-07 -3809,39 -1830,8

Question: What happens to R2 if we include an additional

explanatory variable xK+1to the model?

Answer:

• TSS remains the same

• RSS tends to decrease / ESS increase. Thus, R2 is likely to

increase

• ESS and R2 can never decrease

– In the worst case, bK+1 = 0, and there is no effect.

Goodness of fit

2 1ESS RSS

RTSS TSS

• Simple (but rather arbitrary) degrees of freedom correction to

the usual R2

• Adding a new variable xK+1 will increase Adj. R2 if and only if

|tK+1| ≥ 1

• Dougherty concludes: ”There is little to be gained by

finetuning [R2] by with a ’correction’ of dubious value.”

Adjusted R2

2 2 21 1.

n KAdj R R R

n K n K

Regression equation:

yi = β1 + β2x2i + β3x3i + … + βKxKi + εi

• β1 is the intercept term (constant)

• βk is the slope coefficient of variable xk (marginal effect)

• εi is the disturbance term for observation i

Important: model must be linear in parameters βK , εi.

It does not need to be linear in variables x.

Linearity of the model

• Polynomian functional form:

yi = β1 + β2x2i + β3x2i2+ … + βKx2i

K-1 + εi

• Log-linear (Cobb-Douglas, 1928) functional form:

ln yi = β1 + β2 lnx2i + β3lnx3i + … + βK lnxKi + εi

Examples

Exam question (Fall 2014):

Exercise

1

We are interested in estimating parameters β1, β2, and β3. Which of the following 10 models can be stated

as multiple linear regression (MLR) model, possibly after some variable transformations or other

mathematical operations, such that parameters β1, β2, and β3 can be estimated by OLS? If some

adjustments are required, briefly state the required operations and the resulting MLR equation that can be

estimated by OLS. If the parameters cannot be estimated by OLS, briefly point out the problem.

a) 2

1 2 3 i i i iy x x

b) 32

1 1 2

i i i iy x x

c) 32

1 1 2 exp( ) i i i iy x x

d) 0.4

1 2 3lni i i iy x x

e) 1

2 1 3 2

ii

i i

y

x x

f) 1 2 1 3 2 i i i iy x x

g) 1 2 1 3 2ln i i i iy x x

h) 32

1 1 2 exp( ) 0

i i i iy x x

i) 1 2 1 3 22 / i i i iy x x

j) 3

1 2 1 3 2ln i i i ix x y

Notation: y and x denote the dependent and the independent variables, coefficients β are the parameters

to be estimated, and ε are the random disturbance terms, respectively.

Next time – Mon 14 Sept

Topic:

• Statistical properties of the OLS estimator

31