Upload
others
View
20
Download
0
Embed Size (px)
Citation preview
30C00200
Econometrics
2) Linear regression model
and the OLS estimator
Timo KuosmanenProfessor, Ph.D.
http://nomepre.net/index.php/timokuosmanen
Today’s topics
• Multiple linear regression – what changes?– Model and its interpretation
– Deriving the OLS estimator
• Standard error
• Multicollinearity
• Goodness of fit: R2 and Adj.R2
• Fitting nonlinear functions using linear regression
2
Dependent variable (y)
• Market price of apartment (€)
Explanatory variables (x)
• Size (m2)
• Number of bedrooms (#)
• Age (years)
Hedonic model of housing market
Dependent variable (y)
• Market price of apartment (€)
Explanatory variables (x) Expected signs
• Size (m2) positive
• Number of bedrooms (#) negative
• Age (years) negative
Hedonic model of housing market
Regression equation with K parameters:
yi = β1 + β2x2i + β3x3i + … + βKxKi + εi
• β1 is the intercept term (constant)
• βk is the slope coefficient of variable xk (marginal effect)
• εi is the disturbance term for observation i
Model
Interpretation of slope β2 for the variable ”size” (m2)
Single regression model
• Value of an additional m2, on the average, in apartments
located in Tapiola
Multiple regression model
• Value of an additional m2, on the average, in apartments with
the same nr. of bedrooms and age in Tapiola
Interpretation
Why intercept is β1 ?
• Consider a model without a constant term:
yi = β1x1i + β2x2i + β3x3i + … + βKxKi + εi
• Where x1 = (1, 1, 1, …, 1)
• This model is identical to
yi = β1 + β2x2i + β3x3i + … + βKxKi + εi
• We can think of the intercept term as the slope coefficient of a
regressor that is constant (1) for all observations.
Interpretation
Excel output of the hedonic model
- Single regression with m2
SUMMARY OUTPUT
Regression Statistics
Multiple R 0,808717
R Square 0,654023
Adjusted R
Square 0,648701
Standard Error 110242
Observations 67
ANOVA
df SS MS F Significance F
Regression 1 1,49E+12 1,49E+12 122,874 1,26E-16
Residual 65 7,9E+11 1,22E+10
Total 66 2,28E+12
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -31430,8 33447,06 -0,93972 0,350841 -98229,2 35367,56
size m2 5460,683 492,6256 11,08486 1,26E-16 4476,842 6444,524
Excel output of the hedonic model
- Multiple regressionSUMMARY OUTPUT
Regression Statistics
Multiple R 0,905971
R Square 0,820784
Adjusted R
Square 0,81225
Standard Error 80593,26
Observations 67
ANOVA
df SS MS F Significance F
Regression 3 1,87E+12 6,25E+11 96,17703 1,76E-23
Residual 63 4,09E+11 6,5E+09
Total 66 2,28E+12
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 141366,7 36401,58 3,883532 0,000249 68623,96 214109,5
size m2 6972,404 857,5745 8,130378 2,11E-11 5258,678 8686,13
nr. bedrooms -65182,6 20401,6 -3,19498 0,002186 -105952 -24413,3
age -2820,1 495,0578 -5,6965 3,46E-07 -3809,39 -1830,8
OLS problem with K parameters:
Or equivalently
OLS estimator
2
1
1 2 2, 3 3, ,
min
. .
...
n
i
i
i i i K K i i
RRS e
s t
y b b x b x b x e
2
1 2 2, 3 3, ,
1
min ( ... )n
i i i K K i
i
RSS y b b x b x b x
First-order conditions:
System of K linear equations with K unknowns
- best solved by using matrix algebra
OLS estimator
1 2 2, 3 3, ,
11
2, 1 2 2, 3 3, ,
12
, 1 2 2, 3 3, ,
1
2 ( ... ) 0
2 ( ... ) 0
2 ( ... ) 0
n
i i i K K i
i
n
i i i i K K i
i
n
K i i i i K K i
iK
RSSy b b x b x b x
b
RSSx y b b x b x b x
b
RSSx y b b x b x b x
b
Closed form solution in matrix form:
Note: to study econometrics at more advanced level, you have
to master matrix algebra.
OLS estimator
-1b = (X X) X y
Closed form solutions:
Special case with two regressors x2 and x3
Slope b2
OLS estimator
2 3 3 2 3
2 2
2 3 2 3
. ( , ) . ( ) . ( , ) . ( , )
. ( ) . ( ) . ( , )
Est Cov x y Est Var x Est Cov x y Est Cov x xb
Est Var x Est Var x Est Cov x x
Important: explanatory variable x3 influences the slope of
regressor x2 through the sample covariances.
Note: if the regressor x2 does not correlate with the other
regressor x3, that is, the sample covariance is
then the slope b2 estimated from the multiple regression model
is exactly the same as that of the single regression of y on x2,
leaving the effects of x3 to the disturbance term
OLS estimator
2 3. ( , ) 0Est Cov x x
22
2
. ( , )
. ( )
Est Cov x yb
Est Var x
OLS estimates are calculated from observed data using the
formula:
Important: OLS estimator b is itself a random variable. (Why?)
That is, elements of vector b have a probability distribution with
the expected value E(b) and variance Var(b).
The estimated standard deviation of b is called standard error.
OLS estimator as a random variable
-1b = (X X) X y
Excel output of the hedonic model
- Multiple regressionSUMMARY OUTPUT
Regression Statistics
Multiple R 0,905971
R Square 0,820784
Adjusted R
Square 0,81225
Standard Error 80593,26
Observations 67
ANOVA
df SS MS F Significance F
Regression 3 1,87E+12 6,25E+11 96,17703 1,76E-23
Residual 63 4,09E+11 6,5E+09
Total 66 2,28E+12
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 141366,7 36401,58 3,883532 0,000249 68623,96 214109,5
size m2 6972,404 857,5745 8,130378 2,11E-11 5258,678 8686,13
nr. bedrooms -65182,6 20401,6 -3,19498 0,002186 -105952 -24413,3
age -2820,1 495,0578 -5,6965 3,46E-07 -3809,39 -1830,8
Unbiased estimator of is
Standard error of parameter estimate bk is
The general expression requires matrix algebra
Standard error
2
2 1
n
i
ie
e
sn K
2( )Var
2 1. .( ) ( )k ekk
s e b s X X
Special case of two regressors (x2, x3):
where rx2,x3 is the sample correlation coefficient.
Decomposed as product of 4 components:
Standard error
2
2 2
2 2, 3
. .( )( 1) . ( ) (1 )
e
x x
ss e b
n Est Var x r
2 2
2 2, 3
1 1 1. .( )
1 . ( ) 1e
x x
s e b sn Est Var x r
Four potential ways to improve precision:
1) Decrease se (improve empirical fit; add explanatory
variables?)
2) Increase sample size n
3) Increase sample variance of x2
4) Decrease correlation of x2 and x3
Standard error
2 2
2 2, 3
1 1 1. .( )
1 . ( ) 1e
x x
s e b sn Est Var x r
• High correlation between explanatory variables x can cause
loss of precision
• In practice, there is always some correlation
– Example: sample correlation between variables ”size (m2)” and ”nr. of
bedrooms” is 0.887
• Multiple regression analysis takes the correlation explicitly
into account – correlation as such is not a problem
– Depends on circumstances, particularly the sample size n, variance of
disturbance ε, sample variances of regressors x.
• When the high correlation is considered a problem, it is
referred to as multicollinearity
Multicollinearity
Typical symptoms of multicollinearity:
• Model has high R2 and is jointly significant in the F-test
• Slope coefficients are large (small) but still insignificant
• Slope coefficients have high standard errors
• Slope coefficients have unexpected signs
• Slope coefficients are much smaller or larger than expected
Note: If two variables are perfectly correlated, the OLS estimator
ill-defined and cannot be computed. This is the extreme type
of multicollinearity.
Multicollinearity
Indirect methods to influence the other factors:
1) Improving empirical fit to decrease s
– Include additional explanatory variables
2) Increase sample size n
3) Increase sample variance of problem variables
Direct treatments:
• Exclude one of problem variables
– problem: omitted variable bias
• Impose theoretical restriction
Multicollinearity
ANOVA and the coefficient of determination R2 :
Goodness of fit
TSS ESS RSS
2 1ESS RSS
RTSS TSS
2 2 2
1 1 1
ˆ ˆ( ) ( ) ( )
n n n
i i i i
i i i
y y y y y y
Excel output of the hedonic model
- Multiple regressionSUMMARY OUTPUT
Regression Statistics
Multiple R 0,905971
R Square 0,820784
Adjusted R
Square 0,81225
Standard Error 80593,26
Observations 67
ANOVA
df SS MS F Significance F
Regression 3 1 874 088 309 956 6,25E+11 96,17703 1,76E-23
Residual 63 409 202 213 242 6,5E+09
Total 66 2 283 290 523 198
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 141366,7 36401,58 3,883532 0,000249 68623,96 214109,5
size m2 6972,404 857,5745 8,130378 2,11E-11 5258,678 8686,13
nr. bedrooms -65182,6 20401,6 -3,19498 0,002186 -105952 -24413,3
age -2820,1 495,0578 -5,6965 3,46E-07 -3809,39 -1830,8
Question: What happens to R2 if we include an additional
explanatory variable xK+1to the model?
Answer:
• TSS remains the same
• RSS tends to decrease / ESS increase. Thus, R2 is likely to
increase
• ESS and R2 can never decrease
– In the worst case, bK+1 = 0, and there is no effect.
Goodness of fit
2 1ESS RSS
RTSS TSS
• Simple (but rather arbitrary) degrees of freedom correction to
the usual R2
• Adding a new variable xK+1 will increase Adj. R2 if and only if
|tK+1| ≥ 1
• Dougherty concludes: ”There is little to be gained by
finetuning [R2] by with a ’correction’ of dubious value.”
Adjusted R2
2 2 21 1.
n KAdj R R R
n K n K
Regression equation:
yi = β1 + β2x2i + β3x3i + … + βKxKi + εi
• β1 is the intercept term (constant)
• βk is the slope coefficient of variable xk (marginal effect)
• εi is the disturbance term for observation i
Important: model must be linear in parameters βK , εi.
It does not need to be linear in variables x.
Linearity of the model
• Polynomian functional form:
yi = β1 + β2x2i + β3x2i2+ … + βKx2i
K-1 + εi
• Log-linear (Cobb-Douglas, 1928) functional form:
ln yi = β1 + β2 lnx2i + β3lnx3i + … + βK lnxKi + εi
Examples
Exam question (Fall 2014):
Exercise
1
We are interested in estimating parameters β1, β2, and β3. Which of the following 10 models can be stated
as multiple linear regression (MLR) model, possibly after some variable transformations or other
mathematical operations, such that parameters β1, β2, and β3 can be estimated by OLS? If some
adjustments are required, briefly state the required operations and the resulting MLR equation that can be
estimated by OLS. If the parameters cannot be estimated by OLS, briefly point out the problem.
a) 2
1 2 3 i i i iy x x
b) 32
1 1 2
i i i iy x x
c) 32
1 1 2 exp( ) i i i iy x x
d) 0.4
1 2 3lni i i iy x x
e) 1
2 1 3 2
ii
i i
y
x x
f) 1 2 1 3 2 i i i iy x x
g) 1 2 1 3 2ln i i i iy x x
h) 32
1 1 2 exp( ) 0
i i i iy x x
i) 1 2 1 3 22 / i i i iy x x
j) 3
1 2 1 3 2ln i i i ix x y
Notation: y and x denote the dependent and the independent variables, coefficients β are the parameters
to be estimated, and ε are the random disturbance terms, respectively.