Specification Econometria

IntroductionMulticollinearity and Micronumerosity

Model Specification

Multicollinearity, Model Specification: Precisionand Bias

Walter Sosa-Escudero

Econ 507. Econometric Analysis. Spring 2009

February 9, 2009

Walter Sosa-Escudero Multicollinearity, Model Specification: Precision and Bias


Model Specification

The Classical Linear Model:

1 Linearity: Y = X + u.2 Strict exogeneity: E(u|X) = 03 No Multicollinearity: (X) = K, w.p.1.4 No heteroskedasticity/ serial correlation: V (u|X) = 2In.

Gauss/Markov: = (X X)1X Y is best linear unbiased.

This does not mean that is good. It is interesting to explore whatthings make it worse: less precise (higher variance) and morebiased.



Model Specification

Multicollinearity, Micronumerosity and Imprecisions

A crucial assumption is the no-multicollinearity assumption,(X) = K, which guarantees (X X) is invertible, so the OLSproblem has a unique solution.

Any violation to this assumption, so (X) < K will refer to asexact multicollinearity and elliminates the possibility of findingunique OLS estimates.

High multicolinearity is a rather contradictory notion where(X) = K, but the correlation among variables is not exactbut high. In such case, no classical assumptions areremoved, so the Gauss/Markov result holds.



Model Specification

The following result suggest why practitioners worry about highmulticollinearityResult:

V (j) =2[

(1R2j )Sjj]

with R2j is the R2 coefficient of regressing Xj on all other

explanatory variables, and Sjj =n

i=1(Xji Xj)2



Model Specification

Proof: By the FWL theorem,

j =

ni=1X

jiYin

i=1X2ji

and

V (j) =2n

i=1X2ji

=2n

i=1 X2ji

SjjSjj

where Xj MjXj and Mj is a matrix that gets residuals ofregression Xj on all other explanatory variables in the model.The result follows by noting

R2j = 1n

i=1X2ji

Sjj= 1

ni=1X

2jin

i=1(Xji Xj)2



Model Specification

Factors affecting V (j)

Go back to our result

V (j) =2

(1R2j )Sjj=2

n

1(1R2j )(Sjj/n)

Later on we will see that Sjj/n should be a rather stablemagnitude. So there are three main factors that contribute to thevariance:

1 2, the error variance.

2 n, the sample size.

3 R2j , the correlation between Xj and all other variables.

It is important to note that high multicolinearity affects thevariance in the same manner as the number of observations(micronumerosity).



Model Specification

It is interesting to remark that under high multicollinearity theremight be situations with really low t significance statistics and highR2 and high global significance F statistics.

We have already explore that high multicollinearity induceshigh variance, and hence is compatible with low ts.

R2 is related to the distance between Y and the span of X,which does not depend on the degree of correlation among itscomponents.

Check carefully what significance ts mean and what globalsignificance F means.



Model Specification



Model Specification

Model a) High multicollinearity

cor(x,y)=0.998983

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.04171 0.04426 0.943 0.348

y 0.57840 0.83608 0.692 0.491

x 1.33508 0.83893 1.591 0.115

Residual standard error: 0.4415 on 97 degrees of freedom

Multiple R-squared: 0.9635, Adjusted R-squared: 0.9628

F-statistic: 1282 on 2 and 97 DF, p-value: < 2.2e-16

Model b) Low multicollinearity

cor(x,y1)= 0.4047114

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.0009127 0.0465794 -0.02 0.984

y1 0.9773821 0.0220314 44.36


Model Specification

Specification errors, bias and imprecision

So far we have considered that our linear model Y = X + u iscorrect

Consider the following case

Y = X11 +X22 + u

where all classical assumptions hold K1 and K2 are the columns ofX1 and X2. Trivially, our original model corresponds toX = [X1 X2], with K = K1 +K2.



Model Specification

Consider the following scenarios regarding 2 and thecorresponding estimation strategies:

Omission of relevant variables: 2 6= 0, but we wronglyproceed as if 2 = 0, that is, we regress Y on X1 only.Inclusion of irrelevant variables: 2 = 0, but we wronglyproceed as if 2 might be 6= 0, that is, regress Y on X1 andX2 when we could have ignored X2.



Model Specification

Biases

Let us compare results for the estimation of 1 in the two scenarios

I) Omission of relevant variables

First note that in this case

Y = X11 + u

with u = X22 + u. Let 1 = (X1X1)

1X 1Y .

It is easy to see that 1 will be biased unless E(X2|X1) = 0. Thisis a really important result: not all omissions lead to biases.



Model Specification

II) Inclusion of Irrelavant Variables

In this case we would estimate 1 jointly with 2 by regressing Yon X1 and X2, that is, 1 is a subvector of

=[12

]= (X X)1X Y

It is important to see that under the classical assumptions andhence 1 will be unbiased. Why?



Model Specification

Variances

Let us compute the bias of 1 explicitely,

1 = (X1X1)

1X 1Y

= (X 1X1)1X 1(X11 +X22 + u)

E(1 |X1) = 1 + (X 1X1)1X 1E(X2|X1) bias

From here, it easy to check

V (1 |X) = 2(X 1X1)1

Using the FWL theorem

V (1|X) = 2(X 1M2X1)1

with M2 = I X2(X 2X2)1X 2.Walter Sosa-Escudero Multicollinearity, Model Specification: Precision and Bias


Model Specification

Now: V (1|X) V (1 |X) = 2[(X 1M2X1)

1 (X 1X1)1]

Aside: If AB is psd, then B1 A1 is psd. (Greene (2000, pp.49)).

Note: X 1X1 X 1M2X1 = X 1(I M2)X1 = X 1P2X1.

Since P2 is idempotent, for every c


Model Specification

Bias-variance trade-off

To summarize:

In practice we do not know which model holds (the large oneor the small one)?

The trade-off: estimating a small model (omit variables)implies a gain in precision and a likely bias. A large model isless likely to be biased and will be more inefficient.

Variable omission does not necessarily lead to biases.



Model Specification

Ommited Variable Bias: an example

Computer generated data, but based on Appleton, French andVanderpump (Ignoring a Covariate: an Example of SimponsParadox, The American Statistician, 50, 4, 1996)

Y = risk of death.

SMOKE = consumption of cigarrettes.



Model Specification

. reg y smoke

Source | SS df MS Number of obs = 100

-------------+------------------------------ F( 1, 98) = 194.34

Model | 7613.25147 1 7613.25147 Prob > F = 0.0000

Residual | 3839.18734 98 39.1753811 R-squared = 0.6648

-------------+------------------------------ Adj R-squared = 0.6614

Total | 11452.4388 99 115.6812 Root MSE = 6.259

------------------------------------------------------------------------------

y | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

smoke | -1.819348 .1305081 -13.94 0.000 -2.078337 -1.560359

_cons | 158.5975 4.774249 33.22 0.000 149.1231 168.0718

------------------------------------------------------------------------------



Model Specification

. reg y smoke age

Source | SS df MS Number of obs = 100

-------------+------------------------------ F( 2, 97) = 5424.58

Model | 11350.9524 2 5675.47622 Prob > F = 0.0000

Residual | 101.486373 97 1.04625126 R-squared = 0.9911

-------------+------------------------------ Adj R-squared = 0.9910

Total | 11452.4388 99 115.6812 Root MSE = 1.0229

------------------------------------------------------------------------------

y | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

smoke | .9431267 .050902 18.53 0.000 .8421004 1.044153

age | .9804631 .0164039 59.77 0.000 .9479059 1.01302

_cons | 12.84084 2.560392 5.02 0.000 7.759169 17.92251

------------------------------------------------------------------------------

. cor y smoke age

(obs=100)

| y smoke age

-------------+---------------------------

y | 1.0000

smoke | -0.8153 1.0000

age | 0.9797 -0.9080 1.0000


IntroductionMulticollinearity and MicronumerosityModel Specification

Documents

Specification Econometria