Chapter Specific Appendices

Embed Size (px)

Citation preview

  • 8/10/2019 Chapter Specific Appendices

    1/19

    Minimising the Sum of Squared Residuals

    We show that the OLS estimates b0 and b1 do minimise the sum of squared residuals, as as-serted in Section 2.2. Formally, the problem is to characterise the solutions b0 and b1 to theminimisation problem

    minb0,b1

    i5 1

    n ( yi 2 b0 2 b1 x i)

    2,

    where b 0 and b 1 are the dummy arguments for the optimisation problem; for simplicity,call this function Q(b0, b1). By a fundamental result from multivariable calculus (see the onlineAppendix A), a necessary condition for b0 and b1 to solve the minimisation problem is thatthe partial derivatives of Q(b0, b1) with respect to b0 and b1 must be zero when evaluated at b0, b1:Q ( b0, b1)/b0 5 0 and Q( b0, b1)/b1 5 0. Using the chain rule from calculus, these two equa-tions become

    22

    i5 1

    n

    ( yi2

    b

    02

    b

    1 x i)5

    0

    2 2 i5 1

    n x i( yi 2 b0 2 b1 x i) 5 0.

    These two equations are just (2.14) and (2.15) multiplied by 2 2n and, therefore, are solved bythe same b0 and b1.

    How do we know that we have actually minimised the sum of squared residuals? The firstorder conditions are necessary but not sufficient conditions. One way to verify that we haveminimised the sum of squared residuals is to write, for any b0 and b1,

    Q(b0, b1)5 i5 1

    n [ yi 2 b0 2 b1 x i 1 ( b0 2 b0) 1 ( b1 2 b1) x i]

    2

    5 i5 1

    n [ui 1 ( b0 2 b0) 1 ( b1 2 b1) x i]

    2

    5 i5 1

    n u i

    2 1 n( b0 2 b0)2 1 ( b1 2 b1)

    2 i5 1

    n x i

    2 1 2( b0 2 b0)( b1 2 b1) i5 1

    n x i ,

    Chapter 2 Appendix

    2AAPPENDIX

    34394_app2A_rev01.indd 1 07/10/13 2:36 PM

    2

    0 1 4

    C e n g a g e

    L e a r n

    i n g . A

    l l R i g h t s R e s e r v e d .

    T h i s c o n

    t e n

    t i s n o

    t y e t

    f n a

    l a n

    d C e n g a g e

    L e a r n

    i n g

    d o e s n o

    t g u a r a n

    t e e

    t h i s p a g e w

    i l l c o n t a i n c u r r e n

    t m a t e r

    i a l o r m a t c h

    t h e p u

    b l i s h e d p r o d u c

    t .

  • 8/10/2019 Chapter Specific Appendices

    2/19

    where we have used equations (2.30) and (2.31). The first term does not depend on b0 or b1,while the sum of the last three terms can be written as

    i5 1 n

    [( b0 2 b0) 1 ( b1 2 b1) x i]2

    ,

    as can be verified by straightforward algebra. Because this is a sum of squared terms, the small-est it can be is zero. Therefore, it is smallest when b0 5 b0 and b1 5 b1.

    34394_app2A_rev01.indd 2 07/10/13 2:36 PM

  • 8/10/2019 Chapter Specific Appendices

    3/19

    3A.1 Derivation of the First Order Conditions in Equation (3.13)

    The analysis is very similar to the simple regression case. We must characterise the solutionsto the problem

    min b0, b1 , bk

    i5 1

    n ( yi 2 b0 2 b1 x i1 2 2 bk x ik )2.

    Taking the partial derivatives with respect to each of the b j (see the online Appendix A),evaluating them at the solutions, and setting them equal to zero gives

    2 2 i5 1

    n ( yi 2

    b 0 2 b 1 x i12 2

    b k x ik ) 5 0

    2 2 i5 1

    n x ij( yi 2

    b 0 2 b 1 x i1 2 2

    b k x ik ) 5 0, for all j 5 1, , k .

    Canceling the 2 2 gives the first order conditions in (3.13).

    3A.2 Derivation of Equation (3.22)

    To derive (3.22), write x i1 in terms of its fitted value and its residual from the regressionof x 1 on x 2, , x k : x i1 5

    xi1 1 r i1, for all i 5 1, , n. Now, plug this into the second equation

    in (3.13):

    i5 1

    n (

    xi1 1 r i1)( yi 2

    b 0 2 b 1 x i1 2 2

    b k x ik ) 5 0. [3.63]

    By the definition of the OLS residual ui, since

    xi1 is just a linear function of the explanatoryvariables x i2, , x ik , it follows that i5 1

    n xi1 u i 5 0. Therefore, equation (3.63) can be

    expressed as

    i5 1

    n r i1( yi 2

    b 0 2 b 1 x i1 2 2

    b k x ik ) 5 0. [3.64]

    Since the r i1 are the residuals from regressing x 1 on x 2, , x k , i5 1

    n x ij r i1 5 0, for all

    j 5 2, , k . Therefore, (3.64) is equivalent to i5 1 n r i1( yi 2

    b 1 x i1) 5 0. Finally, we use the factthat i5 1

    n xi1 r i1 5 0, which means that

    b 1 solves

    Chapter 3 Appendix

    3AAPPENDIX

    34394_app3A_rev01.indd 1 07/10/13 2:35 PM

    2

    0 1 4

    C e n g a g e

    L e a r n

    i n g . A

    l l R i g h t s R e s e r v e d .

    T h i s c o n

    t e n

    t i s n o

    t y e t

    f n a

    l a n

    d C e n g a g e

    L e a r n

    i n g

    d o e s n o

    t g u a r a n

    t e e

    t h i s p a g e w

    i l l c o n t a i n c u r r e n

    t m a t e r

    i a l o r m a t c h

    t h e p u

    b l i s h e d p r o d u c

    t .

  • 8/10/2019 Chapter Specific Appendices

    4/19

    i5 1

    n r i1( yi 2

    b 1 r i1) 5 0.

    Now, straightforward algebra gives (3.22), provided, of course, that i5 1 n r 2 i1 . 0; this is

    ensured by Assumption MLR.3.

    3A.3 Proof of Theorem 3.1

    We prove Theorem 3.1 for b 1; the proof for the other slope parameters is virtually identical.

    (See the online Appendix E for a more succinct proof using matrices.) Under AssumptionMLR.3, the OLS estimators exist, and we can write

    b 1 as in (3.22). Under Assumption MLR.1,we can write yi as in (3.32); substitute this for yi in (3.22). Then, using i5 1

    n ri1 5 0, i5 1 n x ij

    r i1 5 0, for all j 5 2, , k , and i5 1

    n x i1 r i1 5 i5 1

    n r 2 i1 , we have

    b 1 5 b 1 1 _ i5 1

    n r i1ui + / _ i5 1

    n

    r 2 i1 + . [3.65]Now, under Assumptions MLR.2 and MLR.4, the expected value of each ui, given all indepen-dent variables in the sample, is zero. Since the r i1 are just functions of the sample independentvariables, it follows that

    E( b 1uX ) 5 b 1 1 _

    i5 1

    n

    r i1E(uiuX )+ / _ i5 1

    n

    r 2 i1 + 5 b 1 1 _ i5 1

    n

    r i1 0+ / _ i5 1 n

    r 2 i1 + 5 b 1,where X denotes the data on all independent variables and E(

    b 1uX ) is the expected valueof

    b 1, given x i1, , x ik , for all i 5 1, , n. This completes the proof.

    3A.4 General Omitted Variable Bias

    We can derive the omitted variable bias in the general model in equation (3.31) underthe first four Gauss-Markov assumptions. In particular, let the

    b j, j 5 0, 1, , k be theOLS estimators from the regression using the full set of explanatory variables. Let the b j , j 5 0, 1, , k 2 1 be the OLS estimators from the regression that leaves out x k . Let

    d j, j 5 1, , k 2 1 be the slope coefficient on x j in the auxiliary regression of x ik on x i1, x i2, x i, k 2 1, i 5 1, , n. A useful fact is that

    b j 5

    b j 1 b k

    d j. [3.66]

    This shows explicitly that, when we do not control for x k in the regression, the estimated partialeffect of x j equals the partial effect when we include x k plus the partial effect of x k on

    y timesthe partial relationship between the omitted variable, x k , and x j, j , k . Conditional on the en-tire set of explanatory variables, X , we know that the

    b j are all unbiased for the corresponding b j, j 5 1, , k . Further, since

    d j is just a function of X , we have

    E( b j|X ) 5 E(

    b j|X ) 1 E( b k |X )

    d j [3.67] 5 b j 1 b k

    d j.

    34394_app3A_rev01.indd 2 07/10/13 2:35 PM

  • 8/10/2019 Chapter Specific Appendices

    5/19

    Equation (3.67) shows that b j is biased for b j unless b k 5 0in which case x k has no par-

    tial effect in the populationor

    d j equals zero, which means that x ik and x ij are partiallyuncorrelated in the sample. The key to obtaining equation (3.67) is equation (3.66).

    To show equation (3.66), we can use equation (3.22) a couple of times. For simplicity,we look at j 5 1. Now,

    b 1 is the slope coefficient in the simple regression of yi on r i1,

    i 5 1, , n, where the r i1 are the OLS residuals from the regression of x i1 on x i2, x i3, , x i,k 2 1.Consider the numerator of the expression for

    b 1: i5 1 n r i1 yi. But for each i, we can write

    yi 5 b 0 1

    b 1 x i1 1 1 b k x ik 1

    ui and plug in for yi. Now, by properties of the OLS residu-als, the

    r i1 have zero sample average and are uncorrelated with x i2, x i3, , x i,k 2 1 in the sample.Similarly, the

    ui have zero sample average and zero sample correlation with x i1, x i2, , x ik . Itfollows that the

    r i1 and ui are uncorrelated in the sample (since the

    r i1 are just linear combina-tions of x i1, x i2, , x i,k 2 1). So

    i5 1

    n

    r i1 yi 5 b 1_ i5 1

    n

    ri1 x i1+ 1 b k _ i5 1 n

    r i1 x ik+ . [3.68]Now, i5 1 n

    r i1 x i1 5 i5 1 n r 2 i1 , which is also the denominator of

    b 1. Therefore, we have shownthat

    b 1 5 b 1 1

    b k _ i5 1 n

    r i1 x ik+ / _ i5 1 n

    r 2 i1 ,+ 5

    b 1 1 b k

    d1.

    This is the relationship we wanted to show.

    3A.5 Proof of Theorem 3.2

    Again, we prove this for j 5 1. Write b 1 as in equation (3.65). Now, under MLR.5, Var( uiuX ) 5 s 2,

    for all i 5 1, , n. Under random sampling, the ui are independent, even conditional on X , andthe

    r i1 are nonrandom conditional on X . Therefore,

    Var( b 1uX ) 5 _ i5 1

    n r 2 i1 Var( uiuX )+ / _ i5 1

    n r 2 i1 + 2

    5 _ i5 1 n r 2 i1 s 2 + / _ i5 1

    n r 2 i1 + 2 5 s 2 / _ i5 1

    n r 2 i1 + .

    Now, since i5 1 n r 2 i1 is the sum of squared residuals from regressing x 1 on x 2, , x k ,

    i5 1 n r 2 i1 5 SST 1(1 2 R2 1 ). This completes the proof.

    3A.6 Proof of Theorem 3.4

    We show that, for any other linear unbiased estimator b 1 of b 1, Var(

    b 1) $ Var( b 1), where

    b 1 isthe OLS estimator. The focus on j 5 1 is without loss of generality.For

    b 1 as in equation (3.60), we can plug in for yi to obtain

    b 1 5 b 0 i5 1

    n

    wi1 1 b 1 i5 1

    n

    wi1 x i1 1 b 2 i5 1

    n

    wi1 x i2 1 1 b k i5 1

    n

    wi1 x ik 1 i5 1

    n

    wi1ui.

    34394_app3A_rev01.indd 3 07/10/13 2:35 PM

    2

    0 1 4

    C e n g a g e

    L e a r n

    i n g . A

    l l R i g h t s R e s e r v e d .

    T h i s c o n

    t e n

    t i s n o

    t y e t

    f n a

    l a n

    d C e n g a g e

    L e a r n

    i n g

    d o e s n o

    t g u a r a n

    t e e

    t h i s p a g e w

    i l l c o n t a i n c u r r e n

    t m a t e r

    i a l o r m a t c h

    t h e p u

    b l i s h e d p r o d u c

    t .

  • 8/10/2019 Chapter Specific Appendices

    6/19

    Now, since the wi1 are functions of the x ij,

    E( b 1uX ) 5 b 0

    i5 1

    n

    wi1 1 b 1 i5 1

    n

    wi1 x i1 1 b 2 i5 1

    n

    wi1 x i2 1 1 b k i5 1

    n

    wi1 x ik 1 i5 1

    n

    wi1E(uiuX )

    5 b 0 i5 1

    n

    wi1 1 b 1 i5 1

    n

    wi1 x i1 1 b 2 i5 1

    n

    wi1 x i2 1 1 b k i5 1

    n

    wi1 x ik

    because E( uiuX ) 5 0, for all i 5 1, , n under MLR.2 and MLR.4. Therefore, for E( b 1uX ) to

    equal b 1 for any values of the parameters, we must have

    i5 1

    n

    wi1 5 0, i5 1

    n

    wi1 x i1 5 1, i5 1

    n

    wi1 x ij 5 0, j 5 2, , k . [3.69]

    Now, let r i1 be the residuals from the regression of x i1 on x i2, , x ik . Then, from (3.69), it

    follows that

    i5 1

    n

    wi1 ri1 5 1 [3.70]

    because x i1 5 xi1 1

    r i1 and i5 1 n wi1

    xi1 5 0. Now, consider the difference between Var( b 1uX )

    and Var( b 1uX ) under MLR.1 through MLR.5:

    s 2 i5 1

    n w2 i1 2 s 2 / _ i5 1

    n r 2 i1 + . [3.71]

    Because of (3.70), we can write the difference in (3.71), without s 2, as

    i5 1

    n w2 i1 2 _ i5 1

    n wi1

    r i l + 2 / _ i5 1

    n r 2 i1 + . [3.72]

    But (3.72) is simply

    i5 1

    n (wi1 2 g 1

    r i1)2, [3.73]

    where g 1 5 _ i5 1

    n wi1

    r il + / _ i5 1 n r 2 i1 + , as can be seen by squaring each term in (3.73), summing,and then canceling terms. Because (3.73) is just the sum of squared residuals from the simpleregression of wi1 on

    r i1remember that the sample average of r i1 is zero(3.73) must be non-

    negative. This completes the proof.

    34394_app3A_rev01.indd 4 07/10/13 2:35 PM

  • 8/10/2019 Chapter Specific Appendices

    7/19

    Asymptotic Normality of OLS

    We sketch a proof of the asymptotic normality of OLS [Theorem 5.2(i)] in the simpleregression case. Write the simple regression model as in equation (5.16). Then, by the usualalgebra of simple regression, we can write

    ! __ n (

    b 1 2 b 1) 5 (1/ s x 2 )3 n2 1 / 2 i5 1

    n ( x i 2

    - x )u i 4 ,where we use s x

    2 to denote the sample variance of { x i: i 5 1, 2, ..., n}. By the law of large numbers(see the online Appendix C), s x

    2 p

    s x 2 5 Var( x ). Assumption MLR.3 rules out no perfect col-

    linearity, which means that Var( x ) . 0 ( x i varies in the sample, and therefore x is not constantin the population). Next, n2 1/2 i5 1

    n ( x i 2 - x )u i 5 n

    2 1/2 i5 1 n ( x i 2 )u i 1 ( 2

    - x )[n2 1/2 i5 1 n u i],

    where 5 E( x ) is the population mean of x . Now { u i} is a sequence of i.i.d. random vari-ables with mean zero and variance s 2, and so n 2 1/2 i5 1

    n u i converges to the Normal(0, s 2)

    distribution as n ` ; this is just the central limit theorem from the online Appendix C.By the law of large numbers, plim( 2 x - ) 5 0. A standard result in asymptotic theoryis that if plim( wn) 5 0 and zn has an asymptotic normal distribution, then plim( wn zn) 5 0.[See Wooldridge (2010, Chapter 3) for more discussion.] This implies that ( 2 - x )[n 2 1 / 2 i5 1

    n u i] has zero plim. Next, {( x i 2 )u i: i 5 1, 2, ...} is an indefinite sequence ofi.i.d. random variables with mean zerobecause u and x are uncorrelated under Assump-tion MLR.4and variance s 2s 2 x by the homoskedasticity Assumption MLR.5. Therefore,n 2 1 / 2 i5 1

    n ( x i 2 )u i has an asymptotic Normal(0, s 2s 2 x ) distribution. We just showed that the

    difference between n 2 1/2 i5 1 n ( x i 2

    - x )u i and n2 1/2 i5 1

    n ( x i 2 )u i has zero plim. A result inasymptotic theory is that if zn has an asymptotic normal distribution and plim( vn 2 zn) 5 0,then vn has the same asymptotic normal distribution as zn. It follows that n

    2 1/2 i5 1 n ( x i 2

    - x )u i also has an asymptotic Normal(0, s

    2s 2 x ) distribution. Putting all of the pieces together gives

    ! __ n (

    b 1 2 b 1) 5 (1/ s x 2 )

    3 n 2 1/2

    i5 1

    n ( x i 2

    - x )u i

    4

    1 [(1/ s x 2 ) 2 (1/ s x

    2)]3 n 2 1/2 i5 1

    n ( x i 2

    - x )u i 4 ,and since plim(1/ s x

    2) 5 1/ s x 2, the second term has zero plim. Therefore, the asymptotic distribution of

    ! __ n (

    b 1 2 b 1) is Normal(0,{ s 2s 2 x }/{s x 2}2) 5 Normal(0, s 2 / s 2 x ). This completes the proof in the simpleregression case, as a1

    2 5 s x 2 in this case. See Wooldridge (2010, Chapter 4) for the general case.

    Chapter 5 Appendix

    5AAPPENDIX

    34394_app5A_rev01.indd 1 07/10/13 2:57 PM

    2

    0 1 4

    C e n g a g e

    L e a r n

    i n g . A

    l l R i g h t s R e s e r v e d .

    T h i s c o n

    t e n

    t i s n o

    t y e t

    f n a

    l a n

    d C e n g a g e

    L e a r n

    i n g

    d o e s n o

    t g u a r a n

    t e e

    t h i s p a g e w

    i l l c o n t a i n c u r r e n

    t m a t e r

    i a l o r m a t c h

    t h e p u

    b l i s h e d p r o d u c

    t .

  • 8/10/2019 Chapter Specific Appendices

    8/19

    6A. A Brief Introduction to Bootstrapping

    In many cases where formulas for standard errors are hard to obtain mathematically, or where they are thoughtnot to be very good approximations to the true sampling variation of an estimator, we can rely on a resamplingmethod . The general idea is to treat the observed data as a population that we can draw samples from. The mostcommon resampling method is the bootstrap . (There are actually several versions of the bootstrap, but the mostgeneral, and most easily applied, is called the nonparametric bootstrap , and that is what we describe here.)

    Suppose we have an estimate, u , of a population parameter, u . We obtained this estimate, which could be

    a function of OLS estimates (or estimates that we cover in later chapters), from a random sample of size n. Wewould like to obtain a standard error for

    u that can be used for constructing t statistics or confidence intervals.Remarkably, we can obtain a valid standard error by computing the estimate from different random samplesdrawn from the original data.

    Implementation is easy. If we list our observations from 1 through n, we draw n numbers randomly, withreplacement, from this list. This produces a new data set (of size n) that consists of the original data, but withmany observations appearing multiple times (except in the rather unusual case that we resample the original

    data). Each time we randomly sample from the original data, we can estimateu

    using the same procedure thatwe used on the original data. Let u (b) denote the estimate from bootstrap sample b. Now, if we repeat the resam-

    pling and estimation m times, we have m new estimates, { u (b): b 5 1, 2, , m}. The bootstrap standard error

    of u is just the sample standard deviation of the

    u (b), namely,

    bse( u ) 5 f (m 2 1) 2 1

    b 5 1

    m (

    u (b) 2-

    u )2 g 1/2

    , [6.48]

    where-

    u is the average of the bootstrap estimates.If obtaining an estimate of u on a sample of size n requires little computational time, as in the case of OLS

    and all the other estimators we encounter in this text, we can afford to choose mthe number of bootstrap repli-cationsto be large. A typical value is m 5 1,000, but even m 5 500 or a somewhat smaller value can producea reliable standard error. Note that the size of m the number of times we resample the original datahas noth-ing to do with the sample size, n. (For certain estimation problems beyond the scope of this text, a large n can

    force one to do fewer bootstrap replications.) Many statistics and econometrics packages have built-in bootstrapcommands, and this makes the calculation of bootstrap standard errors simple, especially compared with thework often required to obtain an analytical formula for an asymptotic standard error.

    One can actually do better in most cases by using the bootstrap sample to compute p-values for t statistics(and F statistics), or for obtaining confidence intervals, rather than obtaining a bootstrap standard error to be usedin the construction of t statistics or confidence intervals. See Horowitz (2001) for a comprehensive treatment.

    Chapter 6 Appendix

    6AAPPENDIX

    34394_app6A_rev01.indd 2 07/10/13 3:00 PM

  • 8/10/2019 Chapter Specific Appendices

    9/19

    13A.1 Assumptions for Pooled OLS Using First Differences

    In this appendix, we provide careful statements of the assumptions for the first- differencing es-timator. Verification of these claims is somewhat involved, but it can be found in Wooldridge(2010, Chapter 10).

    Assumption FD.1For each i , the model is

    y it 5 b 1 xit 1 1 1 b k xitk 1 ai 1 uit , t 5 1, , T ,

    where the b j are the parameters to estimate and ai is the unobserved effect.

    Assumption FD.2We have a random sample from the cross section.

    Assumption FD.3Each explanatory variable changes over time (for at least some i), and no perfect linearrelationships exist among the explanatory variables.

    For the next assumption, it is useful to let X i denote the explanatory variables for all time periodsfor cross-sectional observation i; thus, X i contains x itj, t 5 1, , T , j 5 1, , k .

    Assumption FD.4For each t , the expected value of the idiosyncratic error given the explanatory variables in all time periods and the unobserved effect is zero: E( uit uXi , ai ) 5 0.

    When Assumption FD.4 holds, we sometimes say that the x itj are strictly exogenous conditionalon the unobserved effect . The idea is that, once we control for a i, there is no correlation be-tween the x isj and the remaining idiosyncratic error, u it , for all s and t .

    As stated, Assumption FD.4 is stronger than necessary. We use this form of the assumptionbecause it emphasises that we are interested in the equation

    E( yit uX i, a i) 5 E( yit ux it , a i) 5 b 1 x it 1 1 1 b k x itk 1 a i ,

    Chapter 13 Appendix

    13AAPPENDIX

    34394_app13A_rev01.indd 1 07/10/13 2:24 PM

    2

    0 1 4

    C e n g a g e

    L e a r n

    i n g . A

    l l R i g h t s R e s e r v e d .

    T h i s c o n

    t e n

    t i s n o

    t y e t

    f n a

    l a n

    d C e n g a g e

    L e a r n

    i n g

    d o e s n o

    t g u a r a n

    t e e

    t h i s p a g e w

    i l l c o n t a i n c u r r e n

    t m a t e r

    i a l o r m a t c h

    t h e p u

    b l i s h e d p r o d u c

    t .

  • 8/10/2019 Chapter Specific Appendices

    10/19

    so that the b j measure partial effects of the observed explanatory variables holding fixed, orcontrolling for, the unobserved effect, a i. Nevertheless, an important implication of FD.4,and one that is sufficient for the unbiasedness of the FD estimator, is E( uit uX i) 5 0, t 5 2, ...,

    T . In fact, for consistency we can simply assume that x itj is uncorrelated with u it for all t 5 2,, T and j 5 1, , k . See Wooldridge (2010, Chapter 10) for further discussion. Under these first four assumptions, the first-difference estimators are unbiased. The keyassumption is FD.4, which is strict exogeneity of the explanatory variables. Under these sameassumptions, we can also show that the FD estimator is consistent with a fixed T and as N ` (and perhaps more generally). The next two assumptions ensure that the standard errors and test statistics resulting frompooled OLS on the first differences are (asymptotically) valid.

    Assumption FD.5The variance of the differenced errors, conditional on all explanatory variables, is constant:Var(D uit uXi ) 5 s 2, t 5 2, , T .

    Assumption FD.6For all t s, the differences in the idiosyncratic errors are uncorrelated (conditional on allexplanatory variables): Cov( D uit , D uisuXi ) 5 0, t s.

    Assumption FD.5 ensures that the differenced errors, u it , are homoskedastic. AssumptionFD.6 states that the differenced errors are serially uncorrelated, which means that the u it fol-low a random walk across time (see Chapter 11). Under Assumptions FD.1 through FD.6, theFD estimator of the b j is the best linear unbiased estimator (conditional on the explanatoryvariables).

    Assumption FD.7Conditional on Xi , the D uit are independent and identically distributed normal randomvariables.

    When we add Assumption FD.7, the FD estimators are normally distributed, and the t and F statistics from pooled OLS on the differences have exact t and F distributions. Without FD.7,we can rely on the usual asymptotic approximations.

    13A.2 Computing Standard Errors Robust to Serial Correlation andHeteroskedasticity of Unknown Form

    Because the FD estimator is consistent as N under Assumptions FD.1 through FD.4, itwould be very handy to have a simple method of obtaining proper standard errors and test sta-tistics that allow for any kind of serial correlation or heteroskedasticity in the FD errors, e it 5 uit . Fortunately, provided N is moderately large, and T is not too large, fully robust standarderrors and test statistics are readily available. As mentioned in the text, a detailed treatment is

    above the level of this text. The technical arguments combine the insights described in Chap-ters 8 and 12, where statistics robust to heteroskedasticity and serial correlation are discussed.Actually, there is one important advantage with panel data: because we have a (large) crosssection, we can allow unrestricted serial correlation in the errors { e it } provided T is not toolarge. We can contrast this situation with the Newey-West approach in Section 12.5, wherethe estimated covariances must be downweighted as the observations get farther apart in time.

    34394_app13A_rev01.indd 2 07/10/13 2:24 PM

  • 8/10/2019 Chapter Specific Appendices

    11/19

    The general approach to obtaining fully robust standard errors and test statistics in the con-text of panel data is known as clustering , and ideas have been borrowed from the cluster sam-pling literature. The idea is that each cross-sectional unit is defined as a cluster of observations

    over time, and arbitrary correlationserial correlationand changing variances are allowedwithin each cluster. Because of the relationship to cluster sampling, many econometric soft-ware packages have options for clustering standard errors and test statistics. Most commandslook something like

    regress cy cx1 cx2 ... cxk, cluster(id)

    where id is a variable containing unique identifiers for the cross-sectional units (and the cbefore each variable denotes change). The option cluster(id) at the end of the regresscommand tells the software to report all standard errors and test statistics including t statisticsand F- type statisticsso that they are valid, in large cross sections, with any kind of serial cor-relation or heteroskedasticity. Reporting such statistics is very common in modern empiricalwork with panel data. Often the corrected standard errors will be substantially larger than eitherthe usual standard errors or those that only correct for heteroskedasticity. The larger standarderrors better reflect the sampling error in the pooled OLS coefficients.

    34394_app13A_rev01.indd 3 07/10/13 2:24 PM

    2

    0 1 4

    C e n g a g e

    L e a r n

    i n g . A

    l l R i g h t s R e s e r v e d .

    T h i s c o n

    t e n

    t i s n o

    t y e t

    f n a

    l a n

    d C e n g a g e

    L e a r n

    i n g

    d o e s n o

    t g u a r a n

    t e e

    t h i s p a g e w

    i l l c o n t a i n c u r r e n

    t m a t e r

    i a l o r m a t c h

    t h e p u

    b l i s h e d p r o d u c

    t .

  • 8/10/2019 Chapter Specific Appendices

    12/19

    14A.1 Assumptions for Fixed and Random Effects

    In this appendix, we provide statements of the assumptions for fixed and random effectsestimation. We also provide a discussion of the properties of the estimators under differentsets of assumptions. Verification of these claims is somewhat involved, but can be found inWooldridge (2010, Chapter 10).

    Assumption FE.1For each i , the model is

    y it 5 b 1 xit 1 1 1 b k xitk 1 ai 1 uit , t 5 1,, T ,

    where the b j are the parameters to estimate and ai is the unobserved effect.

    Assumption FE.2We have a random sample from the cross section.

    Assumption FE.3Each explanatory variable changes over time (for at least some i), and no perfect linearrelationships exist among the explanatory variables.

    Assumption FE.4For each t , the expected value of the idiosyncratic error given the explanatory variablesin all time periods and the unobserved effect is zero: E( u it uXi, ai ) 5 0.

    Under these first four assumptionswhich are identical to the assumptions for the first-differencing estimatorthe fixed effects estimator is unbiased. Again, the key is the strict exo-geneity assumption, FE.4. Under these same assumptions, the FE estimator is consistent witha fixed T as N ` .

    Assumption FE.5Var(u it uXi , ai ) 5 Var(u it ) 5 s u2, for all t 5 1, , T .

    Assumption FE.6

    For all t s, the idiosyncratic errors are uncorrelated (conditional on all explanatoryvariables and ai ): Cov(uit ,u isuXi , a i ) 5 0.

    Chapter 14 Appendix

    14AAPPENDIX

    34394_app14A_rev01.indd 2 07/10/13 2:29 PM

  • 8/10/2019 Chapter Specific Appendices

    13/19

    Under Assumptions FE.1 through FE.6, the fixed effects estimator of the b j is the best linearunbiased estimator. Since the FD estimator is linear and unbiased, it is necessarily worse thanthe FE estimator. The assumption that makes FE better than FD is FE.6, which implies that the

    idiosyncratic errors are serially uncorrelated.

    Assumption FE.7Conditional on Xi and ai , the u it are independent and identically distributed asNormal(0, s u2).

    Assumption FE.7 implies FE.4, FE.5, and FE.6, but it is stronger because it assumes a normaldistribution for the idiosyncratic errors. If we add FE.7, the FE estimator is normally distrib-uted, and t and F statistics have exact t and F distributions. Without FE.7, we can rely onasymptotic approximations. But, without making special assumptions, these approximationsrequire large N and small T .

    The ideal random effects assumptions include FE.1, FE.2, FE.4, FE.5, and FE.6. (FE.7could be added but it gains us little in practice because we have to estimate u.) Because we are

    only subtracting a fraction of the time averages, we can now allow time-constant explanatoryvariables. So, FE.3 is replaced with

    Assumption RE.1There are no perfect linear relationships among the explanatory variables.

    The cost of allowing time-constant regressors is that we must add assumptions about how theunobserved effect, a i, is related to the explanatory variables.

    Assumption RE.2In addition to FE.4, the expected value of a i given all explanatory variables is constant:E(a i uXi ) 5 b 0.

    This is the assumption that rules out correlation between the unobserved effect and the

    explanatory variables, and it is the key distinction between fixed effects and random ef-fects. Because we are assuming a i is uncorrelated with all elements of x it , we can includetime- constant explanatory variables. (Technically, the quasi-time-demeaning only removes afraction of the time average, and not the whole time average.) We allow for a nonzero expecta-tion for a i in stating Assumption RE.4 so that the model under the random effects assumptionscontains an intercept, b 0, as in equation (14.7). Remember, we would typically include a set oftime-period intercepts, too, with the first year acting as the base year.

    We also need to impose homoskedasticity on a i as follows:

    Assumption RE.3In addition to FE.5, the variance of a i given all explanatory variables is constant:Var(a i uXi ) 5 s a2.

    Under the six random effects assumptions (FE.1, FE.2, RE.3, RE.4, RE.5, and FE.6), theRE estimator is consistent and asymptotically normally distributed as N gets large for fixed T .Actually, consistency and asymptotic normality follow under the first four assumptions, but

    34394_app14A_rev01.indd 3 07/10/13 2:29 PM

    2

    0 1 4

    C e n g a g e

    L e a r n

    i n g . A

    l l R i g h t s R e s e r v e d .

    T h i s c o n

    t e n

    t i s n o

    t y e t

    f n a

    l a n

    d C e n g a g e

    L e a r n

    i n g

    d o e s n o

    t g u a r a n

    t e e

    t h i s p a g e w

    i l l c o n t a i n c u r r e n

    t m a t e r

    i a l o r m a t c h

    t h e p u

    b l i s h e d p r o d u c

    t .

  • 8/10/2019 Chapter Specific Appendices

    14/19

    without the last two assumptions the usual RE standard errors and test statistics would notbe valid. In addition, under the six RE assumptions, the RE estimators are asymptotically ef-ficient. This means that, in large samples, the RE estimators will have smaller standard errors

    than the corresponding pooled OLS estimators (when the proper, robust standard errors areused for pooled OLS). For coefficients on time-varying explanatory variables (the only onesestimable by FE), the RE estimator is more efficient than the FE estimatoroften much moreefficient. But FE is not meant to be efficient under the RE assumptions; FE is intended to be ro-bust to correlation between a i and the x itj. As often happens in econometrics, there is a tradeoffbetween robustness and efficiency. See Wooldridge (2010, Chapter 10) for verification of theclaims made here.

    14A.2 Inference Robust to Serial Correlation and Heteroskedasticity for FixedEffects and Random Effects

    One of the key assumptions for performing inference using the FE, RE, and even the CREapproach to panel data models is the assumption of no serial correlation in the idiosyncratic

    errors,{ u it : t = 1, , T }see Assumption FE.6. Of course, heteroskedasticity can also be anissue, but this is also ruled out for standard inference (see Assumption FE.5). As discussed inthe online appendix to Chapter 13, the same issues can arise with first differencing estimationwhen we have T 3 time periods.

    Fortunately, as with FD estimation, there are now simple solutions for fully robustinferenceinference that is robust to arbitrary violations of Assumptions FE.5 and FE.6and, when applying the RE or CRE approaches, to Assumption RE.5. As with FD estima-tion, the general approach to obtaining fully robust standard errors and test statistics is knownas clustering . Now, however, the clustering is applied to a different equation. For example,for FE estimation, the clustering is applied to the time-demeaned equation (14.5). For REestimation, the clustering gets applied to the quasi-time-demeaned equation (14.11) [and asimilar comment holds for CRE, but there the time averages are included as separate explana-tory variables]. The details, which can be found in Wooldridge (2010, Chapter 10) are too

    advanced for this course. But understanding the purpose of clustering is not: if possible, weshould compute standard errors, confidence intervals, and test statistics that are valid in largecross sections under the weakest set of assumptions. The FE estimator requires only Assump-tions FE.1 to FE.4 for unbiasedness and consistency (as N ` with T fixed). Thus, a carefulresearcher at least checks whether inference made robust to serial correlation and heteroske-dasticity in the errors affects inference. Experience shows that it often does.

    Applying cluster robust inference to account for serial correlation within a panel data con-text is easily justified when N is substantially larger than T , but cannot be justified when N issmall and T is larger. Computing the cluster robust statistics after FE or RE estimation is simplein many econometrics packages, often only requiring a qualification for the form cluster(id)appended to the end of FE and RE estimation commands. As in the FD case, id refers to across-section identifier.

    34394_app14A_rev01.indd 4 07/10/13 2:29 PM

  • 8/10/2019 Chapter Specific Appendices

    15/19

    15A.1 Assumptions for Two Stage Least Squares

    This appendix covers the assumptions under which 2SLS has desirable large sample properties.We first state the assumptions for cross-sectional applications under random sampling. Then,we discuss what needs to be added for them to apply to time series and panel data.

    15A.2 Assumption 2SLS.1 (Linear in Parameters)

    The model in the population can be written as

    y 5 b 0 1 b 1 x 1 1 b 2 x 2 1 1 b k x k 1 u,

    where b 0 , b 1, , b k are the unknown parameters (constants) of interest, and u is anunobserved random error or random disturbance term. The instrumental variables aredenoted as z j.

    It is worth emphasising that Assumption 2SLS.1 is virtually identical to MLR.1 (with the

    minor exception that 2SLS.1 mentions the notation for the instrumental variables, z j). In otherwords, the model we are interested in is the same as that for OLS estimation of the b j. Sometimesit is easy to lose sight of the fact that we can apply different estimation methods to the samemodel. Unfortunately, it is not uncommon to hear researchers say I estimated an OLS model orI used a 2SLS model. Such statements are meaningless. OLS and 2SLS are different estimation methods that are applied to the same model. It is true that they have desirable statistical proper-ties under different sets of assumptions on the model, but the relationship they are estimating isgiven by the equation in 2SLS.1 (or MLR.1). The point is similar to that made for the unobservedeffects panel data model covered in Chapters 13 and 14: pooled OLS, first differencing, fixed ef-fects, and random effects are different estimation methods for the same model.

    15A.3 Assumption 2SLS.2 (Random Sampling)

    We have a random sample on y, the x j, and the z j.

    15A.4 Assumption 2SLS.3 (Rank Condition)

    (i) There are no perfect linear relationships among the instrumental variables. (ii) The rankcondition for identification holds.

    Chapter 15 Appendix

    15AAPPENDIX

    34394_app15A_rev01.indd 1 07/10/13 2:00 PM

    2

    0 1 4

    C e n g a g e

    L e a r n

    i n g . A

    l l R i g h t s R e s e r v e d .

    T h i s c o n

    t e n

    t i s n o

    t y e t

    f n a

    l a n

    d C e n g a g e

    L e a r n

    i n g

    d o e s n o

    t g u a r a n

    t e e

    t h i s p a g e w

    i l l c o n t a i n c u r r e n

    t m a t e r

    i a l o r m a t c h

    t h e p u

    b l i s h e d p r o d u c

    t .

  • 8/10/2019 Chapter Specific Appendices

    16/19

    With a single endogenous explanatory variable, as in equation (15.42), the rank con-dition is easily described. Let z1, , zm denote the exogenous variables, where zk , , zm do not appear in the structural model (15.42). The reduced form of y2 is

    y2 5 p 0 1 p 1 z1 1 p 2 z2 1 1 p k 1 zk 1 1 p k zk 1 1 p m zm 1 v2.

    Then, we need at least one of p k , , p m to be nonzero. This requires at least one exogenous variablethat does not appear in (15.42) (the order condition). Stating the rank condition with two or moreendogenous explanatory variables requires matrix algebra. [See Wooldridge (2010, Chapter 5).]

    15A.5 Assumption 2SLS.4 (Exogenous Instrumental Variables)

    The error term u has zero mean, and each IV is uncorrelated with u.

    Remember that any x j that is uncorrelated with u also acts as an IV.

    15A.6 Theorem 15A.1

    Under Assumptions 2SLS.1 through 2SLS.4, the 2SLS estimator is consistent.

    15A.7 Assumption 2SLS.5 (Homoskedasticity)

    Let z denote the collection of all instrumental variables. Then, E( u2| z ) 5 s 2.

    15A.8 Theorem 15A.2

    Under Assumptions 2SLS.1 through 2SLS.5, the 2SLS estimators are asymptoticallynormally distributed. Consistent estimators of the asymptotic variance are given as inequation (15.43), where s 2 is replaced with

    s 2 5 (n 2 k 2 1 ) 2 1 i5 1 n u i2, and the

    u i are the 2SLS residuals.

    The 2SLS estimator is also the best IV estimator under the five assumptions given. We state

    the result here. A proof can be found in Wooldridge (2010, Chapter 5).

    15A.9 Theorem 15A.3

    Under Assumptions 2SLS.1 through 2SLS.5, the 2SLS estimator is asymptotically efficient in theclass of IV estimators that uses linear combinations of the exogenous variables as instruments.

    If the homoskedasticity assumption does not hold, the 2SLS estimators are still asymp-totically normal, but the standard errors (and t and F statistics) need to be adjusted; manyeconometrics packages do this routinely. Moreover, the 2SLS estimator is no longer the as-ymptotically efficient IV estimator, in general. We will not study more efficient estimatorshere [see Wooldridge (2010, Chapter 8)].

    For time series applications, we must add some assumptions. First, as with OLS, we must as-sume that all series (including the IVs) are weakly dependent: this ensures that the law of large

    numbers and the central limit theorem hold. For the usual standard errors and test statistics to bevalid, as well as for asymptotic efficiency, we must add a no serial correlation assumption.

    15A.10 Assumption 2SLS.6 (No Serial Correlation)

    Equation (15.54) holds.

    A similar no serial correlation assumption is needed in panel data applications. Tests andcorrections for serial correlation were discussed in Section 15.7.

    34394_app15A_rev01.indd 2 07/10/13 2:00 PM

  • 8/10/2019 Chapter Specific Appendices

    17/19

    17A.1 Maximum Likelihood Estimation with Explanatory Variables

    The online Appendix C provides a review of maximum likelihood estimation (MLE) in thesimplest case of estimating the parameters in an unconditional distribution. But most models ineconometrics have explanatory variables, whether we estimate those models by OLS or MLE.The latter is indispensable for nonlinear models, and here we provide a very brief descriptionof the general approach.

    All of the models covered in this chapter can be put in the following form. Let f ( y|x , b ) de-note the density function for a random draw yi from the population, conditional on x i 5 x . Themaximum likelihood estimator (MLE) of b maximises the log-likelihood function,

    maxb

    i5 1

    n log f( yi |x i, b ), [17.53]

    where the vector b is the dummy argument in the maximisation problem. In most cases, theMLE, which we write as

    b , is consistent and has an approximate normal distribution in largesamples. This is true even though we cannot write down a formula for

    b except in very specialcircumstances.

    For the binary response case (logit and probit), the conditional density is determined bytwo values, f (1 |x , b ) 5 P( yi 5 1|x i) 5 G(x i b ) and f (0 |x , b ) 5 P( yi 5 0|x i) 5 1 2 G(x i b ). Infact, a succinct way to write the density is f( y|x , b ) 5 [1 2 G(x b )] (12 y)[G(x b )] y for y 5 0, 1.Thus, we can write (17.53) as

    maxb

    i5 1

    n {(1 2 yi) log[1 2 G(x ib )] 1 yi log[ G(x ib )]}. [17.54]

    Generally, the solutions to (17.54) are quickly found by modern computers using iterativemethods to maximise a function. The total computation time even for fairly large data sets istypically quite low.

    The log-likelihood function for the Tobit model and for censored and truncated regressionare only slightly more complicated, depending on an additional variance parameter in additionto b . They are easily derived from the densities obtained in the text. See Wooldridge (2010)for details.

    Chapter 17 Appendix

    17AAPPENDIX

    34394_app17A_rev01.indd 1 07/10/13 2:05 PM

    2

    0 1 4

    C e n g a g e

    L e a r n

    i n g . A

    l l R i g h t s R e s e r v e d .

    T h i s c o n

    t e n

    t i s n o

    t y e t

    f n a

    l a n

    d C e n g a g e

    L e a r n

    i n g

    d o e s n o

    t g u a r a n

    t e e

    t h i s p a g e w

    i l l c o n t a i n c u r r e n

    t m a t e r

    i a l o r m a t c h

    t h e p u

    b l i s h e d p r o d u c

    t .

  • 8/10/2019 Chapter Specific Appendices

    18/19

    17B.1 Asymptotic Standard Errors in Limited Dependent Variable Models

    Derivations of the asymptotic standard errors for the models and methods introduced in thischapter are well beyond the scope of this text. Not only do the derivations require matrix alge-bra, but they also require advanced asymptotic theory of nonlinear estimation. The backgroundneeded for a careful analysis of these methods and several derivations are given in Wooldridge(2010).

    It is instructive to see the formulas for obtaining the asymptotic standard errors for at leastsome of the methods. Given the binary response model P( y 5 1ux ) 5 G(x b ), where G(?) is thelogit or probit function, and b is the k 3 1 vector of parameters, the asymptotic variance matrixof

    b is estimated as

    Avar ( b ) ; 1 i5 1

    n [g(x i

    b )]2x i9 x i__________________

    G(x i

    b )[1 2 G(x i

    b )] 2

    2 1

    , [17.55]

    which is a k 3 k matrix. (See the online Appendix D for a summary of matrix algebra.) Withoutthe terms involving g(?) and G(?), this formula looks a lot like the estimated variance matrix forthe OLS estimator, minus the term

    s 2. The expression in (17.55) accounts for the nonlinear na-ture of the response probabilitythat is, the nonlinear nature of G(?)as well as the particularform of heteroskedasticity in a binary response model: Var( yux ) 5 G(x b )[1 2 G(x b )].

    The square roots of the diagonal elements of (17.55) are the asymptotic standard errors ofthe

    b j, and they are routinely reported by econometrics software that supports logit and probitanalysis. Once we have these, (asymptotic) t statistics and confidence intervals are obtained inthe usual ways.

    The matrix in (17.55) is also the basis for Wald tests of multiple restrictions on b [see Wooldridge (2010, Chapter 15)].

    The asymptotic variance matrix for Tobit is more complicated but has a similar structure.

    Note that we can obtain a standard error for

    s as well. The asymptotic variance for Poissonregression, allowing for s 2 1 in (17.35), has a form much like (17.55):

    Avar ( b ) 5 s 2_ i5 1

    n exp( x i

    b )x i9 x i +

    2 1

    . [17.56]

    Chapter 17 Appendices

    17BAPPENDIX

    34394_app17A_rev01.indd 2 07/10/13 2:05 PM

  • 8/10/2019 Chapter Specific Appendices

    19/19

    The square roots of the diagonal elements of this matrix are the asymptotic standard errors. If thePoisson assumption holds, we can drop

    s 2 from the formula (because s 2 5 1).Asymptotic standard errors for censored regression, truncated regression, and the Heckit

    sample selection correction are more complicated, although they share features with theprevious formulas. [See Wooldridge (2010) for details.]

    2

    0 1 4

    C e n g a g e

    L e a r n

    i n g . A

    l l R i g h t s R e s e r v e d .

    T h i s c o n

    t e n

    t i s n o

    t y e t

    f n a

    l a n

    d C e n g a g e

    L e a r n

    i n g

    d o e s n o

    t g u a r a n

    t e e

    t h i s p a g e w

    i l l c o n t a i n c u r r e n

    t m a t e r

    i a l o r m a t c h

    t h e p u

    b l i s h e d p r o d u c

    t .