Bivariate Least Squaresbwbwn/econ510_files/part_2.pdf74 CHAPTER 6. BIVARIATE LEAST SQUARES 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 x y 2 12 3 7 4 8 5 5 6 3 Figure 6.1: Scatter Plot We

Chapter 6

Bivariate Least Squares

In economics we frequently encounter models where the relationship betweenvariables enables us to classify the variables into an independent variables whichare not modeled and and a dependent variable that is generated by the model.In order to better understand the underlying structure of such models we be-gin by restricting our attention to the bivariate case, where there is only oneindependent variable, and the linear case, where the relationship among thevariables is linear. Since the relationship is not precise, we specify an additiveerror term, which is assumed to be a random variable. An example would thewage equation wt = α+βEi + ui, where wi is the wage earned by individual iand Ei is the education level attained by the individual.

6.1 Introduction

6.1.1 The Bivariate Linear Model

Consider the model

yi = α+ βxi + ui, i = 1, 2, . . . , n (6.1)

where yi is the dependent variable, xi is the independent or explanatory variable,and ui is the unobservable disturbance. This is the bivariate linear regressionmodel. It is linear in the variables (given α and β), linear in the parameters(given xi), and linear in the disturbances.

In matrix form, we can gather all the observations together asy1y2...yn

=

1 x11 x2...

...1 xn

(αβ

)+

u1u2...un

, (6.2)

or, more compactly,y = Xβ + u, (6.3)

72

6.1. INTRODUCTION 73

where y is an n × 1 vector of all n observations on yi, u is an n × 1 vector ofall n observations on ui, X is an n× 2 matrix with the first column a vector ofones and the second column the observations on xi, and β=(α β)′.

Example 6.1. The consumption example from the first chapter is linear:

Ct = α+ βDt + ut

where Ct is aggregate consumption and Dt is aggregate disposable income attime t. Except for the definition of the variables and use of the index t insteadof i it is the same model. �

Example 6.2. Neither of the following two equations is linear:

yi = (α+ βxi)ui

yi = αxβi + ui.

The first is nonlinear in the variables while the second is nonlinear in the pa-rameters. �

6.1.2 Assumptions

For the disturbances, we assume that

(i) E(ui) = 0, for all i

(ii) E(u2i ) = σ2, for all i

(iii) E(uiuj) = 0, for all i 6= j

For the independent variable, We suppose

(iv) xi nonstochastic for all i

(v) xi nonconstant

For purposes of inference in finite samples, we sometime assume

(vi) ui v i.i.d. N(0, σ2), for all i.

6.1.3 Line Fitting

Consider a scatter of points for two unspecified variables x and y as shown inFigure 6.1.

74 CHAPTER 6. BIVARIATE LEAST SQUARES

0 2 4 6 8 10 12 140

2

4

6

8

10

12

14

x y 2 12

3 7

4 8

5 5

6 3

Figure 6.1: Scatter Plot

We assume that x and y are, in some way, linearly related. If this wereexactly true then we can write

yi = α+ βxi.

for some α and β. Unfortunately, we find that no single line “matches” all theobservations. Since no single line is entirely consistent with the data, we mightchoose α and β that best “fits” the data, in some sense. Define

ei = yi − (α+ βxi), for i = 1, 2, . . . , n (6.4)

as the discrepancy between the line chosen and the observations. Our objectivethen is to choose α and β to minimize the discrepancy.

6.1.4 Least Squares

A possible criterion for minimum discrepancy is

minα,β|∑i

ei|,

but this will be zero for any line passing through (x, y). Another possibility is

minα,β

∑i

|ei|,

which is called the minimum absolute distance (MAD) or L1 estimator. TheMAD estimator has problems since the mathematics (and statistical distribu-tions) are relatively complicated. A closely related choice is

minα,β

∑i

e2i . (6.5)

This yields the least squares or L2 estimator.

6.2. LEAST SQUARES REGRESSION 75

6.2 Least Squares Regression

6.2.1 The First-Order Conditions

Let

ψ =

n∑i=1

e2i =

n∑i=1

(yi − α− βxi) (6.6)

Then the minimum values, say, α and β must satisfy the following first-orderconditions:

0 =∂ψ

∂α=

n∑i=1

−2(yi − α− βxi)

0 =∂ψ

∂β=

n∑i=1

−2(yi − α− βxi)xi

These first-order conditions may be rewritten as the normal equations

n∑i=1

yi =

n∑i=1

(α+ βxi) (6.7)

n∑i=1

yixi =

n∑i=1

(αxi + βx2i ) (6.8)

Clearly, (6.7) implies

α = y − βx, (6.9)

where y =∑ni=1 yi/n and x =

∑ni=1 xi/n.

Substituting (6.9) into (6.8) yields

n∑i=1

yixi =

n∑i=1

[ (y − βx)xi + βx2i ]

= y

n∑i=1

xi + β

n∑i=1

xi (xi − x)

and rearraging we obtain

n∑i=1

yixi − nyx = β

(n∑i=1

x2i − nx2)

n∑i=1

(yi − y)(xi − x) = β

n∑i=1

(xi − x)2.

Provided∑ni=1(xi − x)2 6= 0, which is assured by Assumption (v) above, this

may be solved for β to obtain

β =

∑ni=1(yi − y)(xi − x)∑n

i=1(xi − x)2. (6.10)


6.2.2 The Second-Order Conditions

Note that the least squares estimators β, and in turn, α, are unique. Takingthe second derivatives of the objective function yields

∂2ψ

∂α2= 2n,

∂2ψ

∂β2= 2

n∑i=1

x2i ,

and

∂2ψ

∂α∂β= 2

n∑i=1

xi.

Thus, the Hessian matrix is

H =

(2n 2

∑ni=1 xi

2∑ni=1 xi 2

∑ni=1 x

2i

),

which is a positive definite matrix, provided∑ni=1(xi − x)2 6= 0, so α and β

yield a unique minimum.

6.2.3 Matrix Interpretation

Matrix algebra provides an alternative route to the same solutions. Note thatthe normal equations (6.7) and (6.8), are linear in α and β. In matrix form, wehave ( ∑n

i=1 yi∑ni=1 yixi

)=

(n

∑ni=1 xi∑n

i=1 xi∑ni=1 x

2i

) (α

β

),

which has the solution(α

β

)=

(n

∑ni=1 xi∑n

i=1 xi∑ni=1 x

2i

)−1( ∑ni=1 yi∑ni=1 yixi

)(6.11)

=1

n∑ni=1 x

2i − (

∑ni=1 xi)

2

( ∑ni=1 x

2i −

∑ni=1 xi

−∑ni=1 xi n

)( ∑ni=1 yi∑ni=1 yixi

)

=1

n∑ni=1 x

2i − (

∑ni=1 xi)

2

∑ni=1 x

2i

∑ni=1 yi −

∑ni=1 xi

∑ni=1 xiyi

−∑ni=1 xi

∑ni=1 yi + n

∑ni=1 xiyi

.

6.3. BASIC STATISTICAL PROPERTIES 77

Now, β, according to this formula, is

β =n∑ni=1 xiyi −

∑ni=1 xi

∑ni=1 yi

n∑ni=1 x

2i − (

∑ni=1 xi)

2

=

∑ni=1 xiyi − x

∑ni=1 yi∑n

i=1 x2i − x

∑ni=1 xi

=

∑ni=1 xiyi − nxy∑ni=1 x

2i − nx2

=

∑ni=1(yi − y)(xi − x)∑n

i=1(xi − x))2,

while

α =

∑ni=1 x

2i

∑ni=1 yi −

∑ni=1 xi

∑ni=1 xiyi

n∑ni=1 x

2i − (

∑ni=1 xi)

2

=y∑ni=1 x

2i − x

∑ni=1 xiyi∑n

i=1 x2i − x

∑ni=1 xi

=y∑ni=1 x

2i − yx

∑ni=1 xi + yx

∑ni=1 xi − x

∑ni=1 xiyi∑n

i=1 x2i − x

∑ni=1 xi

=y(∑ni=1 x

2i − x

∑ni=1 xi) + x2

∑ni=1 yi − x

∑ni=1 xiyi∑n

i=1 x2i − x

∑ni=1 xi

= y +x2∑ni=1 yi − x

∑ni=1 xiyi∑n

i=1 x2i − x

∑ni=1 xi

= y +(x∑ni=1 yi −

∑ni=1 xiyi)∑n

i=1 x2i − x

∑ni=1 xi

x

= y − βx.

Note that in the compact notation given following (6.2), we can write thesolution to (6.11) using the general solution from the next chapter, namely

β = (X′X)−1X′y.

6.3 Basic Statistical Properties

6.3.1 Method Of Moments Interpretation

Suppose, for the moment, that xi is a random variable that is uncorrelated withui. Let µx = E(xi). Then

µy = E(yi) = E(α+ βxi + ui) = α+ βµx. (6.12)

Thus, subtracting (6.12) from (6.1) yields

(yi − µy) = α+ βxi + ui − (α+ βµx) = β(xi − µx) + ui,


and

E(yi − µy)(xi − µx) = βE(xi − µx) + E(xi − µx)ui = βE(xi − µx)2,

since xi and ui are uncorrelated. Solving for β, we have

β =E[(xi − µx)(yi − µy)]

E[(xi − µx)2],

while

α = µy − βµx.

These values are unique provided the variance of the independent variable isnonzero.

These expressions form the basis for estimators by replacing the variouspopulation moments with sample moments. Specifically, we have, by analogy

µx = x =1

n

∑n

i=1xi

µy = y =1

n

∑n

i=1yi

β =1n

∑ni=1(xi − µx)(yi − µy)1n

∑ni=1(xi − µx)2

=

∑ni=1(yi − y)(xi − x)∑n

i=1(xi − x)2

α = µy − βµx= y − βx.

Thus the least-squares estimators are method of moments estimators with sam-ple moments replacing the population moments.

6.3.2 Mean of Estimators

Next, note that

β =

∑ni=1(xi − x) yi∑ni=1(xi − x)2

=

∑ni=1(xi − x)(α+ βxi + ui)∑n

i=1(xi − x)2

= α

∑ni=1(xi − x)∑ni=1(xi − x)2

+ β

∑ni=1(xi − x)xi∑ni=1(xi − x)2

+

∑ni=1(xi − x)ui∑ni=1(xi − x)2

= β +


. (6.13)

6.3. BASIC STATISTICAL PROPERTIES 79

So, E[β] = β, since E[ui] = 0 for all i. Thus, β is an unbiased estimator of β.Further

α = y − βx =

n∑i=1

yin− x

∑ni=1(xi − x) yi∑ni=1(xi − x)2

=

n∑i=1

(1

n− x xi − x∑n

i=1(xi − x)2

)yi

= α

n∑i=1

(1

n− x xi − x∑n

i=1(xi − x)2

)

+β

n∑i=1

(xin− x (xi − x)xi∑n

i=1(xi − x)2

)+

n∑i=1

(1

n− x xi − x∑n

i=1(xi − x)2

)ui

= α+

n∑i=1

(1

n− x xi − x∑n

i=1(xi − x)2

)ui (6.14)

So we have E[α] = α, and α is also an unbiased estimator of α.

6.3.3 Variance of Estimators

Now, from (6.13) we have

β − β =


=

n∑i=1

wiui,

where

wi =(xi − x)∑ni=1(xi − x)2

.

So,

Var(β) = E[(β − β)2] (6.15)

= E[(

n∑i=1

wiui)2] = E[(w1u1 + w2u2 + · · ·+ wnun)2]

= E[(w21u

21 + w1u1w2u2 + · · ·+ w1u1wnun

+w2u2w1u1 + w22u

22 + · · ·+ w2u2wnun

...

+wnunw1u1 + wnunw2u2 + · · ·+ w2nu

2n)]

= w21σ

2 + w22σ

2 + · · ·+ w2nσ

2

= σ2n∑i=1

w2i = σ2

n∑i=1

((xi − x)∑nj=1(xj − x)2

)2

=σ2∑n

i=1(xi − x)2. (6.16)


Next, we note from (6.14) that

α− α =

n∑i=1

(1

n− x xi − x∑n

i=1(xi − x)2

)ui =

n∑i=1

viui,

where

vi =

(1

n− x xi − x∑n

i=1(xi − x)2

).

So, in a fashion similar to above, we find that

Var(α− α) = σ2

∑ni=1 x

2i

n∑ni=1(xi − x)2

, (6.17)

and

Cov(α, β) = E[(α− α)(β − β)] = σ2

∑ni=1 xi

n∑ni=1(xi − x)2

. (6.18)

6.3.4 Estimation of σ2

Next, we would like to get an estimate of σ2. Let

s2 =1

n− 2

n∑i=1

e2i , (6.19)

where

ei = yi − α− βxi= (yi − y)− (α− α)− β(xi − x)

= (yi − y)− β(xi − x)

= β(xi − x) + (ui − u)− β(xi − x)

= −(β − β)(xi − x) + (ui − u)

So,

n∑i=1

e2i =

n∑i=1

[−(β − β)(xi − x) + (ui − u)]2

= (β − β)2n∑i=1

(xi − x)2

−2(β − β)

n∑i=1

(xi − x)(ui − u) +

n∑i=1

(ui − u)2.

Now, we have

E[(β − β)2n∑i=1

(xi − x)2] = σ2,

6.4. STATISTICAL PROPERTIES UNDER NORMALITY 81

E[

n∑i=1

(ui − u)2] = E[

n∑i=1

(u2i − u2)]

= E[

n∑i=1

u2i ]−1

nE[

n∑i=1

ui] = (n− 1)σ2,

and

E[(β − β)

n∑i=1

(xi − x)(ui − u)]

= E

[n∑i=1

wiui

](n∑i=1

(xi − x)ui −n∑i=1

(xi − x)u

)

= E

[n∑i=1

wiui

]n∑i=1

(xi − x)ui =

n∑i=1

wi(xi − x)σ2 = σ2.

Therefore, collecting terms, we have

E[∑

e2i ] = σ2 − 2σ2 + (n− 1)σ2 = (n− 2)σ2

andE[s2] = σ2. (6.20)

6.4 Statistical Properties Under Normality

6.4.1 Distribution of β

Suppose, under Assumption (vi), that ui v i.i.d. N(0, σ2). Recall from (6.13)that

β = β +

n∑i=1

wiui,

where wi = (xi − x)/(∑ni=1(xi − x)2), so β is also a normal random variable.

Specifically,β ∼ N(β, σ2/q), (6.21)

where q =∑

(xi − x)2. It follows, for a given β0,

β − β0 ∼ N(β − β0, σ2/q),

and

zβ =β − β0√σ2/q

∼ N

(β − β0√σ2/q

, 1

).

This ratio forms the basis for inferences regarding β. Under the null hy-pothesis H0 : β = β0, then

β − β0√σ2/q

∼ N(0, 1),


and the probability of values exceeding, say 1.96, is well known to be .025.Under the alternative hypothesis H1 : β = β1 > β0 we have

β − β0√σ2/q

∼ N

(β1 − β0√σ2/q

, 1

),

and we would expect the statistic to be centered to the right of zero and theprobability of exceeding 1.96 will be incrased. Thus for values exceeding 1.96we are faced with the choice of the null hypothesis where it is an unlikely valueor the alternative hypothesis where it is more likely.

6.4.2 Distribution of α

Again, suppose that ui v i.i.d. N(0, σ2). From (6.14) above

α = α+

n∑i=1

viui,

where vi = 1/n − x(xi − x)/(∑ni=1(xi − x)2), is a normal random variable.

Specifically,α ∼ N(α, σ2p/q), (6.22)

where p =∑ni=1 x

2i /n. For a given α0,

α− α0 ∼ N(α− α0, σ2p/q),

and

zα =α− α0√σ2p/q

∼ N

(α− α0√σ2p/q

, 1

).

Now, suppose that H0 : α = α0. Then,

α− α0√σ2p/q

∼ N(0, 1), (6.23)

while for H0 : α = α1 6= α0 we have

α− α0√σ2p/q

∼ N

(α1 − α0√σ2p/q

, 1

),

and we would expect the statistic not to be centered around zero.It is important to note that

p/q =

∑ni=1 x

2i

n∑ni=1(xi − x)2

,

and so√σ2p/q grows small as n grows large. Thus, the noncentrality of the

distribution of the statistic zα will also grow under the alternative. A similarstatement can be made concerning q and the statistic zβ .


6.4.3 t-Distribution

In most cases, we do not know the value of σ2. A possible alternative is to uses2, whereupon we find

α− α0√s2p/q

∼ tn−2, (6.24)

under H0 : α = α0, and

β − β0√s2/q

∼ tn−2, (6.25)

under H0 : β = β0.The t-distibution is quite similar to the standard normal except being slightly

fatter. This reflects the added uncertainty introduced by using s2 rather than σ2.However, as n increases and the precision of s2 becomes better, the t-distibutiongrows closer and closer to the N(0, 1). We loose two degrees of freedom (and sotn−2) because of the fact that we estimated two coefficients, namely α and β.

Just as in the case of zα and zβ , we would expect the t-distribution to beoff-center if the null hypothesis were not true. Specifically, it would have a non-central t-distribution. This will be covered in more detail in the more generalregression model.

6.4.4 Maximum Likelihood

Suppose that ui v i.i.d. N(0, σ2). Then,

yi v i.i.d. N(α+ βxi, σ2). (6.26)

Then, the pdf of yi is given by

f(yi) =1√

2πσ2exp

{− 1

2σ2[ yi − (α+ βxi)]

2

}.

Since the observations are independent, we can write the joint likelihood func-tion as

f(y1, y2, . . . , yn) = f(y1)f(y2) · · · f(yn)

=1

(2πσ2)exp

{− 1

2σ2

n∑i=1

[ yi − (α+ βxi)]2

}= L(α, β, σ2|y,x).

Now, let L(α, β, σ2|y,x) = ln L(α, β, σ2|y,x). We seek to maximize

L(α, β, σ2|y,x) = −n2

ln(2π)− n

2ln(σ2)− 1

2σ2

n∑i=1

[ yi − (α+ βxi)]2. (6.27)

Note that for (6.27) to be a maximum with respect to α and β, we must minimize∑ni=1[ yi − (α+ βxi)]

2.


The first-order conditions for maximiazing (6.27) are

∂L∂α

=1

σ2

n∑i=1

[ yi − (α+ βxi)] = 0, (6.28)

∂L∂β

=1

σ2

n∑i=1

[ yi − (α+ βxi)xi ] = 0, (6.29)

and∂L∂σ2

= − n

2σ2+

1

2(σ2)2

n∑i=1

[yi − (α+ βxi)]2. (6.30)

Note that the first two conditions imply that

α = y − βx,

and

β =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x),

since these are the same as the normal equations (except for σ2). The thirdcondition yields

σ2 =

∑ni=1[ yi − (α+ βxi)]

2

n

=

∑ni=1 e

2i

n=n− 2

ns2. (6.31)

6.5 An Example

Consider the scatter graph given in Figure 6.1. From this, we construct Table6.1.

xi yi xi − x yi − y (xi − x)2 (xi − x)(yi − y)

2 12 -2 5 4 -243 7 -1 0 1 -74 8 0 1 0 05 5 1 -2 1 56 3 2 -4 4 6

20 35 0 0 10 -20

Table 6.1: Summary table.

6.5. AN EXAMPLE 85

Thus, we have

β =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2=−20

10= −2,

andα = y − βx = 7− (−2)4 = 15.

Now, we calculate the residuals in Table 6.2,

βxi α+ βxi ei e2i-4 11 1 1-6 9 -2 4-8 7 1 1-10 5 0 0-12 3 0 0

0 6

Table 6.2: Residual calculation.

and find

s2 =

∑ni=1 e

2i

n− 2=

6

3= 2.

An estimate of the variance of β, namely σ2q, is provided by

s2/q = s21∑n

i=1(xi − x)2= 2

1

10= 0.2.

Suppose we wish to test H0 : β = 0 against H1 : β 6= 0. Then

β − 0√s2/q

∼ t3

under the null hypothesis. The critical values for a symmetric 5% test are±3.182 which indicate the 2.5% tails. Since

β − 0√s2/q

=−2√0.2

=−2

0.45= −4.2

is clearly in the left-hand 2.5% tail of the t3-distribution, we reject the nullhypothesis at the 95% significance level.

Chapter 7

Multivariate Least Squares

The bivariate regression model studied in the previous chapter enables us tostudy the relationship between two variables. One is taken as the dependentvariable and the other the independent or expanatory variable. We simplified byassuming that the relationship was linear in the parameters given the variables.In much of economic analysis, however, the models considered involve more thantwo variables. Accordingly, in this chapter we begin to consider models thathave more than one independent or explanatory variables. Again, we simpifyby assuming that the relationship is linear in the parameters given the variablesand the error term is additive.

7.1 Introduction

7.1.1 Multiple Regression Model

The general k-variable linear model can be written as

yi = β1xi1 + β2xi2 + · · ·+ βkxik + ui i = 1, 2, . . . , n (7.1)

where xij denotes the i-th observation on the j-th explanatory variable. Forthe bivariate model studied in the previous chapter, k = 2 and xi1 = 1 yieldsβ1 as the intercept and xi2 = xi yields β2 as the slope coefficient. In general,for any model with an intercept term the corresponding variable (non-variable)will be unity. The variables xij are known as regressors or covariates.

Using matrix notation, we can equivalently write all observations on thisthis model as

y1y2...yn

=

x11 x12 · · · x1kx21 x22 · · · x2k

......

. . ....

xn1 xn2 · · · xnk

β1β2...βk

+

u1u2...un

,

86

7.1. INTRODUCTION 87

or, more compactly, asy = Xβ + u. (7.2)

where

y︷︸︸︷n× 1

=

y1y2...yn

, X︷︸︸︷n× k

=

x11 x12 · · · x1kx21 x22 · · · x2k

......

. . ....


, β︷︸︸︷k × 1

=

β1β2...βk

and

u︷︸︸︷n× 1

=

u1u2...un

.

7.1.2 Assumptions

For the disturbances, we assume

(i) E(ui ) = 0 for all i

(ii) E(u2i ) = 0 for all i

(iii) E(uiu` ) = 0, for all i 6= `.

For the independent or explanatory variables, we assume

(iv) xij non-stochastic for all i, j

(v) (xi1, xi2, ..., xik) linearly independent for all i.

For the purposes of inference in finite samples, we sometimes also assume

(vi) ui v i.i.d. N(0, σ2) for all i.

The assumptions on the disturbances are the same as for the bivariate model.Likewise, the assumption that the explanatory variables are all non-stochasticis the same and covers one of the variables being unity for the intercept term.The linear independence assumption is introduced to make sure that no termis redundant since if one of the variables were a constant linear combination ofother variables for all i, then we could eliminate that variable and use a modelwith one less explanatory variable.

7.1.3 Matrix Form Assumptions

The assumptions can be rewritten in terms of the matrix form of the model.Specifically we have, for the disturbances,

(i) E(u ) = 0

88 CHAPTER 7. MULTIVARIATE LEAST SQUARES

and

(ii),(iii) Cov(u ) = E(uu′ ) = σ2In,

where In is an n×n identity matrix. And for the explanatory variables, we have

(iv) X is nonstochastic.

(v) X has full column rank (the columns are linearly independent).

And for inferences in finite samples, we have

(vi) u ∼ N( 0, σ2In ).

7.1.4 Plane Fitting

It is useful to consider the approach taken in choosing an estimator for β for thetri-variate model, where the relationships may still be visualized geometrically.Accordingly, we have k = 3 and xi1 = 1, so

yi = β1 + β2xi2 + β3xi3 + ui i = 1, 2, . . . , n.

Now the triples (yi, xi2, xi3) define a clound of n points in the three-dimensional

space of y, x2, and x3 and for some choice of, say, β1,β2, and β3

yi = β1 + β2xi2 + β3xi3 i = 1, 2, . . . , n.

defines a plane in that space. We seek to choose β1, β2 and β3 so that the pointson the plane corresponding to xi2 and xi3, namely yi, will be close to yi. Thatis, we will “fit” a plane to the cloud of observations.

As in the two-dimensional case, we choose to measure closeness in the verticaldistance. Specifically, define

ei = yi − ( β1 + β2xi2 + β3xi3 ) i = 1, 2, . . . , n,

and we choose β1, β2 and β3 to minimize the sum of these squared residuals.The results of this minimization define the least-squares estimator.

More generally, for the the k-variate case, we choose β1 through βk so thatthe points lying on the hyper-plane in the k-dimensional space of the y’s andx’s define by

yi = β1xi1 + β2xi2 + ...+ βkxik i = 1, 2, . . . , n.

are close to yi. Specifically, we define

ei = yi − ( β1xi1 + β2xi2 + ...+ βkxik ) i = 1, 2, . . . , n,


and our measure of distance is given by

ψ(β) =

n∑i=1

e2t

=

n∑i=1

[ yi − (β1xi1 + β2xi2 + · · ·+ βkxik )]2. (7.3)

The least-squares estimator for this general case is then yielded, as before, by

β = argminβ

ψ(β). (7.4)

7.2 Least Squares Regression

7.2.1 The OLS Estimator

A necessary condition for the minimization of this function is that the firstpartial derivatives be zero at the solution. These first-order conditions are

0 =∂ψ

∂β1= 2

n∑i=1

[ yi − ( β1 + β2xi2 + · · ·+ βkxik )]xi1,

0 =∂ψ

∂β2= 2

n∑i=1

[ yi − ( β1 + β2xi2 + · · ·+ βkxik )]xi2,

...

0 =∂ψ

∂βk= 2

n∑i=1

[ yi − ( β1 + β2xi2 + · · ·+ βkxik )]xik.

where (β1, β2, · · · , βk) are solutions. Rearranging, we have the normal equa-tions:

β1

n∑i=1

x2i1 + β2

n∑i=1

xi1xi2 + · · ·+ βk

n∑i=1

xi1xik =

n∑i=1

xi1yi

β1

n∑i=1

xi2xi1 + β2

n∑i=1

x2i2 + · · ·+ βk

n∑i=1

xi2xik =

n∑i=1

xi2yi

...

β1

n∑i=1

xikxi1 + β2

n∑i=1

xikxi2 + · · ·+ βk

n∑i=1

x2ik =

n∑i=1

xikyi


which can be rewritten using matrix notation as∑ni=1 x

2i1

∑ni=1 xi1xi2 · · ·

∑ni=1 xi1xik∑n

i=1 xi2xi1∑ni=1 x

2i2 · · ·

∑ni=1 xi2xik

......

. . ....∑n

i=1 xikxi1∑ni=1 xikxi2 · · ·

∑ni=1 x

2ik

β1β2...

βk

=

∑ni=1 xi1yi∑ni=1 xi2yi

...∑ni=1 xikyi

.

(7.5)Analogous to the bivariate model, we find

X′X =

∑ni=1 x

2i1

∑ni=1 xi1xi2 · · ·

∑ni=1 xi1xik∑n

i=1 xi2xi1∑ni=1 x

2i2 · · ·

∑ni=1 xi2xik

......

. . ....∑n

i=1 xikxi1∑ni=1 xikxi2 · · ·

∑ni=1 x

2ik

, (7.6)

and

X′y =

∑ni=1 xi1yi∑ni=1 xi2yi

...∑ni=1 xikyi

. (7.7)

Substituting, we can write the normal equations (7.5) as

X′Xβ = X′y. (7.8)

where β′

= (β1, β2, · · · , βk). Therefore, we have the unique solution

β = (X′X )−1X′y, (7.9)

as long as |X′X| 6= 0, which is assured by Assumption (v). This estimatoris known as the ordinary least squares (OLS) estimator. It is ordinary in thesense that it works well under the standard assumptions. When we attemptto deal with relaxation of these assumptions more complicated estimators aresuggested.

7.2.2 Some Algebraic Results

Define the fitted value for each i as yi = xi1β1 + xi2β2 + · · · + xikβk = x′iβwhereupon

y = Xβ. (7.10)

Next define the OLS residual for each i as ei = yi − yi so

e = y − y. (7.11)

Then,

X′e = X′(y − y )

= X′y −X′X(X′X)−1X′y

= 0, (7.12)


and we say that the residuals are orthogonal to the regressors. Also, we find

y′y = (Xβ)′(Xβ + e)

= β′X′Xβ + β

′X′e

= β′X′Xβ by (7.12)

= y′y. (7.13)

Now, suppose that the first coefficient is the intercept. Then, the first columnof X and hence the first row of X′ are all ones. This means that

0 = X′e =

1 1 · · · 1x12 x22 · · · xn2

......

. . ....

x1n x2n · · · xnn

e1e2...en

=

∑ni=1 ei∑n

i=1 xi2ei...∑n

i=1 xikei

.

So,∑ni=1 ei = 0, which means that

n∑i=1

yi =

n∑i=1

yi + ei =

n∑i=1

yi. (7.14)

Finally, we note that

e = y −Xβ

= y −X(X′X)−1X′y

= [ In −X(X′X)−1X′ ]y

= My

= M(Xβ + u )

= [ In −X(X′X)−1X′ ](Xβ ) + Mu

= Mu. (7.15)

We see that the OLS residuals are a linear transformation of the underlyingdisturbances. The matrix M = In−X(X′X)−1X′ which is called the influencefunction, plays an important role in the sequel. Note that it is both symmetricand idempotent, the latter meaning M = M ·M.

7.2.3 The R2 Statistic

Define the following:

SSE =

n∑i=1

e2i =

n∑i=1

( yi − yi )2, (7.16)


SST =

n∑i=1

( yi − y )2, (7.17)

SSR =

n∑i=1

( yi − y )2. (7.18)

Note that SSE is the variation of actuals around the fitted plane and is calledthe unexplained or error sum-of-squares. SSR, or residual sum-of-squares, isthe variation of the fitted values around the sample mean and SST, or totalsum-of-squares, is the variation of the actual around the sample mean.

The three sums-of-squares are closely related. Consider

SST− SSE =

n∑i=1

[( yi − y )2 − ( yi − yi )2]

=

n∑i=1

(y2i − 2yiy + y2)−n∑i=1

(y2i − 2yiy + y2)

=

n∑i=1

y2 − 2y

n∑i=1

yi + 2

n∑i=1

yiyi −n∑i=1

y2i

=

n∑i=1

y2 − 2y

n∑i=1

yi +

n∑i=1

y2i by (7.13)

=

n∑i=1

y2 − 2y

n∑i=1

yi +

n∑i=1

y2i by (7.14)

=

n∑i=1

( yi − y )2 = SSR.

Thus we see the residual sum-of-squares SSR is total sum of squares SST less theunexplained SSE, hence the name. Alternatively, we have SST = SSE+SSR.

We now define

R2 = 1− SSE

SST=

SST

SST− SSE

SST=

SSR

SST. (7.19)

as one less the percentage of the total variation that is not explained by themodel or the percent of of total variation explained by the model.

This statistic can also be interpreted as a squared correlation coefficient.Consider the sample second moments,

Var(y ) =1

n

n∑i=1

( yi − y )2 =1

nSST,

Var( y ) =1

n

n∑i=1

( yi − y )2 =1

nSSR,


and

Cov(y, y ) =1

n

n∑i=1

( yi − y )( yi − y )

=1

n

n∑i=1

yiyi − yyi − yyi + y2

=1

n

n∑i=1

y2i − 2yyi + y2

=1

n

n∑i=1

( yi − y )2

=1

nSSR.

Then the sample correlation between yi and yi can be written as,

ry,y =Cov(y, y )√

Var(y ) Var( y )

=1nSSR√

1nSST 1

nSSR

=

√SSR√SST

.

Squaring, we have the R2 as a squared correlation between y and y,

r2y,y =SSR

SST= R2.

This statistic is variously called the coefficient of determination, the multiplecorrelation statistic and the “R”-squared statistic. It is a measure of goodness-of-fit and always lies between 0, in the event that there is no linear relationshipbetween yi and the x’s, and 1, in the event that there is a perfect relationship.It is worth noting that it is typically large in the case of time-series data due toshared trends and much smaller with cross-sectional data. While an invaluabletool, it is subject to a great deal of misuse in making selections between modelalternatives. The topic of model selection will receive extended treatment inChapter 15.


7.3 Basic Statistical Results

7.3.1 Mean and Covariance of β

From the previous section, under assumption (v), we have

β = (X′X )−1X′y

= (X′X )−1X′(Xβ + u )

= (X′X )−1X′Xβ + (X′X )−1X′u

= β + (X′X )−1X′u. (7.20)

Since X is nonstochastic by assumption (iv), we have, also using assumption(i),

E( β ) = β + E[(X′X )−1X′u ]

= β + (X′X )−1X′E(u )

= β. (7.21)

Thus, the OLS estimator is unbiased. Also, using assumptions (ii) and (iii),

Cov( β ) = E( β − β )[ β − β ′]= E[(X′X )−1X′uu′X(X′X )−1]

= (X′X )−1X′E[uu′ ]X(X′X )−1

= (X′X )−1X′(σ2In)X(X′X )−1

= σ2(X′X )−1. (7.22)

7.3.2 Best Linear Unbiased Estimator (BLUE)

The OLS estimator is linear in y and it is an unbiased estimator, as we sawabove. Let

β = Ay,

where A is a k×n matrix that is nonstochastic, be any other unbiased estimator.Define

A = A− (X′X )−1X′,

then

β = [A+(X′X )−1X′ ]y

= [A+(X′X )−1X′ ][Xβ + u ]

= AXβ + β + [A + (X′X )−1X′ ]u.

Now, β is an unbiased estimator, so

E( β ) = AXβ + β + [A + (X′X )−1X′ ]E(u )

= AXβ + β = β,

7.3. BASIC STATISTICAL RESULTS 95

for any true value of β, which implies that AX = 0. Thus,

β = β + [A + (X′X )−1X′ ]u, (7.23)

and

Cov( β ) = E[ β − β )( β − β ′]= E{[A + (X′X )−1X′ ]uu′[A + (X′X )−1X′ ]′}= [A + (X′X )−1X′ ]E(uu′ )[A + (X′X )−1X′ ]′

= [A + (X′X )−1X′ ]σ2In[A + (X′X )−1X′ ]′

= σ2[AA′ + (X′X )−1 ] since AX=0

= σ2AA′ + σ2(X′X )−1.

This shows that the covariance matrix of any other linear unbiased estimatorexceeds the covariance matrix of the OLS estimator by a possitive semi-definitematrix σ2AA′. Hence, OLS is said to be best linear unbiased estimator (BLUE).This result is known as the Gauss-Markov theorem. Note that we have usedall of the Assumptions (i)-(v) to get to this point.

7.3.3 Consistancy

Typically, the elements of X′X are unbounded (they go to infinity) as n getsvery large. For example, the 1, 1 element is n and the j, j element is

∑ni=1 x

2ij .

Therefore, we typically have

limn→∞

(X′X )−1 = 0,

and the variances of β converge to zero. This means that the distributioncollapses about its expected value, namely β. So, by convergence in quadraticmean,

plimn→∞

β = β, (7.24)

and OLS estimation is consistent. A more formal proof of this property will begiven in the chapter on stochastic regressors.

7.3.4 Estimation of σ2

Recall that e = Mu, then,

e′e = (Mu )′Mu = u′M′Mu = u′Mu,

since M is symmetric and idempotent. Now,

e′e = tr(e′e) = tr(u′Mu) = tr(Muu′),


since e′e is a scalar, and trAB = trBA, for any matrices A and B, providedboth multiplications are conformable. Thus,

E[ e′e ] = E[ tr(Muu′) ]

= tr(ME[uu′ ]) since E and tr are linear operators

= tr(Mσ2In)

= σ2trM.

But, M = In −X(X′X )−1X′, so

trM = tr( In −X(X′X )−1X′ )

= trIn − tr(X(X′X )−1X′ )

= n− tr((X′X )−1X′X )

= n− k.

Now, define

s2 =e′e

n− k, (7.25)

then it follows that

E( s2 ) =E( e′e )

n− k=σ2(n− k )

n− k= σ2, (7.26)

so s2 is an unbiased estimator of σ2.We can also establish, under consistency of β, that s2 is a consistent esti-

mator of σ2. That is,plimn→∞

s2 = σ2. (7.27)

This finding will be given more extensive attention in the chapter on stochasticregressors.

7.4 Statistical Properties Under Normality

7.4.1 Distribution Of β

Suppose that we introduce assumption (vi), so the ui’s are jointly normal:

u ∼ N( 0, σ2In ). (7.28)

Recall thatβ = β + (X′X )−1X′u,

so β is linear in u since (X′X )−1X′, which is nonstochastic, may be treated as

a constant matrix. It follows that β is also jointly normally distributed. Wealready know the mean and covariance matrices so we have

β ∼ N(β, σ2(X′X )−1 ). (7.29)


as the joint distribution of β.

The normality of the estimator will prove useful in performing inferences infinite samples as discussed in the next chapter. The individual elements of βwill be marginally normal, so

βj ∼ N(βj , σ2djj )

for j = 1, 2, ..., k, where djj = [(X′X )−1]jj . We may now apply the standardnormal transformation to obtain

zj =βj − βj√σ2djj

∼ N( 0, 1 ). (7.30)

which will form the basis for testing hypotheses on a single coefficient. UnderH0 : βj = β0

j we have

βj − β0j√

σ2djj∼ N(0, 1),

while for H1 : βj = β1j 6= β0

j we have

βj − β0j√

σ2djj∼ N

(β1j − β0

j√σ2djj

, 1

),

and we would expect the statistic not to be centered around zero.

7.4.2 t-Distribution

In most cases, we do not know the value of σ2. A possible alternative is to uses2, whereupon we find

βj − βj√s2djj

∼ tn−2. (7.31)

Under H0 : βj = β0j , then

βj − β0j√

s2djj∼ tn−2

while for H1 : βj = β1j 6= β0

j

βj − β0j√

s2djj=βj − β1

j√s2djj

+β1j − β0

j√s2djj

will have the distribution shifted to the right or left depending on whether β1j−β0

j

is positive or negative. Specifically, it would have a non-central t-distribution.This will be covered in more detail in the next chapter.


7.4.3 Maximum Likelihood Estimation

Now,yi = β1xi1 + β2xi2 + · · ·+ βkxik + ui = xi

′β + ui,

is linear in ui, so yi is also normal given x′i = (xi1, xi2, · · · , xik):

yi ∼ N( xi′β, σ2 ).

Thus, the density for yi is given by

f( yi ) =1√

2πσ2exp

{− 1

2σ2[ yi − xi

′β ]2}.

Since the ui’s and hence the yi’s are independent, the joint likelihood functionis

f( y1, y2, . . . , yn ) = f( y1 )f( y2 ) · · · f( yn )

=1

( 2πσ2 )n2

exp

{− 1

2σ2

n∑i=1

[ yi − xi′β ]2

}= L(β, σ2|y,X ).

We may maximize this function with respect to β and σ2 directly or equiv-alently, and more conveniently, the log-likelihood function

L(β, σ2|y,X ) = ln L(β, σ2|y,X ) (7.32)

= −n2

ln( 2πσ2 )− 1

2σ2

n∑i=1

[ yi − xi′β ]2

which is a monotonic transformation. Notice that maximizing L with respectto β is equivalent to minimizing the sum of squares ψ(β) =

∑ni=1[ yi − xi

′β ]2,so

βMLE = (X′X )−1X′y, (7.33)

which is the OLS estimator from above. It is easily shown that

σ2MLE =

e′e

n=n− kn

s2. (7.34)

The distribution of s2 and hence σ2MLE will be developed and utilized in the

next chapter.

7.4.4 Efficiency of β and s2

Since β is the MLE and unbiased, we find then it is the minimum varianceunbiased estimator (BUE). At the same time, s2 is not the MLE, so it is notnecessarily BUE. On the other hand, σ2

MLE is biased, so it is not BUE either.The bias disappears asymptotically, however, so they will both be equivalent inlarge samples and be asymptotically BUE.

7.5. AN EXAMPLE 99

7.5 An Example

We now examine a simple trivariate example taken from Wonacott and Wona-cott. The details of the calcualtions are instructive and will be exhibited infull. The model considered is

Yt = β1 + β2Xt,2 + β3Xt,3 + ut, (7.35)

where Yt is wheat yield, Xt,2 is the amount of fertilizer applied and Xt,3 is theannual rainfall. The data are given in Table 8.1.

Wheat Yield Fertilizer Rainfall(Bushels/Acre) (Pounds/Acre) (Inches/Year)

40 100 3645 200 3350 300 3765 400 3770 500 3470 600 3280 700 36

Table 8.1: Wheat yield data.

For computional convenience, we rescale the data, and calculate the elementsof the moment matrix

Yt Xt2 Xt3 Xt2Yt Xt3Yt X2t2 Xt2Xt3 X2

t3

4.0 1 3.6 4 14.40 1 3.6 12.964.5 2 3.3 9 14.85 4 6.6 10.895.0 3 3.7 15 18.50 9 11.1 13.696.5 4 3.7 26 24.05 16 14.8 13.697.0 5 3.4 35 23.80 25 17.0 11.567.0 6 3.2 42 22.40 36 19.2 10.248.0 7 3.6 56 28.80 49 25.2 12.96

42 28 24.5 187 146.80 140 97.5 85.99

Table 8.2: Moment matrix calculations.


Thus,

X′X =

n∑xt2

∑xt3∑

xt2∑x2t2

∑xt2xt3∑

xt3∑xt3xt2

∑x2t3

(7.36)

=

7 28 24.528 14 97.5

24.5 97.5 85.99

(7.37)

X′y =

∑yt∑

xt2yt∑xt3yt

=

42.0187.0146.8

(7.38)

and

det(X′X) = 84, 270.2 + 66, 885.0 + 66.885.0

−84, 035.0− 67, 416.16− 66, 543.75

= 45.29

cof(X′X) =

2532.35 −18.97 −700−18.97 1.68 3.5−700 3.5 196.0

(X′X)−1

=1

det(X′X)cof(X′X)

=

55.9141 −0.4189 −15.4560−0.4189 0.0371 0.0773−15.4560 0.0773 4.3277

.

We can now calculate the least squares estimator

β = (X′X )−1X′y

β1β2β3

=

55.9141 −0.4189 −15.4560−0.4189 0.0371 0.0773−15.4560 0.0773 4.3277

42.0187.0146.8

=

1.1320.6890.603

.

In order to conduct inference and evaluate the quality of these estimates, we

need to obtain the relevant sums-of-squares. Accordingly we obtain

7.5. AN EXAMPLE 101

Yt Yt et e2t Yt − Y (Yt − Y )2

4.0 3.99 0.01 0.0001 -2.0 4.004.5 4.50 0.00 0.0000 -1.5 2.255.0 5.43 -0.43 0.1849 -1.0 1.006.5 6.12 0.38 0.1444 0.5 0.257.0 6.63 0.37 0.1369 1.0 1.007.0 7.20 -0.20 0.0400 1.0 1.008.0 8.13 -0.13 0.0169 2.0 4.00

42 42.0 0.0 0.5232 0.0 13.50

It follows that

s2 =∑t

e2t/(n− 3) = 0.5232/4 = 0.1308,

R2 = 1− 0.5232/13.5 = 0.9612,

since SSE =∑t e

2t = 0.5232 and SST =

∑t (Yt − Y )2 = 13.50 .

RecallE[( β − β )( β − β )′] = Cov( β ) = σ2(X′X )−1, (7.39)

which we estimate using

Cov( β ) = s2(X′X )−1. (7.40)

Thus,

Var( β1 ) = s2d11 = 7.3133

Var( β2 ) = s2d22 = 0.0049

Var( β3 ) = s2d33 = 0.5660

For H0 : β3 = 0 vs H1 : β3 6= 0, we have

β3 − β03√

Var(β3)

=0.603

0.75236= 0.80119 ∼ t4. (7.41)

Now, a 95% acceptance region for a t4 distribution is −2.776 ≤ t4 ≤ 2.776.Thus, we fail to reject the null hypothesis that rainfall has no effect on yieldonce we control for fertilizer.

For H0 : β2 = 0 vs H1 : β2 6= 0, we have

β2 − β02√

Var(β2)

=0.6893

0.0697= 9.8965 ∼ t4. (7.42)

and we reject the null hypothesis at the 95% confidence level. In fact, we reject atthe 99.9% confidence level, where the acceptance region is −7.173 ≤ t4 ≤ 7.173.Clearly, rainfall is important in determining yield whether or not fertilizer isapplied.

Chapter 8

Confidence Intervals andHypothesis Tests

In the previous chapter we established a number of properties of the least-squares estimator for the linear regression model. These estimates provide pointestimates and distributions of the coefficients in question. Ultimately, we areinterested in making statements regarding the reasonableness of our preferredvalues for certain coefficients. This entails making a probability statement re-garding the most likely values of the estimators relative to the preferred values.We can formulate these statements as hypothesis tests, confidence intervals,or p-values, which will be devoloped below. First we review and extend theprevious results.

8.1 Preliminaries

8.1.1 Model and Assumptions

The model is a k-variable linear model:

y = Xβ + u,

where y and u are both n × 1 vectors, X is a n × k matrix and β is a k × 1vector. We make the following assumptions about the disturbances:

(i) E[u ] = 0

and

(ii),(iii) Cov(u ) = E[uu′ ] = σ2In,

where In is an n× n identity matrix. The nonstochastic assumptions are


102

8.1. PRELIMINARIES 103


For inferences, we assume that u are normally distributed. That is,

(vi) u ∼ N( 0, σ2In ).

8.1.2 Ordinary Least Squares Estimation

For some estimate β of β, define

e = y −Xβ

andψ(β) = e′e

Choosing β to minimize ψ(β) yields the ordinary least squares (OLS) estimator

β = (X′X )−1X′y,

Substitution yields

β = (X′X )−1X′(Xβ + u )

= (X′X )−1X′Xβ + (X′X )−1X′u

= β + (X′X )−1X′u.

This structure of the estimator being represented as the sum of the target plusan expression which resembles the estimator except it replaces y with u willprove to be very general.

8.1.3 Properties of β

Since X is nonstochastic,

E[ β ] = β + E[(X′X )−1X′u ]

= β + (X′X )−1X′E[u ]

= β.

Thus, the OLS estimator is unbiased. Also,

Cov( β ) = E[ β − β )( β − β )′]

= E[(X′X )−1X′uu′X(X′X )−1]

= (X′X )−1X′E[uu′ ]X(X′X )−1

= (X′X )−1X′σ2InX(X′X )−1

= σ2(X′X )−1.

The elements of X′X are unbounded as n gets very large. Therefore,

limn→∞

(X′X )−1 = 0,

104CHAPTER 8. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

and the variances of β converge to zero. This means that the distributioncollapses about its expected value, namely β. So,

plimn→∞

β = β,

and OLS estimation is consistent.The OLS estimates β are the best linear unbiased (BLUE) in that they have

minimum variance in the class of unbiased estimators of β that are also linearin y.

Suppose that u is normal, then the linear transformation

β − β = (X′X )−1X′u

is also normal.β − β ∼ N( 0, σ2(X′X )−1 ).

orβ ∼ N(β, σ2(X′X )−1 ).

Moreover, the β are maximum likelihood and hence minimum variance in theclass of unbiased estimators (BUE).

8.1.4 Properties of e

Now, the OLS residuals are

e = y −Xβ

= y −X(X′X)−1X′y

= [ In −X(X′X)−1X′ ]y

= My M = In −X(X′X)−1X′

= M(Xβ + u )

= [ In −X(X′X)−1X′ ](Xβ ) + Mu

= Mu.

since MX = 0. Thus, the OLS residuals are a linear transformation of theunderlying disturbances. Also,

X′e = X′Mu

= 0,

again, since MX = 0, and the OLS residuals are orthogonal or linearly unrelatedto X. When u are normal, then the linear transformation e = Mu is alsonormal. Specifically,

e ∼ N(0, σ2M )

sinceE[e] = E[Mu] = ME[u] = 0,

8.2. TESTS BASED ON THE χ2 DISTRIBUTION 105

and

E[ee′] = E[Muu′M′]

= M ( E[uu′ ])M′

= M(σ2I

)M′

= σ2M,

since MM′ = M. Note that this is not a full rank distribution since as we seebelow rank(M) = n− k.

8.2 Tests Based on the χ2 Distribution

8.2.1 The χ2 Distribution

Recall that for z1, z2, . . . , zm i.i.d. N(0, 1) random variables, then

m∑i=1

z2i ∼ χ2m.

8.2.2 Distribution of (n− k)s2/σ2

By definition,

ui = yi − x′iβ ∼ N(0, σ2),

souiσ∼ N(0, 1)

andn∑i=1

( uiσ

)2=

n∑i=1

u2iσ2∼ χ2

n.

Now,

ei = yi − x′iβ

is an estimate of ui and we might expect that

n∑i=1

e2iσ2∼ χ2

n.

However, this would be wrong as only n−k of the observations are independentsince e satisfies the k equations X′e = 0.

The properties of e = Mu follow from the properties of M = In−X(X′X)−1X′,which is symmetric idempotent and positive semi-definite and hence has somevery special properties. Principle of these is that we can write the decompositionM = QDn−kQ

′ where Dn−k is a diagonal matrix of eigenvalues with its first


n− k diagonals unity and the remainder zero, and Q is a matrix of correspond-ing orthonormal eigenvectors such that Q′Q = In or Q′ = Q−1. It follows, thatrank(M) =tr(M) =n− k. Let

v = Q′u

then v ∼ N(0, σ2In) and u = Qv. Substitution yields

1

σ2e′e =

1

σ2u′Mu

=1

σ2u′QDn−kQ

′u

=1

σ2v′Q′QDn−kQ

′Qv

=1

σ2v′Dn−kv

=1

σ2

∑n−k

i=1v2i

=∑n−k

i=1(viσ

)2 ∼ χ2n−k

Thus,n∑i=1

e2iσ2∼ χ2

n−k

and

(n− k )

∑ni=1 e

2i

n−kσ2

= (n− k )s2

σ2∼ χ2

n−k.

Not only so, but e = Mu and β−β = (X′X)−1

X′u are obviously jointly normaland

E[e(β − β)′] = E[Muu′X(X

′X)−1

]

= Mσ2InX(X′X)−1

= σ2MX(X′X)−1

= 0

so they are uncorrelated and hence stochastically independent. But this meanss2 is independent of β, since it is a function only of e.

8.2.3 An Hypothesis Test

Occasionally, a model will suggest a specific value of the variance which can besubjected to a hypothesis test. Suppose that

H0 : σ2 = σ20 , H1 : σ2 = σ2

1 6= σ20 .

Then we know that

(n− k )s2

σ20

∼ χ2n−k.

8.2. TESTS BASED ON THE χ2 DISTRIBUTION 107

under the null hypothesis. Under the alternative, we have

(n− k )s2

σ20

= (n− k )s2

σ21

σ21

σ20

=(n− k )s2

σ21

+ (n− k )

(σ21

σ20

− 1

)s2

σ21

with the first term having a χ2n−k distribution and the second being a shift term.

We see that the distribution will be shifted to the right if σ21 > σ2

0 and to left ifσ21 < σ2

0 . The size of the shift increases with the sample size and the differencebetween the null and alternative.

For example, suppose that s2 = 4.0 and σ20 = 1 for n− k = 14, then

(n− k )s2

σ20

= 144.0

1= 56.

Choose α = 0.05, say, then the critical values corresponding to 2.5% tails are5.63 and 26.12 for n − k = 14. Since 56 > 26.12 we are in the right-hand tailand we reject the null hypothesis.

8.2.4 A Confidence Interval

Now, let a and b be numbers such that

Pr( b ≤ χ2n−k ≤ a ) = 1− α = 0.95,

say. Then a and b can be obtained from a table. Thus,

Pr

(b ≤ (n− k )

s2

σ2≤ a

)= 0.95

Pr

(1

b≥ σ2

(n− k )s2≥ 1

a

)= 0.95

Pr

((n− k )s2

b≥ σ2 ≥ (n− k )s2

a

)= 0.95

establishes a 95% confidence interval for σ2.For example, for n− k = 14, we have

Pr

(5.64 ≤ 14s2

σ2≤ 26.12

)= 0.95

Pr

(14s2

5.63≥ σ2 ≥ 14s2

26.12

)= 0.95

and if s2 = 4.0, then

Pr

(56

5.63≥ σ2 ≥ 56

26.12

)= 0.95


orPr(

10 ≥ σ2 ≥ 2.1)

= 0.95

is the confidence interval.This confidence interval comprises the set of values for σ2 that would not be

rejected if used as a null hypothesis when α is chosen as the size. Conversely,any null value that falls outside the interval would lead to rejection then used asthe basis of a hypothesis test with that α. In the example above, for α = .05,the null hypothesis σ2

0 = 1 is not included in this interval and would be rejected.

8.3 Tests Based on the t Distribution

8.3.1 The t Distribution

Suppose that z is a N(0, 1) random variable and that w ∼ χ2m independent of

z, thenz√wm

∼ tm.

8.3.2 The Distribution of ( βj − βj )/( s2djj )1/2

We have seen thatβj ∼ N(βj , σ

2djj ),

where djj is the (j, j) element of the matrix (X′X)−1. Then,

z =βj − βj√σ2djj

∼ N( 0, 1 ),

while

w = (n− k )s2

σ2∼ χ2

n−k.

Since β and s2 are independent, we have

βj−βj√σ2djj√

(n−k )s2

σ2 /(n− k )=βj − βj√s2djj

∼ tn−k.

8.3.3 Testing a Simple Hypothesis

Suppose thatH0 : βj = β0

j , H1 : βj = β1j 6= β0

j .

We know thatβj − β0

j√s2djj

∼ tn−k.

under the null hypothesis.

8.3. TESTS BASED ON THE T DISTRIBUTION 109

Under the alternative hypothesis, we find the numerator

z =βj − β0

j√σ2djj

=βj − β1

j√σ2djj

+β1j − β0

j√σ2djj

∼ N(β1j − β0

j√σ2djj

, 1 )

has a noncentral normal, so the statistic will have a non-central t-distributionwith non-centrality (β1

j −β0j )/√σ2djj . The size of the noncentrality can depend

on a number of factors other than the difference between the null and alternativevalues, which will be studied in Chapter 10.

Now, choose α = 0.05, say, then critical values corresponding to 2.5% tailsof a t distribution with (n− k) degrees of freedom are taken from the tabes as±a, say, so if ∣∣∣∣∣ βj − β0

j√s2djj

∣∣∣∣∣ > a,

we reject the null hypothesis. Otherwise we fail to reject the null hypothesis.Here again we are making a choice between a value which is unlikely under thenull or likely under the alternative.

8.3.4 Confidence Interval and P-value for βj

We consider a two-sided interval. First, for n−k degrees of freedom, we obtaina such that

Pr(−a ≤ tn−k ≤ a ) = 1− α = 0.95,

say, from a table. Then,

Pr

(−a ≤ βj − βj√

s2djj≤ a

)= 0.95

Pr(βj + a

√s2djj ≥ βj ≥ βj − a

√s2djj

)= 0.95

and βj ± a√s2djj defines a 95% confidence interval for βj .

It is important to emphasize that the interval is random and βj is not.

Specifically, the center and width of the interval are random since βj and s2 areboth random variables. The probability statement above says that the randominterval has a 95% chance of including the true (non-random) βj .

As with the chi-squared interval, this interval comprises the set of non-rejectable null hypotheses for βj , when α is chosen as the size. Conversely, anynull value that falls outside the interval would lead to rejection when used asthe basis of a hypothesis test with that α. This is a much more informative


procedure than the hypothesis test, which gives a yes-no answer for a particularα and null value.

The other side of the inferential coin to the confidence interval is the p-value.The p-value is the probability of obtaining a test statistic at least as extreme asthe one that was actually observed, assuming that the null hypothesis is true.Suppose that τ is the observed value of the statistic and t is a random variablethat has the specified tn−k-distribution, then

p-value = Pr[|t| ≥ |τ |]= 2 Pr[t ≥ |τ |]

since the t-distribution is symmetric. For a two-sided hypothesis test, if α issmaller than the p-value then we would not reject with the realization τ of thestatistic. Thus it yields the set of α for which the given null hypothesis willnot be rejected. Again, this is much more informative than a single hypothesistest.

8.3.5 Testing a Linear Restriction on β

We frequently encounter situations where we want to test a linear restrictionsuch as H0 : c′β = a where c and a are known weights. For example, β1+β2 = 1or, as a special case, the usual zero restriction βj = 0. Consider the linearcombination

c′β.

It follows thatc′β ∼ N( c′β, σ2c′(X′X )−1c ),

andc′( β − β )√σ2c′(X′X )−1c

∼ N( 1, 0 ).

As before, we use s2 instead of σ2, whereupon we find

c′( β − β )√s2c′(X′X )−1c

∼ tn−k.

We can perform inferences and calculate confidence intervals and p-values as be-fore. In the next section we will consider testing more than one linear restrictionat the same time.

8.4 Tests Based on the F Distribution

8.4.1 The F Distribution

Suppose thatv ∼ χ2

l and v ∼ χ2m

8.4. TESTS BASED ON THE F DISTRIBUTION 111

If v and w are independent, then

v/l

w/m∼ Fl,m.

8.4.2 Testing a Set of Linear Restrictions on β

Suppose we are interested in testing a set of q linear restrictions. Exampleswould be β1 + β2 + ...+ βk = 1 and β3 = 2β2. More generally, we consider

H0 : Rβ = r H1 : Rβ 6= r

where r is a q × 1 known vector and R is a q × k known matrix. Due to themultivariate normality of β, then under the null hypothesis, we have

Rβ−r ∼N(0, σ2R(X′X)−1R′)

and hence(Rβ−r)

′[σ2R(X′X)−1R′)]−1(Rβ−r) ∼ χ2

q.

Recall that β and s2 are independent so (n− k) s2

σ2∼χ2n−k is independent of

the quadratic form in (8.53). Thus, under the null hypothesis,

F =(Rβ−r)

′[σ2R(X′X)−1R′)]−1(Rβ−r)/q

(n− k) s2

σ2 /(n− k)∼ Fq,n−k

and after some simplification

F = (Rβ−r)′[R(X′X)−1R′]−1(Rβ−r)/(s2q) ∼ Fq,n−k. (8.1)

Under the alternative hypothesis Rβ − r = d 6= 0, then the numerator isdistributed χ2

q(δ) where δ = d′[σ2R(X′X)−1R′)]−1d. Accordingly, F is dis-tributed Fq,n−k(δ), which diverges at the rate n, and we expect large positivevalues of the statistic with high probability. This is why we typically onlyconsult the RHS tail values of the distribution to establish critical values. Val-ues of the statistic exceeding these critical values are rare events under the nullbut typical under the alternative, so we reject when the realization exceeds thecritical value.

8.4.3 A Sums of Squares Approach

An alternative to the quadratic form test just proposed can be based on sums ofsquares from a restricted and unrestricted regression. The unrestricted regres-sion is the usual least squares estimator β = (X′X )−1X′y with correspondingunrestricted residuals

eu = e = y −Xβ

and unrestricted sum of squares

SSEu = e′ueu.


The restricted regression is obtained as

β = argminβ

[(y −Xβ)′(y −Xβ)] s.t. Rβ − r = 0.

Define the corresponding Langrangian function

ϕ(β,λ) = [(y −Xβ)′(y −Xβ)]− λ′ (Rβ − r)

where λ is a k× 1 vector of Lagrange multipliers. The first-order conditions foroptimizing this function, evaluated at the solution β are

∂ϕ(β,λ)

∂β= −2X′(y −Xβ)−R′λ = 0

∂ϕ(β,λ)

∂λ= Rβ − r = 0.

Multiplying the first equation by R(X′X)−1 and solving for λ yields

λ = 2[R(X′X)−1R′]−1R(β − β)

= 2[R(X′X)−1R′]−1(Rβ − r)

since Rβ = r. Substituting for λ in the first equation and solving for β yields

β = β − (X′X)−1R′[R(X′X)−1R′]−1(Rβ − r). (8.2)

Now define the corresponding resticted residuals

er = y −Xβ

= y −Xβ −X(β − β)

= eu−X(β − β).

Thus the restricted sum of squares is

SSEr = e′rer

= e′ueu − 2e′uX(β − β) + (β − β)′X′X(β − β)

= e′ueu + (β − β)′X′X(β − β)

since X ′eu = X ′e = 0. Rearranging and substituing for (β − β) from (8.2)yields

e′rer − e′ueu = (β − β)′X′X(β − β)

= (Rβ − r)′[R(X′X)−1R′]−1R(X′X)−1X′X(X′X)−1R′[R(X′X)−1R′]−1(Rβ − r)

= (Rβ − r)′[R(X′X)−1R′]−1(Rβ − r).

8.4. TESTS BASED ON THE F DISTRIBUTION 113

Substituting into (8.1), we find the F-statistic can be alternatively and equiva-lently written

F = (e′rer − e′ueu)/(s2q).

=(SSEr − SSEu)/q

SSEu/(n− k)∼ Fq,n−k.

This is by far the most frequently see form of the F-statistic and is directlyrelated to the likelihood ratio statistic, as is shown in the Appendix to theChapter.

8.4.4 Testing a Set of Zero Restrictions on β

Although superficially simpler, the sums of squares form of the F -statistic re-quires obtaining the resticted least squares estimates which, in general, is nosimpler than calculating the quadratic form alternative. In some cases, how-ever, where the restricted estimates are easy to obtain, it has substantial appeal.The leading such case is when the null hypothesis eliminates a set of variablesfrom the regression.

Suppose the “unrestricted” model can be written as

y = X1β1 + X2β2 + u

andH0 : β2 = 0 H1 : β2 6= 0

where β1 is (k1×1), β2 is (k2×1), and k1+k2 = k so q = k2. For β′

= (β′2, β′2),

define the unrestricted residuals

eu = y −X1β1 −X2β2

andSSEu = e′ueu

from the OLS regression of y on X1 and X2, namely β = (X′X )−1X′y.Under the null hypothesis the “restricted” model can be written

y = X1β1 + u

Next, define the restricted residuals

er = y −X1β1

andSSEr = e′rer

from the OLS regression of y on X1 only, namely β1 = (X1′X1 )−1X1

′y.Thus specific for the zero restriction case, the F -statistic becomes

F =( SSEr − SSEu )/k2

SSEu/(n− (k1 + k2))∼ Fk1,n−(k1+k2)


We can consult the tables to find the critical point, c.05, corresponding to α =0.05, say. Then, if F > c.05,we reject the null hypothesis at the 5% level. Notethat for k2 = 1, that is, one restriction, the critical point for the F -distributionis the square of the critical point for the tn−(k1+1) distribution. So, in thissimple case, the test based on the F statistic is equivalent to the two-sided testbased on the t-ratio for the β2 coefficient.

8.5 An Example

We now undertake an extended example which examines the possibility thatdifferences in wages may or may not be explained by gender discrimination. Wehave a series of observations (527) on wages by individual for a firm. We alsohave a number of other variables which might be potential explanatory variablesin a wage equation.

We begin with a bivariate regression equation with level of wages explainedby eduction attainment in years. The results are:

wi = −1.54(1.05)

+ 0.804(0.079)

Ei + ei

with SSE = 10574.16 and R2 = 0.166. The estimated standard errors for eachcoefficient estimate is in parentheses below the estimate, which is a standard wayto present the results. The coefficient on Ei is highly significant and indicatesthat an additional year of schooling adds 80 cents to the wage. A plot of theresidual against Ei, however reveals a problem

8.5. AN EXAMPLE 115

6 8 10 12 14 16 180

5

10

15

20

25

30

Education

Wag

es

Bivariate Regression: Actual vs. Fitted

It is clear that the spread of the residuals is larger at higher values of educationand wages. This is not surprising since most researchers find that income andwage data are best approximated by a log-normal distribution.

Accordingly we turn to a specification with the log of wages as the dependentvariable. The results are

ln(wi) = 0.985(0.111)

+ 0.0822(0.0083)

Ei + ei

with SSE = 118.13 and R2 = 0.157. Note that neither of the latter two statisticsare comparable since the dependent variable is different. The coefficient on Eiis also highly significant here but has a different interpretation as the percentageincrease in the wage for a year additional schooling. A plot of the residuals forthis form is


6 8 10 12 14 16 180.5

1

1.5

2

2.5

3

3.5

Education

log W

ages


This plot has largely corrected the problem of spread increasing with educationand appears approximately normal which is desirable.

Using this regression as a base we begin to examine the possibility that wagesare paid differentially on the basis of gender. Accordingly, we add Gi, whichrepresents gender, to the regression. Now Gi is a “dummy” or indicator variablethat takes on a value of 1 for females and 0 for males. This specification allowsa different intercept for male and female employees. The results are

ln(wi) = 1.114(0.109)

+ 0.0807(0.0080)

Ei − 0.240(0.0402)

Gi + ei

with SSE = 110.60 and R2 = 0.211. These results indicate that, on average,a female recieves 24 percent less in wages, correcting for education. Testingthe null hypothesis of no discrimination or the coefficient on gender being zero

we obtain the statistic (βG − βG)/

√s.e.(βG) = −0.240/0.0402 = −5.97. This

statistic should have a t524 which is very closely approximated by the standardnormal. Looking at the normal tables we see that this realization is far far outin the left-hand tail of the distribution and hence very unlikely. The probabilityof such an unlikely value is effectively zero. We would reject the null for anypositive choice of α. There seems to to convincing evidence of discrimination.A plot of the residuals for males and females is presented below

8.5. AN EXAMPLE 117

6 8 10 12 14 16 180

0.5

1

1.5

2

2.5

3

3.5

Education

log W

ages


Males are represented by “o” and females by “x” with the two regression linesreflecting the shifted intercept.

Some analysts might caution us not to be so bold in our conclusions. Theywould argue that there may be other factors important to determining thewage that differ by gender and might provide an explanation for the differences.Foremost would be the suggestion that we also consider Xi or experience as anexplanatory variable. It is likely important in explaning the wage and due tofamily interruptions likely to differ for males and females. Accordingly, we addXi and X2

i as explanatory variables. The square is added because research hasshown that the impact of experience on wages is nonlinear - first increasing thewage and then ultimately decreasing it in later years. The results are

ln(wi) = 0.498(0.118)

+ 0.0944(0.0078)

Ei + 0.0446(0.0054)

Xi − 0.00074(0.00012)

X2i − 0.267

(0.0369)Gi + ei

with with SSE = 92.11 and R2 = 0.343. Surprisingly, this did not detract fromthe finding that women received a significantly lower wage. In fact, the statisticfor testing the null of no discrimination is marginally higher at −7.23. Norethat both of the experience coefficients are also highly significant.

Our cautious analysts might suggest there is more work to be done. Inparticular, it might be argued that what we are witnessing is discrimination byrace rather than gender. They would argue that minorities are over-representedin the females employee group as a justification. Accordingly, we add two more


“dummy” variables: Bi for black and Hi for hispanic. The results are

ln(wi) = 0.539(0.120)

+ 0.0927(0.0079)

Ei + 0.0446(0.0054)

Xi − 0.00074(0.00012)

X2i

− 0.101(0.055)

Bi − 0.094(0.087)

Hi − 0.268(0.0369)

Gi + ei

with SSE = 91.37 and R2 = 0.348. Again, adding the variables did not changethe result with respect to gender discrimination. The statistic for testing thatnull is −7.28 and just as signigicant – hugely. It is interesting that neither ofthe other dummy variables has a significant coefficient at the 5 percent levelwhen conducting a two-sided test with values of −1.83 for testing the null thatβB = 0 and -1.08 for testing the null that βH = 0. If we conduct a one-sidedtest with negative as the alternative, then the black coefficient is significant atthe 5 percent level. We can conduct an F -test for testing the null that bothcoefficients are zero. This compares the last two regressions and yields thestatistic

(92.11− 91.37)/2

91.37/520=

0.37

0.175= 2.10

which should have an F2,520 distribution under the null. This has a p-value of0.125 and is hence not significant for any of the preferred choices for α.

The bottom line is that there is strong evidence of gender discriminationcontrolling for the likely other covariates.

8.A Appendix

Since we have the maximum likelihood estimator we can directly apply theresults of Chapter 5. There we introduced the trinity of tests for a vectorrestriction. They were the likelihood ratio, Wald and Lagrange multipliertests. It is instructive to present these tests in some detail for the present case.As above, we seek to test the null hypothesis H0 : Rβ = r where r is a q × 1known vector and R is a q × k known matrix full row rank.

First consider the Wald statistic from above. Now, from above, Rβ −r ∼N(0, σ2R(X′X)−1R′) so using the maximum likelihood estimate σ2 for σ2,we have the Wald statistic as the feasible quadratic form

W = (Rβ−r)′[σ2R(X′X)−1R′)]−1(Rβ−r)

= (Rβ−r)′[s2R(X′X)−1R′)]−1(Rβ−r)/q× q · n

n− k= F× q · n

n− k.

Except for being divided by q and use of s2 rather than σ2, we see that theF -statistic is just a Wald-type test. The critical points for this statistic will

just be the critical points of the appropriate F -distribution, muliplied by q(n−k)n .

The F -statistic is really just a Wald-type test.

8.5. AN EXAMPLE 119

Next, we consider the likelihood ratio statistic. From Chapter 7, the log-likelihood function is given by

Lu = −n2

ln 2πσ2 − 1

2σ2(y −Xβ)

′(y −Xβ).

Suppose β denotes the unrestricted maximum likelihood estimator used in theWald-type test and β is the restricted maximum likelihood estimator, thenRβ − r = 0. Let eu = y −Xβ = e and er = y −Xβ denote the unrestrictedand restricted residuals, then σ2 = e′ueu/n and σ2 = e′rer/n are the unrestrictedand restricted maximum likelihood estimators of σ2.

Now substitution of σ2 into the unrestricted log-likelihood function yieldsthe concentrated unrestricted log-likelihood function

Lu = −n2

ln 2πσ2 − 1

2σ2(y −Xβ)

′(y −Xβ)

= −n2

ln 2πσ2 − 1

2σ2u′u

= −n2

ln 2π − n

2ln σ2 − n

2.

Similarly the concentrated restricted log-likelihood is

Lr = −n2

ln 2π − n

2ln σ2 − n

2.

Thus we have the likelihood ratio statistic and its limiting distribution

LR = 2(Lu − Lr)

= 2(−n2

ln σ2 +n

2ln σ2)

= n · ln σ2

σ2−→d χ

2q.

The likelihood ratio statistic is obviously a monotonic function of the ratioof restricted to unrestricted sums-of-squares. Moreover,

σ2

σ2=

e′rer/n

e′ueu/n=

SSErSSEu

=SSErSSEu

− SSEuSSEu

+ 1

=SSEr − SSEu

SSEu+ 1

=( SSEr − SSEu )/k2

SSEu/(n− (k1 + k2))× (n− (k1 + k2)

k2+ 1

= F × (n− (k1 + k2)

k2+ 1.

We see that the likelihood ratio is a simple nonstochastic monotonic transfor-mation of the F -statistic for the linear model. Accordingly, the finite-sample


critical values for the likelihood ratio are the same nonstochastic transformationof the F -statistic critical values. Both the likelihood ratio and Wald-type testwill yield exactly the same inferences in finite samples as the F -statistic.

The Lagrange multiplier test examines the unrestricted scores (first deriva-tives of the log-likelihood function) evaluated at the restricted estimates. If therestriction is correct then the average of the scores will converge in probabilityto zero and appropriately normalized be asymptotically normal. For this modelwe have

∂ lnL(β)

∂β=

1

σ2X′(y −Xβ).

For simplicity, we consider the case where the restrictions have been rewritten

in zero form as discussed above, so β = (β′1,0′)′.

For simplicity, we consider the case from Section 8.4.3 where the restrictionsrequire the final q = k2 coefficients to be zero. Now, by definition, the restrictedestimates yield

0 =∂ lnL(β)

∂β1

=1

σ2X′1(y −Xβ),

so for the LM statistic, we only consider the final q, possibly nonzero, elementsof the score

∂ lnL(β)

∂β2

=1

σ2X′2(y −Xβ)

=1

σ2X′2u

=1

σ2X′2M1y

=1

σ2X′2M1u

where M1 = In −X1(X′1X1)−1X′1.

By normality of u, we have

∂ lnL(β)

∂β2

∼ N(0,1

σ2X′2M1X2).

The corresponding quadratic form,

∂ lnL(β)

∂β′2(

1

σ2X′2M1X2)−1

∂ lnL(β)

∂β2

=u′M1X2(X′2M1X2)−1X′2M1u

σ2

will have a χ2q distribution. The LM statistic is the feasible version of this

8.5. AN EXAMPLE 121

quadratic form

LM =u′M1X2(X′2M1X2)−1X′2M1u

σ2

=u′M1u− u′[M1 −M1X2(X′2M1X2)−1X′2M1]u

σ2

=e′rer − e′ueu

σ2=

e′rer − e′ueue′ueu/n

=(e′rer − e′ueu)/k2e′ueu/(n− k)

n · k2n− k

= F × n · k2n− k

.

And we see that the LM statistic is, like the LR, a monotonically increasingnonstochastic function of F -statistic. So the finite-sample critical values forthe LM statistic are the same nonstochastic transformation of the F -statisticcritical values. Not only so, but the LM is exactly the same as the W .

Chapter 9

Prediction in Linear Models

In the first chapter, one of the three statistical tasks we outlined for economet-rics was to forecast the behavior of the variable of interest using an estimatedmodel of the process that generates the variable. For example, we might beinterested in predicting the GPA of an incoming college student using a modelbased on high school grades and test scores. Or we might seek to predict theresponse of GDP levels to various choices of levels of the policy instrumentssuch as the money supply, government expenditures, and tax levels. In eithercase we have a set of explanatory variables, whose values we know, and seekto predict the corresponding values of the dependent variable based on an esti-mated relationship. Estimation and inference concerning this relationship werestudied in the previous two chapters. In this chapter, we will examine how touse these results to accomplish prediction when the underlying model is thelinear regression model.

9.1 Notation and Review

It is useful in this chapter to introduce a more compact notation for a singleobservation on the linear relationship. Specifically, from before we have

yi = β1xi1 + β2xi2 + · · ·+ βkxik + ui i = 1, 2, . . . , n

which can be rewritten as

yi = x′iβ + ui i = 1, 2, . . . , n

where x′i = (xi1, xi2, · · · , xik) is the 1 × k vector of explanatory variables forobservation i. For the bivariate model studied in the previous chapter, k = 2and xi1 = 1 yields β1 as the intercept and xi2 = xi yields β2 as the slopecoefficient.

In terms of the single observation representation, we have the assumptionsfrom Chapter 7. For the disturbances we assume

122

9.1. NOTATION AND REVIEW 123

(i) E(ui ) = 0 for all i

(ii) E(u2i ) = 0 for all i

(iii) E(uiu` ) = 0, for all i 6= `.

For the independent or explanatory variables, we assume

(iv) xij non-stochastic for all i, j

(v) (xi1, xi2, ..., xik) linearly independent for all i.

For the purposes of inference in finite samples, we also assume

(vi) ui v i.i.d. N(0, σ2) for all i.

Using matrix notation, we can equivalently write all observations on this thismodel as or, more compactly, as

y = Xβ + u. (9.1)

where

y︷︸︸︷n× 1

=

y1y2...yn

, X︷︸︸︷n× k

=

x11 x12 · · · x1kx21 x22 · · · x2k

......

. . ....


, β︷︸︸︷k × 1

=

β1β2...βk

and

u︷︸︸︷n× 1

=

u1u2...un

.

Note that x′i is the ith row of X.The assumptions can be rewritten in terms of the matrix form of the model.

Specifically we have, for the disturbances,

(i) E(u ) = 0

and

(ii),(iii) Cov(u ) = E(uu′ ) = σ2In,

where In is an n×n identity matrix. And for the explanatory variables, we have



And for inferences in finite samples, we have

124 CHAPTER 9. PREDICTION IN LINEAR MODELS

(vi) u ∼ N( 0, σ2In ).

In Chapter 7, we introduced the least squares estimator of the coefficient vectorβ. Specifically, we obtained

β = (X′X )−1X′y (9.2)

= β + (X′X )−1X′u (9.3)

as long as |X′X| 6= 0, which is assured by Assumption (v). Since X is non-stochastic by assumption (iv), we have, also using assumption (i),

E( β ) = β.

Thus, the OLS estimator is unbiased. Also, using assumptions (ii) and (iii),

Cov( β ) = E( β − β )( β − β )′

= σ2(X′X )−1.

OLS was shown to be the best unbiased estimator within the class of estimatorsthat are linear in y (BLUE). Note that we have used all of the Assumptions(i)-(v) to get to this point.

Suppose that we introduce assumption (vi), so the ui’s are jointly normal:

u ∼ N(0, σ2In ). (9.4)

Since β is linear in u and (X′X )−1X′ is nonstochastic, it follows that β isalso jointly normally distributed. We already know the mean and covariancematrices so we have

β ∼ N(β, σ2(X′X )−1 ). (9.5)

as the joint distribution of β.The normality of the estimator proved useful in performing inferences in

finite samples, as discussed in the previous chapter. The individual elementsof β will be marginally normal, so

βj ∼ N(βj , σ2djj )

for j = 1, 2, ..., k and djj = [(X′X )−1]jj . We may now apply the standardnormal transformation to obtain

βj − βj√σ2djj

∼ N( 0, 1 ). (9.6)

which was the basis for testing hypotheses on single coefficients.The problem with using the ratio in the previous paragraph for inference is

that σ2 is unknown. An estimator is provided by

s2 =e′e

n− k. (9.7)

9.2. PREDICTION PROPERTIES 125

which was shown to be unbiased. Moreover, under the normality assumption(vi), we found that

(n− k )s2

σ2∼ χ2

n−k

and is stochastically independent of β. Combining this result with the infeasiblestandard normal ratio yielded the feasable statistic

βj − βj√s2[(X′X )−1]jj

∼ tn−k

This ratio was the basis for conducting hypothesis testing, confidence intervals,and prob-values in the previous chapter.

9.2 Prediction Properties

Suppose that we wish to predict the value of the dependent variable, possiblyoutside the sample, given only the values of the independent or explanatoryvariables. Let the subscript ∗ denote the prediction period, then the value ofthe dependent variable generated by the model is

y∗ = β1x∗1 + β2x∗2 + · · ·+ βkx∗k + u∗ (9.8)

= x∗′β + u∗. (9.9)

where x′∗ = (x∗1, x∗2, · · · , x∗k). Note that the objective of our inferences y∗,which we label predictand, is now a random variable. This is fundamentallydifferent from the estimation problem where our objective is a real number.Since the disturbance term u∗ is is outside the sample period and has yet to bedrawn, it is a priori unknowable. Consistent with previous chapters, we assumethat x∗ is nonstochastic and known and that u∗ has zero mean, variance σ2,and is uncorrelated with the sample disturbances u.

We consider prediction strategies or predictors, denoted y∗, that are basedon the sample information and given values of the indpendent variables. Theperformance of the predictor is based on the properties of the prediction error

y∗ − y∗ = u∗ + x∗′β − y∗,

which is the difference between the predictor and predictand. The expectationof this error is the prediction bias

B∗ = E[y∗ − y∗] = x∗′βE[y∗]

and the expectation of its square is the mean squared prediction error (MSPE)

M∗ = E[(y∗ − y∗)2].

Rather obviously, we prefer predictors which yield zero values of B∗ and smallvalues of M∗.


Some additional structure can be given to the mean squared prediction error.Define µ∗ = x∗

′β, then we find

M∗ = E[((y∗ − µ∗) + (µ∗ − y∗))2]

= E[(y∗ − µ∗)2] + 2 E[(y∗ − µ∗)(µ∗ − y∗)] + E[(µ∗ − y∗)2]

= E[u2∗] + 2 E[u∗(µ∗ − y∗)] + E[(µ∗ − y∗)2]

= σ2 + E[(x∗′β − y∗)2]

since y∗ is based on sample information and hence uncorrelated with u∗. We seethat there are two components to the mean square prediction error. The first,σ2, arises because it is a prediction problem and u∗ has yet to be drawn andis unknowable. The second depends on the precision with which we estimatex∗′β, the expectation of the target given the explanatory variables x∗. This

naturally suggests y∗ = x∗′β as a predictor, whose behavior we consider in the

next two sections.

9.3 Bivariate Least Squares Prediction

We now consider the prediction problem in the bivariate regression model. Al-though all the results will be special cases of the more general results in thenext section, particular insight is gained by examining this simpler problemfirst. From Chapter 6, the model is

yi = α+ βxi + ui i = 1, 2, . . . , n

so the corresponding predictand is

y∗ = α+ βx∗ + u∗.

From the previous section, the MSPE is minimized by the infeasible preidictory∗ = α + βx∗ Since α and β are unknown we use the least squares estimatorsto obtain

y∗ = α+ βx∗

as a feasible predictor. Note that this predictor is a single point on the realline and is hence termed a point predictor.

The prediction error for this simpler problem is given by

y∗ − y∗ = (α− α) + (β − β)x∗ + u∗

Cleary since u∗ has mean zero and the coefficient estimators are both unbiased

B∗ = E[y∗ − y∗] = 0

and the least squares predictor is unbiased. This is a measure of the locationof the predictor relative to the unrealized predictand. On average, we have thecorrect level.

9.3. BIVARIATE LEAST SQUARES PREDICTION 127

We are much more interested in the precision of the predictor, which ismeasured by the MSPE. Specifically, we examine

M∗ = E[(y∗ − y∗)2]

= E[((α− α) + (β − β)x∗ + u∗)2]

= E[(α− α)2] + 2 E[(α− α)(β − β)x∗] + E[(β − β)2x2∗] + E[u2∗].

Using the results for variances and covariances from Chapter 6, we find

M∗ = σ2

∑ni=1 x

2i

n∑ni=1(xi − x )2

− 2σ2

∑ni=1 xix∗

n∑ni=1(xi − x )2

+ σ2 x2∗∑ni=1(xi − x )2

+ σ2

= σ2

[1 +

1n

∑ni=1 x

2i − 2

n

∑ni=1 xix∗ + x2∗∑n

i=1(xi − x )2

]= σ2

[1 +

1n

∑ni=1(xi − x∗)2∑ni=1(xi − x )2

]= σ2

[1 +

1n

∑ni=1((xi − x) + (x− x∗))2∑n

i=1(xi − x )2

]= σ2

[1 +

1

n+

(x− x∗)2∑ni=1(xi − x )2

]= σ2m∗

where m∗ = 1 + 1n + (x − x∗)

2/∑ni=1(xi − x )2 = M∗/σ

2. Notice that thismeasure is quadratic in (x − x∗) and achieves a minimum at x∗ = x. Thismeans that the predictors performance will degrade if we attempt to predict insituations where the conditioning variables are far from their historical averages.Also notice that the second and third terms will become very small as n growslarge. So the impact of using estimated coefficients becomes negligible in largesamples.

The basic limitation of the point predictor y∗ is that it is a single point onthe real line. The probability of the value predicted being realized is zero.Accordingly, we will move to interval predictors which have some probabilityof including he yet to be drawn realization. This requires a distributionalassumption, so we add Assumption (vi) or normality to the properties of the uiand u∗. Since the prediction error is linear (α, β) and hence ui and also u∗ itwill also be normal. Specifically,

(y∗ − y∗) ∼ N(0, σ2m∗)

and(y∗ − y∗)√σ2m∗

∼ N(0, 1).

From the previous chapter we know (n − 2)s2/σ2 ∼ χ2n−2 and independent of

(α, β) and now u∗, so

z∗ =(y∗ − y∗)/

√σ2m∗√

((n− 2)s2/σ2)/(n− 2)=

(y∗ − y∗)√s2m∗

∼ tn−2.


This result may now be utilized to form a prediction interval. Choosinga significance level, say α = .05, and a corresponding critical value from thet-tables for n− 2 degrees of freedom, say t.025, we know that

0.95 = Pr[t.025 ≤(y∗ − y∗)√s2m∗

≤ t.025]

= Pr[y∗ − t.025

√s2m∗ ≤ y∗ ≤ y∗ + t.025

√s2m∗

].

Thus y∗± t.025√s2m∗ forms a 95% prediction interval for the yet to be realized

y∗. Notice that like the confidence interval for a coefficient estimate the intervalis a random variable with both the center and width of the interval random.Unlike the confidence interval, the target is also a random variable. We have a95% chance of our random interval including the random predictand. Since thewidth depends directly on m∗, the size of the prediction interval will be smallestfor x∗ = x.

9.4 Multivariate Least Squares Prediction

We now consider the more general problem where the predictor is

y∗ = x′∗β

= x′∗(X′X )−1X′y

= x′∗(X′X )−1X′(X′β + u)

= x∗β + x′∗(X′X )−1X′u. (9.10)

Note that this predictor is linear in y and hence u. Now, it is easy to see that

E( y∗|x∗ ) = x′∗β, (9.11)

soE[( y∗ − y∗ )|x∗ ] = B∗ = 0. (9.12)

whereupon y∗ is an unbiased predictor of y∗.In terms of expected squared error, we find that the variance or expected

squared error of the predictor around its mean is

Var( y∗ ) = E( y∗ − x′∗β )2 = σ2x′∗(X′X )−1x∗, (9.13)

and that the mean squared prediction error (MSPE) or expected squared errorof the predictor around its stochastic target y∗ is

M∗ = E[( y∗ − y∗ )2] = σ2[ 1 + x′∗(X′X )−1x∗ ] = σ2m∗. (9.14)

Note that the typical diagonal element of X′X is a sum of squares and growslarge and hence (X′X )−1 grows small with the sample size n. Thus, in largesamples, σ2, which reflects the variation introduced by the a priori unknowable

9.4. MULTIVARIATE LEAST SQUARES PREDICTION 129

u∗, dominates the MSPE and the variability arising from the estimation of βbecomes relatively negligible.

Notice that the MSPE is quadratic in x∗. As with the bivariate model it isof interest what values of x∗ yield smaller values of M∗. Specifically, we seekto

minx∗

x′∗(X′X )−1x∗

which, following some math, yields the same solution as the bivariate case,namely

x∗ = x.

Thus we see that, in general, the predictions perform more accurately when theconditioning information is typical of the sample on which the estimates/predictionsare based. And the imprecision of the predictor goes up exponentially as wedepart from the mean of the conditioning variables.

It is this precision and its dependence on the conditioning variables whichleads us to entertain interval rather than point predictors. If we assume, fol-lowing Assumption (vi), that ui ∼ i.i.d.N(0, σ2), including for the predictionobservation, then

(y∗ − y∗) = x′∗(β−β) + u∗

= −x′∗(X′X )−1X′u)+u∗

∼ N(0, σ2m∗)

so(y∗ − y∗)√σ2m∗

∼ N(0, 1).

And for general k-variate case

(n− k)s2/σ2 ∼ χ2n−k

independent of β, u∗, and (y∗ − y∗), whereupon

(y∗ − y∗)√s2m∗

∼ tn−k.

Proceeding as above, we find the 95% interval

0.95 = Pr[y∗ − t.025

√s2m∗ ≤ y∗ ≤ y∗ + t.025

√s2m∗

]where t.025 is a critical value taken from the t-tables with n − k degrees offreedom.

The predictor y∗ and its standard error√s2m∗ can be easily calculated

from standard least squares statistics produced by an auxilliary regression. Letx′i = (1,x′2i) denote the conditioning values of the explanatory variables, β′ =


(β1,β′2), and recall µ∗ = x′∗β as the target of our predictor. Then form the

regression

yi − µ∗ = x′iβ + ui − x′∗β i = 1, 2, . . . , n

= (xi − x∗)′β + ui

= (x2i − x2∗)′β2 + ui

since the first element of xi and x∗ are both unity. Thus an equivalent regressionis

yi = µ∗ + (x2i − x2∗)′β2 + ui i = 1, 2, . . . , n

= (1, (x2i − x2∗)′)

(µ∗β2

)+ ui.

Performing least squares on this regression yields y∗ = µ∗ as the interceptestimator and the usual standard error estimator for the intercept will yield√s2m∗. These may be used to find an interval predictor as discussed in the last

paragraph.If we restrict our attention to unbiased predictors, which have E( y∗|x∗ ) =

x′∗β, then the least square predictor has efficiency properties. Consider anypredictor, say y, that is unbiased then for any x∗

E( y∗|x∗ ) = x′∗

which implies thaty∗ = x′∗β

for some unbiased estimator β. If y∗ is linear in y then so is β and we find thaty∗ has smaller variance as a consequence of the Gauss-Markov theorem and infact is the best (minimum variance) linear unbiased predictor (BLUP) of y∗.Note that this efficiency applies whether or not the disturbances are normallydistributed. When we add the normality assumption this result strengthensand y∗ is the best unbiased predictor (BUP).

9.5 An Example

We will now apply the results obtained in this chapter to the extended exam-ple presented in the previous chapter. We will generate a series of point andinterval predictions for both the full sample of 527 observations and the first25 observations. This will enable us to see the impact of more observations onthe prediction intervals. We will also generate predictions for both the simplestregression considered and the most complicated. This will enable us to see theeffect adding regressors on the predictions.

First we look at predictions for the simplest regression, which was a bivariateregression of teh log-wage on education. The regression results for a sample sizeof 25 are:

ln(wi) = 0.959(0.629)

+ 0.095(0.046)

Ei + ei

9.5. AN EXAMPLE 131

with s2 = 0.268 and R2 = 0.156. A plot of the in-sample predictions and a 50percent prediction interval yields Figure 9.1.

0 5 10 15 20 250.5

1

1.5

2

2.5

3

3.5

4

Education

Wag

es

50% Prediction Interval

Actual

Fitted

Lower Bound

Upper Bound

Figure 9.1: Bivariate Regression, n = 25

Note the quadratic nature of the prediction interval and how it is smallest atthe mean value of the independent variable.

The complete regression over the smaller sample size yields

ln(wi) = 0.705(0.791)

+ 0.109(0.059)

Ei + 0.0157(0.0322)

Xi − 0.00018(0.00077)

X2i

− 0.169(0.310)

Bi − 0.068(0.348)

Hi − 0.408(0.253)

Gi + ei

with s2 = 0.278 and R2 = 0.313. Note that the R2 is much larger, which reflectsthe added regressors. For prediction purposes, we set the explanatory variablesother than education to their means and generate point and interval predictionsfor education levels 1 to 25. A plot for the in-sample predictions for this caseis presented in Figure 9.2. Note how the confidence intervals for this case aresmaller, which reflects the smaller R2.

We now turn to the predictions for the full sample. The regression resultsfor the bivariate regression are (from last chapter)

ln(wi) = 0.985(0.111)

+ 0.0822(0.0083)

Ei + ei

with s2 = 0.225 and R2 = 0.157. Note that both s2 and R2 are roughly thesame as for the smaller sample size, which is to be expected. A plot of the point


0 5 10 15 20 250

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Education

Wag

es


Actual

Fitted

Lower Bound

Upper Bound

Figure 9.2: Complete Regression, n = 25

and 50 percent inerval predictions for this case yields Figure 9.3. Note how theupper and lower components of the interval are low almost linear. This reflectsthe fact that the variance of the predictor σ2m∗ = σ2[ 1 +x′∗(X

′X )−1x∗ ] ' σ2

since the second term is O(1/n). The estimation error disappears from theprediction error in large samples.

Finally, we look at predictions for the complete regression over the full sam-ple. The regression results are (from last chapter)

ln(wi) = 0.539(0.120)

+ 0.0927(0.0079)

Ei + 0.0446(0.0054)

Xi − 0.00074(0.00012)

X2i

− 0.101(0.055)

Bi − 0.094(0.087)

Hi − 0.268(0.0369)

Gi + ei

with s2 = 0.176 and R2 = 0.348. Again, both s2 and R2 are roughly the sameas for the smaller sample size. The differences in the size of s2 are probablydue to sampling error. A plot of the point and interval predictions for this casegives Figure 9.4. The notable differences from the previous plots are the smallerintervals and the almost linear upper and lower bound lines for the intervals.These reflect the impact of additional variables and additional observationsrespectively. Both of these phoenomena are predicted by the theory.

9.5. AN EXAMPLE 133

0 5 10 15 20 250.5

1

1.5

2

2.5

3

3.5

Education

Wages


Actual

Fitted

Lower Bound

Upper Bound

Figure 9.3: Bivariate Regression, n = 527

0 5 10 15 20 250.5

1

1.5

2

2.5

3

3.5

Education

Wages


Actual

Fitted

Lower Bound

Upper Bound

Figure 9.4: Complete Regression, n = 527

Documents

Bivariate Least Squaresbwbwn/econ510_files/part_2.pdf74 CHAPTER 6. BIVARIATE LEAST SQUARES 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 x y 2 12 3 7 4 8 5 5 6 3 Figure 6.1: Scatter Plot We