Chapter 7: Multiple Regression - Natcor · 2017. 10. 31. · Mini-case 8.6 : U.S. Unemployment Rates ... building using multiple regression methods. The same mini-cases will be revisited

1

Chapter 8: Multiple Regression for Time Series (under revision)

Contents

Introduction ..................................................................................................................... 2 8.1 Graphical analysis and preliminary model development .......................................... 4

8.2 The multiple regression model.................................................................................. 5

8.2.1 The method of least squares ............................................................................... 7

8.3 Testing the overall model.......................................................................................... 8 8.3.1 The F-test for multiple variables ........................................................................ 9 8.3.2 ANOVA in simple regression .......................................................................... 12 8.3.3 The relationship between F and R

2 .................................................................. 13

8.3.4 S and Adjusted R2 ............................................................................................ 14

8.4 Testing individual coefficients ................................................................................ 16

8.4.1 Testing a group of coefficients

8.5 Checking the assumptions ......................................................................................... 1 8.5.1 Analysis of residuals for gas price data ........................................................... 27

8.6 Forecasting using multiple regression .................................................................... 30

8.6.1 The point forecast ............................................................................................ 31 8.6.2 Prediction intervals .......................................................................................... 34 8.6.3 Forecasting more than one period ahead ......................................................... 35

8.7 Principles................................................................................................................. 36 Summary ....................................................................................................................... 38

References ..................................................................................................................... 39 Exercises ....................................................................................................................... 40 Mini-Cases .................................................................................................................... 42

Mini-case 8.1: The Volatility of Google Stock ............................................................. 43 Mini-case 8.2: Economic Factors in Homicide Rates ................................................... 44

Mini-case 8.3: Forecasting Natural Gas Consumption for the DC Metropolitan Area 45

Mini-case 8.4: Economic Factors in Property Crime .................................................... 46 Mini-case 8.5: U.S Retail & Food Service Sales .......................................................... 47 Mini-case 8.6 : U.S. Unemployment Rates................................................................... 48

Why have you included so many variables in your regression model?

(Anonymous Statistician)

Why have you included so few variables in your regression model?

(Anonymous Economist)

2

Introduction

One of the key restrictions we faced in Chapter 7 was the inability to consider more than

one explanatory variable at a time. Yet both the discussion there and basic common

sense indicate that events in the business world are typically affected by multiple inputs.

We may not be able to measure all of them, but we do need to identify the main factors

and incorporate them into our forecasting framework. A first step towards identifying an

appropriate set of variables is to examine plots of the data, which we do in section 8.1,

although we need to proceed with caution as multiple dependencies in the data may make

interpretations complex. In section 8.2, we then proceed to formulate a statistical model

that incorporates multiple inputs and to interpret the coefficients in that model.

Estimation of the parameters follows the Method of Least Squares developed in section

7.2 and is extended to cover multiple regression in section 8.2.1.

Once we have developed a model, we need to know whether it is useful. For simple

linear regression, this question was straightforward. We checked to see whether or not

there was a statistically meaningful relationship between X and Y and that completed the

analysis, as in section 7.6. The question is now more complex. For example, sales of a

product may depend upon both advertising expenditures and price. Either variable alone

may provide only a modest description of what is going on, whereas the two taken

together may give a much better level of explanation. Conversely, a model for national

retail sales that includes both consumer expenditures and consumer incomes may be only

3

marginally better than a model that includes only one of them. The reason for this

apparent anomaly is that if X1 and X2 are highly correlated and X1 is already in the

model, X2 will not bring much new information to the table. To resolve such questions

we need to proceed in two steps:

1. Is the overall model is useful? If the answer is NO, we go back to the drawing

board.

2. If the answer is YES, we check whether individual variables in the model are

useful for forecasting purposes.

These two steps are explored in sections 8.3 and 8.4. As we explained in section 7.5, our

analysis is based upon a set of standard assumptions. In section 8.5 we briefly revisit

those assumptions and present graphical procedures to determine whether the

assumptions are reasonable. Taking action to deal with failures of the assumptions is a

more difficult step, which we defer to Chapter 9.

Once the model has been shown to be effective and the assumptions appear to be

reasonable, we are in a position to generate forecasts. Point forecasts and prediction

intervals are considered in section 8.6. Finally in section 8.7, we consider some of the

key principles that underlie the development of multiple regression models.

At the end of the chapter, we provide details of six Mini-cases. Rather than work through

“pre-packaged problem sets, these examples provide a more realistic approach to model-

building using multiple regression methods. The same mini-cases will be revisited at the

end of chapter 9, to make use of the more advances skills developed in that chapter.

4

8.1 Graphical analysis and preliminary model development

We return to the study of gasoline prices, initially examined in section 7.3 (Gas

prices_1.xlsx). The matrix plot we considered there is reproduced as Figure 8.1 for

convenience. The variables in the plot are:

The price of regular unleaded gasoline (‘Unleaded’; in cents per U.S. gallon)

Total disposable income (‘Disposable Income’; in billions of current Dollars)

The First Purchase Price of Crude Oil (‘L1_crude’; in cents per barrel, lagged one

month)

Unemployment (‘Unemploy’; Overall percentage rate for U.S)

The S&P 500 Stock Index (‘S&P’).

Examination of the plot already revealed that the strongest linear relationship for gas

price appeared to be the lagged value of the first purchase price of crude oil. However,

we also see a somewhat upward sloping relationship between price and disposable

income, possibly due to the effect of inflation on both series. The general level of

economic activity is reflected in a downward sloping relationship with unemployment

and an upward sloping relationship with the S&P Index. None of these last three

relationships appears to be nearly as strong as that of L1_crude, but they all make

economic sense and might improve the overall ability to forecast gas prices. Also, as we

saw in Table 7.1, all their correlations with gas prices are significantly different from

zero, so potentially they may add value to the model.

5

With this example as background, we now examine the specification of the multiple

regression model.

Figure 8.1: MINITAB matrix plot for gas prices against disposable income, lagged price

of crude oil, unemployment and the S&P Index.

8.2 The multiple regression model

The multiple regression model is a direct extension of the simple regression model

specified in section 7.5. Note that we now move directly to the specification of the

underlying model, having already motivated the basic ideas in Chapter 7. We consider K

explanatory variables: and assume that the dependent variable Y is

linearly related to them through the following model:

U n le a d e d

180

150

120

800070006000 5.64.84 .0

D isp o sa b le I n co m e

8000

7000

6000

3000

2000

1000

L 1 _ cru d e

U n e m p lo y

5.6

4.8

4.0

180150120

1600

1200

800

300020001000

S & P

16001200800

M a tr ix P lo t o f U n le a d e d , D is p o s a b le In c o m e , L 1 _ c r u d e , U n e m p lo y , S & P

1 2, , ,

KX X X

6

0 1 1 2 2 K KY X X X (8.1)

The coefficients in (8.1) may be interpreted as follows:

β0 denotes the intercept, which is the expected value of Y when all the {Xj} are

zero, so that the equation reduces to Y = β0.

βj denotes the slope for Xj: when Xj increases by one unit and all the other X’s are

kept fixed, the expected value of Y increases by βj units.

Beyond the extended form of the expected value, or explained component of the model,

the underlying assumptions are the same as for simple regression given in section 7.5.1.

That is, we need only extend Assumption R1 appropriately.

Assumption R1: For given values of the explanatory variables, , the

expected value of Y is written as E(Y) and has the form:

Assumption R2: The difference between an observed Y and its expectation is a random

error, denoted by ε. The complete model is:

( ) [E x p ected va lu e] +[ran d o m erro r]Y E Y (8.2)

Assumption R3: The errors have zero means.

Assumption R4: The errors for different observations are uncorrelated with one another

and with other explanatory variables.

Assumption R5: The error terms come from distributions with equal variances.

Assumption R6: The errors are drawn from a normal distribution.

As in section 7.5, if we take assumptions R3 – R6 together, we are making the claim that

the random errors are independent and normally distributed with zero means and equal

variances.

1 2, , ,

KX X X

0 1 1 2 2( )

K KE Y X X X

7

8.2.1 The method of ordinary least squares (OLS)

The Method of Ordinary Least Squares (OLS) may be used to estimate the unknown

parameters. As for simple linear regression, we choose the sample coefficients, now

to minimize the sum of squared errors, SSE. That is, we seek to minimize

(8.3)

The technical details were summarized in Appendix 7A and we will not consider the

computational issues further1. After we have determined the best fitting model, we

estimate the error terms using the (least squares) residuals, defined as:

0 1 1 2 2i i i i K K i

e Y b b X b X b X (8.4)

The residuals form the basis of many of the tests and diagnostic checks that we use to

validate the model, as we shall see in later sections.

Example 8.1: Multiple regression model for gasoline prices

We employ the observations for January 1996 – March 2002, as before. The Least

Squares solution is:

Unleaded = 84.6 + 0.0268 L1_crude + 0.00599 PDI

- 6.94 Unemploy - 0.0184 S&P 500

Examination of the coefficients indicates that an increase an increase of $1 in the price of

a barrel of crude may be expected to increase the price at the pump by about 2.7 cents per

gallon, slightly lower than the figure we got with simple regression. Likewise, an

1 Standard texts on regression analysis provide the necessary details; see for example Kutner et al. (2005,

pp. 15 – 20 and 222 – 227).

0 1{ , , , }

Kb b b

2 2

1 1

2

0 1 1 2 2

1

[ ]

( )

n n

i

i i

n

i i i K K i

i

S S E e O b s e r v e d F it te d

Y b b X b X b X

8

increase in disposable income produces an increase in the expected price, whereas an

increase in unemployment reduces the expected gas price. The coefficient for the S&P

Index is also negative; initially we might have expected an increase in the S&P to signal

increased economic activity and thus upward pressure on gas prices. However, the issue

of timing is important, and the negative sign could reflect the impact of good news in the

crude oil markets, which would both lower pump prices and boost the overall economy.

No matter how good the statistical fit, the forecaster should always check the face

validity of the proposed forecasting model.2 If the model passes the face-validity test, we

go ahead and check to see whether it makes sense statistically.

8.3 Testing the overall model

We specify the null hypothesis as the claim that the overall model is of no value or, more

explicitly that none of the explanatory variables affects the expected value. Formally,

this statement is written as:

.

When the null hypothesis is true, none of the variables in the model contribute to

explaining the variation in Y. The alternative hypothesis, HA states that the overall model

is of value, in that at least one of the explanatory variables has an effect:

2 Imagine standing in front of your boss in her office. Can you give a plausible justification of the model

and all the coefficients? If not, develop a different model!

0 1 2: 0

KH

1 2: N o t a ll 0

A KH

9

That is, there is some statistical relationship between Y and at least one of the Xj. If we

fail to reject H0, we conclude that the overall model is without value and we need to start

over. If we reject H0, we may still wish to eliminate those variables that do not appear to

contribute, so as to arrive at a more parsimonious model.3

8.3.1 The F-test for multiple variables

The test procedure based upon the partition of the sum of squares and is known as the

Analysis of Variance, often referred to as ANOVA or even just AV. The sums of

squares are

Total Sum of Squares: 2

1

( )

i n

Y Y i

i

SST S Y Y

Sum of Squared Errors: 2

1

ˆ( )

i n

i i

i

SSE Y Y

Sum of Squares explained by the Regression model: 2

1

ˆ( )

i n

i

i

SSR Y Y

As was true in the case of simple regression, it may be shown that:

S S T S S R S S E (8.5)

The ANOVA test is usually summarized in tabular form and the general framework is

presented in Table 8.1.

The first column describes the partition into the two sources of variation: the sums of

squares explained by the regression model and the sum of squared errors.

3 When the model is based upon sound theoretical considerations, it makes sense to retain all the variables

in the model, even if some are not statistically significant, so long as the parameter estimates make sense.

This is typically true in econometric modeling. On the other hand, if we optimistically include variables on

a “see if it flies” basis, we will usually prefer to prune the model to the smaller number of “useful”

variables.

10

The second column gives the degrees of freedom (DF) associated with each of the

sums of squares; the total DF is (n-1) since we always start out with a constant term

in the model.

The third column provides the numerical values of the sums of squares.

Column four gives the Mean Squares, defined as [Sum of Squares / DF] for each

source: MSR and MSE respectively.

Column five yields the test statistic F = MSR/MSE.

The reason for introducing the new term Mean Square Error is that, when the null

hypothesis is true, both MSR and MSE have expected values equal to the error variance,

σ2. Thus, the test statistic

4 F=MSR/MSE, should have a value in the neighborhood of 1.0

if the null hypothesis is appropriate. When the regression model is useful, the amount of

variation explained by the model will increase, so MSR will increase relative to MSE and

F will increase. Thus, we will reject H0 for sufficiently large values of F.

Source DF Sums of Mean Squares F

Squares

Regression K SSR MSR = SSR / K MSR / MSE

Residual Error n-K-1 SSE MSE = SSE / (n-K-1)

__________________________________________________________

Total n-1 SST

Table 8.1: General form of the ANOVA table

4 The analysis of variance was first derived by Sir Ronald Fisher, the father of modern inferential statistics.

The ratio was labeled F in his honor by another famous statistician, George Snedecor. Ironically, the test is

sometimes known as Snedecor’s F test.

11

We refer to the observed value of F generated from this table as Fobs. The decision rule

becomes:

Reject H0 if Fobs > Fα(K, n-K-1); otherwise do not reject H0.

The critical value for F depends upon the degrees of freedom for both the SSR (the

numerator DF = ν1 ) and the SSE (the denominator DF = ν2 ). Detailed tables for the

upper 5 percent and 1 percent points of the F-distribution are given in Appendix B4. An

extract of the 5 percent table is shown in Table 8.2.

\ 1 2 3 4 5 6

1 161.448 199.500 215.707 224.583 230.162 233.986

2 18.513 19.000 19.164 19.247 19.296 19.330

3 10.128 9.552 9.277 9.117 9.013 8.941

4 7.709 6.944 6.591 6.388 6.256 6.163

5 6.608 5.786 5.409 5.192 5.050 4.950

6 5.987 5.143 4.757 4.534 4.387 4.284

7 5.591 4.737 4.347 4.120 3.972 3.866

8 5.318 4.459 4.066 3.838 3.687 3.581

9 5.117 4.256 3.863 3.633 3.482 3.374

10 4.965 4.103 3.708 3.478 3.326 3.217

Table 8.2: Extract of table of upper 5 percent points for the F distribution [E Extracted from the

National Institute of Standards and Technology [NIST] website:

http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm#ONE-05-11-20

For example, if n = 14 and K = 3, we have ν1 = 3 and ν2 = 14-3-1 = 10 so that the critical

value from the table is 3.708. The values are also available through the EXCEL command

Again, these tables are perhaps more of historical interest than anything else, since it is

generally much more convenient to use the P-value with the decision rule:

Reject H0 if P < α; otherwise do not reject H0.

Example 8.2: ANOVA for gas prices.

12

The ANOVA table for the gas prices model given in Example 8.1 with K = 4 is shown

below. Consistent with our usual convention, we specify α = 0.05.

Source DF SS MS F P

Regression 4 21268.0 5317.0 110.88 0.000

Residual Error 69 3308.9 48.0

Total 73 24576.9

The 5 percent point from the F tables with ν1 = 4 and ν2 = 69 is 2.525; since ν2 = 69 is not

listed, we take the next lowest DF in the table, which is ν2 = 60. Since Fobs = 110.88 is

much greater than 2.525, we emphatically reject the null hypothesis. The probability of

such an extreme value can also be calculated through EXCEL through the function

‘=FDIST(110.88,4,69)’ which delivers a P value of 0.000. More directly, we observe

that P < 0.05 and we reject H0. Recall that a P value of 0.000 does not mean zero, but

rather that P < 0.0005 and the result is rounded down. We conclude there is strong

evidence that the overall model is useful and the next step is to determine which variables

contribute.

8.3.2 ANOVA in simple regression

We did not consider the Analysis of Variance in Chapter 7 because it did not provide any

additional information. When there is only one variable in the model, the test of the

overall model is formally equivalent to the test of the single slope. Referring back to the

computer output in Figure 7.11, we see that the P-value for ANOVA is identical with

that for the t-test of the slope. That is, in simple linear regression, the F and (two-sided) t

tests provide identical information. Another way of saying this is that, in simple linear

regression, F = t2 as can readily be verified numerically in Figure 7.11.

13

8.3.3 The relationship between F and R2

There is a simple relationship between F and R2, so the F test yields exactly the same

conclusion as if we had performed a test using R2. Starting from

/

/ ( 1)

M SR SSR KF

M SE S SE n K

we divide numerator and denominator by SST so that

( 1) * /

( 1) * /

n K S SR S STF

K S SE S ST.

Since 2/R S S R S S T and 2

1 /R S S E S S T we have that

2

2

( 1).

(1 )

n K RF

K R

(8.6)

From inspection of equation (8.6), it is evident that an increase in R2 leads to an increase

in F, so that the ANOVA test is completely equivalent to a test based upon the coefficient

of determination. Either from the ANOVA table, or directly from the computer output,

we find for Example 8.2 that:

2 2 1 2 6 8 .00 .8 6 5

2 4 5 7 6 .9

S S RR

S S T

Thus, the coefficient of determination has increased compared to the single variable

model, which had R2 = 84.4%. Indeed, whenever we add a variable to the model, we find

that R2 increases (or strictly speaking, cannot decrease). However, the value of the F

statistic often falls because of the K in the denominator of (8.6). There is no

inconsistency here, but we need to recognize that the decline in F does not signal a

weakness in the model.

14

8.3.4 S and Adjusted R2

The steady increase in R2 as new variables are added is a matter for some concern. A

better guide to the performance of the model is to look at S, the standard error, now

defined as:

2

2 1

( 1) ( 1)

n

i

i

eS S E

S M S En K n K

(8.7)

If S is smaller, the model has improved as the result of including the extra variable,

although the improvement may be marginal. As we argued in section 7.7.3, S has a

straightforward interpretation as the accuracy of the predictions since it is an estimate of

the error standard deviation.

An alternative route to interpreting the overall accuracy of a multiple regression model is

through the adjusted form of R2. The construction is as follows:

Question: What value of R2

could we expect if we included K sets of arbitrary numbers in

the model? That is, if H0 was true.

Answer: With K “junk” variables in the model we would have an expected value for R2

equal to ( 1)

K

n . For example, if n = 21 and you include 10 variables in the model, you

can expect an R2 of 50% even if the model is pure junk.

5

Solution: We adjust R2 so that the modified value is zero for “pure junk” but still

increases to 100% when the fit is perfect.

5 Include 20 “junk” variables and you get

21 0 0 %R ! All you would be doing is playing “join the dots”

on the scatter diagram. Try more than 20 variables and the computer program will either fail, or take

corrective action by eliminating excess variables.

15

The plot of adjusted-R2, which we abbreviate to R

2(adj) is shown in Figure 8.2 when

n=21 and K=5. For example, when the observed value of R2 = 0.20, we find that R

2(adj)

= -0.067, a model distinguished only in the sense that you would have been better off

using random numbers!

The general algebraic expression is:

2 2( 1)( )

( 1) ( 1)

n KR a d j R

n K n

(8.8)

The term inside the square brackets removes that part of R2 that could arise just by

chance, and the ratio in front of the square brackets then rescales the expression so that

R2(adj) = 1 when R

2 = 1. When we add a variable to the model, R

2(adj) will increase if

and only if S decreases, so that an evaluation based upon S has been shifted to a more

recognizable index. R2(adj) could be defined as the proportion of variance explained,

above and beyond that part which could be attributed to chance.

Figure 8.2: R2(adj) plotted against R

2 when n = 21, K = 5.

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Adjusted R-square

16

Example 8.3: R-Sq(adj) and S for the gas prices model

We have K = 4 and n = 74 so that

2 3 3 0 8 .94 7 .9 6

6 9S and 6 .9 2S .

Since S2 = MSE, we could have read that value directly from the table in Example 8.2.

We then obtain:

2 (7 3) 4( ) 0 .8 6 5 0 .8 5 8

(6 9 ) (7 3)R a d j

As we see in this example, when K is small relative to n, the adjusted value is only

marginally less than the original R2.

Computer programs will generally provide all the information discussed in this section in

summary form:

S = 6.92 R-Sq = 86.5% R2(adj) = 85.8%

8.4 Testing individual coefficients

Once we have established that the overall model is of value, we need to determine which

variables are useful and which, if any do not contribute. The process is very similar to

that described in section 7.6, but there are some crucial differences. First, since there are

K variables in the model, we will perform K separate tests. We describe the procedure

for variable , 1, 2 , , .j

X j K

17

The null hypothesis, now denoted by H0(j), states that the theoretical slope for Xj in the

regression is zero, given that the other variables are already in the model. We are not

testing for a direct relationship between Xj and Y, but rather a conditional relationship.

Given that the other variables are already in the model, does Xj add anything? The null

hypothesis may be written as:

0

( ) : 0 , g iv e n th a t , , a re in th e m o d e lj i

H j X i j .

The alternative hypothesis is now denoted by HA(j) and states that the slope is not zero,

again assuming that the other variables are in the model. That is, there is a relationship

between Xj and Y even after accounting for the contributions of the other variables. We

write the alternative as:

i

( ) : 0 , g iv e n th a t X , i j , a re in th e m o d e lA j

H j .

As before, we assume the null hypothesis to be true, and then test this assumption. We

use the test statistic

( )

j

j

bt

S E b (8.9)

Let tobs denote the observed value of this statistic. This value is to be compared with the

appropriate value from a table of Student’s t distribution with (n-K-1) DF; the degrees of

freedom are determined by the number of observations available to estimate S, which is

now (n-K-1), as seen from Table 8.1. If we use a significance level of 100α%, we denote

the value from these tables as tα/2(n-K-1). The decision rule for the test is:

If |tobs| > tα/2(n-K-1), reject H0(j); otherwise, do not reject H0(j).

As in chapter 7, we will usually find it more convenient to perform the test using the P-

value. The decision rule is then written as:

18

If P < α, reject H0(j); otherwise do not reject H0(j).

A benefit of using the P-value approach is that, once the value of P is available, the

decision rule always has this standard form: reject H0(j) if P < α.6

Since we are performing a test on each slope in turn, standard computer packages

typically summarize the set of K tests in a single table. Figure 8.3 provides the output for

the gas prices example.

Predictor Coef SE Coef T P

Constant 84.64 16.82 5.03 0.000

L1_crude 0.026810 0.001745 15.37 0.000

PDI 0.005990 0.002459 2.44 0.017

Unemploy -6.937 3.254 -2.13 0.037

S&P 500 -0.01838 0.01074 -1.71 0.092

Figure 8.3: Single variable tests for the gas prices model

As in chapter 7, we ignore the test for the intercept or constant. Of the four input

variables, three have P < 0.05 but the S&P Index does not. We might drop the S&P

index from the model and repeat the analysis; the results are shown in Figure 8.4A.

R2(adj) has dropped slightly, but now we find that Unemployment fails the single

variable test. Such events are by no means uncommon and reflect the correlations among

the input variables: panel (B) of Figure 8.4 gives these correlations and their P-values

(for direct tests on the correlations). The S&P Index and Unemployment both seem

capable of conveying some information, but the high correlation between them makes the

task of estimating their respective impacts statistically difficult.

6 The authors’ tombstones will probably bear the inscription “RIP < α”, or “Reject if the P-value is less than

the significance level”.

19

If we continue our pursuit of statistically significant results, we are led to panel (C) of

Figure 8.4, where we retain only two variables: L1_crude and Disposable Income.

Finally we have a model where all the variables have coefficients that differ significantly

from zero and whose signs point in the appropriate direction.

Which model should we use? There are two possible answers to the question at this

stage:

1. It is too early to decide, because we have not checked the validity of our

underlying assumptions.

2. None of them, because we have not explored possible improvements such as the

addition of other variables.

If we have to choose one of the three models, purely statistical criteria are of limited use.

We might stick doggedly to the argument “significant variables only” and use the two

variable version (the “statistician’s view”). Alternatively, we might respond that

theoretical considerations led us to the four variable model and we will use that even if

some of the terms are not statistically significant (the “economist’s view”). For purposes

of exposition, we will stand by the statistical argument and use the two variable model in

the next few sections.

(A) Regression on L1_crude, Disposable Income and Unemployment


Constant 67.78 13.82 4.90 0.000

L1_crude 0.027928 0.001640 17.03 0.000

PDI 0.002343 0.001244 1.88 0.064

Unemploy -2.090 1.624 -1.29 0.202

S = 7.01963 R-Sq = 86.0% R-Sq(adj) = 85.4%

20

Analysis of Variance

Source DF SS MS F P

Regression 3 21127.6 7042.5 142.92 0.000

Residual Error 70 3449.3 49.3

Total 73 24576.9

(B) Correlations: L1_crude, Disposable Income, Unemploy, S&P

L1_crude PDI Unemploy

PDI 0.455

0.000

Unemploy -0.231 -0.447

0.048 0.000

S&P 500 0.279 0.806 -0.811

0.016 0.000 0.000

(C) Regression on L1_crude and Disposable Income


Constant 53.305 8.072 6.60 0.000

L1_crude 0.028026 0.001645 17.03 0.000

PDI 0.002934 0.001161 2.53 0.014

S = 7.05198 R-Sq = 85.6% R-Sq(adj) = 85.2%

Analysis of Variance

Source DF SS MS F P

Regression 2 21046 10523 211.60 0.000

Residual Error 71 3531 50

Total 73 24577

Figure 8.4: Results for reduced models for gas prices

8.4.1 Testing a group of coefficients

The t-test we’ve just illustrated and shown how it can be used in simplifying a model.

Sometimes we want to go a step further and consider whether it is necessary to include a

group of variables in the model. For example, in predicting the an individual client’s

21

credit risk faced by a credit card company (when applying for a loan), the data base may

include a numbe of variables capturing the applicant’s credit history. The question is ‘Do

these variables add anything to the predictive power of the model?’.

We can solve this problem by comparing two models, M1 which containts all the

variables, and M0 which only contains a sub-set. In equation ? below, M1 contains the

(q+1) parameters {0, 1, . . . p, p+1…q} while the simpler model M0 contains the

(p+1) parameters {0, 1, . . . p}.

More formally, there are (p-q) restrictions placed on model M1 to obtain the simpler

model M0;

H0: q+1 = q+2 = ............= q+ p = 0 is true vs

H1: some are non-zero.

To compare the two models we just examine the explanatory power of the two models

through the residuals from the two models. We therefore:estimate the sum of squared

errors from the extended model M1 and also the simpler model M0. Define

1 0 1 2 1

1

0 0 1 2

0

M o d e l M ( ; , , . . , . . , )

- w ith re s id u a l s u m o f s q u a re s , R S S

M o d e l M ( ; , , . . , 0 , 0 , 0 ..)

- w ith re s id u a l s u m o f s q u a re s , R S S

q q p

q

Y f X

Y f X

2fo r th e p a r t ic u la r m o d e l, , 0 1 .

i t i

t

w h e r e S S R e M i o r

0 1

1

( ) / ( )

( 1)

S S R S S R p qC a lc u la te F

S S R n p

22

Discussion question: Why is SSR0 always greater than SSR1?

This statistic has F distribution with p-q, n-p-1 degrees of freedom and the P-value can

be found using the EXCEL function FDIST.

This same approach can of course be used to test just a single coefficient (q=p+1). F

values of 4 are close to significance. This is equivalent to a t-test. We will leave an

illustration of how this can be used to the next chapter when we show how to identify

seasonal patters.

The test is particularly useful as it can also be used for testing non-linearities or any set of

parameter restrictions so long as the simple model M0 is a restricted model of the full

model

8.5 Checking the assumptions

Both in this chapter and the previous one we laid out a set of assumptions. However, to

date, we have not attempted to validate those assumptions; rather, we have proceeded as

though our model was fully and completely specified and that all the assumptions were

valid. In short, we have been living in a forecasting fool’s paradise. In this section we

take the model selected in section 8.4 and try to determine how well it matches up to the

assumptions stated in section 8.2.

23

We now examine these assumptions and devise ways to check the validity of each. Since

we have available only a sample we can never guarantee a particular assumption, but we

can check whether it seems plausible. We tend to be very pragmatic: if the data suggest

that a particular assumption is OK, we stay with that assumption. Given a reasonably

sized sample, such evidence suggests that any violation of the assumption is likely to be

modest, as will be the likely impact of that violation. However, we should always keep

in mind that this argument applies only if we are confident that in the system will

continue to operate under the same regime as in the past; if major structural changes take

place, all bets are off unless we can incorporate such changes into the model. If a

particular assumption breaks down, the nature of the breakdown will often indicate how

the model might be improved. Our diagnostics are developed using the residuals, as

defined in equation (8.4).

Assumption R1: The expected value of Y is linear in the values of the selected

explanatory variables.

Potential violations: We may have missed an important variable, or the relationship may

not be linear in the X’s.

Diagnostics:

1. Plot the residuals against the fitted values. Non-linear relationships will show up

as curvature in the plot.

2. Plot the residuals against potentially important Xs not currently in the model. If a

particular new X has an impact on Y, this should show up on the scatter plot as a

non-zero slope.

24

Assumption R2: The difference between an observed Y and its expectation is due to

random error.

Discussion: The assumption states that the error is an “add-on” and serves to justify the

least squares formulation for estimating the parameters. The error can always be

expressed in this way, but its properties will depend critically upon the next four

assumptions. Therefore, we do not check this assumption directly, but examine aspects

of it as described below.

Assumption R3: The errors have zero means.

Discussion: Typically, this assumption is not testable, at least when we are looking at a

single series. The inclusion of a constant term in the model ensures the mean of the in-

sample errors is zero. Once out-of-sample however it is a different matter and the model

errors may show bias. This can occur because many macroeconomic series are released in

preliminary form, and then updated. The model may have been constructed on one set of

final figures and then used in forecasting based on the preliminary data. Cross checks

between the preliminary and final versions of such variables may reveal biases in the

initial figures. Likewise, the construction of business databases should be regularly

examined to confirm that variables are correctly measured (e.g. the number of returned

items should be deducted from the appropriate month’s sales). Consistent biases in the

inputs are less critical, since they lead to modified coefficients but need not have an

adverse effect on the forecasts.

25

Assumption R4: The errors for different observations are uncorrelated with other

variables and with one another. Thus, the errors should be uncorrelated with the

explanatory variable or with other variables not included in the model. When examining

observations over time, this assumption implies no correlation between the error at time t

and past errors, i.e. otherwise, the errors are autocorrelated.

Possible violations: This assumption lies at the heart of model building and boils down to

the claim that the model contains all the predictable components leaving only noise in the

error term. The residuals therefore should not be related to factors not included in the

model such as the input variables themselves or where the data is a time series, past

values of either the inputs, the dependent variable or past errors. This can occur if there is

a carryover effect from one period to the next, which could be due to such factors as the

weather, brand loyalty or economic trends. Thus, a positive residual in one time period is

likely to be followed by a positive residual in the next period. High – low sequences are

also possible, such as a drop in sales after high volumes due to a special promotion.

Diagnostics:

1. Plot the residuals against the predicted value of Y and also the input variables

included in the model (as well as any others that have been excluded).

2. Plot the residuals against the time order of the observations. If positive

autocorrelation exists, we will see sequences of values above zero, and then below

zero, rather than a random scatter. If the autocorrelation is negative a saw-tooth

pattern will prevail.

3. Plot the sample autocorrelation function (acf) for the residuals (see section 6.3)

and look for departures from a random series by performing tests for the presence

26

of autocorrelation. (A test that is sometimes recommended for this based on the

Durbin-Watson statistic, but it has limited validity and therefore the examination

of the acf is an easy-to-use effective substitute. In chapter 10 we consider

introduce more efficient tests.)

Assumption R5: The error terms come from distributions with equal variances.

Possible violations: The most common pattern is that the variability increases as the

mean level of Y increases. We naturally talk about percentage movements up or down in

GDP, in sales and in many other series. The implication behind such terminology is that

the variations are proportional to the level of the mean, rather than displaying constant

variance.

Diagnostics:

1. Plot the residuals against the fitted values. If the errors are heteroscedastic, the

scatter will often be greater for the larger fitted values.

2. Various test statistics are available (see Anderson, Sweeney and Williams, 2005,

Chapter 11), but we do not pursue that topic further. Procedures for dealing with

changing variances are discussed in section 9.6.

Assumption R6: The errors are drawn from a normal distribution.

Possible violations: There may be one or more outliers that serve to make the distribution

non-normal, or the whole pattern of the residuals may suggest a non-normal distribution.

Diagnostics:

1. Plot the histogram of the residuals, and look for an approximate bell-shape.

27

2. Use the normal probability plot (see Appendix A2). If the plot deviates

significantly from a straight line, this indicates non-normality.

3. Examine the plots of residuals against both time order and fitted values for

extreme observations.

It is evident from this summary that some plots, notably that of the residuals against fitted

values, serve multiple purposes. It is important to keep these several objectives in mind

when examining the plots.

8.5.1 Analysis of residuals for gas price data

Most forecasting packages will generate the plots we have just discussed, some more

easily than others. In particular, Minitab produces a ‘Four in One’ plot as part of its

regression component, which is particularly useful for the analyses we have been

discussing.

The residuals plots for the two variable model we identified in Figure 8.4 (C) are shown

in Figure 8.5. We examine these plots in the order they appear in the output:

a. The probability plot (top left) appears at first sight to be close to a straight line.

However, closer examination reveals a slight curvature: the points are below the line

at the ends of the plot, and above it in the middle. Further, the largest observation

appears to be an outlier, an identification confirmed by the other plots.

b. The histogram (bottom left) tells much the same story as the probability plot. There is

some evidence of a departure from the normal curve, with a long tail at the upper end.

28

200- 20

99.9

99

90

50

10

1

0.1

R e s id u a l

Pe

rc

en

t

160140120100

30

20

10

0

- 10

F itte d V a lu e

Re

sid

ua

l

3020100- 10

24

18

12

6

0

R e s id u a l

Fr

eq

ue

nc

y

7 57 06 56 05 55 04 54 03 53 02 52 01 51 051

30

20

10

0

- 10

O b s e r v a tio n O r d e r

Re

sid

ua

l

No rm a l P ro b a b ilit y P lo t V e rs u s Fit s

H is t o g ra m V e rs u s O rd e r

R e s id u a l P lo ts f o r U n le a d e d

Figure 8.5a: Residuals plots for the 2-variable gas prices model.

Figure 8.5b: Residual plots versus the input variables in the data set

c. As we noted earlier, the plot of residuals against fitted values (top right) may tell

several stories. The residuals for fitted values below 140 show something of

downward drift and then we get the sudden jump up for larger fitted values. Non-

linearity is one possibility and an omitted variable is another. Also, the scatter of the

points is greater for larger fitted values, indicating some heteroscedasticity. Finally,

we again note the large positive residual, which may be an outlier.

d. The plot of residuals against order (bottom right) shows runs of positive values

followed by runs of negative values, indicative of autocorrelation. Also, we observe

that the string of large values is clustered together at the end of the series, suggesting

a possible change in conditions that should be examined more closely.

29

The issue of residual autocorrelation is particularly important since it indicates

persistence in the time series that has not been fully captured by the current model. To

investigate this phenomenon, we look at the ACF of the residuals, shown in Figure 8.6.

1 81 61 41 21 08642

1 .0

0 .8

0 .6

0 .4

0 .2

0 .0

- 0 .2

- 0 .4

- 0 .6

- 0 .8

- 1 .0

La g

Au

to

co

rr

ela

tio

n

A u to c o r r e la t io n F u n c tio n f o r R E S ID U A L S

(w ith 5 % s ig n if ic a n c e lim its fo r th e a u to c o r r e la tio n s )

Figure 8.6: ACF for the residuals of the two variable gas prices model.

The ACF indicates a degree of persistence with a significant positive autocorrelation at

lag 1. The spikes at lag 6 and lag 12 suggest possible seasonality which also merits

further examination.

Collectively, these plots provide plenty of food for thought and indicate that we have

some work ahead of us before we can be satisfied with the model. We will return to the

model-building endeavor in Chapter 9; for now, we explore the use of such models in

forecasting.

30

8.6 Forecasting using multiple regression

The general procedure for forecasting using several explanatory variables is essentially

the same as the single variable case described in section 7.7. The technical details

become more involved; the interested reader is referred to Kutner et al. (2005, pp. 229 –

232). The first question we must answer relates to the nature of the explanatory variables.

Recall any particular X may arise in one of three ways:

a. X is known ahead of time

b. X is unknown but can itself be forecast

c. X is unknown but we wish to make “what-if” forecasts.

For example, consider a model for sales. Variables that designate particular seasons are

clearly known in advance, as may be substantive variables that have been sufficiently

lagged in time. Policy variables like price and advertising revenues may be explored

using the model to make “what-if” forecasts so that the sensitivity of expected sales to

policy changes can be explored. Finally, some variables such as the price charged by

competitors or the level of GDP will require forecasts themselves. Such forecasts are

often generated by industry analysts, government sources or macroeconomic panels (see

for example, www.consensuseconomics.com). Alternatively, time series methods such as

exponential smoothing methods discussed in Chapters 3 and 4 could be used.

http://www.consensuseconomics.com/

31

8.6.1 The point forecast

We suppose that values for the next time period are available for each of the K variables

and denote these values by 0 1 0 2 0

, , ,K

X X X . Given the estimated regression line the

point forecast is:

0 0 1 0 1 2 0 2 0K K

F b b X b X b X (8.10)

As before, we need to distinguish between the fitted values Y and the forecast F0. The

two formulae are the same but the fitted values correspond to those observations that

were used in the estimation process whereas the forecasts refer to new observations.

These new values may be part of a hold-out sample or values as yet unobserved, but they

should not be used to estimate the model parameters.

Example 8.4: One-step-ahead forecasts for gas prices

We use the two variable model for gas prices as an illustration. One-step-ahead forecasts

were generated using (8.10) so that, for example, the forecast for May 2002 uses the

crude price for April 2002 and the May PDI; the calculation is as follows. The regression

model (from Figure 8.4) is:

5 3 .3 0 5 0 .0 2 8 0 2 6 1 _ 0 .0 0 2 9 3 4U n le a d e d L c r u d e P D I

0 1 0 2

0

A ssu m in g ( 1 _ ) 2 2 5 2 a n d X ( ) 8 9 1 0 .6

th e n 5 3 .3 0 5 0 .0 2 8 0 2 6 * 2 2 5 2 0 .0 0 2 9 3 4 * 8 9 1 0 .6

1 4 2 .6

X L C ru d e P D I

F

We illustrate how to carry out this calculation using SPSS (Minitab). The data matrix is

expanded to include the assumed values of the input variables (but of course there is no

corresponding dependent variable observed). As shown in Figure ?, we then save the

32

predictions and the prediction intervals (see next section) to the data matrix, running the

regression model on the expanded data set.

Figure ? Screenshot showing how prediction intervals are calculated automatically using

SPSS.

Note that this forecast is not a pure ex-ante forecast: we would not know the May PDI at

the time the forecast was made. Exercise 9.3 involves the production of a genuine ex-ante

forecast using a lagged value of PDI.

The complete set of one-step-ahead point forecasts and summary measures is given in

Table 8.3. The various error measures are computed in accordance with the formulae in

33

section 2.7. From the table, we can see that an upturn in prices predicted for the fall of

2002 did not materialize, whereas the actual upturns in the first part of 2003 and again in

the later part were somewhat underestimated. The forecast root mean square error

calculated from the values in this table is 5.70, a modest improvement over the value for

the single predictor model given in section 7.7.

Forecast Month Unleaded - Actual

Unleaded - Forecast

Forecast Error

Absolute Error

Squared Error

Absolute Percentage

Error

2002 April 140.7 136.2 4.5 4.5 20.2 3.2 2002 May 142.1 142.6 -0.5 0.5 0.2 0.3 2002 June 140.4 145.4 -5.0 5.0 25.4 3.6 2002 July 141.2 142.8 -1.6 1.6 2.7 1.2 2002 August 142.3 145.4 -3.1 3.1 9.8 2.2 2002 September 142.2 149.0 -6.8 6.8 46.0 4.8 2002 October 144.9 152.7 -7.8 7.8 61.0 5.4 2002 November 144.8 150.5 -5.7 5.7 32.9 4.0 2002 December 139.4 145.2 -5.8 5.8 33.9 4.2 2003 January 147.3 150.6 -3.3 3.3 11.1 2.3 2003 February 164.1 159.3 4.8 4.8 23.0 2.9 2003 March 174.8 169.2 5.6 5.6 31.0 3.2 2003 April 165.9 164.3 1.6 1.6 2.4 0.9 2003 May 154.2 151.5 2.7 2.7 7.2 1.7 2003 June 151.4 150.2 1.2 1.2 1.4 0.8 2003 July 152.4 155.6 -3.2 3.2 10.1 2.1 2003 August 162.8 157.6 5.2 5.2 26.9 3.2 2003 September 172.8 158.9 13.9 13.9 194.3 8.1 2003 October 160.3 151.4 8.9 8.9 79.2 5.6 2003 November 153.5 155.2 -1.7 1.7 2.8 1.1 2003 December 149.4 157.2 -7.8 7.8 61.0 5.2

MFE MAE MSE MAPE

-0.19 4.80 32.50 3.13

RMSE MdAPE 5.70 3.20

Table 8.3: One-step-ahead forecasts for gasoline prices, with summary measures: April

2002-December 2003

34

8.6.2 Prediction intervals

We now require prediction intervals to provide an indication of the accuracy of the

forecasts. We omit the technical details and simply note that, relative to the unknown

future value 0Y the point forecast has an estimated standard error that we write as:

0 0 0

( ) v a r ( )S E F Y F (8.11)

Given assumptions R3-R6, the forecast error follows a normal distribution and, after

allowing for the estimation of σ by S, we may specify the prediction interval using the

Student’s t distribution with the appropriate DF:

Prediction interval for the future observation Y0:

0 / 2 0

( 1) * ( )F t n K S E F

(8.12)

The 100(1-α)% prediction interval is a probability statement; it says that the probability

that the future observation will lie in the interval defined by equation (8.12) is (1-α).

Example 8.5: Construction of a prediction interval

We continue our consideration of the forecasts for May 2002, begun in Example 8.4.

Since K = 2 and n = 74 we have DF = 71. The SE is found to be 7.21. Using

t0.025(71)=1.99 the 95% prediction interval is:

1 4 2 .6 (1 .9 9 ) * (7 .2 1) 1 4 2 .6 1 4 .3 5 [1 2 8 .2 ,1 5 6 .9 ]

The lower and upper prediction intervals for April 2002 through December 2003 are

given in Table 8.4. The table contains 21 one-step-ahead prediction intervals; all 21

intervals include the actual value.

35

As for simple regression, when n is large an approximate 95 percent prediction interval is

given by 02F S . From Figure 8.4, S=7.05 so that the under-estimation would be

modest in this case.

Table 8.4: Prediction limits for gas prices data, one-step-ahead: April 2002-December

2003.

8.6.3 Forecasting more than one period ahead

When we wish to forecast more than one period ahead, we must provide values for all the

predictor variables over the forecasting horizon. As we discussed in 7.7.4 there are two

possible approaches:

Forecast Month Unleaded - Actual

Unleaded - Forecast

Lower PI Upper PI

2002 April 140.7 136.2 121.8 150.6 2002 May 142.1 142.6 128.2 156.9 2002 June 140.4 145.4 131.0 159.8 2002 July 141.2 142.8 128.4 157.2 2002 August 142.3 145.4 131.0 159.8 2002 September 142.2 149.0 134.6 163.4 2002 October 144.9 152.7 138.3 167.2 2002 November 144.8 150.5 136.1 165.0 2002 December 139.4 145.2 130.8 159.6 2003 January 147.3 150.6 136.2 165.1 2003 February 164.1 159.3 144.8 173.8 2003 March 174.8 169.2 154.5 184.0 2003 April 165.9 164.3 149.7 179.0 2003 May 154.2 151.5 137.0 166.0 2003 June 151.4 150.2 135.7 164.7 2003 July 152.4 155.6 141.0 170.1 2003 August 162.8 157.6 143.0 172.2 2003 September 172.8 158.9 144.3 173.5 2003 October 160.3 151.4 136.8 166.0 2003 November 153.5 155.2 140.6 169.8 2003 December 149.4 157.2 142.6 171.8

36

1. Generate forecasts for all the Xs and apply these to the original model.

2. Reformulate the model so that all unknown Xs are lagged by two (or more)

periods, as appropriate.

The first approach is more commonly applied, but it does suffer from the drawback that

the uncertainty in X is not reflected in the prediction intervals for the forecasts. It has the

advantage that different “what if” paths for X can be formulated and compared using a

single model. The second approach is somewhat more tedious, but it will be more

valuable when good forecasts for X are unavailable and it will provide more accurate

prediction intervals.

As before neither approach is always best. Where forecasts of the Xs are unreliable, it

will usually prove better (and easier) to adopt approach 2. Exploration of the gas prices

model for multiple steps ahead is left as Exercise 8.4.

8.7 Principles

As in earlier chapters, our list of principles relies heavily upon the material in Armstrong

(2001); numbers in parentheses refer to principles listed there. In particular, in this

chapter, we draw upon the research in that volume reported by Allen and Fildes (2001).

8.1 Aim for a relatively simple model specification (Allen and Fildes, 2001)

The researcher must strike a balance between failing to include key variables and

cluttering the model with variables that have very little effect upon the outcome. For

37

example, the number of consumers in a market area clearly has an impact on the level of

sales. However, for a particular market area, that figure is not going to change much over

the course of a few months. Accordingly, we would not bother to include population in a

short-term model for sales forecasting.

8.2 (9.1) Tailor the forecasting model to the horizon (Armstrong, 2001)

As noted in Principle 8.1, we need to identify those variables that are important for the

forecasting horizon under consideration.

8.3 Identify important causal variables based upon underlying theory and earlier

empirical studies. Identify suitable proxy variables when the variables of interest are not

available in timely fashion (adapted from Allen and Fildes, 2001)

Expertise and background knowledge should be used to formulate a model whenever

possible. Statements such as “The stock market goes up when the AFC team wins the

Super Bowl” may be factually correct over a period of years, but they are not a reliable

guide to investment!

8.4 If the aim of the analysis is to provide pure forecasts, you must either know the

explanatory variables in advance or be able to forecasts them sufficiently well to justify

their inclusion in the model (adapted from Allen and Fildes, 2001)

This principle is a more formal statement of the necessary response to the questions

“What do you know?” and “When will you know it?”

8.5 Use the Method of Ordinary Least Squares to estimate the parameters (Allen and

Fildes, 2001).

38

The Method of Ordinary Least Squares is strictly valid only when Assumptions R3-R5

apply but it is often a good place to start and the current form can be extended to deal

with more complex models.

8.6 (9.5) Update the estimates frequently (Armstrong, 2001)

Frequent updating involves little effort beyond recording the latest data. The new

parameter estimates will better reflect the relationships among the variables and also help

to alert the modeler to any structural changes that occur.

Summary

In this chapter we have extended time series regression models to multiple explanatory

variables, thereby greatly increasing the range and value of models that we may use for

forecasting purposes. We have also provided the basic inferential framework in terms of

parameter estimation and model testing, as well as identifying the key assumptions

underlying the models. This structure will enable us to check assumptions and refine the

models in the next chapter.

39

References

Allen, P.G. and Fildes, R. (2001). Econometric forecasting. In J.S. Armstrong (ed.)

Principles of Forecasting, Kluwer, Boston and Dordrecht. Pp. 300 – 362.

Anderson, D.R., Sweeney, D.J. and Williams, T.A. (2005) Statistics for Business and

Economics. South-Western: Mason, Ohio. Ninth edition.

Armstrong, J.S. (ed., 2001). Principles of Forecasting: A Handbook for Researchers and

Practitioners. Kluwer: Boston and Dordrecht.

Hull, J.C. (2009). Options, Futures, and Other Derivatives. Pearson Prentice Hall: Upper

Saddle River, NJ. Seventh edition.

Kutner, M.H., Nachtsheim, C.J., Neter, J. and Li, W. (2005). Applied Linear Statistical

Models. McGraw-Hill: Boston, MA. Fifth edition.

40

Exercises

8.1 The table below contains additional data on price, beyond the data quoted in

Exercise 7.1.

a. Conduct a regression analysis for Sales on Spot and Price

b. Carry out tests on the overall model and on the individual coefficients.

Summarize your conclusions.

c. Compare the performance of the two models. Which model would you

recommend?

WEEK Spots Price Sales

1 8 11 25

2 12 11 34

3 16 12 39

4 10 10 32

5 8 12 22

6 12 12 30

7 16 10 43

8 10 10 31

8.2 The table below contains additional data on advertising expenditures, beyond the

data quoted in Exercise 7.3.

a. Conduct a regression analysis for Sales on Advertising and Price

b. Carry out tests on the overall model and on the individual coefficients.

Summarize your conclusions.

c. Compare the performance of the two models. Which model would you

recommend?

41

Week Price Advertising Sales

1 6 10 28

2 8 20 30

3 10 30 28

4 6 10 30

5 8 10 24

6 10 10 22

7 6 20 34

8 8 10 26

9 10 10 20

10 6 30 36

11 8 30 32

12 10 20 26

8.3 Using the two variable gas prices model generate forecasts three periods ahead for

the period April 2002 to December 2003 by first generating forecasts for both the lagged

crude oil price and disposable income. Compare the estimates with those for the model

developed in the chapter. Compute the forecast accuracy measures and generate the 90

percent prediction intervals. How many of the actual values fall inside these intervals?

8.4 Develop the two variable gas prices model in such as way as to allow for direct

prediction of the prices three periods ahead. Compute the forecast accuracy measures and

generate the 90 percent prediction intervals. How many of the actual values fall inside

these intervals? Compare the results with those obtained for Exercise 8.3.

42

Mini-Cases

The purpose of these mini-cases is to provide opportunities for data analysis to tackle

important real-world problems. The format is essentially the same in each: a dependent

variable of interest is identified along with a plausible set of explanatory variables. The

aim is to develop a valid forecasting model for at least one period ahead and multiple

periods ahead should also be considered.

The full set of modeling steps should be examined:

Create plots of the data to look for relationships and possible unusual observations

Perform basic data analysis

Develop a multiple regression model, preferably using a hold-out sample for the

evaluation of forecasting performance

Keep in mind that there are no “right” answers, but some solutions will be more effective

than others. After you complete your statistical analysis, do not fail to ask the following

questions:

Would the data be available to enable me to make timely forecasts?

Would you feel able to justify your model to a senior manager?

If the answer to either question is NO, you have more work to do!

43

Mini-case 8.1: The Volatility of Google Stock

[Contributors: Christine Choi, Alex Dixon, Melissa Gong, Michael Neches and Greg

Thompson] [Google_Data.xlsx]

Volatility is a measure of uncertainty of the return realized on an asset (Hull, 2009).

Applied to financial markets, volatility of a stock price is a measure of how uncertain we

are of future stock price movements. As volatility increases, the possibility that the stock

price will appreciate or depreciate significantly also increases. This measure has

widespread implications, particularly for stock option valuation and also for volatility

indices (VIX), portfolio management, and hedging strategies. Since its initial public

offering, Google Inc. (GOOG: NASDAQ) stock has become one of the most sought after

and popular investment opportunities. The search engine giant’s stock price has

fluctuated from an IPO price of $85/share, to a high of $741/share (adjusted close

11/27/2007) down to a recent closing price of $345/share (adjusted close 2/25/2009).

This fluctuation reveals the uncertainty associated with any stock, especially for high-

tech companies with web-based models where the monetization of services can confuse

even the most sophisticated investor.

The aim of the project is to develop a multiple regression model to forecast the volatility

of Google’s stock price over the next three months. After an initial review of Google-

specific and macroeconomic data, we identified the following potential explanatory

variables:

STDEV: the volatility measure for Google stock

VOLUME: amount of trading in Google shares

P/E: the price to earnings ratio of Google stock

44

GDP: quarterly growth in GDP at an annualized rate (quarterly, repeated for each month)

VIX: the market volatility index

CONF: the Conference Board index on Consumer Confidence

JOBLESS: the number of claims posted for benefits

HOUSING: the number of new housing starts

Monthly data are available for the period May 2006 through January 2009 on the

following variables. The data were downloaded from Bloomberg.

Mini-case 8.2: Economic Factors in Homicide Rates

[Contributors : Daniel Adcock, Sybil Desangles, Gerald McSwiggan, John Siminerio,

Marc Steining and Katherine Wood] [Homicides.xlsx]

This project is an analysis of annual homicides in the United States, in which historical

data is used to predict future homicide rates. The purpose is to develop a predictive model

that estimates future homicide rates based on a number of potential explanatory and

predictor variables. This project has significant ramifications; if an effective model is

devised to forecast future homicides, the variables underlying the model could be used as

a focal point for police and government officials. It will help law enforcement by

heightening awareness, which will enable more effective homicide prevention.

Annual data are available for the period 1972-2001. The variables measured are:

HOMICIDE: the rate of homicides per 100,000 of the population

GDP: Real GDP per capita,

UNEMPLOY: the national unemployment rate (percent)

DROPOUT: the high-school drop-out rate (percent)

45

RECESSION: the presence (=1) or absence (=0) of a recession

CABLE: cable subscription rates

Mini-case 8.3: Forecasting Natural Gas Consumption for the DC Metropolitan Area

[Contributors: Sameer Aggarwal, Parshant Dhand, Yulia Egorov and Natasha

Heidenrich] [Natural Gas.xlsx]

The intent of the project was to develop a model to forecast natural gas consumption for

the residential sector in the Washington DC metropolitan area. The level of natural gas

consumption is influenced by a variety of factors, including local weather, the state of the

national and the local economies, dollar purchasing power (since at least some of the

natural gas is imported), and the prices for other commodities. The following variables

have been identified and recorded on a quarterly basis. The data cover the period 1997

Q1 through 2008 Q3; all data are measured quarterly.

GASCONS: Consumption of natural gas in DC metro area (million cubic feet)

AVETEMP: Average temperature for the period in the DC metro area

GDP: annualized percentage change in GDP

UNEMP: percentage unemployment in the DC metro area

GAS_PRICE: price of natural gas ($/100 cubic feet)

DOLLAR: value index for the US Dollar relative to a basket of international currencies

OIL_PRICE: price of crude oil ($/barrel)

RESERVES: reserves of natural gas (million cubic feet)

FUTURES: price of futures contract on natural gas ($/100 cubic feet)

46

Mini-case 8.4: Economic Factors in Property Crime

[Contributors: Jay Cafarella, Jordan Krawll, Caroline Levington, Allen Lin and

Steven Schuler] [Property crime.xlsx]

The purpose of this project is to determine the relationship between crime and the

country’s economic health. The drop in wealth that accompanies a recession may result

in an increase in the crime rate. If people don’t have jobs, have less money and are no

longer able to pay for their daily expenses, it is reasonable to believe that some will resort

to crime. Various economic indicators were identified to address this question. The

following variables are measured annually for the period 1960 – 2007:

CRIME_pc: number of property crimes reported, per capita

POP_GROWTH: the percentage change in population over the previous year

GDP_GROWTH: change in Real GDP over the previous year

UNEMP: percentage unemployment in the USA

CREDIT: the percentage growth in credit card indebtedness

S&P RETURN: the annual return on the S&P 500 Index

CPI_GROWTH: percentage change in the Consumer Price Index

INCOME_GROWTH: percentage growth in average household income

RECESSION: the presence (=1) or absence (=0) of a recession

Other variables in the spreadsheet were used in the calculation of these rates.

47

Mini-case 8.5: U.S Retail & Food Service Sales

[Contributors: Doug Goff, Rich Marsden, Jeff Rodgers and Masaki Takeda] [Retail

Sales.xlsx]

The purpose of this project is to forecast how US Retail & Food Sales will fare over the

coming months. The variables considered include personal income and savings and

consumer sentiment, as well as various macroeconomic variables. Since manufacturing

costs and levels of activity are clearly important, these factors are also included. Also

considered in the analysis were three seasonal factors associated with the Easter,

Thanksgiving and Christmas holidays.

The data set includes monthly figures for the period January 2000 – December 2008.

RSALES: US Retail & Food Service Sales [$ Millions]

CONSENT: Index of US Consumer Sentiment

PRICE_OIL: Spot Price of Oil ($/barrel)

IND_PROD: Index of US Industrial Production

PERSINC: US Personal Income ($ per capita)

PERSSAV: Net US Personal Savings ($ per capita)

POPULATION: Total US Population (000s)

UNEMP: US Unemployment Rate

CPI: US Consumer Price Index

TGIVING: Indicator for Thanksgiving (November)

EASTER: Indicator for Easter (March or April)

XMAS: Indicator for Christmas (December)

48

Mini-case 8.6 : U.S. Unemployment Rates

[Contributors : Matt Egyhazy, Jay Kreider, Jim Platt, Ashley Wall and Rob Whiteside]

[Unemployment_2.xlsx]

Unemployment has multiple causes, among which five important factors are: Minimum

Wage Laws, Labor Unions, Efficiency Wages, Job Search, and General Economic

Conditions. Minimum wages help maintain a certain standard of living for individuals,

but also create an artificial bottom which prevents wages from dropping to a level where

a greater percentage of workers would be employed. Similar to minimum wage laws,

labor unions create higher wage levels by increasing the wages and benefits of union

workers, potentially at the expense of others. Efficiency wages or paying employees

above the market equilibrium in order to produce more productive employees creates an

excess in the labor supply which also may increase unemployment. Finally, the amount

of time taken in the job search as employees move from one job to another has a direct

effect on unemployment since those employees are considered unemployed between jobs.

The Economic Conditions of the US Economy have a direct effect on the employment

rate; as the economy slips, unemployment rises.

In order to build an appropriate forecasting model, monthly data were collected for the

following economic indicators:

UNEMPLOYMENT: U.S Unemployment (percent)

MINIMUM WAGE: Nominal Minimum Wage, as enacted by Congress ($/hour)

NONFARM EARNINGS: Average income in the non-farming sector ($/year, monthly

series, annualized data)

CPI: Consumer Price Index

49

UNION MEMBERSHIP: percentage of labor force that is unionized

NOM GDP: U.S. Gross Domestic Product (nominal $ millions; quarterly)

Data Sources for Mini-cases:

1. National Climatic Data Center (National Oceanic and Atmospheric Administration,

Department of Commerce),

http://lwf.ncdc.noaa.gov/oa/climate/research/cag3/md.html

2. Bureau of Labor Statistics (U.S. Department of Labor), http://www.bls.gov/

3. Energy Information Administration (U.S. Department of Energy),

http://www.eia.doe.gov/overview_hd.html and

http://tonto.eia.doe.gov/dnav/ng/ng_stor_sum_dcu_nus_m.htm

4. Bureau of Economic Analysis (U.S. Department of Commerce), http://www.bea.gov/

5. The Bureau of Justice Statistics(U.S. Department of Justice),

http://www.ojp.usdoj.gov/bjs/

6. FBI Uniform Crime Reports (National Archive of Criminal Justice Data),

http://www.fbi.gov/ucr/

http://lwf.ncdc.noaa.gov/oa/climate/research/cag3/md.html

http://www.bls.gov/

http://www.eia.doe.gov/overview_hd.html

http://tonto.eia.doe.gov/dnav/ng/ng_stor_sum_dcu_nus_m.htm

http://www.bea.gov/

http://www.ojp.usdoj.gov/bjs/

http://www.fbi.gov/ucr/

Documents

Chapter 7: Multiple Regression - Natcor · 2017. 10. 31. · Mini-case 8.6 : U.S. Unemployment Rates ... building using multiple regression methods. The same mini-cases will be revisited