Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
Chapter 8: Multiple Regression for Time Series (under revision)
Contents
Introduction ..................................................................................................................... 2 8.1 Graphical analysis and preliminary model development .......................................... 4
8.2 The multiple regression model.................................................................................. 5
8.2.1 The method of least squares ............................................................................... 7
8.3 Testing the overall model.......................................................................................... 8 8.3.1 The F-test for multiple variables ........................................................................ 9 8.3.2 ANOVA in simple regression .......................................................................... 12 8.3.3 The relationship between F and R
2 .................................................................. 13
8.3.4 S and Adjusted R2 ............................................................................................ 14
8.4 Testing individual coefficients ................................................................................ 16
8.4.1 Testing a group of coefficients
8.5 Checking the assumptions ......................................................................................... 1 8.5.1 Analysis of residuals for gas price data ........................................................... 27
8.6 Forecasting using multiple regression .................................................................... 30
8.6.1 The point forecast ............................................................................................ 31 8.6.2 Prediction intervals .......................................................................................... 34 8.6.3 Forecasting more than one period ahead ......................................................... 35
8.7 Principles................................................................................................................. 36 Summary ....................................................................................................................... 38
References ..................................................................................................................... 39 Exercises ....................................................................................................................... 40 Mini-Cases .................................................................................................................... 42
Mini-case 8.1: The Volatility of Google Stock ............................................................. 43 Mini-case 8.2: Economic Factors in Homicide Rates ................................................... 44
Mini-case 8.3: Forecasting Natural Gas Consumption for the DC Metropolitan Area 45
Mini-case 8.4: Economic Factors in Property Crime .................................................... 46 Mini-case 8.5: U.S Retail & Food Service Sales .......................................................... 47 Mini-case 8.6 : U.S. Unemployment Rates................................................................... 48
Why have you included so many variables in your regression model?
(Anonymous Statistician)
Why have you included so few variables in your regression model?
(Anonymous Economist)
2
Introduction
One of the key restrictions we faced in Chapter 7 was the inability to consider more than
one explanatory variable at a time. Yet both the discussion there and basic common
sense indicate that events in the business world are typically affected by multiple inputs.
We may not be able to measure all of them, but we do need to identify the main factors
and incorporate them into our forecasting framework. A first step towards identifying an
appropriate set of variables is to examine plots of the data, which we do in section 8.1,
although we need to proceed with caution as multiple dependencies in the data may make
interpretations complex. In section 8.2, we then proceed to formulate a statistical model
that incorporates multiple inputs and to interpret the coefficients in that model.
Estimation of the parameters follows the Method of Least Squares developed in section
7.2 and is extended to cover multiple regression in section 8.2.1.
Once we have developed a model, we need to know whether it is useful. For simple
linear regression, this question was straightforward. We checked to see whether or not
there was a statistically meaningful relationship between X and Y and that completed the
analysis, as in section 7.6. The question is now more complex. For example, sales of a
product may depend upon both advertising expenditures and price. Either variable alone
may provide only a modest description of what is going on, whereas the two taken
together may give a much better level of explanation. Conversely, a model for national
retail sales that includes both consumer expenditures and consumer incomes may be only
3
marginally better than a model that includes only one of them. The reason for this
apparent anomaly is that if X1 and X2 are highly correlated and X1 is already in the
model, X2 will not bring much new information to the table. To resolve such questions
we need to proceed in two steps:
1. Is the overall model is useful? If the answer is NO, we go back to the drawing
board.
2. If the answer is YES, we check whether individual variables in the model are
useful for forecasting purposes.
These two steps are explored in sections 8.3 and 8.4. As we explained in section 7.5, our
analysis is based upon a set of standard assumptions. In section 8.5 we briefly revisit
those assumptions and present graphical procedures to determine whether the
assumptions are reasonable. Taking action to deal with failures of the assumptions is a
more difficult step, which we defer to Chapter 9.
Once the model has been shown to be effective and the assumptions appear to be
reasonable, we are in a position to generate forecasts. Point forecasts and prediction
intervals are considered in section 8.6. Finally in section 8.7, we consider some of the
key principles that underlie the development of multiple regression models.
At the end of the chapter, we provide details of six Mini-cases. Rather than work through
“pre-packaged problem sets, these examples provide a more realistic approach to model-
building using multiple regression methods. The same mini-cases will be revisited at the
end of chapter 9, to make use of the more advances skills developed in that chapter.
4
8.1 Graphical analysis and preliminary model development
We return to the study of gasoline prices, initially examined in section 7.3 (Gas
prices_1.xlsx). The matrix plot we considered there is reproduced as Figure 8.1 for
convenience. The variables in the plot are:
The price of regular unleaded gasoline (‘Unleaded’; in cents per U.S. gallon)
Total disposable income (‘Disposable Income’; in billions of current Dollars)
The First Purchase Price of Crude Oil (‘L1_crude’; in cents per barrel, lagged one
month)
Unemployment (‘Unemploy’; Overall percentage rate for U.S)
The S&P 500 Stock Index (‘S&P’).
Examination of the plot already revealed that the strongest linear relationship for gas
price appeared to be the lagged value of the first purchase price of crude oil. However,
we also see a somewhat upward sloping relationship between price and disposable
income, possibly due to the effect of inflation on both series. The general level of
economic activity is reflected in a downward sloping relationship with unemployment
and an upward sloping relationship with the S&P Index. None of these last three
relationships appears to be nearly as strong as that of L1_crude, but they all make
economic sense and might improve the overall ability to forecast gas prices. Also, as we
saw in Table 7.1, all their correlations with gas prices are significantly different from
zero, so potentially they may add value to the model.
5
With this example as background, we now examine the specification of the multiple
regression model.
Figure 8.1: MINITAB matrix plot for gas prices against disposable income, lagged price
of crude oil, unemployment and the S&P Index.
8.2 The multiple regression model
The multiple regression model is a direct extension of the simple regression model
specified in section 7.5. Note that we now move directly to the specification of the
underlying model, having already motivated the basic ideas in Chapter 7. We consider K
explanatory variables: and assume that the dependent variable Y is
linearly related to them through the following model:
U n le a d e d
180
150
120
800070006000 5.64.84 .0
D isp o sa b le I n co m e
8000
7000
6000
3000
2000
1000
L 1 _ cru d e
U n e m p lo y
5.6
4.8
4.0
180150120
1600
1200
800
300020001000
S & P
16001200800
M a tr ix P lo t o f U n le a d e d , D is p o s a b le In c o m e , L 1 _ c r u d e , U n e m p lo y , S & P
1 2, , ,
KX X X
6
0 1 1 2 2 K KY X X X (8.1)
The coefficients in (8.1) may be interpreted as follows:
β0 denotes the intercept, which is the expected value of Y when all the {Xj} are
zero, so that the equation reduces to Y = β0.
βj denotes the slope for Xj: when Xj increases by one unit and all the other X’s are
kept fixed, the expected value of Y increases by βj units.
Beyond the extended form of the expected value, or explained component of the model,
the underlying assumptions are the same as for simple regression given in section 7.5.1.
That is, we need only extend Assumption R1 appropriately.
Assumption R1: For given values of the explanatory variables, , the
expected value of Y is written as E(Y) and has the form:
Assumption R2: The difference between an observed Y and its expectation is a random
error, denoted by ε. The complete model is:
( ) [E x p ected va lu e] +[ran d o m erro r]Y E Y (8.2)
Assumption R3: The errors have zero means.
Assumption R4: The errors for different observations are uncorrelated with one another
and with other explanatory variables.
Assumption R5: The error terms come from distributions with equal variances.
Assumption R6: The errors are drawn from a normal distribution.
As in section 7.5, if we take assumptions R3 – R6 together, we are making the claim that
the random errors are independent and normally distributed with zero means and equal
variances.
1 2, , ,
KX X X
0 1 1 2 2( )
K KE Y X X X
7
8.2.1 The method of ordinary least squares (OLS)
The Method of Ordinary Least Squares (OLS) may be used to estimate the unknown
parameters. As for simple linear regression, we choose the sample coefficients, now
to minimize the sum of squared errors, SSE. That is, we seek to minimize
(8.3)
The technical details were summarized in Appendix 7A and we will not consider the
computational issues further1. After we have determined the best fitting model, we
estimate the error terms using the (least squares) residuals, defined as:
0 1 1 2 2i i i i K K i
e Y b b X b X b X (8.4)
The residuals form the basis of many of the tests and diagnostic checks that we use to
validate the model, as we shall see in later sections.
Example 8.1: Multiple regression model for gasoline prices
We employ the observations for January 1996 – March 2002, as before. The Least
Squares solution is:
Unleaded = 84.6 + 0.0268 L1_crude + 0.00599 PDI
- 6.94 Unemploy - 0.0184 S&P 500
Examination of the coefficients indicates that an increase an increase of $1 in the price of
a barrel of crude may be expected to increase the price at the pump by about 2.7 cents per
gallon, slightly lower than the figure we got with simple regression. Likewise, an
1 Standard texts on regression analysis provide the necessary details; see for example Kutner et al. (2005,
pp. 15 – 20 and 222 – 227).
0 1{ , , , }
Kb b b
2 2
1 1
2
0 1 1 2 2
1
[ ]
( )
n n
i
i i
n
i i i K K i
i
S S E e O b s e r v e d F it te d
Y b b X b X b X
8
increase in disposable income produces an increase in the expected price, whereas an
increase in unemployment reduces the expected gas price. The coefficient for the S&P
Index is also negative; initially we might have expected an increase in the S&P to signal
increased economic activity and thus upward pressure on gas prices. However, the issue
of timing is important, and the negative sign could reflect the impact of good news in the
crude oil markets, which would both lower pump prices and boost the overall economy.
No matter how good the statistical fit, the forecaster should always check the face
validity of the proposed forecasting model.2 If the model passes the face-validity test, we
go ahead and check to see whether it makes sense statistically.
8.3 Testing the overall model
We specify the null hypothesis as the claim that the overall model is of no value or, more
explicitly that none of the explanatory variables affects the expected value. Formally,
this statement is written as:
.
When the null hypothesis is true, none of the variables in the model contribute to
explaining the variation in Y. The alternative hypothesis, HA states that the overall model
is of value, in that at least one of the explanatory variables has an effect:
2 Imagine standing in front of your boss in her office. Can you give a plausible justification of the model
and all the coefficients? If not, develop a different model!
0 1 2: 0
KH
1 2: N o t a ll 0
A KH
9
That is, there is some statistical relationship between Y and at least one of the Xj. If we
fail to reject H0, we conclude that the overall model is without value and we need to start
over. If we reject H0, we may still wish to eliminate those variables that do not appear to
contribute, so as to arrive at a more parsimonious model.3
8.3.1 The F-test for multiple variables
The test procedure based upon the partition of the sum of squares and is known as the
Analysis of Variance, often referred to as ANOVA or even just AV. The sums of
squares are
Total Sum of Squares: 2
1
( )
i n
Y Y i
i
SST S Y Y
Sum of Squared Errors: 2
1
ˆ( )
i n
i i
i
SSE Y Y
Sum of Squares explained by the Regression model: 2
1
ˆ( )
i n
i
i
SSR Y Y
As was true in the case of simple regression, it may be shown that:
S S T S S R S S E (8.5)
The ANOVA test is usually summarized in tabular form and the general framework is
presented in Table 8.1.
The first column describes the partition into the two sources of variation: the sums of
squares explained by the regression model and the sum of squared errors.
3 When the model is based upon sound theoretical considerations, it makes sense to retain all the variables
in the model, even if some are not statistically significant, so long as the parameter estimates make sense.
This is typically true in econometric modeling. On the other hand, if we optimistically include variables on
a “see if it flies” basis, we will usually prefer to prune the model to the smaller number of “useful”
variables.
10
The second column gives the degrees of freedom (DF) associated with each of the
sums of squares; the total DF is (n-1) since we always start out with a constant term
in the model.
The third column provides the numerical values of the sums of squares.
Column four gives the Mean Squares, defined as [Sum of Squares / DF] for each
source: MSR and MSE respectively.
Column five yields the test statistic F = MSR/MSE.
The reason for introducing the new term Mean Square Error is that, when the null
hypothesis is true, both MSR and MSE have expected values equal to the error variance,
σ2. Thus, the test statistic
4 F=MSR/MSE, should have a value in the neighborhood of 1.0
if the null hypothesis is appropriate. When the regression model is useful, the amount of
variation explained by the model will increase, so MSR will increase relative to MSE and
F will increase. Thus, we will reject H0 for sufficiently large values of F.
Source DF Sums of Mean Squares F
Squares
Regression K SSR MSR = SSR / K MSR / MSE
Residual Error n-K-1 SSE MSE = SSE / (n-K-1)
__________________________________________________________
Total n-1 SST
Table 8.1: General form of the ANOVA table
4 The analysis of variance was first derived by Sir Ronald Fisher, the father of modern inferential statistics.
The ratio was labeled F in his honor by another famous statistician, George Snedecor. Ironically, the test is
sometimes known as Snedecor’s F test.
11
We refer to the observed value of F generated from this table as Fobs. The decision rule
becomes:
Reject H0 if Fobs > Fα(K, n-K-1); otherwise do not reject H0.
The critical value for F depends upon the degrees of freedom for both the SSR (the
numerator DF = ν1 ) and the SSE (the denominator DF = ν2 ). Detailed tables for the
upper 5 percent and 1 percent points of the F-distribution are given in Appendix B4. An
extract of the 5 percent table is shown in Table 8.2.
\ 1 2 3 4 5 6
1 161.448 199.500 215.707 224.583 230.162 233.986
2 18.513 19.000 19.164 19.247 19.296 19.330
3 10.128 9.552 9.277 9.117 9.013 8.941
4 7.709 6.944 6.591 6.388 6.256 6.163
5 6.608 5.786 5.409 5.192 5.050 4.950
6 5.987 5.143 4.757 4.534 4.387 4.284
7 5.591 4.737 4.347 4.120 3.972 3.866
8 5.318 4.459 4.066 3.838 3.687 3.581
9 5.117 4.256 3.863 3.633 3.482 3.374
10 4.965 4.103 3.708 3.478 3.326 3.217
Table 8.2: Extract of table of upper 5 percent points for the F distribution [E Extracted from the
National Institute of Standards and Technology [NIST] website:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm#ONE-05-11-20
For example, if n = 14 and K = 3, we have ν1 = 3 and ν2 = 14-3-1 = 10 so that the critical
value from the table is 3.708. The values are also available through the EXCEL command
Again, these tables are perhaps more of historical interest than anything else, since it is
generally much more convenient to use the P-value with the decision rule:
Reject H0 if P < α; otherwise do not reject H0.
Example 8.2: ANOVA for gas prices.
12
The ANOVA table for the gas prices model given in Example 8.1 with K = 4 is shown
below. Consistent with our usual convention, we specify α = 0.05.
Source DF SS MS F P
Regression 4 21268.0 5317.0 110.88 0.000
Residual Error 69 3308.9 48.0
Total 73 24576.9
The 5 percent point from the F tables with ν1 = 4 and ν2 = 69 is 2.525; since ν2 = 69 is not
listed, we take the next lowest DF in the table, which is ν2 = 60. Since Fobs = 110.88 is
much greater than 2.525, we emphatically reject the null hypothesis. The probability of
such an extreme value can also be calculated through EXCEL through the function
‘=FDIST(110.88,4,69)’ which delivers a P value of 0.000. More directly, we observe
that P < 0.05 and we reject H0. Recall that a P value of 0.000 does not mean zero, but
rather that P < 0.0005 and the result is rounded down. We conclude there is strong
evidence that the overall model is useful and the next step is to determine which variables
contribute.
8.3.2 ANOVA in simple regression
We did not consider the Analysis of Variance in Chapter 7 because it did not provide any
additional information. When there is only one variable in the model, the test of the
overall model is formally equivalent to the test of the single slope. Referring back to the
computer output in Figure 7.11, we see that the P-value for ANOVA is identical with
that for the t-test of the slope. That is, in simple linear regression, the F and (two-sided) t
tests provide identical information. Another way of saying this is that, in simple linear
regression, F = t2 as can readily be verified numerically in Figure 7.11.
13
8.3.3 The relationship between F and R2
There is a simple relationship between F and R2, so the F test yields exactly the same
conclusion as if we had performed a test using R2. Starting from
/
/ ( 1)
M SR SSR KF
M SE S SE n K
we divide numerator and denominator by SST so that
( 1) * /
( 1) * /
n K S SR S STF
K S SE S ST.
Since 2/R S S R S S T and 2
1 /R S S E S S T we have that
2
2
( 1).
(1 )
n K RF
K R
(8.6)
From inspection of equation (8.6), it is evident that an increase in R2 leads to an increase
in F, so that the ANOVA test is completely equivalent to a test based upon the coefficient
of determination. Either from the ANOVA table, or directly from the computer output,
we find for Example 8.2 that:
2 2 1 2 6 8 .00 .8 6 5
2 4 5 7 6 .9
S S RR
S S T
Thus, the coefficient of determination has increased compared to the single variable
model, which had R2 = 84.4%. Indeed, whenever we add a variable to the model, we find
that R2 increases (or strictly speaking, cannot decrease). However, the value of the F
statistic often falls because of the K in the denominator of (8.6). There is no
inconsistency here, but we need to recognize that the decline in F does not signal a
weakness in the model.
14
8.3.4 S and Adjusted R2
The steady increase in R2 as new variables are added is a matter for some concern. A
better guide to the performance of the model is to look at S, the standard error, now
defined as:
2
2 1
( 1) ( 1)
n
i
i
eS S E
S M S En K n K
(8.7)
If S is smaller, the model has improved as the result of including the extra variable,
although the improvement may be marginal. As we argued in section 7.7.3, S has a
straightforward interpretation as the accuracy of the predictions since it is an estimate of
the error standard deviation.
An alternative route to interpreting the overall accuracy of a multiple regression model is
through the adjusted form of R2. The construction is as follows:
Question: What value of R2
could we expect if we included K sets of arbitrary numbers in
the model? That is, if H0 was true.
Answer: With K “junk” variables in the model we would have an expected value for R2
equal to ( 1)
K
n . For example, if n = 21 and you include 10 variables in the model, you
can expect an R2 of 50% even if the model is pure junk.
5
Solution: We adjust R2 so that the modified value is zero for “pure junk” but still
increases to 100% when the fit is perfect.
5 Include 20 “junk” variables and you get
21 0 0 %R ! All you would be doing is playing “join the dots”
on the scatter diagram. Try more than 20 variables and the computer program will either fail, or take
corrective action by eliminating excess variables.
15
The plot of adjusted-R2, which we abbreviate to R
2(adj) is shown in Figure 8.2 when
n=21 and K=5. For example, when the observed value of R2 = 0.20, we find that R
2(adj)
= -0.067, a model distinguished only in the sense that you would have been better off
using random numbers!
The general algebraic expression is:
2 2( 1)( )
( 1) ( 1)
n KR a d j R
n K n
(8.8)
The term inside the square brackets removes that part of R2 that could arise just by
chance, and the ratio in front of the square brackets then rescales the expression so that
R2(adj) = 1 when R
2 = 1. When we add a variable to the model, R
2(adj) will increase if
and only if S decreases, so that an evaluation based upon S has been shifted to a more
recognizable index. R2(adj) could be defined as the proportion of variance explained,
above and beyond that part which could be attributed to chance.
Figure 8.2: R2(adj) plotted against R
2 when n = 21, K = 5.
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Adjusted R-square
16
Example 8.3: R-Sq(adj) and S for the gas prices model
We have K = 4 and n = 74 so that
2 3 3 0 8 .94 7 .9 6
6 9S and 6 .9 2S .
Since S2 = MSE, we could have read that value directly from the table in Example 8.2.
We then obtain:
2 (7 3) 4( ) 0 .8 6 5 0 .8 5 8
(6 9 ) (7 3)R a d j
As we see in this example, when K is small relative to n, the adjusted value is only
marginally less than the original R2.
Computer programs will generally provide all the information discussed in this section in
summary form:
S = 6.92 R-Sq = 86.5% R2(adj) = 85.8%
8.4 Testing individual coefficients
Once we have established that the overall model is of value, we need to determine which
variables are useful and which, if any do not contribute. The process is very similar to
that described in section 7.6, but there are some crucial differences. First, since there are
K variables in the model, we will perform K separate tests. We describe the procedure
for variable , 1, 2 , , .j
X j K
17
The null hypothesis, now denoted by H0(j), states that the theoretical slope for Xj in the
regression is zero, given that the other variables are already in the model. We are not
testing for a direct relationship between Xj and Y, but rather a conditional relationship.
Given that the other variables are already in the model, does Xj add anything? The null
hypothesis may be written as:
0
( ) : 0 , g iv e n th a t , , a re in th e m o d e lj i
H j X i j .
The alternative hypothesis is now denoted by HA(j) and states that the slope is not zero,
again assuming that the other variables are in the model. That is, there is a relationship
between Xj and Y even after accounting for the contributions of the other variables. We
write the alternative as:
i
( ) : 0 , g iv e n th a t X , i j , a re in th e m o d e lA j
H j .
As before, we assume the null hypothesis to be true, and then test this assumption. We
use the test statistic
( )
j
j
bt
S E b (8.9)
Let tobs denote the observed value of this statistic. This value is to be compared with the
appropriate value from a table of Student’s t distribution with (n-K-1) DF; the degrees of
freedom are determined by the number of observations available to estimate S, which is
now (n-K-1), as seen from Table 8.1. If we use a significance level of 100α%, we denote
the value from these tables as tα/2(n-K-1). The decision rule for the test is:
If |tobs| > tα/2(n-K-1), reject H0(j); otherwise, do not reject H0(j).
As in chapter 7, we will usually find it more convenient to perform the test using the P-
value. The decision rule is then written as:
18
If P < α, reject H0(j); otherwise do not reject H0(j).
A benefit of using the P-value approach is that, once the value of P is available, the
decision rule always has this standard form: reject H0(j) if P < α.6
Since we are performing a test on each slope in turn, standard computer packages
typically summarize the set of K tests in a single table. Figure 8.3 provides the output for
the gas prices example.
Predictor Coef SE Coef T P
Constant 84.64 16.82 5.03 0.000
L1_crude 0.026810 0.001745 15.37 0.000
PDI 0.005990 0.002459 2.44 0.017
Unemploy -6.937 3.254 -2.13 0.037
S&P 500 -0.01838 0.01074 -1.71 0.092
Figure 8.3: Single variable tests for the gas prices model
As in chapter 7, we ignore the test for the intercept or constant. Of the four input
variables, three have P < 0.05 but the S&P Index does not. We might drop the S&P
index from the model and repeat the analysis; the results are shown in Figure 8.4A.
R2(adj) has dropped slightly, but now we find that Unemployment fails the single
variable test. Such events are by no means uncommon and reflect the correlations among
the input variables: panel (B) of Figure 8.4 gives these correlations and their P-values
(for direct tests on the correlations). The S&P Index and Unemployment both seem
capable of conveying some information, but the high correlation between them makes the
task of estimating their respective impacts statistically difficult.
6 The authors’ tombstones will probably bear the inscription “RIP < α”, or “Reject if the P-value is less than
the significance level”.
19
If we continue our pursuit of statistically significant results, we are led to panel (C) of
Figure 8.4, where we retain only two variables: L1_crude and Disposable Income.
Finally we have a model where all the variables have coefficients that differ significantly
from zero and whose signs point in the appropriate direction.
Which model should we use? There are two possible answers to the question at this
stage:
1. It is too early to decide, because we have not checked the validity of our
underlying assumptions.
2. None of them, because we have not explored possible improvements such as the
addition of other variables.
If we have to choose one of the three models, purely statistical criteria are of limited use.
We might stick doggedly to the argument “significant variables only” and use the two
variable version (the “statistician’s view”). Alternatively, we might respond that
theoretical considerations led us to the four variable model and we will use that even if
some of the terms are not statistically significant (the “economist’s view”). For purposes
of exposition, we will stand by the statistical argument and use the two variable model in
the next few sections.
(A) Regression on L1_crude, Disposable Income and Unemployment
Predictor Coef SE Coef T P
Constant 67.78 13.82 4.90 0.000
L1_crude 0.027928 0.001640 17.03 0.000
PDI 0.002343 0.001244 1.88 0.064
Unemploy -2.090 1.624 -1.29 0.202
S = 7.01963 R-Sq = 86.0% R-Sq(adj) = 85.4%
20
Analysis of Variance
Source DF SS MS F P
Regression 3 21127.6 7042.5 142.92 0.000
Residual Error 70 3449.3 49.3
Total 73 24576.9
(B) Correlations: L1_crude, Disposable Income, Unemploy, S&P
L1_crude PDI Unemploy
PDI 0.455
0.000
Unemploy -0.231 -0.447
0.048 0.000
S&P 500 0.279 0.806 -0.811
0.016 0.000 0.000
(C) Regression on L1_crude and Disposable Income
Predictor Coef SE Coef T P
Constant 53.305 8.072 6.60 0.000
L1_crude 0.028026 0.001645 17.03 0.000
PDI 0.002934 0.001161 2.53 0.014
S = 7.05198 R-Sq = 85.6% R-Sq(adj) = 85.2%
Analysis of Variance
Source DF SS MS F P
Regression 2 21046 10523 211.60 0.000
Residual Error 71 3531 50
Total 73 24577
Figure 8.4: Results for reduced models for gas prices
8.4.1 Testing a group of coefficients
The t-test we’ve just illustrated and shown how it can be used in simplifying a model.
Sometimes we want to go a step further and consider whether it is necessary to include a
group of variables in the model. For example, in predicting the an individual client’s
21
credit risk faced by a credit card company (when applying for a loan), the data base may
include a numbe of variables capturing the applicant’s credit history. The question is ‘Do
these variables add anything to the predictive power of the model?’.
We can solve this problem by comparing two models, M1 which containts all the
variables, and M0 which only contains a sub-set. In equation ? below, M1 contains the
(q+1) parameters {0, 1, . . . p, p+1…q} while the simpler model M0 contains the
(p+1) parameters {0, 1, . . . p}.
More formally, there are (p-q) restrictions placed on model M1 to obtain the simpler
model M0;
H0: q+1 = q+2 = ............= q+ p = 0 is true vs
H1: some are non-zero.
To compare the two models we just examine the explanatory power of the two models
through the residuals from the two models. We therefore:estimate the sum of squared
errors from the extended model M1 and also the simpler model M0. Define
1 0 1 2 1
1
0 0 1 2
0
M o d e l M ( ; , , . . , . . , )
- w ith re s id u a l s u m o f s q u a re s , R S S
M o d e l M ( ; , , . . , 0 , 0 , 0 ..)
- w ith re s id u a l s u m o f s q u a re s , R S S
q q p
q
Y f X
Y f X
2fo r th e p a r t ic u la r m o d e l, , 0 1 .
i t i
t
w h e r e S S R e M i o r
0 1
1
( ) / ( )
( 1)
S S R S S R p qC a lc u la te F
S S R n p
22
Discussion question: Why is SSR0 always greater than SSR1?
This statistic has F distribution with p-q, n-p-1 degrees of freedom and the P-value can
be found using the EXCEL function FDIST.
This same approach can of course be used to test just a single coefficient (q=p+1). F
values of 4 are close to significance. This is equivalent to a t-test. We will leave an
illustration of how this can be used to the next chapter when we show how to identify
seasonal patters.
The test is particularly useful as it can also be used for testing non-linearities or any set of
parameter restrictions so long as the simple model M0 is a restricted model of the full
model
8.5 Checking the assumptions
Both in this chapter and the previous one we laid out a set of assumptions. However, to
date, we have not attempted to validate those assumptions; rather, we have proceeded as
though our model was fully and completely specified and that all the assumptions were
valid. In short, we have been living in a forecasting fool’s paradise. In this section we
take the model selected in section 8.4 and try to determine how well it matches up to the
assumptions stated in section 8.2.
23
We now examine these assumptions and devise ways to check the validity of each. Since
we have available only a sample we can never guarantee a particular assumption, but we
can check whether it seems plausible. We tend to be very pragmatic: if the data suggest
that a particular assumption is OK, we stay with that assumption. Given a reasonably
sized sample, such evidence suggests that any violation of the assumption is likely to be
modest, as will be the likely impact of that violation. However, we should always keep
in mind that this argument applies only if we are confident that in the system will
continue to operate under the same regime as in the past; if major structural changes take
place, all bets are off unless we can incorporate such changes into the model. If a
particular assumption breaks down, the nature of the breakdown will often indicate how
the model might be improved. Our diagnostics are developed using the residuals, as
defined in equation (8.4).
Assumption R1: The expected value of Y is linear in the values of the selected
explanatory variables.
Potential violations: We may have missed an important variable, or the relationship may
not be linear in the X’s.
Diagnostics:
1. Plot the residuals against the fitted values. Non-linear relationships will show up
as curvature in the plot.
2. Plot the residuals against potentially important Xs not currently in the model. If a
particular new X has an impact on Y, this should show up on the scatter plot as a
non-zero slope.
24
Assumption R2: The difference between an observed Y and its expectation is due to
random error.
Discussion: The assumption states that the error is an “add-on” and serves to justify the
least squares formulation for estimating the parameters. The error can always be
expressed in this way, but its properties will depend critically upon the next four
assumptions. Therefore, we do not check this assumption directly, but examine aspects
of it as described below.
Assumption R3: The errors have zero means.
Discussion: Typically, this assumption is not testable, at least when we are looking at a
single series. The inclusion of a constant term in the model ensures the mean of the in-
sample errors is zero. Once out-of-sample however it is a different matter and the model
errors may show bias. This can occur because many macroeconomic series are released in
preliminary form, and then updated. The model may have been constructed on one set of
final figures and then used in forecasting based on the preliminary data. Cross checks
between the preliminary and final versions of such variables may reveal biases in the
initial figures. Likewise, the construction of business databases should be regularly
examined to confirm that variables are correctly measured (e.g. the number of returned
items should be deducted from the appropriate month’s sales). Consistent biases in the
inputs are less critical, since they lead to modified coefficients but need not have an
adverse effect on the forecasts.
25
Assumption R4: The errors for different observations are uncorrelated with other
variables and with one another. Thus, the errors should be uncorrelated with the
explanatory variable or with other variables not included in the model. When examining
observations over time, this assumption implies no correlation between the error at time t
and past errors, i.e. otherwise, the errors are autocorrelated.
Possible violations: This assumption lies at the heart of model building and boils down to
the claim that the model contains all the predictable components leaving only noise in the
error term. The residuals therefore should not be related to factors not included in the
model such as the input variables themselves or where the data is a time series, past
values of either the inputs, the dependent variable or past errors. This can occur if there is
a carryover effect from one period to the next, which could be due to such factors as the
weather, brand loyalty or economic trends. Thus, a positive residual in one time period is
likely to be followed by a positive residual in the next period. High – low sequences are
also possible, such as a drop in sales after high volumes due to a special promotion.
Diagnostics:
1. Plot the residuals against the predicted value of Y and also the input variables
included in the model (as well as any others that have been excluded).
2. Plot the residuals against the time order of the observations. If positive
autocorrelation exists, we will see sequences of values above zero, and then below
zero, rather than a random scatter. If the autocorrelation is negative a saw-tooth
pattern will prevail.
3. Plot the sample autocorrelation function (acf) for the residuals (see section 6.3)
and look for departures from a random series by performing tests for the presence
26
of autocorrelation. (A test that is sometimes recommended for this based on the
Durbin-Watson statistic, but it has limited validity and therefore the examination
of the acf is an easy-to-use effective substitute. In chapter 10 we consider
introduce more efficient tests.)
Assumption R5: The error terms come from distributions with equal variances.
Possible violations: The most common pattern is that the variability increases as the
mean level of Y increases. We naturally talk about percentage movements up or down in
GDP, in sales and in many other series. The implication behind such terminology is that
the variations are proportional to the level of the mean, rather than displaying constant
variance.
Diagnostics:
1. Plot the residuals against the fitted values. If the errors are heteroscedastic, the
scatter will often be greater for the larger fitted values.
2. Various test statistics are available (see Anderson, Sweeney and Williams, 2005,
Chapter 11), but we do not pursue that topic further. Procedures for dealing with
changing variances are discussed in section 9.6.
Assumption R6: The errors are drawn from a normal distribution.
Possible violations: There may be one or more outliers that serve to make the distribution
non-normal, or the whole pattern of the residuals may suggest a non-normal distribution.
Diagnostics:
1. Plot the histogram of the residuals, and look for an approximate bell-shape.
27
2. Use the normal probability plot (see Appendix A2). If the plot deviates
significantly from a straight line, this indicates non-normality.
3. Examine the plots of residuals against both time order and fitted values for
extreme observations.
It is evident from this summary that some plots, notably that of the residuals against fitted
values, serve multiple purposes. It is important to keep these several objectives in mind
when examining the plots.
8.5.1 Analysis of residuals for gas price data
Most forecasting packages will generate the plots we have just discussed, some more
easily than others. In particular, Minitab produces a ‘Four in One’ plot as part of its
regression component, which is particularly useful for the analyses we have been
discussing.
The residuals plots for the two variable model we identified in Figure 8.4 (C) are shown
in Figure 8.5. We examine these plots in the order they appear in the output:
a. The probability plot (top left) appears at first sight to be close to a straight line.
However, closer examination reveals a slight curvature: the points are below the line
at the ends of the plot, and above it in the middle. Further, the largest observation
appears to be an outlier, an identification confirmed by the other plots.
b. The histogram (bottom left) tells much the same story as the probability plot. There is
some evidence of a departure from the normal curve, with a long tail at the upper end.
28
200- 20
99.9
99
90
50
10
1
0.1
R e s id u a l
Pe
rc
en
t
160140120100
30
20
10
0
- 10
F itte d V a lu e
Re
sid
ua
l
3020100- 10
24
18
12
6
0
R e s id u a l
Fr
eq
ue
nc
y
7 57 06 56 05 55 04 54 03 53 02 52 01 51 051
30
20
10
0
- 10
O b s e r v a tio n O r d e r
Re
sid
ua
l
No rm a l P ro b a b ilit y P lo t V e rs u s Fit s
H is t o g ra m V e rs u s O rd e r
R e s id u a l P lo ts f o r U n le a d e d
Figure 8.5a: Residuals plots for the 2-variable gas prices model.
Figure 8.5b: Residual plots versus the input variables in the data set
c. As we noted earlier, the plot of residuals against fitted values (top right) may tell
several stories. The residuals for fitted values below 140 show something of
downward drift and then we get the sudden jump up for larger fitted values. Non-
linearity is one possibility and an omitted variable is another. Also, the scatter of the
points is greater for larger fitted values, indicating some heteroscedasticity. Finally,
we again note the large positive residual, which may be an outlier.
d. The plot of residuals against order (bottom right) shows runs of positive values
followed by runs of negative values, indicative of autocorrelation. Also, we observe
that the string of large values is clustered together at the end of the series, suggesting
a possible change in conditions that should be examined more closely.
29
The issue of residual autocorrelation is particularly important since it indicates
persistence in the time series that has not been fully captured by the current model. To
investigate this phenomenon, we look at the ACF of the residuals, shown in Figure 8.6.
1 81 61 41 21 08642
1 .0
0 .8
0 .6
0 .4
0 .2
0 .0
- 0 .2
- 0 .4
- 0 .6
- 0 .8
- 1 .0
La g
Au
to
co
rr
ela
tio
n
A u to c o r r e la t io n F u n c tio n f o r R E S ID U A L S
(w ith 5 % s ig n if ic a n c e lim its fo r th e a u to c o r r e la tio n s )
Figure 8.6: ACF for the residuals of the two variable gas prices model.
The ACF indicates a degree of persistence with a significant positive autocorrelation at
lag 1. The spikes at lag 6 and lag 12 suggest possible seasonality which also merits
further examination.
Collectively, these plots provide plenty of food for thought and indicate that we have
some work ahead of us before we can be satisfied with the model. We will return to the
model-building endeavor in Chapter 9; for now, we explore the use of such models in
forecasting.
30
8.6 Forecasting using multiple regression
The general procedure for forecasting using several explanatory variables is essentially
the same as the single variable case described in section 7.7. The technical details
become more involved; the interested reader is referred to Kutner et al. (2005, pp. 229 –
232). The first question we must answer relates to the nature of the explanatory variables.
Recall any particular X may arise in one of three ways:
a. X is known ahead of time
b. X is unknown but can itself be forecast
c. X is unknown but we wish to make “what-if” forecasts.
For example, consider a model for sales. Variables that designate particular seasons are
clearly known in advance, as may be substantive variables that have been sufficiently
lagged in time. Policy variables like price and advertising revenues may be explored
using the model to make “what-if” forecasts so that the sensitivity of expected sales to
policy changes can be explored. Finally, some variables such as the price charged by
competitors or the level of GDP will require forecasts themselves. Such forecasts are
often generated by industry analysts, government sources or macroeconomic panels (see
for example, www.consensuseconomics.com). Alternatively, time series methods such as
exponential smoothing methods discussed in Chapters 3 and 4 could be used.
31
8.6.1 The point forecast
We suppose that values for the next time period are available for each of the K variables
and denote these values by 0 1 0 2 0
, , ,K
X X X . Given the estimated regression line the
point forecast is:
0 0 1 0 1 2 0 2 0K K
F b b X b X b X (8.10)
As before, we need to distinguish between the fitted values Y and the forecast F0. The
two formulae are the same but the fitted values correspond to those observations that
were used in the estimation process whereas the forecasts refer to new observations.
These new values may be part of a hold-out sample or values as yet unobserved, but they
should not be used to estimate the model parameters.
Example 8.4: One-step-ahead forecasts for gas prices
We use the two variable model for gas prices as an illustration. One-step-ahead forecasts
were generated using (8.10) so that, for example, the forecast for May 2002 uses the
crude price for April 2002 and the May PDI; the calculation is as follows. The regression
model (from Figure 8.4) is:
5 3 .3 0 5 0 .0 2 8 0 2 6 1 _ 0 .0 0 2 9 3 4U n le a d e d L c r u d e P D I
0 1 0 2
0
A ssu m in g ( 1 _ ) 2 2 5 2 a n d X ( ) 8 9 1 0 .6
th e n 5 3 .3 0 5 0 .0 2 8 0 2 6 * 2 2 5 2 0 .0 0 2 9 3 4 * 8 9 1 0 .6
1 4 2 .6
X L C ru d e P D I
F
We illustrate how to carry out this calculation using SPSS (Minitab). The data matrix is
expanded to include the assumed values of the input variables (but of course there is no
corresponding dependent variable observed). As shown in Figure ?, we then save the
32
predictions and the prediction intervals (see next section) to the data matrix, running the
regression model on the expanded data set.
Figure ? Screenshot showing how prediction intervals are calculated automatically using
SPSS.
Note that this forecast is not a pure ex-ante forecast: we would not know the May PDI at
the time the forecast was made. Exercise 9.3 involves the production of a genuine ex-ante
forecast using a lagged value of PDI.
The complete set of one-step-ahead point forecasts and summary measures is given in
Table 8.3. The various error measures are computed in accordance with the formulae in
33
section 2.7. From the table, we can see that an upturn in prices predicted for the fall of
2002 did not materialize, whereas the actual upturns in the first part of 2003 and again in
the later part were somewhat underestimated. The forecast root mean square error
calculated from the values in this table is 5.70, a modest improvement over the value for
the single predictor model given in section 7.7.
Forecast Month Unleaded - Actual
Unleaded - Forecast
Forecast Error
Absolute Error
Squared Error
Absolute Percentage
Error
2002 April 140.7 136.2 4.5 4.5 20.2 3.2 2002 May 142.1 142.6 -0.5 0.5 0.2 0.3 2002 June 140.4 145.4 -5.0 5.0 25.4 3.6 2002 July 141.2 142.8 -1.6 1.6 2.7 1.2 2002 August 142.3 145.4 -3.1 3.1 9.8 2.2 2002 September 142.2 149.0 -6.8 6.8 46.0 4.8 2002 October 144.9 152.7 -7.8 7.8 61.0 5.4 2002 November 144.8 150.5 -5.7 5.7 32.9 4.0 2002 December 139.4 145.2 -5.8 5.8 33.9 4.2 2003 January 147.3 150.6 -3.3 3.3 11.1 2.3 2003 February 164.1 159.3 4.8 4.8 23.0 2.9 2003 March 174.8 169.2 5.6 5.6 31.0 3.2 2003 April 165.9 164.3 1.6 1.6 2.4 0.9 2003 May 154.2 151.5 2.7 2.7 7.2 1.7 2003 June 151.4 150.2 1.2 1.2 1.4 0.8 2003 July 152.4 155.6 -3.2 3.2 10.1 2.1 2003 August 162.8 157.6 5.2 5.2 26.9 3.2 2003 September 172.8 158.9 13.9 13.9 194.3 8.1 2003 October 160.3 151.4 8.9 8.9 79.2 5.6 2003 November 153.5 155.2 -1.7 1.7 2.8 1.1 2003 December 149.4 157.2 -7.8 7.8 61.0 5.2
MFE MAE MSE MAPE
-0.19 4.80 32.50 3.13
RMSE MdAPE 5.70 3.20
Table 8.3: One-step-ahead forecasts for gasoline prices, with summary measures: April
2002-December 2003
34
8.6.2 Prediction intervals
We now require prediction intervals to provide an indication of the accuracy of the
forecasts. We omit the technical details and simply note that, relative to the unknown
future value 0Y the point forecast has an estimated standard error that we write as:
0 0 0
( ) v a r ( )S E F Y F (8.11)
Given assumptions R3-R6, the forecast error follows a normal distribution and, after
allowing for the estimation of σ by S, we may specify the prediction interval using the
Student’s t distribution with the appropriate DF:
Prediction interval for the future observation Y0:
0 / 2 0
( 1) * ( )F t n K S E F
(8.12)
The 100(1-α)% prediction interval is a probability statement; it says that the probability
that the future observation will lie in the interval defined by equation (8.12) is (1-α).
Example 8.5: Construction of a prediction interval
We continue our consideration of the forecasts for May 2002, begun in Example 8.4.
Since K = 2 and n = 74 we have DF = 71. The SE is found to be 7.21. Using
t0.025(71)=1.99 the 95% prediction interval is:
1 4 2 .6 (1 .9 9 ) * (7 .2 1) 1 4 2 .6 1 4 .3 5 [1 2 8 .2 ,1 5 6 .9 ]
The lower and upper prediction intervals for April 2002 through December 2003 are
given in Table 8.4. The table contains 21 one-step-ahead prediction intervals; all 21
intervals include the actual value.
35
As for simple regression, when n is large an approximate 95 percent prediction interval is
given by 02F S . From Figure 8.4, S=7.05 so that the under-estimation would be
modest in this case.
Table 8.4: Prediction limits for gas prices data, one-step-ahead: April 2002-December
2003.
8.6.3 Forecasting more than one period ahead
When we wish to forecast more than one period ahead, we must provide values for all the
predictor variables over the forecasting horizon. As we discussed in 7.7.4 there are two
possible approaches:
Forecast Month Unleaded - Actual
Unleaded - Forecast
Lower PI Upper PI
2002 April 140.7 136.2 121.8 150.6 2002 May 142.1 142.6 128.2 156.9 2002 June 140.4 145.4 131.0 159.8 2002 July 141.2 142.8 128.4 157.2 2002 August 142.3 145.4 131.0 159.8 2002 September 142.2 149.0 134.6 163.4 2002 October 144.9 152.7 138.3 167.2 2002 November 144.8 150.5 136.1 165.0 2002 December 139.4 145.2 130.8 159.6 2003 January 147.3 150.6 136.2 165.1 2003 February 164.1 159.3 144.8 173.8 2003 March 174.8 169.2 154.5 184.0 2003 April 165.9 164.3 149.7 179.0 2003 May 154.2 151.5 137.0 166.0 2003 June 151.4 150.2 135.7 164.7 2003 July 152.4 155.6 141.0 170.1 2003 August 162.8 157.6 143.0 172.2 2003 September 172.8 158.9 144.3 173.5 2003 October 160.3 151.4 136.8 166.0 2003 November 153.5 155.2 140.6 169.8 2003 December 149.4 157.2 142.6 171.8
36
1. Generate forecasts for all the Xs and apply these to the original model.
2. Reformulate the model so that all unknown Xs are lagged by two (or more)
periods, as appropriate.
The first approach is more commonly applied, but it does suffer from the drawback that
the uncertainty in X is not reflected in the prediction intervals for the forecasts. It has the
advantage that different “what if” paths for X can be formulated and compared using a
single model. The second approach is somewhat more tedious, but it will be more
valuable when good forecasts for X are unavailable and it will provide more accurate
prediction intervals.
As before neither approach is always best. Where forecasts of the Xs are unreliable, it
will usually prove better (and easier) to adopt approach 2. Exploration of the gas prices
model for multiple steps ahead is left as Exercise 8.4.
8.7 Principles
As in earlier chapters, our list of principles relies heavily upon the material in Armstrong
(2001); numbers in parentheses refer to principles listed there. In particular, in this
chapter, we draw upon the research in that volume reported by Allen and Fildes (2001).
8.1 Aim for a relatively simple model specification (Allen and Fildes, 2001)
The researcher must strike a balance between failing to include key variables and
cluttering the model with variables that have very little effect upon the outcome. For
37
example, the number of consumers in a market area clearly has an impact on the level of
sales. However, for a particular market area, that figure is not going to change much over
the course of a few months. Accordingly, we would not bother to include population in a
short-term model for sales forecasting.
8.2 (9.1) Tailor the forecasting model to the horizon (Armstrong, 2001)
As noted in Principle 8.1, we need to identify those variables that are important for the
forecasting horizon under consideration.
8.3 Identify important causal variables based upon underlying theory and earlier
empirical studies. Identify suitable proxy variables when the variables of interest are not
available in timely fashion (adapted from Allen and Fildes, 2001)
Expertise and background knowledge should be used to formulate a model whenever
possible. Statements such as “The stock market goes up when the AFC team wins the
Super Bowl” may be factually correct over a period of years, but they are not a reliable
guide to investment!
8.4 If the aim of the analysis is to provide pure forecasts, you must either know the
explanatory variables in advance or be able to forecasts them sufficiently well to justify
their inclusion in the model (adapted from Allen and Fildes, 2001)
This principle is a more formal statement of the necessary response to the questions
“What do you know?” and “When will you know it?”
8.5 Use the Method of Ordinary Least Squares to estimate the parameters (Allen and
Fildes, 2001).
38
The Method of Ordinary Least Squares is strictly valid only when Assumptions R3-R5
apply but it is often a good place to start and the current form can be extended to deal
with more complex models.
8.6 (9.5) Update the estimates frequently (Armstrong, 2001)
Frequent updating involves little effort beyond recording the latest data. The new
parameter estimates will better reflect the relationships among the variables and also help
to alert the modeler to any structural changes that occur.
Summary
In this chapter we have extended time series regression models to multiple explanatory
variables, thereby greatly increasing the range and value of models that we may use for
forecasting purposes. We have also provided the basic inferential framework in terms of
parameter estimation and model testing, as well as identifying the key assumptions
underlying the models. This structure will enable us to check assumptions and refine the
models in the next chapter.
39
References
Allen, P.G. and Fildes, R. (2001). Econometric forecasting. In J.S. Armstrong (ed.)
Principles of Forecasting, Kluwer, Boston and Dordrecht. Pp. 300 – 362.
Anderson, D.R., Sweeney, D.J. and Williams, T.A. (2005) Statistics for Business and
Economics. South-Western: Mason, Ohio. Ninth edition.
Armstrong, J.S. (ed., 2001). Principles of Forecasting: A Handbook for Researchers and
Practitioners. Kluwer: Boston and Dordrecht.
Hull, J.C. (2009). Options, Futures, and Other Derivatives. Pearson Prentice Hall: Upper
Saddle River, NJ. Seventh edition.
Kutner, M.H., Nachtsheim, C.J., Neter, J. and Li, W. (2005). Applied Linear Statistical
Models. McGraw-Hill: Boston, MA. Fifth edition.
40
Exercises
8.1 The table below contains additional data on price, beyond the data quoted in
Exercise 7.1.
a. Conduct a regression analysis for Sales on Spot and Price
b. Carry out tests on the overall model and on the individual coefficients.
Summarize your conclusions.
c. Compare the performance of the two models. Which model would you
recommend?
WEEK Spots Price Sales
1 8 11 25
2 12 11 34
3 16 12 39
4 10 10 32
5 8 12 22
6 12 12 30
7 16 10 43
8 10 10 31
8.2 The table below contains additional data on advertising expenditures, beyond the
data quoted in Exercise 7.3.
a. Conduct a regression analysis for Sales on Advertising and Price
b. Carry out tests on the overall model and on the individual coefficients.
Summarize your conclusions.
c. Compare the performance of the two models. Which model would you
recommend?
41
Week Price Advertising Sales
1 6 10 28
2 8 20 30
3 10 30 28
4 6 10 30
5 8 10 24
6 10 10 22
7 6 20 34
8 8 10 26
9 10 10 20
10 6 30 36
11 8 30 32
12 10 20 26
8.3 Using the two variable gas prices model generate forecasts three periods ahead for
the period April 2002 to December 2003 by first generating forecasts for both the lagged
crude oil price and disposable income. Compare the estimates with those for the model
developed in the chapter. Compute the forecast accuracy measures and generate the 90
percent prediction intervals. How many of the actual values fall inside these intervals?
8.4 Develop the two variable gas prices model in such as way as to allow for direct
prediction of the prices three periods ahead. Compute the forecast accuracy measures and
generate the 90 percent prediction intervals. How many of the actual values fall inside
these intervals? Compare the results with those obtained for Exercise 8.3.
42
Mini-Cases
The purpose of these mini-cases is to provide opportunities for data analysis to tackle
important real-world problems. The format is essentially the same in each: a dependent
variable of interest is identified along with a plausible set of explanatory variables. The
aim is to develop a valid forecasting model for at least one period ahead and multiple
periods ahead should also be considered.
The full set of modeling steps should be examined:
Create plots of the data to look for relationships and possible unusual observations
Perform basic data analysis
Develop a multiple regression model, preferably using a hold-out sample for the
evaluation of forecasting performance
Keep in mind that there are no “right” answers, but some solutions will be more effective
than others. After you complete your statistical analysis, do not fail to ask the following
questions:
Would the data be available to enable me to make timely forecasts?
Would you feel able to justify your model to a senior manager?
If the answer to either question is NO, you have more work to do!
43
Mini-case 8.1: The Volatility of Google Stock
[Contributors: Christine Choi, Alex Dixon, Melissa Gong, Michael Neches and Greg
Thompson] [Google_Data.xlsx]
Volatility is a measure of uncertainty of the return realized on an asset (Hull, 2009).
Applied to financial markets, volatility of a stock price is a measure of how uncertain we
are of future stock price movements. As volatility increases, the possibility that the stock
price will appreciate or depreciate significantly also increases. This measure has
widespread implications, particularly for stock option valuation and also for volatility
indices (VIX), portfolio management, and hedging strategies. Since its initial public
offering, Google Inc. (GOOG: NASDAQ) stock has become one of the most sought after
and popular investment opportunities. The search engine giant’s stock price has
fluctuated from an IPO price of $85/share, to a high of $741/share (adjusted close
11/27/2007) down to a recent closing price of $345/share (adjusted close 2/25/2009).
This fluctuation reveals the uncertainty associated with any stock, especially for high-
tech companies with web-based models where the monetization of services can confuse
even the most sophisticated investor.
The aim of the project is to develop a multiple regression model to forecast the volatility
of Google’s stock price over the next three months. After an initial review of Google-
specific and macroeconomic data, we identified the following potential explanatory
variables:
STDEV: the volatility measure for Google stock
VOLUME: amount of trading in Google shares
P/E: the price to earnings ratio of Google stock
44
GDP: quarterly growth in GDP at an annualized rate (quarterly, repeated for each month)
VIX: the market volatility index
CONF: the Conference Board index on Consumer Confidence
JOBLESS: the number of claims posted for benefits
HOUSING: the number of new housing starts
Monthly data are available for the period May 2006 through January 2009 on the
following variables. The data were downloaded from Bloomberg.
Mini-case 8.2: Economic Factors in Homicide Rates
[Contributors : Daniel Adcock, Sybil Desangles, Gerald McSwiggan, John Siminerio,
Marc Steining and Katherine Wood] [Homicides.xlsx]
This project is an analysis of annual homicides in the United States, in which historical
data is used to predict future homicide rates. The purpose is to develop a predictive model
that estimates future homicide rates based on a number of potential explanatory and
predictor variables. This project has significant ramifications; if an effective model is
devised to forecast future homicides, the variables underlying the model could be used as
a focal point for police and government officials. It will help law enforcement by
heightening awareness, which will enable more effective homicide prevention.
Annual data are available for the period 1972-2001. The variables measured are:
HOMICIDE: the rate of homicides per 100,000 of the population
GDP: Real GDP per capita,
UNEMPLOY: the national unemployment rate (percent)
DROPOUT: the high-school drop-out rate (percent)
45
RECESSION: the presence (=1) or absence (=0) of a recession
CABLE: cable subscription rates
Mini-case 8.3: Forecasting Natural Gas Consumption for the DC Metropolitan Area
[Contributors: Sameer Aggarwal, Parshant Dhand, Yulia Egorov and Natasha
Heidenrich] [Natural Gas.xlsx]
The intent of the project was to develop a model to forecast natural gas consumption for
the residential sector in the Washington DC metropolitan area. The level of natural gas
consumption is influenced by a variety of factors, including local weather, the state of the
national and the local economies, dollar purchasing power (since at least some of the
natural gas is imported), and the prices for other commodities. The following variables
have been identified and recorded on a quarterly basis. The data cover the period 1997
Q1 through 2008 Q3; all data are measured quarterly.
GASCONS: Consumption of natural gas in DC metro area (million cubic feet)
AVETEMP: Average temperature for the period in the DC metro area
GDP: annualized percentage change in GDP
UNEMP: percentage unemployment in the DC metro area
GAS_PRICE: price of natural gas ($/100 cubic feet)
DOLLAR: value index for the US Dollar relative to a basket of international currencies
OIL_PRICE: price of crude oil ($/barrel)
RESERVES: reserves of natural gas (million cubic feet)
FUTURES: price of futures contract on natural gas ($/100 cubic feet)
46
Mini-case 8.4: Economic Factors in Property Crime
[Contributors: Jay Cafarella, Jordan Krawll, Caroline Levington, Allen Lin and
Steven Schuler] [Property crime.xlsx]
The purpose of this project is to determine the relationship between crime and the
country’s economic health. The drop in wealth that accompanies a recession may result
in an increase in the crime rate. If people don’t have jobs, have less money and are no
longer able to pay for their daily expenses, it is reasonable to believe that some will resort
to crime. Various economic indicators were identified to address this question. The
following variables are measured annually for the period 1960 – 2007:
CRIME_pc: number of property crimes reported, per capita
POP_GROWTH: the percentage change in population over the previous year
GDP_GROWTH: change in Real GDP over the previous year
UNEMP: percentage unemployment in the USA
CREDIT: the percentage growth in credit card indebtedness
S&P RETURN: the annual return on the S&P 500 Index
CPI_GROWTH: percentage change in the Consumer Price Index
INCOME_GROWTH: percentage growth in average household income
RECESSION: the presence (=1) or absence (=0) of a recession
Other variables in the spreadsheet were used in the calculation of these rates.
47
Mini-case 8.5: U.S Retail & Food Service Sales
[Contributors: Doug Goff, Rich Marsden, Jeff Rodgers and Masaki Takeda] [Retail
Sales.xlsx]
The purpose of this project is to forecast how US Retail & Food Sales will fare over the
coming months. The variables considered include personal income and savings and
consumer sentiment, as well as various macroeconomic variables. Since manufacturing
costs and levels of activity are clearly important, these factors are also included. Also
considered in the analysis were three seasonal factors associated with the Easter,
Thanksgiving and Christmas holidays.
The data set includes monthly figures for the period January 2000 – December 2008.
RSALES: US Retail & Food Service Sales [$ Millions]
CONSENT: Index of US Consumer Sentiment
PRICE_OIL: Spot Price of Oil ($/barrel)
IND_PROD: Index of US Industrial Production
PERSINC: US Personal Income ($ per capita)
PERSSAV: Net US Personal Savings ($ per capita)
POPULATION: Total US Population (000s)
UNEMP: US Unemployment Rate
CPI: US Consumer Price Index
TGIVING: Indicator for Thanksgiving (November)
EASTER: Indicator for Easter (March or April)
XMAS: Indicator for Christmas (December)
48
Mini-case 8.6 : U.S. Unemployment Rates
[Contributors : Matt Egyhazy, Jay Kreider, Jim Platt, Ashley Wall and Rob Whiteside]
[Unemployment_2.xlsx]
Unemployment has multiple causes, among which five important factors are: Minimum
Wage Laws, Labor Unions, Efficiency Wages, Job Search, and General Economic
Conditions. Minimum wages help maintain a certain standard of living for individuals,
but also create an artificial bottom which prevents wages from dropping to a level where
a greater percentage of workers would be employed. Similar to minimum wage laws,
labor unions create higher wage levels by increasing the wages and benefits of union
workers, potentially at the expense of others. Efficiency wages or paying employees
above the market equilibrium in order to produce more productive employees creates an
excess in the labor supply which also may increase unemployment. Finally, the amount
of time taken in the job search as employees move from one job to another has a direct
effect on unemployment since those employees are considered unemployed between jobs.
The Economic Conditions of the US Economy have a direct effect on the employment
rate; as the economy slips, unemployment rises.
In order to build an appropriate forecasting model, monthly data were collected for the
following economic indicators:
UNEMPLOYMENT: U.S Unemployment (percent)
MINIMUM WAGE: Nominal Minimum Wage, as enacted by Congress ($/hour)
NONFARM EARNINGS: Average income in the non-farming sector ($/year, monthly
series, annualized data)
CPI: Consumer Price Index
49
UNION MEMBERSHIP: percentage of labor force that is unionized
NOM GDP: U.S. Gross Domestic Product (nominal $ millions; quarterly)
Data Sources for Mini-cases:
1. National Climatic Data Center (National Oceanic and Atmospheric Administration,
Department of Commerce),
http://lwf.ncdc.noaa.gov/oa/climate/research/cag3/md.html
2. Bureau of Labor Statistics (U.S. Department of Labor), http://www.bls.gov/
3. Energy Information Administration (U.S. Department of Energy),
http://www.eia.doe.gov/overview_hd.html and
http://tonto.eia.doe.gov/dnav/ng/ng_stor_sum_dcu_nus_m.htm
4. Bureau of Economic Analysis (U.S. Department of Commerce), http://www.bea.gov/
5. The Bureau of Justice Statistics(U.S. Department of Justice),
http://www.ojp.usdoj.gov/bjs/
6. FBI Uniform Crime Reports (National Archive of Criminal Justice Data),
http://www.fbi.gov/ucr/