15
International Journal of Forecasting 16 (2000) 261–275 www.elsevier.com / locate / ijforecast Correct or combine? Mechanically integrating judgmental forecasts q with statistical methods * Paul Goodwin Faculty of Computer Studies and Mathematics, University of the West of England, Frenchay, Bristol BS16 1QY, UK Abstract A laboratory experiment and two field studies were used to compare the accuracy of three methods that allow judgmental forecasts to be integrated with statistical methods. In all three studies the judgmental forecaster had exclusive access to contextual (or non time-series) information. The three methods compared were: (i) statistical correction of judgmental biases using Theil’s optimal linear correction; (ii) combination of judgmental forecasts and statistical time-series forecasts using a simple average and (iii) correction of judgmental biases followed by combination. There was little evidence in any of the studies that it was worth going to the effort of combining judgmental forecasts with a statistical time-series forecast – simply correcting judgmental biases was usually sufficient to obtain any improvements in accuracy. The improvements obtained through correction in the laboratory experiment were achieved despite its effectiveness being weakened by variations in biases between periods. 2000 International Institute of Forecasters. Published by Elsevier Science B.V. All rights reserved. Keywords: Judgmental forecasting; Combining forecasts 1. Introduction in noise and to overreact to random movements in series (O’Connor, Remus & Griggs, 1993). Several studies have found that, in many On the other hand, when it is known that special contexts, both human judges and statistical events will occur in the future, judgment can be methods have valuable and complementary con- used to anticipate their effects, while statistical tributions to make to the forecasting process estimation of these effects may be precluded by (e.g., Blattberg & Hoch, 1990). For example, the rarity of the events. statistical methods are adept at filtering regular The integration of judgmental forecasts with time series patterns from noisy data while statistical methods can be carried out in several judgmental forecasters tend to see false patterns ways. Voluntary integration involves supplying the judgmental forecaster with a statistical fore- cast, which the forecaster is then free to ignore, q An earlier version of this paper was presented at accept or adjust. However, a recent study by Nineteenth International Symposium on Forecasting, Goodwin and Fildes (1999) found that judg- Washington DC, June 1999. mental forecasters carried out voluntary integra- *Tel.: 144-117-965-6261; fax: 144-117-976-3860. E-mail address: [email protected] (P. Goodwin) tion inefficiently. They made deleterious adjust- 0169-2070 / 00 / $ – see front matter 2000 International Institute of Forecasters. Published by Elsevier Science B.V. All rights reserved. PII: S0169-2070(00)00038-8

Correct or combine? Mechanically integrating judgmental forecasts with statistical methods

Embed Size (px)

Citation preview

International Journal of Forecasting 16 (2000) 261–275www.elsevier.com/ locate / ijforecast

Correct or combine? Mechanically integrating judgmental forecastsqwith statistical methods

*Paul GoodwinFaculty of Computer Studies and Mathematics, University of the West of England, Frenchay, Bristol BS16 1QY, UK

Abstract

A laboratory experiment and two field studies were used to compare the accuracy of three methods that allow judgmentalforecasts to be integrated with statistical methods. In all three studies the judgmental forecaster had exclusive access tocontextual (or non time-series) information. The three methods compared were: (i) statistical correction of judgmental biasesusing Theil’s optimal linear correction; (ii) combination of judgmental forecasts and statistical time-series forecasts using asimple average and (iii) correction of judgmental biases followed by combination. There was little evidence in any of thestudies that it was worth going to the effort of combining judgmental forecasts with a statistical time-series forecast – simplycorrecting judgmental biases was usually sufficient to obtain any improvements in accuracy. The improvements obtainedthrough correction in the laboratory experiment were achieved despite its effectiveness being weakened by variations inbiases between periods. 2000 International Institute of Forecasters. Published by Elsevier Science B.V. All rightsreserved.

Keywords: Judgmental forecasting; Combining forecasts

1. Introduction in noise and to overreact to random movementsin series (O’Connor, Remus & Griggs, 1993).

Several studies have found that, in many On the other hand, when it is known that specialcontexts, both human judges and statistical events will occur in the future, judgment can bemethods have valuable and complementary con- used to anticipate their effects, while statisticaltributions to make to the forecasting process estimation of these effects may be precluded by(e.g., Blattberg & Hoch, 1990). For example, the rarity of the events.statistical methods are adept at filtering regular The integration of judgmental forecasts withtime series patterns from noisy data while statistical methods can be carried out in severaljudgmental forecasters tend to see false patterns ways. Voluntary integration involves supplying

the judgmental forecaster with a statistical fore-cast, which the forecaster is then free to ignore,

qAn earlier version of this paper was presented at accept or adjust. However, a recent study byNineteenth International Symposium on Forecasting,

Goodwin and Fildes (1999) found that judg-Washington DC, June 1999.mental forecasters carried out voluntary integra-*Tel.: 144-117-965-6261; fax: 144-117-976-3860.

E-mail address: [email protected] (P. Goodwin) tion inefficiently. They made deleterious adjust-

0169-2070/00/$ – see front matter 2000 International Institute of Forecasters. Published by Elsevier Science B.V. All rights reserved.PI I : S0169-2070( 00 )00038-8

262 P. Goodwin / International Journal of Forecasting 16 (2000) 261 –275

ments to statistical forecasts when they were of the campaign may reduce forecast accuracy.reliable and ignored these forecasts in periods Similarly, if correction is employed, estimateswhen they formed an ideal baseline for adjust- of judgmental biases that will occur in forecastsment. A similar study by Lim and O’Connor for ‘special’ periods may be contaminated by(1995) also found that forecasters tended to the different types of biases observed in ‘nor-underweigh statistical forecasts in favour of mal’ periods, and vice versa. In practice, in-their own judgments, even when their attention formation about special events, and the fact thatwas drawn to the superior accuracy of the the judgmental forecaster used this information,statistical forecasts. may not be made explicit or recorded so that it

In the light of these concerns some research- is not possible to remove its effects from theers have recommended that the integration correction model or to suspend averaging withshould be carried out mechanically (Lawrence, the statistical forecast when special eventsEdmundson & O’Connor, 1986; Lim & O’Con- apply. This is a particular danger becausenor, 1995). Combining and correction are two mechanical integration methods are likely to bemethods of mechanical integration that have most appropriate when employed by recipientsbeen proposed for situations where the forecasts of judgmental forecasts rather than the fore-

1are expressed as point estimates . In combining casters themselves (Goodwin, 1996).the forecast is obtained by calculating a simple This paper addresses two research questionsor weighted average of independent judgmental in circumstances where the judgmental fore-and statistical forecasts (Clemen, 1989). Cor- caster has exclusive access to non-time seriesrection methods involve the use of regression to information that will have an impact on theforecast errors in judgmental forecasts. Each forecast variable.judgmental forecast is then corrected by remov-ing its expected error (e.g. see Theil’s optimal 1. What is the relative accuracy of forecastslinear correction (Theil, 1971)). Correction has obtained through (i) correction, using Theil’sreceived less attention in the literature than optimal linear correction, (ii) combination,combination. However, arguably correction, in using a simple average of judgmental andits simplest forms, is more convenient in that it statistical time series forecasts, and (iii)does not require the identification, fitting and using both approaches in tandem?testing of an independent statistical method in 2. To what extent, if any, do these methodsaddition to the elicitation of judgmental fore- improve judgmental forecasts, even thoughcasts. the judgmental forecaster has exclusive ac-

An obvious concern of using any of these cess to non-time series information?integration methods arises when the judgmentalforecaster has access to information about spe- To answer these questions data was obtainedcial events that is not available to the statistical from two sources. First the integration methodsmethod. For example, averaging a judgmental were applied to judgmental forecasts made byforecast, which reflects the expected high sales subjects in a laboratory experiment. This datathat will result from a promotion campaign, allowed the research questions to be exploredwith a statistical forecast that takes no account under a range of controlled conditions. Then, to

assess the extent to which the laboratory results1 can be generalised, the methods were applied toNote that the discussion here relates to integration of

judgmental sales forecasts made by managers injudgment with statistical methods, not just statisticalforecasts. two manufacturing companies.

P. Goodwin / International Journal of Forecasting 16 (2000) 261 –275 263

The paper is organised as follows. In the next slope estimates, and, to make the correction asSection, the theory underlying the mechanical shown below:integration methods is explored and examples of ˆˆY 5 a 1 bF (2)t tthe applications of these methods that have beenreported in the literature are reviewed. In the so thatsubsequent Section the laboratory experiment is

ˆˆP 5 a 1 bF (3)t toutlined, and the results of the application of themechanical methods to the experimental data where Y is the outcome at time t, F is the pointt tare presented and discussed. Following this, the forecast for time t and P is the correctedtapplication of the methods to the industrial judgmental point forecast for period t.forecasts is outlined and the results compared Ahlburg (1984) found that the correctionwith those from the laboratory study. substantially improved forecasts of US prices

and housing starts, while Shaffer (1998) foundthat correction of commercial forecasts of the

2. Background and theory US implicit GNP price deflator reduced theMSE of out-of-sample forecasts by either 15%

2.1. Correcting judgmental forecasts or 25%, depending on the forecast lead time.Similarly, Elgers, May and Murray (1995)Theil (1971) showed that the mean squaredapplied it to analysts’ company earnings fore-error (MSE) of a set of forecasts can becasts and reported that it reduced the MSEsdecomposed into three elements.emanating from systematic bias by about 91%.

2 2 2 2¯ ¯ In a laboratory experiment, Goodwin (1997)MSE 5 (Y 2 F ) 1 (S 2rS ) 1 (12r )SF Y Y (1) found that the correction was most successfulTerm 1 Term 2 Term 3where series had high levels of noise. In par-

¯ ¯here Y and F are the means of the outcomes and ticular, for white noise series the correction hadpoint forecasts, respectively, S and S are theF Y the effect of smoothing out the variation in thestandard deviations of the point forecasts and judgmental forecasts which was caused by theoutcomes, respectively and r is the correlation forecasters reacting to the random movementsbetween the point forecasts and outcomes. in the series.

In this decomposition, Term 1 representsmean (or level) bias. This is the systematic 2.2. Combining judgmental forecasts withtendency of the forecasts to be too high or too statistical forecastslow. Term 2 represents regression bias. This

The effectiveness of combining independentmeasures the extent to which the forecasts failjudgmental and statistical forecasts has beento track the actual observations. For example,examined in several studies (see Clemen (1989)forecasts may tend to be too high when out-for a review). The general conclusion is thatcomes are low and too low when outcomes arecombining improves forecast accuracy becausehigh. Theil then showed that mean and regres-the constituent forecasts are able to capturesion bias can be eliminated from a set of past‘different aspects of the information availableforecasts (i.e. forecasts for periods where thefor prediction’ (Clemen). Although it is possibleoutcomes have been realised) by using anto use a weighted average to achieve theoptimal linear correction. This simply involvescombination, estimating the appropriate weightsregressing the actual outcomes on to the pointwhen there is only a small data base of pastforecasts and using the resulting intercept and

264 P. Goodwin / International Journal of Forecasting 16 (2000) 261 –275

2observations is problematical (Bunn, 1987). statistical forecasts, s is the variance of thej

This is likely to be a common problem in errors of the judgmental forecasts and r is theindustrial contexts, particularly in industries correlation between the constituent forecasts’where products are subject to rapid change and errorsdevelopment (Watson, 1996). In fact, many It can be shown that this implies that thestudies have found that a simple mean of the variance of the errors of the combined forecasts,two forecasts performs relatively well (de Men- and hence the MSE, will only be lower than thatezes, Bunn & Taylor, 2000). Moreover, Arm- of the judgmental forecasts when:strong and Collopy (1998) argue that the simple

2 0.5s r 1 (r 1 3)jmean is particularly appropriate where series] ]]]]]. 5 F (5)s 3have high uncertainty and instability because, s

under these conditions, there will be consider-If it is also the case that:able uncertainty as to which method is likely to ]]

be most accurate. (Hereafter, the term ‘combi- s 1j] ], (6)nation’ will refer to the simple mean of two s Fsforecasts.)

When the constituent forecasts in a combina- then the MSE of the combined forecast will betion are free of mean bias, the MSE of the less than that of both the constituent forecasts.]]]]]] ]]forecasts is identical to the variance of the Fig. 1 shows when combination will reduce theforecast errors. In these circumstances, it is easy MSE of either or both of the constituent fore-to show that the variance of the forecast errors casts for different values of r. For example, if

2of the combined forecasts, s , is given by: the two sets of forecast errors are perfectlyc

2 2 2 negatively correlated then combination wills 5 0.25(s 1 s 1 2rs s ) (4)c s j s j improve both forecasts if s /s is greater thanj s

2where s is the variance of the errors of the 1/3 and less than 3. Essentially, the vertical axiss

Fig. 1. Where combination improves constituent forecasts.

P. Goodwin / International Journal of Forecasting 16 (2000) 261 –275 265

of the graph represents the relative inaccuracy applying a correct then combine strategy involv-of the judgmental forecasts when compared to ing Theil’s correction might diminish the po-the statistical forecasts, while the horizontal axis tential gains of combination. First, Theil’s cor-can loosely be interpreted as representing the rection is also designed to remove regressionlack of new information brought to the process bias from the MSE of the judgmental forecasts,by the second forecasting method. which will reduce the value of s /s in (5). Thisj s

means that after applying correction to thejudgmental forecasts the probability that combi-2.3. Correcting judgmental forecasts beforenation will be exceed the threshold in (5) andcombiningthus improve accuracy is reduced. Put simply,the correction might be so successful thatWhen the constituent forecasts in a combina-subsequent combination cannot lead to furthertion suffer from mean bias the benefits ofimprovements. Secondly, if Theil’s correctioncombination will depend on the relative size andsuccessfully removes mean bias from future¯ ¯sign of the forecasts’ mean errors (i.e. Y 2 F ).forecasts then it will also remove the potentialIf the mean errors of the judgmental andbenefits of mean errors of opposite signs tend-statistical forecasts are given respectively by ving to cancel each other out in the combination.and w, then the MSE of the combined forecastFinally, it is possible that the smoothing effectwill be:that Theil’s method has on the judgmental

2 2 2 forecasts (Goodwin, 1997) will increase theMSE 5 0.25[(s 1 s 1 2rs s ) 1 (v 1 w) ]s j s j correlation of their errors with those of the(7) statistical forecasts. This would again reduce the

potential benefits of combination.Of course, the effectiveness of applyingThus if v52w the bias of one forecast will

correction to forecasts made for observationscancel out that of the other. However, if thethat are yet to be realised depends on thestatistical forecasts are unbiased, but the mean

2 validity of the assumption that the pattern ofbias of the judgmental forecasts is v units thenerrors is stationary over time (Moriarty, 1985;the combination would only remove 75% of thisGoodwin, 1997). In many practical situationsmean bias. Given the propensity of judgmentalthe judgmental forecast errors are unlikely to beforecasts to suffer from biases (Bolger & Har-stationary. For example, the pattern of errors invey, 1998), it may be beneficial to applyperiods where foreseeable special events willcorrection to them before combining them withoccur may be different from the pattern inthe statistical forecasts – that is a correct-then-‘normal’ periods when the judge has accesscombine strategy. Indeed, in their seminal paperonly to time series information.on combination, Bates and Granger (1969)

In order to compare the relative improve-argued that forecasts should be corrected forments in accuracy that could be obtained frombias before being combined – although theirmechanical integration methods, under condi-suggested correction only involved the removaltions where the errors are unlikely to be station-of mean bias. Since Bates and Granger’s paper,ary, the three strategies, (i) correct, (ii) combinemuch of the published theory on combinationand (iii) correct then combine, were first appliedhas been based on the presumption that theto judgmental forecasts obtained from a labora-constituent forecasts are unbiased (e.g. Bunn,tory experiment. The details of this application1987).are discussed in the following Section.There are, however, a number of reasons why

266 P. Goodwin / International Journal of Forecasting 16 (2000) 261 –275

3. Application of methods to experimental time series signal) or ‘strong’ (extra sales5]]data 0.73expenditure). Promotions occurred in

21 of the 71 quarters (12 in quarters requir-3.1. Details of experiment ing forecasts).

Judgmental forecasts were obtained from one In post-promotion periods the underlyingof the treatments in an experiment reported by time series observation was reduced by 50% ofGoodwin and Fildes (1999). Subjects in this the previous promotion period’s effect. In prac-treatment condition saw a computer screen tice this might occur where consumers simplydisplaying a graph of the last 30 quarterly sales bring their purchases forward by one periodfigures of a hypothetical product. These sales because of the campaign, but reduce purchaseswere occasionally affected by promotion cam- in the subsequent period to compensate (Ab-paigns and a bar chart showing past promotion raham & Lodish, 1987). At the start of theexpenditures and details of any expenditure in experiment subjects received written instruc-the next quarter was also displayed. The sub- tions which included advice from the ‘salesjects were asked to use their judgment to manager’. This informed them (i) whether orproduce one period ahead sales forecasts for the not the sales had a seasonal pattern, (ii) thatnext 40 periods. After each forecast had been promotion campaigns might not have a strongmade the graphs were updated and subjects effect on sales, but any positive effects were

restricted to the quarter in which the campaignwere informed of the sales that had occurred.took place and (iii) that in quarters followingThe sixteen subjects, who were finalists on aeffective promotion campaigns a negative effectBusiness Decision Analysis degree course at theon sales was to be expected. As an incentive, aUniversity of the West of England, were ran-prize of £20 was offered for the most accuratedomly assigned to one of eight series whichforecasts, after taking into account the estimatedwere obtained by varying:level of difficulty associated with forecasting

(i) the complexity of the underlying time each series.series signal – the simple signal had a When the judgmental forecasts (JUDGMEN-

]]constant mean of 300 units, while the TAL) had been obtained their mechanical inte-complex signal had an upward trend of 1.5 gration with statistical methods was carried out]]]units per quarter (starting from sales of 210 as follows. For each of the subjects, the first 15units at period 0) with a multiplicative of their forecasts were used to fit an initial Theilseasonal pattern with seasonal indices of 0.7, regression model (2). This was then used to1.1, 1.3, and 0.9 for quarters 1 to 4, respec- produce a corrected forecast for the next period.tively; After each period, this model was then recur-(ii) the level of noise around the signal – sively updated to take into account the judg-this was either ‘low’ (independently normal- mental forecast and sales for that period. In this

]ly distributed with a mean of 0 and a way, one-period-ahead corrected forecasts werestandard deviation of 18.8) or ‘high’ (as low generated for the last 25 periods (CORRECT).

]]noise, but with a standard deviation of 56.4); The statistical time series forecasts were(iii) the effectiveness of the promotion ex- obtained automatically by applying the expertpenditure – in promotion periods this was system in the Forecast Pro package (Stellwageneither ‘weak’ (extra sales equal to 0.053 & Goodrich, 1994) to the first 45 observations

]]expenditure were added to the underlying and using the selected method to produce one

P. Goodwin / International Journal of Forecasting 16 (2000) 261 –275 267

period ahead forecasts for the remaining 25 to be compared over a number of series. Theperiods. Forecast Pro always recommended accuracy of the three mechanical integrationeither simple exponential smoothing or the methods and the original judgmental forecasts

2Holt–Winters method. Subsequently, two other was compared by carrying out, separately forsets of forecasts were obtained for the last 25 normal and promotion periods, a 2 (betweenperiods by taking i) the means of the judgmental series type)32 (between noise level)32 (be-forecasts and the statistical time series forecasts tween promotion strength)34 (within forecast-(COMBINE) and ii) the means of the Theil ing method) repeated measures ANOVA on thecorrected judgmental forecasts and the statistical MdAPEs (there were two replications for eachtime series forecasts (CORRECT THEN COM- treatment).BINE). Table 1 shows the mean MdAPEs for

normal periods (the mean MdAPE of the]]]]]statistical time series forecasts is also shown for3.2. Resultscomparison). There were no significant interac-

The forecasts for the last 25 periods were tions involving forecasting methods in theseparated into three categories, depending on ANOVA, but there was a highly significant mainthe type of period that was being forecast: effect (F 57.1328, P50.0014). Comparisons3,24

normal, promotion and post-promotion. The of the methods, using Tukey’s honest significantevidence of the original Goodwin and Fildes difference (HSD) test, indicated that all three(1999) study was that forecasters tended to integration methods significantly improved onforget about the post-promotion reduction in the original judgmental forecasts (all P valuessales. As they were not therefore making use of were ,0.05). However, there were no signifi-information that was unavailable to the statisti- cant differences between the three integration

]]]cal methods the results for post-promotion methods.periods are of limited interest in this study and The mean MdAPEs for promotion periods are

]]]will not be discussed here. However, as we shall shown in Table 2 (for brevity these have simplysee, observations for post-promotion periods been cross-tabulated with promotion effective-still had an important influence on the fitting of ness). When ANOVA was applied to this datathe statistical models. there were two significant interactions involving

Forecast accuracy for the remaining types of forecasting method: series3noise3methodperiod was measured by calculating the median (F 54.52, P50.012) and series3promotion-3,24

absolute percentage error (MdAPE) in forecast- effectiveness3method (F 54.43, P50.013).3,24

ing the time series signal (i.e. the underlying An analysis of these interactions, again usingtime series signal plus any promotion effects – Tukey’s HSD method, found that all of thethe forecasters and the statistical methods were

Table 1not expected to forecast the noise in the series).Mean MdAPEs of methods in normal periodsThe MdAPE has been recommended as an error

measure by Armstrong and Collopy (1992) Method Mean MdAPEwhen the accuracy of forecasting methods needs JUDGMENTAL 11.06

CORRECTION 7.78COMBINE 7.522Forecast Pro has a facility for handling special events.CORRECT THEN COMBINE 5.69This was deliberately not used here in order to simulate

situations where non-time series information is available Statistical time series 6.61only to the judgmental forecaster.

268 P. Goodwin / International Journal of Forecasting 16 (2000) 261 –275

Table 2 integration methods means that there was noMean MdAPEs for promotion periods evidence to suggest that there was anything toMethod Weak Strong be gained by combining judgment with statisti-

promotion promotion cal time series forecasts – simply correctingjudgment appeared to be sufficient. The mech-JUDGMENTAL 11.57 19.29

CORRECT 8.08 16.38 anics underlying these results are discussedCOMBINE 6.32 16.28 next.CORRECT THEN COMBINE 4.43 16.68 In normal periods, when the judgmental fore-

]]]]]casts tended to vary randomly around the signalStatistical time series 5.84 20.90– as forecasters reacted to each random move-ment in the series, the integration methodssucceeded by ‘averaging out’ some of thisrandom variation (as Theil’s method did in

integration methods significantly improved on Goodwin (1997)). This improved the consis-judgment for the trend-seasonal series where tency of the forecasts.there was either high noise or where promotion However, the Theil-corrected forecasts foreffects were weak (all P,0.05), though Theil’s normal periods still had slight mean bias, with amethod failed to improve significantly on judg- predominant tendency to forecast too high (e.g.ment for the latter series. In all other cases, the mean percentage error for flat series asthere was no significant difference between the 22.6%). This bias resulted from contaminationmethods. Thus in promotion periods the integra- of the regression model by observations fortion methods never significantly degraded the ‘non-normal’ periods. Although this bias tendedaccuracy of the judgmental forecasts and in to be reduced by subsequent combination withsome cases improved accuracy, even though the the statistical time series forecast (to 21.3% forjudgmental forecaster had exclusive access to flat series), the improvements were not suffi-information about forthcoming promotions. cient to be significant.

In promotion periods, although the integra-]]]]]]tion methods did not degrade the judgmental3.3. Discussion of laboratory experiment

forecasts, they were also less successful inresultsimproving them for series where the promotion

Three main results emerge from this labora- effects were strong. This appears to be a resulttory experiment. First, even though data used by of a combination of two factors – biasedthe statistical methods was contaminated by judgmental forecasts and integration methodsobservations for special periods, all of the weakened by the effect of observations for non-integration methods were still effective in im- promotion periods. Making forecasts for promo-proving on the judgmental forecasts for normal tion periods will have been particularly difficultperiods. Second, in promotion periods, despite where the underlying time series was complexjudgmental forecasters having exclusive access or subject to high levels of noise (Goodwin &to non-time series information, the use of Wright, 1993) and in these cases biases wereintegration still led either to improvements over likely to occur. For example, Table 3 shows theunaided judgment, or at worst, did not diminish median percentage errors in forecasting thethe accuracy of the forecasts. Third, the absence signal in promotion periods for series where theof significant differences between the three promotion effect was strong. It can be seen that,

P. Goodwin / International Journal of Forecasting 16 (2000) 261 –275 269

Table 3 in promotion periods and low actual sales inMedian percentage error on signal of judgmental forecasts post-promotion periods, thereby reducing thefor promotion periods when promotion effect was strong explanatory power of the Theil regression(note only promotion periods occurring in the last 25

model. Furthermore, in promotion periodsperiods are considered here)where promotion effects were strong, the

Subject Series type Median statistical time series forecasts were relativelypercentage errorless accurate and, since both these forecasts and

1 Flat, low noise 20.29% the Theil-corrected forecasts tended to under-2 Flat, low noise 5.81% estimate sales, their errors were positively corre-3 Flat, high noise 11.86%

lated. All of these factors were detrimental to4 Flat, high noise 20.16%the CORRECT THEN COMBINE method.5 Trend, seasonal, low noise 22.05%

6 Trend, seasonal, low noise 10.53% The laboratory experiment allowed the effec-7 Trend, seasonal, high noise 25.50% tiveness of the mechanical integration methods8 Trend, seasonal, high noise 36.76% to be assessed under controlled conditions.

However, the forecasting task employed in theexperiment may be atypical of many practicalforecasting situations. For example, it only

for the more complex and high noise series, involved a single contextual cue (promotionthere was a substantial tendency to under fore- expenditure) and hard data relating to this cuecast. Theil’s method is, of course designed to was supplied to the forecaster. In practice,correct this type of bias, while the CORRECT managers may base their forecasts on a multip-THEN COMBINE strategy should have ensured licity of cues from many sources (Lim &that the time series pattern was represented in O’Connor, 1996), while much of the infor-the forecast. mation relating to these cues may be ‘soft’, in

Despite this, there was evidence that the that it is of questionable reliability, or presentedsuccess of the mechanical integration methods in an informal verbal manner. Furthermore, inin promotion periods, where promotion effects ‘normal periods’ the pattern of sales in thewere strong, was blunted by contamination of laboratory experiment followed a regular timethe models by observations for non-promotion series pattern, undisturbed by external events. Inperiods – in particular by observations for post- some practical situations the entire time seriespromotion periods. Recall that in post-promo- may be disturbed by these events to the extenttion periods a dip in sales was expected. It that the time series pattern explains a relativeseems that, not only did subjects forget about small percentage of the variation in the series.this effect, but they also tended to make higher Finally, the laboratory forecasts were only madeforecasts for post-promotion periods than for for one period ahead (many organisations adoptnormal periods. On average, for series with a rolling forecast procedure) and the forecastersstrong promotion effects, judgmental forecasts had no expert product knowledge or priorfor post-promotion periods were 11.5% higher information on sales (e.g., as a result of con-than those for normal periods! It appeared that tracts already agreed).subjects tended to anchor on the high sales In order to test the integration methods in theobserved in the preceding promotion period. more complex circumstances that may apply inThis meant that judgmental forecasts of high many practical contexts judgmental sales fore-sales were associated both with high actual sales casts were obtained from two companies. The

270 P. Goodwin / International Journal of Forecasting 16 (2000) 261 –275

next Section describes the application of the 17 months, allowing months 19 to 29 (the lastmethods to this data. 11 months) to be used for out-of-sample com-

parisons of the two-period ahead forecasts. Asbefore, the expert system on the Forecast Propackage was used to obtain statistical time4. Analysis of industrial dataseries forecasts automatically and the packageselected simple exponential smoothing for all 154.1. European textile companyseries. To allow Theil’s method some flexibility

Data was obtained on the monthly forecasts to adapt to possible changes in judgmentaland sales of each of 15 products sold by a biases over time to the regression equation usedEuropean textile manufacturer for the period in (2) was again recursively updated after eachJanuary 1995 to May 1997 (29 observations). month’s sales figure was known. The estimatesThe company manufactures a large number of of a and b at time t were then used to correctsoft furnishing products for both small and large the judgmental forecast made for the sales inUK retailers, including one in-house customer. month t12.Because the large customers usually specify Note that the judgmental forecasters hadexact details of their requirements well in several advantages over the statistical methods.advance, sales forecasting is only required for Not only did they have access to non-time seriessmaller customers and the in-house customer. information, but they could also delay their

The forecasts are produced by the company’s forecasts until 6 weeks before the forecastsales department, but used by the operations period and so make use of informal and pre-department to plan production. Preliminary fore- liminary sales information that was availablecasts are made six months ahead, but these are within the statistical method’s two month leadregularly fine-tuned as the forecast period ap- time.proaches. However, because manufacture of the The out of sample MdAPEs of the forecastingproducts takes six weeks, the ‘final’ forecasts, methods, averaged over the 15 products, arewhich are the ones analysed here, also have this shown in Table 4. The use of significance testslead time. should be treated with caution here as the

The company usually runs promotion cam- products were not randomly selected and therepaigns for its products twice a year in May/ may be some dependence between the observa-June and October /November, but customers tions for the different products. Nevertheless, inalso run their own campaigns. Sales staff meet the light of the laboratory results, significanceregularly with customers to obtain details of tests were used to assess (i) whether Theil’stheir promotions and other sales information. correction significantly improved the judgmen-The forecasters indicated that they used both tal forecasts and (ii) whether COMBINE orthis market information and past sales history CORRECT THEN COMBINE led to any great-(i.e. time series information) to arrive at their er accuracy than Theil’s correction. A one-tailed

3forecasts. paired t-test showed that Theil’s correction hadThe three forecast integration methods were a significantly lower mean MdAPE than the

applied to the data as follows. Because of thesix-week production time, the statistical meth-

3ods could only have access to data up to month A one-tailed test on Theil’s method was considered to bet when a forecast for month t12 was required. justified in the light of the evidence from the laboratoryThe methods were fitted to the data for the first study.

P. Goodwin / International Journal of Forecasting 16 (2000) 261 –275 271

Table 4Accuracy of textile company sales forecasts

Method MeanMdAPE

Management judgment (JUDGMENTAL) 23.8%Exponential smoothing 25.7%Theil’s method (CORRECT) 20.9%Mean of judgment and exponential smoothing (COMBINE) 21.2%Mean of Theil and exponential smoothing (CORRECT THEN COMBINE) 23.3%

judgmental forecasts (t 51.97, P50.034). statistical methods and the last 6 were used for14

Table 4 shows also that Theil’s method led to out of sample comparisons. As before, Forecastthe most accurate forecasts of all the methods so Pro was used to generate the statistical timethere was clearly nothing to be gained by using series forecasts automatically (it always selectedeither form of combination. simple exponential smoothing), and Theil’s

regression equation was recursively updatedafter each sales value was known. Table 54.2. UK-based engineering companyshows the MdAPEs of the forecasting methods,

The UK headquarters of an American com- averaged over the seven series.pany, which manufactures and sells drill bits to Once again a one-tailed paired t-test was usedthe international oil industry, provided data on to investigate whether Theil’s correction led to aits one-month-ahead judgmental sales forecasts. lower mean MdAPE than the judgmental fore-These are made by personnel who have access casts. This suggested that Theil’s method didto information provided by the sales force. The not lead to significant reductions in the meanmanager responsible for forecasting estimated MdAPE (t 51.35, P50.11). However, this6

that, at the time of making the forecast, on result should be treated with caution for theaverage between 10% and 20% of next month’s reasons stated earlier and also because a samplesales are already known, because of contracts of only seven sales areas were involved. In fact,already agreed. Forecasts and outcomes were Theil’s method outperformed both the original

]]obtained for each of seven of the company’s judgmental forecasts and the exponentialsales regions for the period January 1993 to smoothing forecasts in six of the seven series.December 1994 (24 months). Once again, combining did not appear to be

The first 18 months were used to fit the worthwhile.

Table 5Accuracy of engineering company sales forecasts

Method MeanMdAPE

Management judgment (JUDGMENTAL) 15.4%Exponential smoothing 22.2%Theil’s method (CORRECT) 11.8%Mean of judgment and exponential smoothing (COMBINE) 14.9%Mean of Theil and exponential smoothing (CORRECT THEN COMBINE) 14.9%

272 P. Goodwin / International Journal of Forecasting 16 (2000) 261 –275

4.3. Discussion of industrial forecasting tend to be more accurate than the exponentialresults smoothing forecasts and the errors of the two

]types of forecast were highly correlated (e.g.,

As in the laboratory study, Theil’s correction for the textile company the mean value of r wasmethod (CORRECT) played a valuable role in 0.84). As Fig. 1 shows, both of these factorsimproving the accuracy of the judgmental fore- were to the detriment of combination. Thecasts. It improved the judgmental forecasts for underlying mechanics can be seen in Fig. 2,15 out of the 22 industrial series and rendered which shows the out-of-sample forecasts for onethe use of combination redundant. This result is of the products. While there is clear evidenceconsistent with other studies of forecasters in that the judgmental forecaster is using non-timethe field which have shown the effectiveness of series information to anticipate movements incorrection, relative to statistical forecasts or sales, these forecasts tend to be too high.combination, albeit by employing slightly more (Structured interviews with the forecasters pro-complex, correction methods (Fildes, 1991; vided no evidence that this bias was deliberatelyLawrence, O’Connor & Edmundson, in press). created for political reasons or because theWhy was combination not useful in the com- forecast loss function was perceived to bepany forecasts presented here? asymmetric.) However, once the over forecast-

An analysis of the forecast errors showed ing bias in the judgmental forecasts has beenthat, after correction, the judgmental forecasts mitigated by Theil’s correction, it can be seen]]]]]

Fig. 2. Actual sales and forecasts for textile company product.

P. Goodwin / International Journal of Forecasting 16 (2000) 261 –275 273

that exponential smoothing has nothing to add correction method used in the study was, into the ability of the forecasts to explain move- many cases, sufficient to obtain significantments in the sales series. improvements in accuracy and there was little to

Clearly, there are important differences be- be gained by obtaining independent statisticaltween the laboratory and industrial data. Unlike time-series forecasts and then combining thesethe laboratory task, the industrial forecasts were with the judgmental forecasts (or correctedcharacterised by access to continuous non-time judgmental forecasts). Moreover, the correction

]]]]series information (from multiple sources) about method appears to be robust in that it can stillevents whose effects tended to submerge the improve forecasts, or at worst not degrade them,relatively ‘weak’ time series pattern. For the even when different biases apply in differentengineering company this non-time series in- types of period – though its effectiveness isformation included prior knowledge of some reduced by these variations. (The method maysales. In both companies the forecasters had be less robust when the nature of the biasesexpert product knowledge and experience of the changes in a non-reversionary way over timeforecasting tasks so they were able to make (Goodwin, 1997)).good use of the non-time series information. In Of course, the extent to which these conclu-the case of the textile company the judgmental sions can be generalised is limited by theforecasters had a shorter lead time than the conditions which applied in the laboratory ex-statistical method and so were able to use more periment and in the two companies studied. Inrecent non-time series information. Contrast this particular, the relatively small sample size usedwith the laboratory study where the series had a in the laboratory study may have meant that thestrong time series pattern, non-time series in- effectiveness of the CORRECT THEN COM-formation that was only available sporadically BINE strategy was underestimated, though, ofand inexperienced, non-expert forecasters who course, even this strategy involved correction.made inefficient use of this information. Never- Indeed, taken together, the results presentedtheless, in both of these very different contexts here suggest that, relative to combination, cor-the use of correction appeared to be effective rection may have been under represented as aand there was no evidence that greater accuracy recommended technique for harnessing thecould be achieved through combination. complementary strengths of judgment and

statistical methods.

5. ConclusionsAppendix

This paper is based on the premise that theuse of judgment in forecasting is justified when Note that, in this study, the correction methodnon-time series information, which may be was applied indiscriminately to all of the seriesdifficult to model statistically, has high predic- in order to compare their performance. Antive power. However, the limitations of judg- alternative approach would have involved test-ment mean that integration with a statistical ing the in-sample judgmental forecasts for biasmethod may be desirable. The results presented before deciding whether to apply Theil’s correc-here suggest that, where useful, but difficult-to- tion. To achieve this an F-test can be employedmodel, non-time series information is available, to test the joint hypothesis that a 5 0 and b 5 1the most appropriate role of statistical methods in (2) (Johnston, 1972, p. 28). However, evi-is to correct judgmental forecasts. The simple dence from Goodwin (1997, 1998) suggests that

]]

274 P. Goodwin / International Journal of Forecasting 16 (2000) 261 –275

de Menezes, L. M., Bunn, D. W., & Taylor, J. W. (2000).this test has little value in predicting whetherReview of guidelines for the use of combined forecasts.the correction will improve judgmental forecastsEuropean Journal of Operational Research 120, 190–in the out-of sample periods. The limitations of 204.

this test have also been discussed in the econ- Elgers, P. T., May, H. L., & Murray, D. (1995). Note onomics ‘rational expectations’ literature (Liu & adjustments to analysts’ earning forecasts based upon

systematic cross-sectional components of prior-periodMaddala, 1992; Lopes, 1998). Research is cur-errors. Management Science 41, 1392–1396.rently being undertaken to try to develop im-

Fildes, R. (1991). Efficient use of information in theproved methods for identifying when Theil’sformation of subjective industry forecasts. Journal ofcorrection is appropriate. In the absence of theseForecasting 10, 597–617.

methods, the evidence of this study is that Goodwin, P., & Wright, G. (1993). Improving judgmentalindiscriminate correction of judgmental fore- time series forecasting: a review of the guidance pro-casts is likely to be worth carrying out. vided by research. International Journal of Forecasting

9, 147–161.Goodwin, P. (1996). Statistical correction of judgmental

point forecasts and decisions. Omega: InternationalReferencesJournal of Management Science 24, 551–559.

Goodwin, P. (1997). Adjusting judgmental extrapolationsAbraham, M. M., & Lodish, L. M. (1987). PROMOTER:

using Theil’s method and discounted weighted regres-An automated promotion evaluation system. Marketing

sion. Journal of Forecasting 16, 37–46.Science 6, 101–123.

Goodwin, P. (1998). Interfacing judgmental forecasts withAhlburg, D. A. (1984). Forecasting evaluation and im-statistical methods. Unpublished PhD thesis, Universityprovement using Theil’s decomposition. Journal ofof Lancaster, U.K.Forecasting 3, 345–351.

Goodwin, P., & Fildes, R. (1999). Judgmental forecasts ofArmstrong, J. S., & Collopy, F. (1992). Error measures fortime series affected by special events: does providing ageneralizing about forecasting methods: empirical com-statistical forecast improve accuracy? Journal of Be-parisons. International Journal of Forecasting 8, 69–havioral Decision Making 12, 37–53.80.

Johnston, J. (1972). Econometric methods, 2nd ed.,Armstrong, J. S., & Collopy, F. (1998). Integration ofMcGraw-Hill, New York.statistical methods and judgment for time series fore-

Lawrence, M. J., Edmundson, R. H., & O’Connor, M. J.casting: principles from empirical research. In: Wright,(1986). The accuracy of combining judgmental andG., & Goodwin, P. (Eds.), Forecasting with judgment,statistical forecasts. Management Science 32, 1521–John Wiley, Chichester, pp. 269–293.1532.Bates, J. M., & Granger, C. W. J. (1969). The combination

Lawrence, M. J., O’Connor, M. and Edmundson, R. (inof forecasts. Operational Research Quarterly 20, 451–press). A field study of sales forecasting accuracy and468.processes. European Journal of Operational Research.Blattberg, R. C., & Hoch, S. J. (1990). Database models

Lim, J., & O’Connor, M. (1995). Judgmental adjustmentand managerial intuition: 50% model150% manager.of initial forecasts – its effectiveness and biases.Management Science 36, 887–899.Journal of Behavioral Decision Making 8, 149–168.Bolger, F., & Harvey, N. (1998). Heuristics and biases in

Lim, J., & O’Connor, M. (1996). Judgmental forecastingjudgmental forecasting. In: Wright, G., & Goodwin, P.with time series and causal information. International(Eds.), Forecasting with judgment, John Wiley, Chi-Journal of Forecasting 12, 139–153.chester, pp. 113–137.

Liu, P. C., & Maddala, G. S. (1992). Rationality of surveyBunn, D. (1987). Expert use of forecasts: bootstrappingdata and tests for market efficiency in the foreignand linear models. In: Wright, G., & Ayton, P. (Eds.),exchange markets. Journal of International Money andJudgmental forecasting, John Wiley, Chichester, pp.Finance 11, 366–381.229–241.

Lopes, A. S. (1998). On the ‘restricted cointegration test’Clemen, R. T. (1989). Combining forecasts: a review andas a test of the rational expectations hypothesis. Appliedannotated bibliography. International Journal of Fore-Economics 30, 269–278.casting 5, 559–583.

P. Goodwin / International Journal of Forecasting 16 (2000) 261 –275 275

Moriarty, M. M. (1985). Design features of forecasting Biography: Paul GOODWIN is Principal Lecturer insystems involving management judgments. Journal of Operational Research at the University of the West ofMarketing Research 22, 353–364. England. His research interests focus on the role of

O’Connor, M., Remus, W., & Griggs, K. (1993). Judg- judgment in forecasting and decision making and hemental forecasting in times of change. International received his PhD from Lancaster University in 1998. He isJournal of Forecasting 9, 163–172. the co-author of Decision Analysis for Management Judg-

Shaffer, S. (1998). Information content of forecast errors. ment (2nd edition) published by Wiley and co-editor ofEconomics Letters 59, 45–48. Forecasting with Judgment, also published by Wiley. He

Stellwagen, E. A., & Goodrich, R. L. (1994). Forecast Pro has published articles in a number of academic journalsfor Windows, Business Forecast Systems Inc, Belmont, including the International Journal of Forecasting, theMA. Journal of Forecasting, the Journal of Behavioral Deci-

Theil, H. (1971). Applied economic forecasting, North- sion Making and Omega.Holland Publishing Company, Amsterdam.

Watson, M. C. (1996). Forecasting in the Scottish elec-tronics industry. International Journal of Forecasting12, 361–371.