27
International Journal of Forecasting 27 (2011) 777–803 www.elsevier.com/locate/ijforecast An empirical analysis of neural network memory structures for basin water quality forecasting David West , Scott Dellana College of Business, Department of Marketing and Supply Chain Management, East Carolina University, Greenville, NC 27858-4353, United States Available online 15 January 2011 Abstract This research investigates the cumulative multi-period forecast accuracy of a diverse set of potential forecasting models for basin water quality management. The models are characterized by their short-term (memory by delay or memory by feedback) and long-term (linear or nonlinear) memory structures. The experiments are conducted as a series of forecast cycles, with a rolling origin of a constant fit size. The models are recalibrated with each cycle, and out-of-sample forecasts are generated for a five-period forecast horizon. The results confirm that the JENN and GMNN neural network models are generally more accurate than competitors for cumulative multi-period basin water quality prediction. For example, the JENN and GMNN models reduce the cumulative five-period forecast errors by as much as 50%, relative to exponential smoothing and ARIMA models. These findings are significant in view of the increasing social and economic consequences of basin water quality management, and have the potential for extention to other scientific, medical, and business applications where multi-period predictions of nonlinear time series are critical. c 2010 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved. Keywords: Watershed management; Short-term memory; Jordan-Elman neural network; Gamma memory neural network 1. Introduction The level of interest in the analysis and prediction of basin water quality has increased substantially in recent years, due to the convergence of environmental concerns and the availability of innovative computa- tional intelligence approaches (Chau, 2006). Accurate Corresponding author. Tel.: +1 252 321 6380. E-mail addresses: [email protected] (D. West), [email protected] (S. Dellana). forecasting models for basin water quality manage- ment are vital because of the social consequences of deviations from normal conditions, the complex and cumulative nature of the biological processes being modeled, and the long lead times involved in acquir- ing process state information. The social consequences of ineffective watershed management are being expressed through the political actions of several international agencies. The World Health Organization’s 2000 Annapolis Protocol (Bar- tram & Rees, 2000) established a systematic approach 0169-2070/$ - see front matter c 2010 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved. doi:10.1016/j.ijforecast.2010.09.003

An empirical analysis of neural network memory structures for basin water quality forecasting

Embed Size (px)

Citation preview

  • sSCollege of Business, Department of Marketing and Supply Chain Management, East Carolina University, Greenville, NC 27858-4353,United States

    Available online 15 January 2011

    Abstract

    This research investigates the cumulative multi-period forecast accuracy of a diverse set of potential forecasting models forbasin water quality management. The models are characterized by their short-term (memory by delay or memory by feedback)and long-term (linear or nonlinear) memory structures. The experiments are conducted as a series of forecast cycles, with arolling origin of a constant fit size. The models are recalibrated with each cycle, and out-of-sample forecasts are generated for afive-period forecast horizon. The results confirm that the JENN and GMNN neural network models are generally more accuratethan competitors for cumulative multi-period basin water quality prediction. For example, the JENN and GMNN models reducethe cumulative five-period forecast errors by as much as 50%, relative to exponential smoothing and ARIMA models. Thesefindings are significant in view of the increasing social and economic consequences of basin water quality management, and havethe potential for extention to other scientific, medical, and business applications where multi-period predictions of nonlineartime series are critical.c 2010 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.

    Keywords: Watershed management; Short-term memory; Jordan-Elman neural network; Gamma memory neural network

    1. Introduction

    The level of interest in the analysis and predictionof basin water quality has increased substantially inrecent years, due to the convergence of environmentalconcerns and the availability of innovative computa-tional intelligence approaches (Chau, 2006). Accurate

    Corresponding author. Tel.: +1 252 321 6380.E-mail addresses: [email protected] (D. West),

    [email protected] (S. Dellana).

    forecasting models for basin water quality manage-ment are vital because of the social consequences ofdeviations from normal conditions, the complex andcumulative nature of the biological processes beingmodeled, and the long lead times involved in acquir-ing process state information.

    The social consequences of ineffective watershedmanagement are being expressed through the politicalactions of several international agencies. The WorldHealth Organizations 2000 Annapolis Protocol (Bar-tram & Rees, 2000) established a systematic approach

    0169-2070/$ - see front matter c 2010 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.doi:10.1016/j.ijforecast.2010.09.003International Journal of Foreca

    An empirical analysis of neuralbasin water qua

    David West,ting 27 (2011) 777803www.elsevier.com/locate/ijforecast

    network memory structures forlity forecasting

    cott Dellana

  • u778 D. West, S. Dellana / International Jo

    to managing recreational waters. In 2002, the FederalInstitute of Hydrology in the European Union (EU)initiated the design and development of a computer-ized decision support system for the integrated riverbasin management of the Elbe river basin (Matthies,Berlekamp, Lautenbach, Graf, & Reimer, 2006). TheEU also implemented the Bathing Water Directive in2006 to aid in protecting the public by better regulat-ing the pollution of recreation water (Lin, Syed, & Fal-coner, 2008).

    Water quality systems are complex biological pro-cesses (Carlsson & Lindberg, 1998; Lindberg, 1997;Spall & Cristion, 1997). There are huge challenges incontrolling substance levels in wastewater treatment,as a result of a substantial degree of variance in thecomposition and flow rate of the influent (Wen & Vas-siliadis, 1998). Also, since the process is one of con-tinuous flow through the system, the independence ofconsecutive samples cannot be assumed. Finally, theeffluent quality from basin treatment facilities is stud-ied in aggregate, along with other point and non-pointsources (e.g., rainfall runoff), to assess potentially dan-gerous substance concentration levels.

    Forecasting models typically serve as proxies forprocess data in basin water quality management dueto the significant challenge of obtaining real-timeinformation on critical water quality variables. Thestandard Biochemical Oxygen Demand (BOD) test,which measures how quickly biological organismsuse up the oxygen in a body of water, takes fivedays to complete (Dogan, Sengorur, & Koklu, 2009).The analysis of fecal coliforms, which can harbordangerous bacteria or viruses, can take up to fourdays (He & He, 2008). This is not timely enough tosupport process adjustments, so real-time decisions inwater quality management are often based on forecastmodels (ReVelle, 2000).

    Novel neural network architectures that can modeltemporal and nonlinear data characteristics effectivelyare particularly valuable in water quality forecast mod-eling. The purpose of this research is to rigorouslyanalyze several classes of neural network models andseveral linear statistical models in a basin water qual-ity management application. An important categori-cal variable in the experimental design is the struc-ture of the short- and long-term memories. The short-term temporal memory refers to the methodology em-ployed to capture and preserve the relevant informa-rnal of Forecasting 27 (2011) 777803

    tion in the time domain of the series being modeled.It can be as simple as presenting lagged values of theinput variable(s) (i.e., memory by delay), or can in-volve current input values and functions of prior inputs(i.e., memory by feedback). The long-term structuralmemory is a fixed mapping of the information pro-vided by the short-term memory to the prediction do-main. The experimental design includes a wide rangeof forecast models categorized by their short- (delayor feedback) and long-term (linear or nonlinear) mem-ory structures. To the best of our knowledge, this isthe first research to investigate neural network modelswith memory by feedback in a basin water quality ap-plication. This research is also novel in its focus on cu-mulative multi-period forecast errors, which are moresignificant than single period forecasts in this appli-cation (Beck, 2005). Since the concentration of pol-lutants in a basin is a cumulative process, a measureof the cumulative accuracy of predictive models overa multi-period prediction horizon is particularly valu-able for basin water quality management (Beck, 2005).The experimental design rigorously assesses the fore-cast performances of the potential models, using arolling origin to generate a relatively large number offit and prediction cycles. Median based error metricsare reported for both absolute and relative errors.

    In the following section we review the literatureon water quality management in wastewater treat-ment plants and basins. In Section 3, we discuss themodels used in this research, with a particular focuson the memory structures involved. In Section 4, weintroduce the data set used and present the experi-mental methodology. Sample predictive equations arereported in Section 5 and explained for all seven mod-els. In Section 6, the forecasting experimental resultsare reported. We close in Section 7 with a discussionof the implications of this research and concludingremarks.

    2. Forecast modeling literature for wastewatertreatment and basin water quality

    The earliest research on basin water quality focusedon the point discharges of wastewater treatmentplants. Early researchers, influenced by the serialdependence in wastewater processes, employed linearautoregressive integrated moving average (ARIMA)predictive models. They found that including a

  • aD. West, S. Dellana / International Journ

    transfer function in the ARIMA model improved theaccuracy of prediction for wastewater measures suchas suspended solids (SS) and biochemical oxygendemand (BOD) (Berthouex & Box, 1996; Delleur &Gyasi-Agyei, 1994).

    Various non-linear neural network models weretested for modeling wastewater treatment processes,which are complex in nature, with non-linearrelationships, interactions between variables, and timevarying dynamics. Many researchers have studiedthe multilayer perceptron (MLP) neural networkand recommend it for water quality predictionfor measures such as SS, BOD, chemical oxygendemand (COD), dissolved oxygen, ammonia andfecal coliform. For example, Grieu, Traore, Polit,and Colprim (2005) and Hamoda, Al-Ghusain, andHassan (1999) concluded that wastewater treatmentplant performance could be predicted adequatelyusing the MLP neural network. Hamed, Khalafallah,and Hassanien (2004) and Mjalli, Al-Asheh, andAlfadala (2007) found that MLP neural networksgenerally outperformed regression-based models forwastewater treatment. Tomenko, Ahmed, and Popov(2007) reported that the MLP and Radial BasisFunction (RBF) neural networks produced betterresults than multiple regression in a wetland treatmentsystem. Pai, Tsai, Lo, Tsai, and Lin (2007)found that an MLP neural network and a GeneticAlgorithm performed comparably and were bothslightly better than a Grey model (i.e., a non-linear programming differential equation) for forecastsinvolving wastewater treatment. Lin et al. (2008)reported that MLP neural networks are useful formaking predictions of water quality, as required bythe EU Bathing Water Directive. He and He (2008)conducted a study of California recreational beachwater quality and concluded that MLP neural networksare useful predictors of fecal indicator bacteria. Instudies by Dogan et al. (2009) and Singh, Basant,Malik, and Jain (2009), MLP neural networks werealso found to be effective for the computation of riverwater quality.

    Recently, other types of neural network architec-tures have been applied to water quality research aswell. Jin and Englande (2006) found the RBF neuralnetwork to be better than logistic regression for pre-dicting safe swimming conditions in Lake Pontchar-train, USA. As was mentioned earlier, Tomenko et al.l of Forecasting 27 (2011) 777803 779

    (2007) found that a RBF neural network outperformedmultiple regression for a wetland treatment system.Zhu, Zurcher, Rao, and Meng (1998) concluded thatthe time-delay neural network (TDNN) performedwell and should be considered for wastewater treat-ment plant control. Dellana and West (2009) reportedthat the TDNN generally outperformed linear ARIMAmodels for predictions involving wastewater treat-ment, and recommend the TDNN for basin water qual-ity analysis.

    Several opportunities for improving on the designdecisions of the collective water quality research citedare evident in Table 1. First, most authors study onlya single model (frequently MLP) or a small set ofrelated models, and only employ a short-term memoryby delay (MLP and TDNN). To the best of ourknowledge, memory by feedback has not previouslybeen investigated for water quality prediction. Theliterature could be strengthened by comprehensivestudies that include both linear and nonlinear modelsand contrast the accuracies of memory by delayand memory by feedback. Second, the majority ofstudies use a classification framework for a timeseries problem, partitioning the data into training andtesting, or training, validation, and testing, sets formodel calibration and testing. This type of designignores the way in which most predictive models areused in real world applications. Forecast models aregenerally re-estimated when new and more recentinformation becomes available. A third problem is thatlimited and often inappropriate error measures havebeen used. Armstrong and Collopy (1992) report theR2 and MSE metrics to be ineffective for assessingthe performances of predictive models where extremevalues are common. Instead, they recommend the useof relative error measures to compare the accuraciesof predictive models and, given the extreme variationsin water quality data, a median should be used as ameasure of the central tendency. Table 1 reveals thatthe most common metrics used in prior studies arethe MSE and R2. Median measures have generally notbeen used. Finally, we observe that the water qualitystudies focus primarily on single-period predictions.The effect of the cumulative error over a multi-period horizon is neglected, despite the fact that thecumulative accuracy of multi-period predictions ishighly significant in basin water quality management(Beck, 2005).

  • ud

    d

    partitions interval

    Berthouex and Box (1996) ARIMA & ARIMA-TF Modeling & test (30)partitions

    Visual, std deviation,MSE, & confidenceinterval

    1 period &5 periods

    Hamoda et al. (1999) MLP NA R2 NA

    Grieu et al. (2005) MLP Training (80), validationpartition (20)

    RAE 1 period

    Mjalli et al. (2007) MLP Training (46), validation (26)& test (13) partitions

    R2 1 period

    Hamed et al. (2004) MLP & Regression Training (92138) & test(1561) partitions, limited to4 cases

    MAPE, MSE & R2 1 period

    Tomenko et al. (2007) MLP, RBF & Regression Training (74) & test (7)partitions

    Visual, R2, MSE &MAE

    1 period

    Pai et al. (2007) MLP, Genetic algorithm &Grey model

    Training (100) & validation(46) partitions

    MAPE 1 period

    Lin et al. (2008) MLP Training (420), test (210) &validation (210) partitions

    RMSE & R2 1 period

    He and He (2008) MLP Training (103118), test(3045) & validation (36)partitions

    RMSE & R2 1 period

    Singh et al. (2009) MLP Training (576), test (192) &validation (192) partitions

    RMSE & R2 1 period

    Dogan et al. (2009) MLP Training (52) & test (50)partitions

    MSE, R2 & MARE 1 period

    Jin and Englande (2006) RBF & Logit Training (419) & validationpartition (47)

    Percent correct 1 period

    Zhu et al. (1998) TDNN Training (362), Window withthreshold

    R2 &Visualassessment

    1 period

    Dellana and West (2009) TDNN & ARIMAs Training (530/630 rollingorigin fixed periods 100/200 each) & test(530/630 rolling origin fixedperiods 5 each)

    MSE &MAE 15 periods

    This research investigates the predictive accuracyof both linear and nonlinear models and short-termmemory both by delay and by feedback, including thefollowing models: exponential smoothing, ARIMA-Intervention, ARIMA-Intervention-Transfer function,MLP neural network, TDNN, Jordan-Elman neural

    network (JENN) and Gamma memory neural network(GMNN). The JENN and GMNN models both haveshort-term memory by feedback, which, to the best ofour knowledge, has not previously been investigatedin water quality research. The experimental designassesses the predictive ability of each model using a780 D. West, S. Dellana / International Jo

    Table 1Literature summary.

    Author(s) Algorithms Mo

    Delleur and Gyasi-Agyei (1994) ARIMA-I-TF Mornal of Forecasting 27 (2011) 777803

    eling methodology Metrics Forecasthorizon

    eling & test (20) Visual & confidence 1 period

  • arEq. (1) is a generic mathematical model of thepredictive models used in this research, with specificterms identified as the long- and short-term memories.The reader should note that this equation is writtento output a single predictive value, y, from a focusednetwork with a single hidden layer. The term focusedimplies that the feedback is limited to input values.This does not allow for the feedback of networkoutput values, which can result in network instability.The short-term memory consists of i input streams,denoted by the variable xi (which can include timelagged values). The input values are scaled by acollection of weights, wi j , and input to the long-term memory. The weighted inputs are then mappedinto the long-term memory by j activation functions(neural network terminology), f j . The values formedin the long-term memory are weighted by w j andmapped by a second activation function, f , to a single

    The traditional MLP neural network is a staticnonlinear mapping of inputs to outputs (De Vries& Principe, 1992). This is considered as long-termmemory, since the information content of the inputis converted during training into permanent weightvalues that are persistent and independent of time.This static mapping capability is effective for causalregression problems, but is a limitation for time seriesproblems since the architecture of the network isunable to recognize time relationships. This limitationis evident in Eq. (2), which expresses the activation ofa single neuron in the network:

    xi =

    j

  • u782 D. West, S. Dellana / International Jo

    is a bounded nonlinear transformation, and Ii allowsfor inputs from external sources (De Vries & Principe,1992). The term in parentheses in Eq. (2) is frequentlydefined as

    neti =j 0. (7)For this memory structure, the kernel function is con-strained as follows:

    w(t) =K

    k=1wk g

    k (t) . (8)

    De Vries and Principe (1992) prove that a linear com-bination of gamma kernels can approximate w(t) withan error that can be made arbitrarily small; they alsoprove that the gamma memory is stable for 0 < < 2.Details of the derivation of the GMNN are given by DeVries and Principe (1992).

    An effective predictive model for temporal patternsrequires the integration of two memory structures inthe neural network architecture, as shown in Fig. 2.The short-term gamma memory filters information inthe time domain into a K dimensional representationalspace that preserves the dynamic properties of thesystem. The long-term memory is a static mappingfrom the time series representation to a predictedoutput value. The top portion of Fig. 2 depicts theintegration of the short-term dynamic memory and thelong-term static mapping. The short-term memory isexpanded in the lower portion of Fig. 2 to reveal detailsof the gamma memory kernels. This form of short-term memory is referred to as memory by feedback.The nature of the delay and feedback loop is evidentfrom the inset in the lower portion of Fig. 2.

    3.2. Memory by delay

    The gamma memory neural network of Fig. 2 canbe reduced to either TDNN or the context memory

  • aaof JENN networks as special cases. If = 1, thegamma memory reduces to a TDNN neural networkwith memory by delay. When = 1, the feedbackloop of the gamma memory is eliminated, and the Kinformation streams consist of the current observationand K 1 lagged observations. In neural networkterminology, this memory is referred to as a tap delayline (De Vries & Principe, 1992). Waibel, Hanazawa,Hinton, Shikano, and Lang (1989) specifically definea time delay memory structure, which is designed torecognize phonemes. In a similar application, Lang,Waibel, and Hinton (1990) used a time delay memorystructure for word recognition.

    We should also note that the static MLP neural net-work can be adapted to a memory by delay mecha-nism. The fixed nature of the short-term memory bydelay (

    i j wi j xi ) is accomplished by restructuring

    the input data into a series of vectors of size K thatinclude the current observation and K 1 lagged val-ues. This short-term memory representation is thenmapped to a prediction in long-term memory by non-linear functions f j and f in the hidden and output lay-ers respectively. Some obvious disadvantages of thisinclude the need to specify input vectors a priori and

    the need to restructure the input data vectors for dif-ferent memory depths.

    The TDNN (De Vries & Principe, 1992; Waibelet al., 1989) has memory structures similar to thatof the MLP neural network. The TDNN short-termmemory (

    i j wi j xi ) uses a pre-processor to create

    tap delays, allowing the input of a window ofobservations of fixed size K to the network (thecurrent value at time t and values for the previousK1 time periods). TDNN uses a nonlinear long-termmemory structure similar to that of the MLP neuralnetwork.

    The proper selection of the input window size isimportant for both MLP and TDNN. A large valueof K increases the networks dimensionality andthe number of parameters to be estimated, while asmall value of K risks the loss of information thatcharacterizes the process dynamics.

    3.3. Memory by feedback

    The memory structure of the JENN can also bederived as a special case of the gamma memory whereK = 1. This creates a two dimensional representationD. West, S. Dellana / International Journ

    Fig. 2. Gamma memory neurl of Forecasting 27 (2011) 777803 783

    l network time series model.

  • u784 D. West, S. Dellana / International Jo

    space consisting of the current observation and asmoothed average of the current observation with thefeedback of the previously smoothed value (see Eq.(9)). The concept of memory by feedback (i.e., theuse of recurrent links) to provide networks with adynamic short-term memory was first described byElman (1990) and Jordan (1986), and is commonlyreferred to as context units.

    The short-term temporal memory (

    i j wi j xi ) isimplemented by incorporating a recurrent feedback ofinformation to the input layer through a set of contextneurons. The context units are essentially memoryneurons that remember past activities. The feedbackof temporal information can come from one or morehidden layers (Elman network), from the networkoutput layer (Jordan network), or simply from theinput layer. In an exploratory analysis, we found thelast alternative, a feedback loop with an exponentiallydecaying temporal memory of the input stream, to bethe most effective for water quality prediction, and thisis the architecture used for JENN in this research.

    The determination of an output for a context unit attime t , y(t), is dependent on the prior output y(t 1)and the current input value x(t), as defined in Eq. (9)

    y(t) = w1 y(t 1)+ w2x(t). (9)In Eq. (9), w2 is an adaptive network weight estimatedduring the backpropagation of error training, andw1 isreferred to as a time constant, a value between zero andone which is defined by the user. Higher values of thetime constant w1 provide longer memory depths, withslower rates of exponential decay of the input values.

    The GMNN has a more sophisticated temporalmemory (

    i j wi j gi ) than the JENN (De Vries

    & Principe, 1992). The Gamma temporal memorycreates an input representation that is a series ofK parallel information streams, where K is definedby the user. The first stream is the current inputobservation. Each successive stream creates an outputthat is a smoothed function of the prior stream, asdefined in Eq. (10), where gi (t) is the memory outputfor Gamma stream i at time t and g0(t) = x(t).The parameter is adaptive, allowing the model toestimate from the data.

    gK (t) = gK1 (t 1)+ (1 ) gK1 (t 1) .(10)rnal of Forecasting 27 (2011) 777803

    3.4. Memory properties

    Memory depth, D, and memory resolution, R, aretwo important properties of the short-term memorydepicted in Table 2. The memory depth definesthe period of time that the short-term memory canaccess information from the time series observations,while the memory resolution is the precision of therepresentation of these time series observations. Thegamma short-term memory has a depth of K/ anda resolution of . A notable advantage of the Gammamemory is that the parameter , which controls thememory depth and resolution, can be adapted fromthe data during network training and does not haveto be specified by the user a priori. This means thatthe memory properties will be determined empiricallyfrom the training data and can evolve over time withchanges in the system dynamics. The TDNN memorydepth must be specified a priori by the user andremains constant thereafter. The TDNN memory bydelay is a 100% resolution of input values for adepth of K observations (the current observation andK 1 lagged values). There is an abrupt discontinuitybeyond K 1 lags, where no information can beaccessed by the TDNN memory. Since there is nofiltering by the TDDN memory, any noise in the inputobservations is transmitted directly to the long-termmemory mapping function. The context units of theJordan Elman networks have a memory depth of 1/and resolution of . A time constant specified by theuser establishes the memory depth and resolution.

    For the purpose of illustration, we show typicalgamma memory traces (g0 to g2) for BOD time seriesobservations in Fig. 3. We caution the reader that thesevalues have not been normalized for network input.The g0 trace is the time series observation for thecurrent time period, and g1 and g2 are memory tracescalculated from recursive smoothing with gammakernels. Fig. 4 portrays the K = 1 memory tracefor TDNN, the Jordan Elman context units, and thegamma memory. The TDNN trace is simply thememory delay of one time period, while the gammaand the context unit are identical traces representing asingle smoothed average.

    3.5. Neural network training algorithms

    The goal of neutral network training is to identifya set of weights wi j which ensure that the networkoutputs y are close to the target output values y in the

  • ais defined in Eq. (11) for a single output predictivemodel:

    E =T

    t=1

    12(y(t) y(t))2. (11)

    The backpropagation algorithm calculates thederivative of E with respect to each of the weightswi j .The weights are then either increased or decreased,

    wi j = Ewi j

    = i x j . (15)

    A more complex training algorithm (backpropa-gation through time) is required when the networksshort-term memory has adaptable weight parametersfor the feedback of time information from prior peri-ods. In this situation, neti of Eq. (2) has the followingform (Werbos, 1990):D. West, S. Dellana / International Journ

    Table 2Neural network short-term memory properties.

    Short-term memory Depth

    Gamma K/Delay (TDNN) KContext (Jordan, Elman) 1/

    Fig. 3. Gamma memory output.

    Fig. 4. Gamma, context unit and TDNN memory trace.

    fit data. The most common algorithm for the estima-tion of neural network weights is the backpropagationof error. While there are a number of potential errormetrics, it is typical to use the squared error, whichl of Forecasting 27 (2011) 777803 785

    Resolution Specification

    Adapted from data1.0 User specified User Specified

    depending on the gradient of the error: weights withpositive gradients are decreased, weights with negativegradients are increased. Werbos (1990) derived con-ditions for evaluating error gradients and propagatingthe network error of Eq. (11) from the network outputbackwards, layer by layer through the network, usingordered partial derivatives. With neti representing theoutput of neuron i (Eq. (2)), the chain rule expressingthe network error E with respect to a specific weightwi j is stated in Eq. (12):

    E

    wi j= Eneti

    netiwi j

    = Eneti

    x j . (12)

    The partial derivative of the error function withrespect to neti can be derived from a knowledge ofthe activation functions f and f j used in the long- andshort-term memories in Eq. (1).

    E

    neti= (y y)( f neti )

    neti. (13)

    For the hyperbolic tangent activation function usedin this research ( f = tanh), the partial derivative inEq. (13) can be evaluated as follows:

    E

    neti= (y y)(1 tanh2(neti )) = i . (14)

    Weight adjustments are then made by multiplyingthe error gradient with respect to the weight by alearning rate .

  • u786 D. West, S. Dellana / International Jo

    neti =

    j

    wi j x j (t)+

    j

    wi j x j (t 1)

    +

    j

    wi j x j (t 2)+ . (16)

    Now the evaluation of the partial derivatives ofEq. (12) includes gradients (E/ neti ) for priortime periods. The basic backpropagation algorithmmust therefore be modified to progress backwardsin time for evaluating the ordered partial derivatives(Werbos, 1990). The only neural network modelused in this research which requires backpropagationthrough time is the GMNN, with the adaptableparameter . The JENN can also be trained usingconventional backpropagation because the short-termmemory parameter is specified by the user. Thereader is referred to Haykin (1994) for the details ofbackpropagation training algorithms.

    3.6. Memory structures for linear statistical models

    Prediction by exponential smoothing (Eq. (17))is characterized as a short-term temporal memorywith feedback and a linear long-term memory. Theshort-term temporal memory (

    i j wi j xi ) consists

    of the current value of the input and a smoothedaverage saved from the most recent output. The currentobservation is weighted by , and the smoothedaverage is weighted by 1 , to output a prediction.The linear long-term memory is an identity functionthat outputs a value, which is identical to the inputfrom the short-term memory function. Eq. (17) canalso be rewritten as a weighted average of all priorobservations, with weights diminishing over time asa function of .

    yt = (xt1)+ (1 )yt1. (17)This research investigates ARIMA models with an

    intervention term (ARIMA-I) for modeling processdisturbances, and ARIMA-I with a transfer function(ARIMA-I-TF) for modeling the dynamic relationshipbetween the inputs and outputs of the process. The fullARIMA model is expressed by the following equationfor a time series yt (Box & Tiao, 1973):

    yt = (B)(B)

    at + (B)(B)

    It + v(B)xtb. (18)

    The first term (B)(B)at is the basic ARIMA model

    of the undisturbed process (Chen & Liu, 1993). Inrnal of Forecasting 27 (2011) 777803

    this equation, B represents the back-shift operator,where B(yt ) = yt1. (B) represents the polynomialexpression (1 1 B p B p), which capturesthe autoregressive structure of the time series. (B)represents the polynomial (1 1 B q Bq),which captures the moving average structure of thetime series. Finally, at is a white noise series withdistribution N (0, 2a ). The second term

    (B)(B) It is an

    intervention term that identifies periods where externalvariation is present in the dataset (Box, Jenkins, &Reinsel, 1994). The coefficient term It is a ratio ofpolynomials that defines the nature of the externalvariation. Finally, the third term models the transferfunction for exogenous variables.

    ARIMA is a linear long-term memory with ashort-term memory by delay. The short-term memoryconsists of a mechanism for presenting the currentvalues and a specified number of historical observa-tions and/or errors to the model (i.e., lagged values).This form of temporal memory has 100% memoryresolution and a fixed memory depth. The memorydepth is determined during the identification stage ofthe Box-Jenkins methodology when the orders of theautoregressive and moving average components aredetermined. The long-term memory is a linear poly-nomial expression of and .

    4. Data description and experimental methodology

    This section describes the data used in this researchand the control variables that form the experimentalmethodology. These include the model memory struc-ture, the partitioning and formation of experimentalforecast cycles, the over-fitting of nonlinear models,and forecast metrics.

    4.1. Data

    The data used for this study consist of 638 dailymeasurements of several input, process state, andeffluent properties of an urban wastewater treatmentplant, as reported by Poch, Bejar, and Cortes (1993).The complete data set can be downloaded fromthe UCI Machine Learning Repository (Asuncion &Newman, 2007). The physical systems being modeledare known for extreme variations caused by randomexogenous variables. For example, a severe stormcan wash away the colony of microorganisms fromthe BOD treatment process, causing a significant

  • aD. West, S. Dellana / International Journ

    Fig. 5. Daily wastewater BOD effluent.

    Fig. 6. Daily wastewater SS effluent.

    release of pollutants for several days until conditionscan be restored. We investigate the accuracy offorecasting models for two time series, BOD effluentand SS effluent, since these two properties are amajor concern for basin water quality management.Figs. 5 and 6 graph the BOD and SS time seriesrespectively. Dickey-Fuller tests for both time seriesyielded p < 0.001. This confirms that both timeseries are stationary, and thus it is not necessary toeither difference the series or include trend terms in themodels. We also found no significant seasonal effectsin the data; this is consistent with an earlier analysisby Berthouex and Box (1996).

    4.2. Forecast cycles

    The experimental methodology is designed tocreate a series of forecast cycles from the 638 dataobservations and to measure out-of-sample test errorsl of Forecasting 27 (2011) 777803 787

    for a forecast horizon of five periods. The out-of-sample evaluations are accomplished by partitioningthe historical data into a set of fit data T (better knownas training data in the neural network literature) andone of test data with forecast horizon N (Tashman,2000). There is also the issue of validation data forthe nonlinear models; this is discussed in Section 4.3.

    There are two generic strategies for creatingforecast cycles: a single fixed origin or a rolling originthat advances through the data (Tashman, 2000). Forthe purposes of this research, we employ a rollingorigin of fixed size T that advances the forecast originby one time period for each forecast cycle. When theforecast origin is advanced, a new observation is addedto the fit data, and the oldest data point is purged.A key distinction is that the rolling origin maintainsa constant fit size T for each successive forecastcycle, while the fixed origin requires T to increasewith progressive forecast cycles. For the purposesof our experiment (contrasting model accuracies formulti-period forecasts), it is important to maintain aconstant quantity of fit data T for all forecast cycles.Allowing T to increase with successive forecast cyclesintroduces confounding sources of variability intothe experimental design. As T increases, we wouldexpect the accuracy of the models to increase, butat different rates. As T increases, there is also anincreased probability of including cyclical patternscaused by weather variations (i.e. dry and rainyperiods). By purging old data, we minimize thevariation caused by cyclical events. Tashman (2000)reports that the fixed origin strategy is susceptible tocorruption by data occurrences, which are unique tothat single fixed origin, and that the resulting summarystatistics are merely averaging forecast errors acrosslead times. Tashman (2000) concludes that rollingorigin strategies level the playing field in multi-periodcomparisons of forecasting accuracy.

    Two different fit sizes T are used in this research,to ensure that the results reported are not dependent onthe idiosyncrasies of a single fit size. A complicationof our experiments with two different values of Tis that there are more potential forecast cycles withT = 100 than with T = 200. We have chosen toinclude in our analysis forecast cycles from N = 101to N = 200 for T = 100 (thereby improving theestimate of the forecast accuracy), even though thesecannot be replicated as out-of-sample forecasts for

  • u788 D. West, S. Dellana / International Jo

    T = 200. There are 530 forecast cycles for the 100period fit size, and 430 for the 200 period fit size.Because of this decision, the reader is cautioned notto infer any model accuracy differences from the fitsize variation.

    While the selection of a specific fit size is somewhatarbitrary, our choices are guided by the followinglogic. There is a generally accepted practice that aminimum of 100 observations should be used to fitan ARIMA model. The first set of experiments istherefore conducted for a fit size of 100. This amountof training data is also sufficient for the neural networkarchitectures, since the networks are reasonably small.To ensure that the results are not dependent on thefit size, all of the experiments are then repeatedfor a larger fit size of 200 observations. Both fitsizes have a sufficient number of observations to suitthe ARIMA and neural network models investigated.The range of values of T investigated representsapproximately 37 months of historical data. Fit sizeslarger than 200 increase the probability of introducingconfounding variability from longer term cyclicalpatterns, particularly from wet or dry weather patterns.

    All forecast models are recalibrated for eachsuccessive forecast cycle, and out-of-sample testevaluations are generated for a forecast horizon offive periods. While recalibration is computationallymore intensive, it results in improved out-of-samplemeasures (Tashman, 2000). Recalibrated forecastmodels are not limited to the data idiosyncrasies of theinitial model fits, and have the freedom to change asthe temporal nature of the data changes. Recalibrationincludes the re-initialization of weights for neuralnetwork models.

    During a specific forecast cycle, the origin is fixedat position i in the data set, and the parameters foreach forecasting model are calibrated for the twofit periods i + 100 and i + 200. To accommodatemodels with lagged input values (memory by delay),the first rolling origin starts at i = 4. Out-of-sampleforecasts are produced for periods i + 100 + N andi + 200 + N , where N = 1, 2, . . . , 5 (a five-periodforecast horizon). Using a fixed fit size rolling originstrategy, the forecast origin is then moved forwardby one period to i + 1, and the sequence of modelparameter re-calibration and forecasting is repeated.We refer to a single estimation-forecast activity as aforecast cycle.rnal of Forecasting 27 (2011) 777803

    There is a technical issue in defining the testdata for the linear and nonlinear models. Considerthe smaller fit size of 100 observations, where thefirst window used to calibrate the models rangesfrom observation 4 (allowing three lagged values) toobservation 103. The linear models estimated fromthis window generate a five-period forecast for timeperiods 104108. The nature of model estimationfor the nonlinear models is different, requiring pairsof input and target data points. Since observation104 is the target value paired with the input ofobservation 103, time period 104 cannot be the singleperiod forecast for nonlinear models. This would biasthe results strongly in favor of nonlinear models,because the pair of observations are in the data setused to estimate the model, and therefore do notconstitute an independent test case. For the nonlinearmodels, the first independent test case is obtained byinputting observation 104 to predict observation 105.The nonlinear model test sets are aligned with thisas the starting point so that the forecast cycles areidentical for the linear and nonlinear models.

    4.3. Over-fitting of nonlinear models

    For problems that lack temporal information, itis common to fit nonlinear neural network modelsusing observations that are presented in a randomsequence, and to monitor the progression of the out-of-sample error using an independent validation set.The amount of training can then be stopped whenthe validation error reaches its minimum and startsto increase (i.e. early stopping). This prevents modelover-fitting, a condition where the model memorizesdata idiosyncrasies, resulting in a poor generalizationto novel data. The inclusion of models with memoryby feedback in this research creates an experimentaldesign requirement that data be presented in temporalsequence (for example, see Eqs. (9) and (10)).Therefore, if validation data are used, they mustimmediately follow the fit data. The validation datacreate a separation between the fit data and theout-of-sample test data, equal to the length of thevalidation set. For the validation data to be effective,this distance would have to be a minimum of ten totwenty observations. It is obviously not possible in thiscase to generate one step predictions using nonlinearmodels that employ a validation set and early stopping.

  • aD. West, S. Dellana / International Journ

    Our resolution of this dilemma is to validatekey model parameters for the short- and long-termmemories and for the number of training cycles duringa preliminary grid search experiment (see Section 5.2for details). The optimal parameters, including thenumber of training cycles identified during the gridsearch, are used to produce the research results,obviating the need for model validation during eachexperiment. We would argue that the potential ofover-fitting the nonlinear models is minimized in thisproblem domain by the fact that the neural networkarchitectures are relatively simple. The architecturesconsist of a single temporal stream of observations andpossibly a few lagged or smoothed information inputs,a compact hidden layer, and a single output; there arerelatively few weights to estimate.

    4.4. Metrics

    The choice of the metrics reported in this researchwas guided by the findings of Armstrong and Collopy(1992). The model forecast accuracy is measured ateach of the five periods in the forecast horizon by themedian absolute percent error (MdAPE), the medianrelative absolute error (MdRAE), and the mediancumulative relative error (MdCumRAE) for periodstwo to five. The relative absolute error is defined asthe ratio of the absolute error at a given time horizonto the corresponding error for the random walk model.The forecast error metrics used in this paper are themedians of the 530 and 430 forecast cycles describedpreviously.

    5. Definition of predictive models

    This section describes the configuration and cali-bration of each of the forecasting models investigated.The first subsection defines the linear models, the sec-ond describes the nonlinear neural network models.

    5.1. Linear models: exponential smoothing andARIMA

    For each forecast cycle performed in this study, thesmoothing constant is optimized to minimize themean squared error, subject to the constraints 0.1 0.5. Outliers at a distance greater than 3 sigmaare replaced by interpolated values prior to estimatingl of Forecasting 27 (2011) 777803 789

    . The mean value of estimated for all forecastcycles is 0.30 for the fit size of 100, and 0.31 for thefit size of 200. The estimated exponential smoothingmodels for the first BOD and SS effluent forecastingcycles are shown below in Eqs. (19) and (20). Re-estimated models of the same form are then usedto generate forecasts for subsequent data windows.Multi-period forecasts of the exponential smoothingmodels are extensions of the single period forecasts.

    BOD effluent forecast:

    y1t+1 = y1t + 0.298 (y1t y1t ) (19)SS effluent forecast:

    y2t+1 = y2t + 0.274 (y2t y2t ). (20)The development of an ARIMA model begins with

    the identification of an appropriate model from infor-mation about the autocorrelation and partial correla-tion functions of the response variable. The analysisof the correlation functions of the BOD effluent andthe SS effluent suggests an MA(1) model for the BODeffluent and an AR(1) model for the SS effluent. Wealso conducted a forecasting competition of ARIMAmodels, including AR(1), MA(1), ARMA(1,1) and theBerthouex and Box (1996) model, an MA(1) modelof first differences applied to a natural log transfor-mation of the time series data. The competition ver-ified that the AR(1) model minimized the MdRAEand Akaike Information Criterion (AIC) for the SS ef-fluent. However, a ARMA(1,1) model had a slightlylower MdRAE and AIC for the BOD effluent series.These two ARIMA model forms are used in the sub-sequent development of ARIMA-I models.

    The ARIMA-I models are estimated from the raw(unadjusted) data, with an intervention term includedin those time periods where outliers are identified bythe iterative method of Chen and Liu (1993). Summarystatistics of the parameter estimates for the BODeffluent ARMA(1,1) model for each forecast cycle(430 for the larger fit size and 530 for the smallerfit size) are given in Table 3. The autoregressiveparameter 1 has a mean of 0.59 for the fit size of100 and 0.67 for the fit size of 200, and ranges from aminimum of 0.18 to a maximum of 1.0. The movingaverage parameter 1 has a mean of 0.15 for the fit sizeof 100, and 0.22 for the fit size of 200, with a rangeof 0.64 to 0.86. A typical BOD effluent forecastingequation estimated for the first forecast cycle (fit

  • useries and the response series and examining the cross-correlation structure of the residuals (Box & Tiao,1973). The cross-correlation reveals the nature ofthe dynamic response, as well as the time lag. Theidentification of potential transfer functions in thisresearch is also guided by the work of Berthouexand Box (1996). They identify the following variablesas potentially being significant: inflow of BOD (x1),inflow of SS (x2), total inflow (x3), BOD effluent (y1,for the SS forecast only), and SS effluent (y2, for theBOD forecast only).

    The analysis of the pre-whitened cross-correlationfunctions for the wastewater treatment data indicatesthat the input BOD has a modest cross-correlation with

    critical design decisions are the configurations of theshort-term memory depth and resolution, the capac-ity of the long-term memory, and the output layer. Asimple exploratory grid search is employed to vali-date the appropriate model configurations. The gridsearch involved testing all permutations of the num-ber of inputs, the number of hidden layer neurons,and the length of the training cycle. Specifically, theshort-term memory inputs varied from two to fiveby increments of one, the number of hidden neuronsvaried from two to twelve, and the training cycles var-ied from 250 to 1000 by increments of 250. In addi-tion, the JENN time constant varied between 0.2 and0.8 by increments of 0.2, and the trajectory length of790 D. West, S. Dellana / International Jo

    Table 3Estimation of ARIMA model parameters.

    Autoregressive parameter Fit size = 100 Fit size =

    BOD effluentAverage 0.594402 0.672619Maximum 1 1Minimum 0.18842 0.161373SS effluentAverage 0.649516 0.663632Maximum 0.999979 0.998723Minimum 0.38866 0.431841

    size = 100) is given in Eq. (21):y1t=1 = 21.6+ 0.95y1t + 0.75et + 14I11 + 23I12+ 133I13 + 300I14 + 64I15 + 12I16 + 22I17 . (21)The parameter estimates for the SS effluent, a first

    order autoregressive model, are also given in Table 3.The mean value of 1 is 0.65, with a minimum of0.39 and a maximum of 0.99. The model for the firstforecasting cycle of the SS effluent series is shown inEq. (22). Re-estimated models similar to Eqs. (21) and(22) are used to generate forecasts for the subsequentforecast cycles. Multi-step forecasts are generated byiteratively using forecast values from prior periods.

    y2t+1 = 25.8+ 0.7y2t + 17I21 + 23I22 + 50I23+ 99I24 + 200I25 + 58I26 + 43I27 + 27I28 . (22)ARIMA-I-TF models are developed by identifying

    other predictor variables that might provide relevantforecasting information. Potential transfer functionsare analyzed by pre-whitening both the predictorrnal of Forecasting 27 (2011) 777803

    Moving average parameter 200 Fit size = 100 Fit size = 200

    0.14778 0.2208210.858602 0.6983990.6427 0.28748

    NA NANA NANA NA

    effluent BOD, and that effluent SS is strongly cross-correlated with effluent BOD. Unfortunately, bothlags are zero; the dynamic relationship is concurrentin time. The need to forecast the transfer functionvariable simultaneously introduces an additionalsource of error. The only significant cross-correlationfor the SS effluent model is with BOD effluent, also ata lag of zero. The forms of the two transfer functionsare given below in Eqs. (23) and (24):

    y1t+1 = 0.015x1t+1 +(0.33+ 0.06B)(1+ 0.03B) y2t+1 (23)

    y2t+1 = 0.73y1t+1 . (24)

    5.2. Nonlinear models: MLP, TDNN, JENN andGMNN

    Each nonlinear model must be configured empir-ically to match the model capacity to the require-ments of the physical system being modeled. The most

  • aoo

    oo

    oo

    oo

    oo

    oo

    e

    oocycles). These data samples are never part of a test setin our experimental methodology. The configurationdetails for all four neural network models are summa-rized in Table 4 for the BOD experiments and Table 5for the SS experiments.

    6. Forecast model accuracy results for basin waterquality management

    This section reports the forecast error metrics forthe three linear forecasting models and the four non-linear neural network models. The errors are sum-marized for two time series, BOD and SS, and fortwo different model fit sizes, 100 and 200 observa-tions. The error metrics used to compare the forecastperformances include the MdAPE (median absolutepercent error), the MdRAE (median relative absoluteerror), and the MdCumRAE (median cumulative rela-tive absolute error) for each period of the five-periodforecast horizon. Our discussion of the results will fo-cus on forecast performance distinctions between theindividual forecast models and the classes of models

    and ARIMA-I-TF models. The nonlinear models com-prise both memory by delay (MLP and TDNN) andmemory by feedback (JENN and GMNN) models.

    6.1. Discussion of forecast error effect size by model

    Tables 6 and 7 summarize the MdAPE results bymodel for the BOD and SS variables respectively.The tables also report an effect size that contrasts themagnitude of the error reduction for a given model tothat of the least accurate model by forecast period. Theleast accurate models have effect sizes of zero, whilethe more accurate models have negative percentagesthat estimate the reduction in MdAPE achieved by themodel.

    For the BOD 100 experiment (top portion ofTable 6), the ARIMA-I-TF is the least accurate model,with a MdAPE ranging from 19.5% in Period 1to 24.4% in Period 5. Of the class of linear long-term memory models, exponential smoothing has anegligible effect size in Period 1 of 0.5%, butmore prominent effect sizes in other periods, rangingfrom 7.2% in Period 2 to 9.8% in Period 5. TheD. West, S. Dellana / International Journ

    Table 4Specification of neural network models for BOD experiments.

    Experiment Model Short-term memory Long-term mem

    BOD 100 MLP 3 inputs: current and 2lagged values

    8 nodes, hyperbtangent activati

    BOD 100 TDNN 2 inputs: current observationand 1 lagged value

    4 nodes, hyperbtangent activati

    BOD 100 JENN Input memory, time constantw1 = 0.8

    2 nodes, hyperbtangent activati

    BOD 100 GMNN 3 inputs, current observationand 2 smoothed

    8 nodes, hyperbtangent activati

    BOD 200 MLP 3 inputs: current and 2lagged values

    8 nodes, hyperbtangent activati

    BOD 200 TDNN 4 inputs: current observationand 3 lagged value

    4 nodes, hyperbtangent activati

    BOD 200 JENN Input memory, time constantw1 = 0.8

    12 nodes,hyperbolic tangactivation

    BOD 200 GMNN 2 inputs, current observationand 1 smoothed

    8 nodes, hyperbtangent activati

    GMNN varied between two and five by increments ofone. To avoid contaminating the independent test set,the exploratory analysis is conducted on the first 100observations (75 training observations and 25 forecastl of Forecasting 27 (2011) 777803 791

    ory Training

    licn

    Static back-propagation, 250 random iterations, momentum =0.7, learning rate = 0.01

    licn

    Static back-propagation, 250 sequential iterations, momentum= 0.7, learning rate = 0.01

    licn

    Static back-propagation, 250 sequential iterations, momentum= 0.7, learning rate = 0.01

    licn

    Dynamic back-propagation, 250 sequential iterationstrajectory = 5, momentum = 0.7, learning rate = 0.01

    licn

    Static back-propagation, 250 random iterations, momentum =0.7, learning rate = 0.01

    licn

    Static back-propagation, 250 sequential iterations, momentum= 0.7, learning rate = 0.01

    ntStatic back-propagation, 250 sequential iterations, momentum= 0.7, learning rate = 0.01

    licn

    Dynamic back-propagation, 250 sequential iterationstrajectory = 5, momentum = 0.7, learning rate = 0.01

    with linear and nonlinear long-term memories, anddifferences in short-term temporal memory for thenonlinear neural network models. The linear forecastmodels consist of exponential smoothing, ARIMA-I

  • uy

    c

    c

    c

    li

    c

    c

    li

    c2 to 22.0% in Period 5. From the nonlinear long-term memory group, the MLP neural network has thelargest effect size for Period 1 (7.4%), and the JENNhas the largest effect sizes for Periods 2 to 4 (13.1%,17.6%, and 20.2%, respectively). The GMNN hasthe largest effect size for Period 5 (18.4%). Theseresults suggest that the effect sizes, which rangefrom 7% to 20% for nonlinear models, are largeenough to be of value to practitioners. We also observethat the effect sizes for nonlinear long-term memorymodels increase with longer forecast horizons, fromthe fairly modest level of 7% in Period 1 to 20%in Period 5, a pattern that is repeated in subsequentexperiments. We attribute this to the fact that, forforecasts of physical systems that exhibit nonlinearbehavior, nonlinear forecast projections will becomemore accurate relative to linear projections as theprojection distance increases.

    The MdAPE values and effect sizes for the BOD200 fit size experiment are presented in the bottomportion of Table 6. Similar results to those aboveare observed, with a few noticeable exceptions. TheTDNN is the least accurate model for Period 1, whilethe ARIMA-I is the least accurate model for Periods

    and effect sizes for the average performances of thethree linear models and the four nonlinear models,and for the nonlinear models with memory by delayand feedback. It is clear that nonlinear forecast modelshave small or no effect sizes relative to linear modelsfor single period forecasts. The MdAPE for the linearmodels in Period 1 is 19.47%, vs. 18.44% for thenonlinear models (with a fit size of 100). For thelarger 200 fit size case, the average MdAPE is 18.1%for linear models and 18.65% for nonlinear models.This is a 3% effect size favoring linear models atthe larger fit size, and a 5.3% effect size favoringnonlinear models at the smaller fit size. As the forecasthorizon is extended beyond Period 1, the effect sizeof the nonlinear models, relative to linear models,increases for both fit sizes. The nonlinear effect sizeincreases to 6.3% (100 observations) and 10.2%(200 observations) for Period 2. The nonlinear effectsize is 10.3% and 13.0% for Period 3, 11.9%and 11.2% for Period 4, and 12.7% and 12% forPeriod 5.

    The comparison of effect sizes for nonlinear mod-els with memory by delay with those of mem-ory by feedback suggests that, for BOD predictions,792 D. West, S. Dellana / International Jo

    Table 5Specification of neural network models for SS experiments.

    Experiment Model Short-term memory Long-term memor

    SS 100 MLP 3 inputs: current and 2lagged values

    8 nodes, hyperbolitangent activation

    SS 100 TDNN 3 inputs: current and 2lagged values

    4 nodes, hyperbolitangent activation

    SS 100 JENN Input memory, timeconstant w1 = 0.8

    4 nodes, hyperbolitangent activation

    SS 100 GMNN 3 inputs, currentobservation and 2smoothed

    12 nodes, hyperbotangent activation

    SS 200 MLP 3 inputs: current and 2lagged values

    8 nodes, hyperbolitangent activation

    SS 200 TDNN 4 inputs: current and 3lagged values

    8 nodes, hyperbolitangent activation

    SS 200 JENN Input memory, timeconstant w1 = 0.8

    12 nodes, hyperbotangent activation

    SS 200 GMNN 3 inputs, currentobservation and 2smoothed

    8 nodes, hyperbolitangent activation

    MdAPEs for those periods range from 20.6% in Periodrnal of Forecasting 27 (2011) 777803

    Training

    Static back-propagation, 250 random iterations, momentum =0.7, learning rate = 0.01Static back-propagation, 250 sequential iterations, momentum= 0.7, learning rate = 0.01Static back-propagation, 250 sequential iterations, momentum= 0.7, learning rate = 0.01

    c Dynamic back-propagation, 250 sequential iterationstrajectory = 2, momentum = 0.7, learning rate = 0.01

    Static back-propagation, 250 random iterations, momentum =0.7, learning rate = 0.01Static back-propagation, 250 sequential iterations, momentum= 0.7, learning rate = 0.01

    c Static back-propagation, 250 sequential iterations, momentum= 0.7, learning rate = 0.01Dynamic back-propagation, 250 sequential iterationstrajectory = 2, momentum = 0.7, learning rate = 0.01

    25. Table 6 also summarizes the MdAPE values

  • D. West, S. Dellana / International Journal of Forecasting 27 (2011) 777803 793

    Tabl

    e6

    MdA

    PEan

    def

    fect

    size

    for

    BO

    Dfo

    reca

    sts.

    Peri

    od1

    Peri

    od2

    Peri

    od3

    Peri

    od4

    Peri

    od5

    MdA

    PE(%

    )E

    ffec

    t(%

    )M

    dAPE

    (%)

    Eff

    ect(

    %)

    MdA

    PE(%

    )E

    ffec

    t(%

    )M

    dAPE

    (%)

    Eff

    ect(

    %)

    MdA

    PE(%

    )E

    ffec

    t(%

    )

    BO

    D10

    0E

    xp.s

    moo

    thin

    g19

    .40

    0.5

    20.6

    07

    .220

    .80

    12.

    621

    .30

    14.

    122

    .00

    9.8

    AR

    IMA

    -I19

    .50

    0.0

    21.9

    01

    .423

    .30

    2.1

    24.1

    02

    .824

    .10

    1.2

    AR

    IMA

    -I-T

    F19

    .50

    0.0

    22.2

    00.

    023

    .80

    0.0

    24.8

    00.

    024

    .40

    0.0

    ML

    P18

    .05

    7.4

    21.1

    54

    .721

    .21

    10.

    920

    .14

    18.

    821

    .02

    13.

    9T

    DN

    N19

    .50

    0.0

    20.8

    06

    .320

    .70

    13.

    021

    .80

    12.

    120

    .40

    16.

    4JE

    NN

    18.1

    07

    .219

    .30

    13.

    119

    .60

    17.

    619

    .80

    20.

    220

    .70

    15.

    2G

    MN

    N18

    .10

    7.2

    19.6

    01

    1.7

    19.7

    01

    7.2

    20.7

    01

    6.5

    19.9

    01

    8.4

    Lin

    ear

    19.4

    70.

    021

    .57

    0.0

    22.6

    30.

    023

    .40

    0.0

    23.5

    00.

    0N

    onlin

    ear

    18.4

    45

    .320

    .21

    6.3

    20.3

    01

    0.3

    20.6

    11

    1.9

    20.5

    11

    2.7

    Mem

    ory

    byde

    lay

    18.7

    80.

    020

    .98

    0.0

    20.9

    60.

    020

    .97

    0.0

    20.7

    10.

    0M

    emor

    yby

    feed

    back

    18.1

    03

    .619

    .45

    7.3

    19.6

    56

    .220

    .25

    3.4

    20.3

    02

    .0

    BO

    D20

    0E

    xp.s

    moo

    thin

    g18

    .50

    4.6

    20.5

    07

    .720

    .40

    10.

    121

    .20

    10.

    922

    .00

    5.6

    AR

    IMA

    -I17

    .60

    9.3

    22.2

    00.

    022

    .70

    0.0

    23.8

    00.

    023

    .30

    0.0

    AR

    IMA

    -I-T

    F18

    .20

    6.2

    21.3

    04

    .122

    .40

    1.3

    22.5

    05

    .522

    .60

    3.0

    ML

    P18

    .15

    3.1

    19.5

    81

    1.8

    18.9

    11

    6.7

    19.5

    21

    8.0

    19.7

    71

    5.2

    TD

    NN

    19.4

    00.

    019

    .40

    12.

    618

    .60

    18.

    120

    .70

    13.

    021

    .40

    8.2

    JEN

    N18

    .20

    6.2

    18.0

    01

    8.9

    19.2

    01

    5.4

    19.5

    01

    8.1

    19.3

    01

    7.2

    GM

    NN

    18.8

    32

    .919

    .65

    11.

    519

    .30

    15.

    020

    .22

    15.

    019

    .24

    17.

    4

    Lin

    ear

    18.1

    00.

    021

    .33

    0.0

    21.8

    30.

    022

    .50

    0.0

    22.6

    30.

    0N

    onlin

    ear

    18.6

    53.

    019

    .16

    10.

    219

    .00

    13.

    019

    .99

    11.

    219

    .93

    12.

    0

    Mem

    ory

    byde

    lay

    18.7

    80.

    019

    .49

    0.0

    18.7

    60.

    020

    .11

    0.0

    20.5

    90.

    0M

    emor

    yby

    feed

    back

    18.5

    21

    .418

    .83

    3.4

    19.2

    52.

    619

    .86

    1.2

    19.2

    76

    .4

  • 794 D. West, S. Dellana / International Journal of Forecasting 27 (2011) 777803

    Tabl

    e7

    MdA

    PEan

    def

    fect

    size

    for

    SSfo

    reca

    sts.

    Peri

    od1

    Peri

    od2

    Peri

    od3

    Peri

    od4

    Peri

    od5

    MdA

    PE(%

    )E

    ffec

    t(%

    )M

    dAPE

    (%)

    Eff

    ect(

    %)

    MdA

    PE(%

    )E

    ffec

    t(%

    )M

    dAPE

    (%)

    Eff

    ect(

    %)

    MdA

    PE(%

    )E

    ffec

    t(%

    )

    SS10

    0E

    xp.s

    moo

    thin

    g20

    .90

    0.0

    22.3

    00.

    022

    .70

    1.3

    22.9

    06

    .523

    .20

    1.3

    AR

    IMA

    -I18

    .60

    11.

    020

    .70

    7.2

    21.9

    04

    .823

    .30

    4.9

    22.5

    04

    .3A

    RIM

    A-I

    -TF

    18.9

    09

    .621

    .10

    5.4

    23.0

    00.

    024

    .50

    0.0

    23.5

    00.

    0M

    LP

    18.7

    21

    0.4

    19.4

    61

    2.7

    20.0

    11

    3.0

    19.8

    31

    9.1

    19.7

    41

    6.0

    TD

    NN

    19.1

    28

    .519

    .56

    12.

    319

    .86

    13.

    719

    .86

    18.

    918

    .85

    19.

    8JE

    NN

    20.3

    22

    .818

    .48

    17.

    119

    .20

    16.

    520

    .69

    15.

    620

    .76

    11.

    7G

    MN

    N19

    .44

    7.0

    20.7

    47

    .022

    .00

    4.3

    21.4

    01

    2.7

    20.7

    11

    1.9

    Lin

    ear

    19.4

    70.

    021

    .37

    0.0

    22.5

    30.

    023

    .57

    0.0

    23.0

    70.

    0N

    onlin

    ear

    19.4

    00

    .319

    .56

    8.5

    20.2

    71

    0.1

    20.4

    51

    3.2

    20.0

    21

    3.2

    Mem

    ory

    byde

    lay

    18.9

    20.

    019

    .51

    0.0

    19.9

    40.

    019

    .85

    0.0

    19.3

    00.

    0M

    emor

    yby

    feed

    back

    19.8

    85.

    119

    .61

    0.5

    20.6

    03.

    321

    .05

    6.0

    20.7

    47.

    5

    SS20

    0E

    xp.s

    moo

    thin

    g20

    .26

    0.0

    22.5

    30.

    022

    .03

    0.0

    22.7

    21

    .422

    .84

    1.3

    AR

    IMA

    -I18

    .19

    10.

    220

    .61

    8.5

    21.0

    64

    .423

    .04

    0.0

    22.1

    74

    .2A

    RIM

    A-I

    -TF

    17.5

    61

    3.3

    20.6

    88

    .221

    .82

    1.0

    22.9

    70

    .323

    .15

    0.0

    ML

    P18

    .07

    10.

    819

    .03

    15.

    519

    .14

    13.

    119

    .64

    14.

    818

    .93

    18.

    2T

    DN

    N19

    .59

    3.3

    20.1

    91

    0.4

    20.4

    47

    .221

    .68

    5.9

    19.9

    11

    4.0

    JEN

    N19

    .18

    5.3

    18.4

    01

    8.3

    19.7

    11

    0.5

    20.2

    81

    2.0

    19.4

    51

    6.0

    GM

    NN

    19.1

    65

    .420

    .08

    10.

    919

    .67

    10.

    719

    .04

    17.

    419

    .83

    14.

    3

    Lin

    ear

    18.6

    70.

    021

    .27

    0.0

    21.6

    40.

    022

    .91

    0.0

    22.7

    20.

    0N

    onlin

    ear

    19.0

    01.

    519

    .43

    7.0

    19.7

    45

    .120

    .16

    17.

    419

    .53

    12.

    8

    Mem

    ory

    byde

    lay

    18.8

    30.

    019

    .61

    0.0

    19.7

    90.

    020

    .66

    0.0

    19.4

    20.

    0M

    emor

    yby

    feed

    back

    19.1

    71.

    319

    .24

    3.8

    19.6

    90

    .219

    .66

    6.3

    19.6

    40.

    7

  • aD. West, S. Dellana / International Journ

    memory by feedback has a small but meaningful ef-fect size. For the BOD 100 experiment, the effect sizefor memory by feedback increases from 3.6% inPeriod 1 to a maximum of 7.3% in Period 2, andthen diminishes to 6.2%, 3.4%, and 2% in thesubsequent periods. The pattern for the BOD 200 ex-periment is similar. Memory by feedback has an effectsize of 1.4% for Period 1 and 3.4% for Period 2.Period 3 has an effect size of 2.6%, favoring memoryby delay. The last two periods favor memory by feed-back, with effect sizes of 1.2% and 6.4%. In total,nine of the ten comparisons favor memory by feedbackfor multi-period BOD forecasts.

    The MdAPE values and effect sizes for the twoSS experiments are shown in Table 7. For the SS 100results, exponential smoothing has the highest MdAPEfor Periods 1 and 2, and the ARIMA-I-TF has forPeriods 3 to 5. The largest effect size for single periodforecasts is achieved by the linear ARIMA-I model(11.0%). For Periods 2 to 5, the forecast modelswith the largest effect sizes are JENN (17.1% forPeriod 2 and16.5% for Period 3), the MLP (19.1%for Period 4), and the TDNN (19.8% for Period 5).Again, it is evident that ARIMA models do relativelywell for the first period forecast, with MdAPE valuesranging from 17.5% to 18.9% for both fit sizes.ARIMA-I-TF has the largest effect size for SS 200 inPeriod 1, at 13.3%. For Periods 2 to 5, the largesteffects are the JENN (18.3%) in Period 2, the MLP(13.1%) in Period 3, the GMNN (17.4%) in Period4, and the TDNN (19.8%) in Period 5. The contrastbetween linear and nonlinear long-term memories isvery similar to that of the BOD pattern discussedpreviously. For Period 1, nonlinear is favored by amodest 0.3% for SS 100, while linear models have a1.5% advantage at the larger SS 200 fit size. Forecastsbeyond Period 1 strongly favor nonlinear models, witheffect sizes of 8.5%, 10.1%, 13.2% and 13.2%for Periods 2 to 5 in SS 100. The comparable effectsizes for SS 200 favoring nonlinear models are7.0%,5.1%, 17.4%, and 12.8% for Periods 25.

    The performances of memory by delay and mem-ory by feedback differ in the SS experiments. For theSS 100 results, memory by delay has larger effectsin all forecast periods (5.1%, 0.5%, 3.3%, 6.0% and7.5%). For SS 200, memory by delay is favored in Pe-riod 1 (1.3%), while memory by feedback is favored inl of Forecasting 27 (2011) 777803 795

    Periods 2 to 4 (3.8%,0.2%, 6.3%). Period 5 favorsmemory by delay, with an effect size of 0.7%.

    6.2. Discussion of the median absolute relative errorresults

    The results for the median of the absolute forecasterror relative to the absolute error of the random walkmodel (MdRAE) are given in Table 8 for the BOD 100and BOD 200 experiments. The errors are summarizedby model for each of the five periods in the forecasthorizon.

    For the BOD 100 results, the lowest MdRAE forPeriod 1 is achieved by the JENN (0.902) model. Inthe second period, the TDNN has the lowest MdRAE,at 0.781, while the JENN is lowest for Periods 3 to5 (0.795, 0.722, and 0.790). The GMNN is a closesecond to the JENN in Period 1 and Periods 3 to 5.These results reinforce the patterns observed in theeffect size analysis of Table 6 where nonlinear modelswith memory by feedback were increasingly favoredat longer forecast horizons. For the BOD 200 results,the ARIMA-I-TF model has the lowest relative errorfor one-period forecasts, with a MdRAE of 0.937. TheJENN has the lowest error for all longer forecasts inPeriods 2 to 5, with MdRAEs of 0.783, 0.731, 0.652,and 0.748.

    The average of the median relative errors is calcu-lated for both linear and nonlinear long-term memorymodels, and these averages highlight the pattern ob-served earlier: the relative accuracy of nonlinear mod-els increases at longer forecast horizons.

    Table 9 documents the MdRAE forecast errors forthe SS 100 and SS 200 experiments. The ARIMA-I model has the lowest MdRAE for single periodforecasts in both cases, 0.918 for SS 100 and 0.914for SS 200. For the SS 100 results, the JENN has thelowest MdRAE in Periods 2 (0.812), 3 (0.774), and 5(0.688), while the GMNN has the lowest MdRAE inPeriod 4 (0.739).

    The analysis of MdRAE values for the SS 200reveals patterns similar to those for the SS 100 forPeriods 2 to 5. The JENN has the lowest error forPeriods 2 and 3 (0.769, 0.740), while the GMNN isslightly lower than the JENN for Periods 4 and 5(0.696, 0.683). The nonlinear models with memoryby feedback are again favored for forecast horizons oftwo to five periods.

  • Table 9MdRAE for SS forecasts.

    Forecast model Period 1 Period 2 Period 3 Period 4 Period 5

    SS 100Exp. smoothing 1.110 0.918 0.825 0.795 0.788ARIMA-I 0.918 0.874 0.835 0.791 0.775ARIMA-I-TF 0.948 0.884 0.860 0.803 0.793Average linear 0.992 0.892 0.840 0.796 0.785

    MLP 0.969 0.872 0.775 0.804 0.775TDNN 1.013 0.894 0.781 0.762 0.702JENN 0.983 0.812 0.774 0.753 0.688GMNN 1.035 0.843 0.836 0.739 0.719Average nonlinear 1.000 0.855 0.792 0.764 0.721

    SS 200Exp. smoothing 1.114 0.876 0.816 0.760 0.756ARIMA-I 0.914 0.859 0.815 0.800 0.775ARIMA-I-TF 0.951 0.856 0.830 0.807 0.777Average linear 0.993 0.863 0.820 0.788 0.769

    MLP 0.978 0.772 0.815 0.736 0.711TDNN 1.010 0.833 0.782 0.714 0.747JENN 1.018 0.769 0.740 0.701 0.687GMNN 0.970 0.813 0.762 0.696 0.683Average nonlinear 0.994 0.796 0.774 0.711 0.706796 D. West, S. Dellana / International Journal of Forecasting 27 (2011) 777803

    Table 8MdRAE for BOD forecasts.

    Forecast model Period 1 Period 2 Period 3 Period 4 Period 5

    BOD 100Exp. smoothing 0.955 0.902 0.803 0.776 0.836ARIMA-I 0.943 0.923 0.870 0.858 0.904ARIMA-I-TF 0.942 0.921 0.867 0.844 0.917Average linear 0.947 0.915 0.847 0.826 0.886

    MLP 0.985 0.905 0.925 0.845 0.882TDNN 0.935 0.781 0.810 0.786 0.800JENN 0.902 0.819 0.795 0.722 0.790GMNN 0.915 0.818 0.796 0.743 0.794Average nonlinear 0.934 0.831 0.832 0.774 0.816

    BOD 200Exp. smoothing 0.973 0.867 0.799 0.777 0.834ARIMA-I 0.957 0.933 0.866 0.860 0.968ARIMA-I-TF 0.937 0.919 0.824 0.808 0.895Average linear 0.956 0.906 0.830 0.815 0.899

    MLP 0.982 0.812 0.812 0.774 0.841TDNN 1.015 0.849 0.830 0.735 0.834JENN 1.013 0.783 0.731 0.652 0.748GMNN 0.971 0.824 0.804 0.767 0.804Average nonlinear 0.995 0.817 0.794 0.732 0.807

  • aRerror as MdCumRAE(i), with i being an index of thenumber of forecast periods cumulated in the error.The calculation of the cumulative error is given byArmstrong and Collopy (1992), and is reported inTable 10 for the BOD forecasts and Table 11 for theSS forecasts.

    For the BOD 100 results in Table 10, the JENNhas the lowest MdCumRAE(2) at 0.931, only slightlysmaller than the ARIMA-I-TF value of 0.934. Itis evident that the cumulative forecast ability ofthe nonlinear models, and the JENN in particular,improves significantly at longer forecast horizons. TheJENN has the lowest MdCumRAE for Periods 3 to 5,with MdCumRAE values of 0.893, 0.899 and 0.878respectively. The JENN with memory by feedbackachieves the lowest MdCumRAE(5) (0.878), followedby the TDNN and the GMNN, at 0.894 and 0.896respectively.

    The BOD 200 results are similar to the BOD100 results, except that ARIMA-I-TF has the lowest

    TF (0.930). For all subsequent forecasts, the lowestcumulative errors are for the JENN (0.894 for threeperiods, 0.879 for four periods, and 0.851 for fiveperiods). The ranking of the most accurate modelsfor MdCumRAE(5) is the JENN (0.851), followed bythe MLP (0.864), the TDNN (0.873), and the GMNN(0.893).

    The SS 200 results parallel the discussion of SS100. The lowest MdCumRAE(2) errors are for theARIMA-I model at 0.917, and the ARIMA-I-TFmodel at 0.921. The GMNN has the lowest cumula-tive errors for Periods 3 (0.893) and 4 (0.859), whilethe JENN has the lowest MdCumRAE(5) (0.814). TheJENN again has the lowest MdCumRAE(5) value of0.841, followed by the GMNN (0.855), the TDNN(0.864), and the MLP (0.866).

    We conclude the discussion of MdCumRAE byexamining the graphical patterns consolidated for allfour experiments: BOD 100, BOD 200, SS 100, andSS 200. Fig. 7 depicts the MDCumRAE values forD. West, S. Dellana / International Journ

    Table 10MdCumRAE for BOD forecasts.

    Forecast model MdCumRAE(2) MdCum

    BOD 100Exp. smoothing 1.000 0.937ARIMA-I 0.943 0.953ARIMA-I-TF 0.934 0.941MLP 0.987 0.970TDNN 0.952 0.930JENN 0.931 0.893GMNN 0.960 0.917

    BOD 200Exp. smoothing 0.987 0.937ARIMA-I 0.943 0.955ARIMA-I-TF 0.920 0.934MLP 0.960 0.928TDNN 0.989 0.937JENN 0.948 0.893GMNN 0.964 0.908

    6.3. Discussion of the cumulative relative absoluteerror results

    The most significant measure of forecast errorsfor watershed management is the median cumulativerelative absolute error (MdCumRAE) for the fiveperiod forecast horizon. This section discusses theMdCumRAE performances for the two, three, four,and five period forecast horizons. We identify thisl of Forecasting 27 (2011) 777803 797

    AE(3) MdCumRAE(4) MdCumRAE(5)

    0.937 0.9360.944 0.9610.933 0.9590.964 0.9600.919 0.8940.899 0.8780.900 0.896

    0.937 0.9320.951 0.9610.928 0.9440.922 0.9020.908 0.8960.886 0.8700.916 0.889

    MdCumRAE(2) at 0.920. The JENN again has thelowest MdCumRAE values for Periods 3 to 5 (0.893,0.886, and 0.870). The three best forecast perfor-mances for the MdCumRAE(5) are the JENN (0.87),GMNN (0.889), and TDNN (0.896) models.

    Table 11 summarizes the MdCumRAE values forthe SS 100 and SS 200 results. A linear model, theARIMA-I, has the lowest MdCumRAE(2) value forthe SS 100 results (0.923), followed by ARIMA-I-

  • uRFig. 7. Consolidated MdCumRAE by forecast horizon.

    averages of models with linear and nonlinear long-term memories. The consolidated results confirm ear-lier findings: the accuracy of nonlinear models forBOD and SS predictions relative to linear models in-creases as the forecast horizon increases. The consol-idated results for the short-term memory structure inFig. 8 suggest a slight advantage of memory by feed-back models at all forecast horizons from one to five.

    6.4. Discussion of confidence intervals and statisticalsignificance

    In this subsection, statistical significance tests arepresented for the most critical metric in the experi-

    Fig. 8. Consolidated MdCumRAE by forecast horizon.

    mental design: the five-period cumulative relative er-ror (MDCUMRAE(5)). Since the results for the twofit periods are comparable, significance tests for the100 observation fit size will be discussed. We followthe advice of Armstrong (2007) and report effect sizesand confidence intervals. Fig. 9 depicts the 90% con-fidence intervals for the MdCumRAE(5) results of theBOD 100 experiment. It is evident from Fig. 9 that theJENN is significantly better than all three linear mod-els and the nonlinear MLP with memory by delay. TheJENN confidence intervals and the 50% median arealso lower than those for the TDNN and the GMNN,but the difference is not statistically significant.798 D. West, S. Dellana / International Jo

    Table 11MdCumRAE for SS forecasts.

    Forecast model MdCumRAE(2) MdCum

    SS 100Exp. smoothing 1.000 0.937ARIMA-I 0.923 0.904ARIMA-I-TF 0.930 0.929MLP 0.944 0.902TDNN 0.975 0.949JENN 0.976 0.894GMNN 0.980 0.936

    SS 200Exp. smoothing 1.000 0.918ARIMA-I 0.917 0.911ARIMA-I-TF 0.921 0.906MLP 0.955 0.919TDNN 0.971 0.933JENN 0.977 0.920GMNN 0.960 0.893rnal of Forecasting 27 (2011) 777803

    AE(3) MdCumRAE(4) MdCumRAE(5)

    0.962 0.9340.901 0.8950.918 0.9170.890 0.8640.900 0.8730.879 0.8510.911 0.893

    0.942 0.9310.892 0.8940.903 0.9000.898 0.8660.889 0.8640.882 0.8410.859 0.855

  • aD. West, S. Dellana / International Journ

    Fig. 9. 90% confidence intervals for MdCumRAE(5) BOD 100.

    Similar conclusions arise from the 90% confidence in-tervals for SS 100 given in Fig. 10. In this experi-ment, the JENN is significantly better than exponentialsmoothing and the ARIMA-I-TF model, but not betterthe ARIMA-I model or the other nonlinear models.

    6.5. Sensitivity analysis for nonlinear models

    The predictions of the nonlinear neural networkmodels are known to be sensitive to configurationdecisions, as well as to the stochastic variabilityresulting from differences in weight initialization.This subsection portrays the sensitivity of the TDNN(Fig. 11), JENN (Fig. 12) and GMNN (Fig. 13) tochanges in short-term memory depth and resolution,as well as to changes in long-term memory capacity,defined by the number of hidden nodes. For the sakeof brevity, we omit the sensitivity analysis for MLP, asit is architecturally similar to TDNN. The sensitivityanalysis focuses on the critical metric for basin waterquality management, the median of the five-periodcumulative relative absolute errors.

    The sensitivity of the short-term memory by delayfor TDNN is shown graphically at the top of Fig. 11.It is interesting to note that the median cumulative rel-ative error of BOD predictions is relatively insensitiveto the memory depth, with a variability of 1.9% (wherevariability is (max min)/max), while another timeseries from the same data set, SS, is extremely sensi-tive to the memory depth. The short-term memory forTDNN displays an optimal value for a memory depthof three delays (resolution = 1) and a variability ofapproximately 10.0%. The sensitivity results for theTDNN long-term memory are presented in the bottoml of Forecasting 27 (2011) 777803 799

    Fig. 10. 90% confidence intervals for MdCumRAE(5) SS 100.

    Fig. 11. Sensitivity analysis for time delay neural network (TDNN).

    portion of Fig. 11. An optimal hidden layer of fourneurons is identified for both BOD and SS. The TDNNsensitivity to the long-term memory is 4.5% for BODand 3% for SS.

    The sensitivity of the JENN memory by feedback(context unit) is portrayed at the top of Fig. 12.The optimal time constant is 0.8 for both BOD andSS. This corresponds to a memory depth of 1.25delays, with a resolution of 0.8. The variability ofthe cumulative accuracy caused by the time constantis 4.7% for BOD and 7.6% for SS. The long-termmemory (bottom of Fig. 12) shows a minimum at 8

  • u800 D. West, S. Dellana / International Jo

    Fig. 12. Sensitivity analysis for Jordan-Elman neural network(JENN).

    Fig. 13. Sensitivity analysis for gamma memory neural network(GMNN).

    hidden layer nodes for both time series. The sensitivityof the cumulative accuracy to the long-term memorystructure is 2.1% for BOD and 2.3% for SS.rnal of Forecasting 27 (2011) 777803

    Fig. 13 depicts the sensitivity of the short-termGamma memory by feedback. A clear minimum isestablished for a memory with two recursive Gammakernels for both BOD and SS. This short-term memorystructure provides a depth of 2.2 and a resolution of0.9. The variability in MdCumRAE(5), the design ofthe gamma memory by feedback, is 4.8% for SS and3.5% for BOD. The long-term memory for the GMNNis optimal at eight hidden layer units for both BODand SS, the same value as for the JENN. The rangein cumulative predictive variability from the long-termmemory is 3.8% for BOD and 3.9% for SS.

    While it is risky to draw any strong conclusionsfrom this limited sensitivity analysis, the results sug-gest that the design of both short- and long-term mem-ories can critically affect the predictive accuracy ofthe neural network model. The variability in predic-tive accuracy from the design of short- and long-termmemory structures is significant, ranging from 1.0%to 10%. In many cases, the proper specification of theshort-term memory is more critical than that of thelong-term memory.

    7. Concluding remarks

    Accurate forecast models for basin water qualitymanagement are becoming increasingly vital as po-litical organizations focus on the social consequencesof ineffective watershed management. This researchcontributes to improved predictive models for waterquality management in the following ways: (1) the in-vestigation of short-term memory by feedback; (2) afocus on the cumulative forecast accuracy; and (3) acomprehensive and rigorous experimental design thatimproves the cumulative body of research relating towater quality. Specifically, this research explores abroad range of predictive models that include both lin-ear and nonlinear long-term memory, and short-termmemory both by delay and by feedback. To the best ofour knowledge, this is the first research to investigateshort-term memory by feedback for watershed appli-cations. We also note that most previous studies havefocused on single-period forecasts, despite the cumu-lative nature of basin water quality systems. The useof a rolling origin methodology in this study producesout-of-sample test results that are relatively large, andreduces the variability of the predictive estimates. Thecumulative relative error metrics provide insights into

  • aD. West, S. Dellana / International Journ

    important cumulative forecast capabilities beyond sin-gle period predictions.

    The long-term memory structure of a forecastmodel creates a mapping from an input representationspace to a prediction (see Fig. 2). A key decisionfor the long-term memory is the choice of alinear or nonlinear mapping. The introductory sectionof this paper describes nonlinearities in the basinand wastewater physical systems studied. Giventhe existence of these nonlinear interactions, wewould anticipate that models with nonlinear long-term memories would be more accurate than modelswith linear memories. Interestingly, we do not findstrong evidence of this for single-period forecasts.The single-period effect sizes favor nonlinear modelsfor BOD 100 (3.6%) and SS 100 (0.3%), but linearmodels at the larger fit size (3.0% and 1.5%). Similarconclusions can be drawn from the MdRAE metric.The model with the lowest effect size for single periodforecasts is a linear model for all experiments exceptBOD 100. These patterns suggest that nonlinearmodels should not be the favored model for single-period forecasts of physical systems, even when thesystems are known to have nonlinear interactions.

    The results of this research do, however, providestrong evidence that for water quality prediction,nonlinear models are increasingly more accuratethan linear models as the forecast horizon extendsbeyond a single period. These effect sizes, whichrange from 7% to 20% for nonlinear models, arecertainly large enough to be of value in water qualityprediction. This is clearly shown in Fig. 7, whichgives MdCumRAE values (consolidated for all fourexperiments) as a function of the forecast horizon,and is also noticeable in the Period 5 effect sizesof Tables 7 and 8. The Period 5 effect sizes favornonlinear models for both BOD (12.75%, 12.0%)and both SS experiments (13.2%, 12.8%). Theconsolidated results confirm that the accuracy ofnonlinear models increases relative to that of linearmodels as the forecast horizon increases.

    The design of the short-term temporal memoryis dependent on the physical system being modeledand the data sampling frequency. Memory by de-lay is advantageous for physical systems that havea short memory depth, low levels of noise, and/orlow frequency data sampling. Memory by feedback isbetter suited to physical systems with long memoryl of Forecasting 27 (2011) 777803 801

    depths, high levels of noise, and/or high frequencydata sampling. Under these conditions, the nature ofeither the context units of JENN or the gamma mem-ory of GMNN is an inherently simpler solution. Fordeep short-term memories, memory by delay requiresa large extension of the input layer of the networkarchitecture, and also requires a relatively large num-ber of model weights to be estimated. The experi-mental findings indicate that short-term memory byfeedback is more accurate for both BOD experiments;nine of the ten experiments favor memory by feedbackfor multi-period BOD forecasts. The magnitude of theeffect size ranges from 1.2% to 7.3%. The SS experi-ments suggest that memory by delay is favored at allSS forecast intervals except SS 200 Periods 2 to 4. Theconsolidated results for the short-term memory struc-ture across all experimental conditions provide a slightadvantage for memory by feedback models at all fore-cast horizons from one to five.

    Since the properties of basin systems are cumula-tive in nature, the most significant metric investigatedin this study is MdCumRAE, a cumulative relative er-ror. According to this metric, JENN, a network withmemory by feedback, is the most accurate predictivemodel for the cumulative five-period forecast accu-racy. JENN has the lowest MDCumRAE(5) for all fourexperiments, with values of 0.878 and 0.870 for BOD100 and BOD 200 respectively, and 0.851 and 0.841for SS 100 and SS 200.

    While this research focuses on variables in basinwater quality management, we believe that the find-ings could potentially be generalized to other fore-cast applications with nonlinear dynamics. This mightinclude aspects of supply chain management whereaccurate multi-period forecasts are important. Weacknowledge that the neural network architecturesinvestigated in this research are designed from con-ventional principles. This includes the use of asingle hidden layer, the use of hyperbolic activationfunctions, and the choice of learning parameters.There are many other neural network architectures thatcould be investigated for basin water quality man-agement applications; a broader investigation of othernonlinear models may result in a more accurate model.Multivariate nonlinear models, which are capable ofpredicting multiple wastewater effluent time seriesjointly, may also have potential in this application.

  • u802 D. West, S. Dellana / International Jo

    References

    Armstrong, J. S. (2007). Significance tests harm progress inforecasting. International Journal of Forecasting, 23, 321327.

    Armstrong, J. S., & Collopy, F. (1992). Error measures for gen-eralizing about forecasting methods: empirical comparisons.International Journal of Forecasting, 8, 6980.

    Asuncion, A., & Newman, D. J. (2007). UCI machine learningrepository.http://archive.ics.uci.edu/ml/datasets/Water+Treatment+Plant.

    Bartram, J. & Rees, G. (Eds.). (2000). Monitoring bathing waters:A practical guide to design and implementation of assessmentsand monitoring programmes. Published on behalf of UNESCO,WHO and UNEP by E & FN Spon, London & New York.

    Beck, M. B. (2005). Vulnerability of water quality in intensivelydeveloping urban watersheds. Environmental Modelling andSoftware, 20, 381400.

    Berthouex, P. M., & Box, G. E. (1996). Time series models forforecasting wastewater treatment plant performance. WaterResources, 30, 18651875.

    Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. (1994). Timeseries analysis and forecasting control. Englewood Cliffs, N.J:Prentice Hall Press.

    Box, G. E. P., & Tiao, G. C. (1973). Intervention analysis withapplications to economic and environmental problems. Journalof the American Statistical Association, 70, 7079.

    Carlsson, B., & Lindberg, C. F. (1998). Some control strategies forthe activated sludge process.http://www.cheric.org/ippage/p/ipdata/2000/07/file/control-Of-wwt.pdf.

    Chau, K. W. (2006). A review on integration of artificial intelli-gence into water quality modelling. Marine Pollution Bulletin,52, 726733.

    Chen, C., & Liu, L. M. (1993). Joint estimation of model param-eters and outlier effects in time series. Journal of the AmericanStatistical Association, 88, 284297.

    De Vries, B., & Principe, J. C. (1992). The gamma model: A newneural network for temporal processing. Neural Networks, 5,565576.

    Dellana, S., & West, D. (2009). Predictive modeling for wastewaterapplications: Linear and nonlinear approaches. EnvironmentalModelling and Software, 24, 96106.

    Delleur, J. W., & Gyasi-Agyei, Y. (1994). Prediction of suspendedsolids in urban sewers by transfer function