Demand Forecast for Short Life Cycle Products Forecast for Short... · 2019-07-08 · Demand Forecast for Short Life Cycle Products Mario Jos´e Basallo Triana December, 2012

Demand Forecast for Short LifeCycle Products

Mario Jose Basallo Triana

December, 2012

Ante todo, a Dios.A mis padres Luis Mario y Luz Angela por su sacrificio, dedicacion, y amor.

A mis maestros por ensenarme el camino.

Agradecimientos

Quiero agradecer a los profesores Jesus Andres Rodrıguez Sarasty y HernanDarıo Benitez Restrepo directores este proyecto por sus consejos, asistencia,y apoyo.

Agradezco a la Pontificia Universidad Javeriana porproporcionarme los recur-sos necesarios para el desarrollo de este proyecto. Este proyecto fue apoyadopor la Pontificia Universidad Javeriana mediante el proyecto Gestion de In-ventarios, 020100292.

Agradezco a todas las personas que de una u otra forma contribuyeron en larealizacion de este proyecto.

Inteligencia, dameel nombre exacto de las cosas!

... Que mi palabra seala cosa misma

creada por mi alma nuevamente.Que por mı vayan todos

los que no las conocen, a las cosas;que por mı vayan todos

los que ya las olvidan, a las cosas;que por mı vayan todos

los mismos que las aman, a las cosas ...

Inteligencia, dameel nombre exacto, y tuyo,y suyo, y mıo, de las cosas!

Juan Ramon Jimenez, Eternidades (1918)

Este trabajo se basa en las ideas propuestas por el profesor Jesus AndresRodrıguez Sarasty para pronosticar la demanda de productos de corto ciclode vida.

CONTENTS

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2. Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1 Forecasting the demand of short life cycle products . . . . . . 3

2.1.1 Problem formulation . . . . . . . . . . . . . . . . . . . 42.2 Fundamental research hypothesis . . . . . . . . . . . . . . . . 6

3. Objectives and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1 General objective . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Specific objectives . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Scope of the research . . . . . . . . . . . . . . . . . . . . . . . 7

4. Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.1 Forecast based on growth models . . . . . . . . . . . . . . . . 94.2 Forecast based on similarity . . . . . . . . . . . . . . . . . . . 114.3 Forecast based on machine learning models . . . . . . . . . . . 114.4 Discussion of the current methods . . . . . . . . . . . . . . . . 124.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5. Time series analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.1 The datasets of time series . . . . . . . . . . . . . . . . . . . . 14

5.1.1 Real datasets . . . . . . . . . . . . . . . . . . . . . . . 145.1.2 Synthetic dataset . . . . . . . . . . . . . . . . . . . . . 15

5.2 Stationarity test . . . . . . . . . . . . . . . . . . . . . . . . . . 165.2.1 Unit-root test . . . . . . . . . . . . . . . . . . . . . . . 17

5.3 Clustering of time series . . . . . . . . . . . . . . . . . . . . . 18

CONTENTS CONTENTS

5.3.1 Some insights on clustering time series . . . . . . . . . 195.3.2 The clustering algorithm . . . . . . . . . . . . . . . . . 235.3.3 Fuzzy cluster validity indices . . . . . . . . . . . . . . . 285.3.4 Clustering results . . . . . . . . . . . . . . . . . . . . . 315.3.5 Clustering results for the real datasets . . . . . . . . . 32

5.4 Conclusions of the chapter . . . . . . . . . . . . . . . . . . . . 33

6. Regression Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 356.1 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . 366.2 Support vector regression . . . . . . . . . . . . . . . . . . . . . 376.3 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 406.4 Tuning parameters . . . . . . . . . . . . . . . . . . . . . . . . 42

6.4.1 Response surface methodology for tuning parameters . 43

7. Experimental procedure . . . . . . . . . . . . . . . . . . . . . . . . 497.1 Collection and analysis of data . . . . . . . . . . . . . . . . . . 507.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527.3 Parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . 527.4 Forecasts evaluation . . . . . . . . . . . . . . . . . . . . . . . 537.5 Some computational aspects . . . . . . . . . . . . . . . . . . . 547.6 Results of the tune parameters procedure . . . . . . . . . . . . 54

7.6.1 Tuning parameters for SVR machines . . . . . . . . . . 547.6.2 Tuning parameters for ANN machines . . . . . . . . . 56

7.7 Conclusions of the chapter . . . . . . . . . . . . . . . . . . . . 58

8. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608.1 Forecasting results using multiple linear regression . . . . . . . 60

8.1.1 Multiple linear regression results with clustering . . . . 618.1.2 Conclusions of the MLR case . . . . . . . . . . . . . . 64

8.2 Forecasting results using support vector regression . . . . . . . 658.2.1 Support vector regression results with clustering . . . . 668.2.2 Conclusions of the SVR case . . . . . . . . . . . . . . . 67

8.3 Forecasting results using artificial neural networks . . . . . . . 688.3.1 Artificial neural network results with clustering . . . . 698.3.2 Conclusions of the ANN case and other cases . . . . . . 69

8.4 Comparison of forecasting methods . . . . . . . . . . . . . . . 71

9. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

ii

CONTENTS CONTENTS

A. Results for the SD1 dataset using the correct partition . . . . . . . 77

B. The effect of the clustering algorithm . . . . . . . . . . . . . . . . . 78B.1 Multiple linear regression case . . . . . . . . . . . . . . . . . . 78B.2 Support vector regression case . . . . . . . . . . . . . . . . . . 78

C. Variance of cumulative and non-cumulative data. . . . . . . . . . . 81

iii

LIST OF FIGURES

2.1 Short life cycle product time series. . . . . . . . . . . . . . . . 42.2 Generalized pattern of a short life cycle product time series. . 6

5.1 The datasets of short time series. . . . . . . . . . . . . . . . . 175.2 Representation of the SD1 data set by its 3 PIPs. . . . . . . . 335.3 Representation of the real datasets by its 3 PIPs. . . . . . . . 34

6.1 Illustration of the linear ε-SVR with soft margin. . . . . . . . 376.2 Feed-forward neural network. . . . . . . . . . . . . . . . . . . 416.3 Experimental designs. . . . . . . . . . . . . . . . . . . . . . . 45

7.1 Experimental factors. . . . . . . . . . . . . . . . . . . . . . . . 507.2 Framework of the forecasting procedures. . . . . . . . . . . . . 517.3 Some iterations of the proposed tuning parameters procedure. 59

8.1 Multiple linear regression results, complete datasets. . . . . . . 618.2 Support vector regression results, complete datasets. . . . . . . 658.3 Artificial neural network results, complete datasets. . . . . . . 688.4 Forecasting results for some time series. . . . . . . . . . . . . . 738.5 Pairwise comparison of regression methods. . . . . . . . . . . . 738.6 Mean absolute error results for each regression method. . . . . 74

C.1 Variance of cumulative and non-cumulative data . . . . . . . . 82

LIST OF TABLES

5.1 Model parameters of synthetic time series. . . . . . . . . . . . 165.2 Validation results for SD1 dataset using FSTS algorithm. . . . 325.3 Optimal number of clusters for the real datasets. . . . . . . . . 32

7.1 Optimal SVR parameters procedure, non-cumulative data. . . 567.2 Optimal SVR parameters, cumulative data. . . . . . . . . . . 567.3 Optimal ANN parameters procedure, non-cumulative data. . . 577.4 Optimal ANN parameters, cumulative data. . . . . . . . . . . 57

8.1 MLR results, complete datasets. . . . . . . . . . . . . . . . . . 628.2 MLR results, partitioned non-cumulative data . . . . . . . . . 648.3 MLR results, partitioned cumulative data . . . . . . . . . . . . 648.4 SVR results, complete datasets. . . . . . . . . . . . . . . . . . 668.5 SVR results, partitioned non-cumulative data . . . . . . . . . 678.6 SVR results, partitioned cumulative data . . . . . . . . . . . . 678.7 ANN, complete datasets. . . . . . . . . . . . . . . . . . . . . . 698.8 ANN, partitioned non-cumulative data . . . . . . . . . . . . . 708.9 ANN, partitioned cumulative data . . . . . . . . . . . . . . . . 718.10 Summary evaluation of the experimental treatments. . . . . . 72

A.1 Preliminary multiple linear regression results for SD1 dataset. 77

B.1 Comparison of FCM, FMLE and FSTS algorithms for MLR. . 79B.2 Comparison of FCM, FMLE and FSTS algorithms for SVR. . 80

LIST OF ALGORITHMS

1 Perceptually Important Points Procedure. . . . . . . . . . . . . 212 Kaufman initialization method. . . . . . . . . . . . . . . . . . . 223 Fuzzy C-Means (FCM) clustering algorithm. . . . . . . . . . . 254 Fuzzy short time series (FSTS) clustering algorithm. . . . . . . 265 Fuzzy Maximum Likelihood Estimation (FMLE) algorithm. . . 28

6 Tuning parameters procedure. . . . . . . . . . . . . . . . . . . . 467 Proposed tuning parameters procedure. . . . . . . . . . . . . . 48

ABSTRACT

Accurate forecast for demand of short life cycle products is a subject ofspecial interest for many companies and researchers. However common fore-casting approaches are not appropriate for this type of products due to thecharacteristics of their demand. This work proposes a method to forecastthe demand of short life cycle products. Clustering techniques will be usedto obtain natural groups in the time series. This analysis allows to extractrelevant information for the forecasting method. The results of the proposedmethod will be compared to other approaches to forecast the demand of shortlife cycle products. Several time series datasets of different type of productsare considered.

Keywords: Short life cycle product, time series, forecast, cluster analy-sis, forecast performance.

ONE

INTRODUCTION

Short life cycle products (SLCP) are characterized by a demand that onlyoccurs during a short time period after which they become obsolete1, thisleads in some cases to very short demand time series (Rodrıguez & Vidal,2009). High technology products (e.g., computers, consumer electronics,video games) and fashion products (e.g., toys, apparel, text books) are typicalexamples of SLCPs (Kurawarwala & Matsuo, 1998; Thomassey & Happiette,2007; Rodrıguez & Vidal, 2009). The period of demand can vary from a fewyears to a few weeks as can be seen in the Colombian textbook industry(Rodrıguez & Vidal, 2009).

The dynamic of new products demand is generally characterized by arelatively slow growth in the introduction stage, followed by a stage of rapidgrowth, afterwards the demand is stabilized and the product enters in astage of maturity; finally the demand declines and then, the product usuallyis replaced by another product (Trappey & Wu, 2008; Meade & Islam, 2006).

In many industries, particularly in the technology sector, SLCPs are be-coming increasingly common (Zhu & Thonemann, 2004). This phenomenonis motivated by a continuous introduction of new products as a consequenceof high competitive markets. In this context, the competitive advantage of acompany is determined largely on its ability to manage frequent entries andexits of products (Wu et al. , 2009).

The demand of a SLCP is highly uncertain and volatile, particularly in theintroduction stage (Wu & Aytac, 2008). Additionally, the demand pattern is

1 Short life cycle products are different of perishable products which generally deterio-rate with time.

1. Introduction

transient, non-stationary and non-linear (Rodrıguez, 2007); these character-istics hinder the analysis and forecast of such a demand. On the other hand,the operations management of this type of products is also difficult becausehigh technology and investment usually is required, the manufacturing anddistribution lead times are usually long, and the risk of excess or shortage ofinventory during the life cycle is high (Rodrıguez & Vidal, 2009). An accu-rate prediction of demand reduces production and distribution costs, and theexcess or shortage of inventory. It is for this reason that an accurate forecastis important for a company.

This work proposes an efficient and effective forecasting methods of shortlife cycle products demand. For this reason we consider regression methodsto forecast the demand of a SLCP. These methods are able to obtain forecastsat early stages of product life cycle achieving an important advantage overcurrent forecasting methods, this fact also represent important advantagesfor a company. In order to improve the forecasting performance differentstrategies are considered, such strategies involve clustering techniques andcumulative and non-cumulative data.

This document is organized as follows: Chapter 2 states the problem andformulates different forecasting strategies (hypothesis) aimed at improvingthe forecast performance. Chapter 3 presents the objectives of this work andthe scope of the research. Chapter 4 duscusses a classification and descrip-tion of the current methods used to predict the demand of short life cycleproducts. Chapter 5 describes the different time series datasets analyzed inthis work and illustrates the use of different clustering techniques. Chapter6 discusses regression methods and tuning parameters procedures followed.Chapter 7 explains in detail the experimental procedure followed in this work.Chapter 8 discusses the results and Chapter 9 concludes this document.

2

TWO

PROBLEM STATEMENT

2.1 Forecasting the demand of short life cycle products

There are different situations that make complex demand forecasting of shortlife cicle products. On one hand, the time series of demand of such prod-ucts appears to be non-stationary, non-linear, and transient (Fig. 2.1 (left)).Another problem related to the demand forecasting of short life cycle prod-ucts is the scarcity of historical data. These products once introduced to themarket have a short period of sales. Then, there are little or no historicalinformation related to the sales of such products. These are a severe inconve-nience because in spite of the existence of forecasting methods for non-lineartime series, such these techniques generally require large amounts of data inorder to obtain accurate forecasts. Many non-linear models have been pro-posed in the field of time series analysis1. This models, in the same way thatlinear models, requires large amount2 of data to obtain accurate results. Fur-thermore, the proposed nonlinear models employ explicit parametric forms;however, it is usually hard to justify a priori the appropriateness of suchexplicit models in real applications. The use of non-parametric regressionanalysis (such as support vector regression and artificial neural networks)can be an effective data driven approach (Pena et al. , 2001).

Traditional forecasting methods3 cannot be used in this context because

1 Some nonlinear time series models are the Bilinear model, the Threshold Autoregres-sive model, the Smooth Transition model, the Markov Switching model, see Tsay (2005);Pena (2005).

2 At least 50 periods of historical data.3 Methods such as: moving average, exponential smoothing, ARIMA, and others. In

2. Problem statement 2.1. Forecasting the demand of short life cycle products

Time

Dem

and

(uni

ts)

Time

Cum

ulat

ive

dem

and

(uni

ts)

Real demandSmothed version

Cumulative real demandCumulative smothed version

Fig. 2.1: Short life cycle product time series. Left: Typical demand pattern for ashort life cycle product. Right: Cumulative demand time series.

short life cycle products time series do not meet the assumptions requiredfor these methods or there is not sufficient information available to get accu-rate estimations of the parameters of such methods (Kurawarwala & Matsuo,1998; Wu & Aytac, 2008). These difficulties make necessary to develop fore-casting methods specifically designed for this type of products. The fore-casting procedures for short life cycle products time series should addressall these difficulties and cope with the high uncertainty and volatility of thedemand typical of this type of products.

2.1.1 Problem formulation

In order to state the problem, let us define xt as the value of the time series attime t and xt as the forecasted value at the same period. Then, we set xt = F ,where F is some functional relationship. We need to define F according tosome criteria. One suitable criterion to define F is according to some error

general, most of these methods are suitable to forecast conventional products with largehistory or a stable demand pattern.

4

2. Problem statement 2.1. Forecasting the demand of short life cycle products

metric (forecasting error). Let et = f(xt − xt) be such metric. Then given atime series {xt}, t = 1, . . . , T , we wish to minimize Eq. 2.1

E =∑t∈T

et (2.1)

The problem can be stated as follow:

Given a time series {xt}, the forecasting problem consists in defining F forwhich Eq. 2.1 is minimized.

Set F = f(t,Θ), where Θ is some set of parameters, which in the contextof product innovation corresponds to the parameters of a growth model (seeChapter 4). Another way of defining F is for example F = f(xt−1, . . . , xt−p),where p is a lag parameter. Finally, F can be defined as F = f(F1, . . . , Fn),where n is the number of different forecasts obtained with different forecastmodels.

If we define Xt =∑t

i=1 xi as the cumulative demand, the time series ob-tained is smoother than the non-cumulative demand time series (see Fig. 2.1right). As this fact can be used to improve the forecasts results (Wu et al. ,2009), we forecast the cumulative demand and then the non-cumulative de-mand is calculated as xt = Xt −Xt−1. This situation motivates the proposalof the following research hypothesis:

The forecast with cumulative demand is more accurate than the forecast withnon-cumulative demand.

In order to cope with the scarcity of historical data of the product demand,in the literature related to SLCPs some references work with time seriesof similar products which have already been introduced to the market (seeChapter 4). In this case, a clustering technique may be used to analyzethe information available and to organize this according to natural clusters(Wu et al. , 2009; Li et al. , 2012). The forecasts may be carried out accord-ing to the results of that cluster analysis. Therefore we also consider thefollowing research hypothesis.

Forecasts based on data obtained by means of cluster analysis improve forecastperformance.

Finally, a time series of a SLCP may have more complex shapes (see Fig. 2.2)

5

2. Problem statement 2.2. Fundamental research hypothesis

that may not necessarily be similar to the typical bell-shaped pattern in Fig.2.1 (left). In this situation the forecasting process becomes more difficultand we need to define forecast models that can deal with any pattern ofSLCPs time series. One way to do this is to use some machine learningmodel in order to capture the complex functional relationship of the timeseries (Zhang et al. , 1998). Accordingly we assume the following researchhypothesis. The methodological procedure for forecasting using machinelearning models is based on Rodrıguez & Vidal (2009) and Rodrıguez (2007).

The use of machine learning models in the forecasting process improves theforecast performance.

Time

Demand

Fig. 2.2: Non bell-shaped pattern of a short life cycle product time series.

2.2 Fundamental research hypothesis

It is possible to determine a forecasting procedure for SLCPs that, based onsimilar time series of available data, can obtain a minimum forecast errorand guarantee a reliable basis for planning activities.

6

THREE

OBJECTIVES AND SCOPE

3.1 General objective

Propose a forecasting method based on machine learning models to forecastthe demand of short life cycle products.

3.2 Specific objectives

• Develop a clustering method for short life cycle products time series toextract relevant information for the forecast process.

• Design the forecasting method for the demand of short life cycle prod-ucts using multiple linear regression, support vector machines and/orartificial neural networks.

• Evaluate the performance of forecast models using appropriate metricsand statistical tests.

3.3 Scope of the research

In this work we use or consider

• Sales data rather than demand data because the demand data is usuallynot available.

• A validation and comparison of the forecasting methods using at leasttwo different datasets of time series of demand.

3. Objectives and Scope 3.3. Scope of the research

• The results do not make any implementation in a company and doesnot develop any implementation of software.

• The analysis only considers time series that follow the behavior of thedemand of short life cycle products.

• This work focuses on the pattern of the time series rather than on theproduct or the type of product being considered.

Note: Due to the scarcity of real data for this project, one of the databasesused in this work contains time series of demand of textbooks and scholarproducts. Some of these products cannot be considered strictly as short lifecycle products1. However, we use these time series because they follow thepattern of the demand of a short life cycle product.

1 The demand of textbooks and scholar products is mainly related with the season, butduring the season such demand follows a short life cycle pattern.

8

FOUR

LITERATURE REVIEW

This chapter presents a review of the current status of research in forecastingthe demand (sales) of short life cycle products. This review is intended todiscuss a general summary of the work in forecasting short life cycle productsand guidelines for future research.

4.1 Forecast based on growth models

Product growth models are widely used in the analysis of the diffusion ofthe innovation. For this reason diffusion models can be used to forecastthe demand of SLCP. This approach requires determining a set of parame-ters; therefore an estimation parameters procedure is required. A review offorecasting methods based on diffusion models for short life cycle products ispresented below. An extensive literature review of the use of diffusion modelsis found in Meade & Islam (2006).

Trappey & Wu (2008) present a comparison of the time varying extendedlogistic, simple logistic, and Gompertz models. The study analyzes electronicproducts time series. Linear and non-linear least squares methods are usedto determine model parameters. The authors found that the time varying ex-tended logistic had the best fit and prediction capability for almost all testedcases, however this forecast procedure cannot converge in some situations 1.Gompertz model had the second best forecasting error.

1 Cumulative time series of SLCPs converges to a limit superiorly, see Fig. 2.1 (right)on Chapter 2.

4. Literature Review 4.1. Forecast based on growth models

Kurawarwala & Matsuo (1998) analyze three models to forecast the de-mand of short life cycle products: the linear growth model, the Bass model,and the seasonal trend Bass model 2. Forecasts are performed using demanddata of personal computer time series. Non-linear least squares estimationis used to determine parameters. Performance measures such as the sum ofsquare error (SSE), the RMSE and the MAD show that the seasonal trendBass model reaches the minimum forecast error.

Tseng & Hu (2009) propose the quadratic interval Bass model to fore-cast new product sales diffusion. Fuzzy regression estimates the parametersof the Bass model, which is tested with different datasets. The proposedmodel is compared with Gompertz, Logistic and quadratic Gompertz, mod-els and analyzed by means of the confidence interval length as performancemeasure. The authors conclude that the proposed method is suitable in casesof insufficient data and should not be used when sufficient data is available.

Wu & Aytac (2008) propose a forecast procedure based on the use oftime series of similar products (leading indicators), Bayesian updating andcombined forecasts of different diffusion models. A priori forecast is madewith several growth models. Then a sampling distribution is obtained withthe forecasts of different time series of similar products (leading indicators),which are obtained by means of the mentioned growth models. Finally,Bayesian updating is performed and the final forecast is obtained as a com-bination of the different growth models in the a posteriori results. The mainadvantage of the method is the systematic reduction of the variance in theforecasts. The method is tested in semiconductor demand time series. Asimilar work is presented in Wu et al. (2009).

Zhu & Thonemann (2004) propose an adaptive forecast procedure forshort life cycle products. The authors use the Bass diffusion model andpropose to update the parameters of such model using Bayesian approach.Prior estimation of the parameters is made using non-linear least squares es-timation. The forecasts are performed using datasets of personal computersand analyzed using the MAD. The results show that the proposed methodperforms better than the double exponential smoothing and the Bass model.

2 The authors consider integrating elements of seasonality in the Bass diffusion model.

10

4. Literature Review 4.2. Forecast based on similarity

4.2 Forecast based on similarity

The lack or scarcity of information in short life cycle products time series iscompensated with the existence of information related to similar productsfor which there are sufficient history or have completed their life cycle. Inthis section we present a review of proposed forecast models that use directlysimilar products or time series to obtain forecasts of a SLCP.

Szozda (2010) proposes an analogous forecast. The purpose of the methodis to find the most similar time series. Calibration and adjustment of the timeseries is performed if necessary, in order to maximize the similarity measure.The forecast is the value of the most similar time series at a specified periodof time. Datasets of different new products on European markets were used.The forecast method was analyzed using the mean squared error (MSE)and the results show that the proposed method presents good performanceobtaining a forecast error less than 10% on average.

Thomassey & Fiordaliso (2006) propose a forecast procedure based onclustering (unsupervised learning) and classification (supervised learning) tocarry out early forecasts of new products. First, natural groups in the timeseries are obtained by means of a clustering procedure; then a classificationprocedure classifies new products in a specified cluster. A forecasting of salesprofiles is given by the centroids of the cluster for which the new productbelong. Datasets of textile fashion products is used. The forecasts wereanalyzed using the RMSE, the Mean Absolute Percentage Error (MAPE),and the Median Absolute Percentage Error (MdAPE). A similar work ispresented by Thomassey & Happiette (2007).

4.3 Forecast based on machine learning models

Machine learning methods such as neural networks are widely used in fore-casting activities (Zhang et al. , 1998). Additionally techniques such a clus-tering and classification are important to extract relevant information fromtime series data and, therefore, of great utility in finding similar time seriesas can be seen in Section 4.2. In this section we present a review of machinelearning methods used in forecasting activities for short life cycle products.

Xu & Zhang (2008) uses a Support Vector Machine (SVM) to forecastthe demand of short life cycle products in conditions of data deficiency. Theauthors take into account factors such as the past values of current demand,

11

4. Literature Review 4.4. Discussion of the current methods

the forecast given by the Bass model, and seasonal factors. A dataset of com-puter products was utilized. The forecasts were analyzed using the RMSEand MAD. The results show that the proposed model outperform the Bassmodel.

Meade & Islam (2006) a multilayer feed forward neural network accom-modated for prediction and a controlled recurrent neural network to predictshort time series. The authors use datasets available in the literature andfind that the feed forward neural network accommodated for prediction per-form better in one step ahead forecasts, but in the case of two step aheadforecasts the controlled recurrent neural network improves the feed forwardneural network.

The capability of artificial neural networks account for non-linear relation-ships and the fact that no assumptions are made about the time series char-acteristics are interesting features for forecasting. However, a drawback ofneural networks is that they have many parameters to set up and there is nota standard procedure for this labor to ensure a good network performance.The lack of a systematic approach to neural network model building is prob-ably the primary cause of inconsistencies in reported findings (Zhang et al. ,2001).

Design of Experiments (DOE) and/or Response Surface Methodology(RSM) are techniques that can be used to tuning parameters of neural net-work models (Zhang et al. , 2001; Bashiri & Geranmayeh, 2011; Chiu et al. ,1994; Madadlou et al. , 1994; Balestrassi et al. , 1994). However the re-sults obtained by response surface methodology are not necessarily the best(Wang & Wan, 2009; Jian et al. , 2009).

4.4 Discussion of the current methods

From our literature review it is possible to describe some limitations of thecurrent forecasting methods for SLCPs demand. These limitations are re-lated mainly to the capability of forecasting the demand along the completelife cycle, the appropriateness of the models used, and the effective use ofhistorical information related to others SLCPs. We describe in more detailthese limitations.

• Capability of forecast along the complete life cycle: This is an impor-tant question due that many methods such as diffusion models requiresto extract part of the historical information for tuning parameters. For

12

4. Literature Review 4.5. Conclusions

example, the Bass model (see 5.1.2) requires to determine three param-eters (a, bandc, see 5.1) then it is necessary to reserve at least 3 datapoints for estimate such parameters. Then if we use the Bass diffusionmodel the initial forecast can be made at the 4th period.

• Appropriateness of the model : Many models used in forecasting the de-mand of SLCPs employ explicit parametric forms that can, at best, beregarded as rough approximations of the underling non-linear behaviorof the time series being studied. It is necessary to justify a priory theappropriateness of the model used.

• Effective use of information related to other SLCPs : Many forecastingmethods for SLCPs demand, such as difussion models, does not exploitinformation related to historical patterns of SLCPs demand.

These drawbacks and the hypothesis outlined in chapter 2 motivate andjustifies the work presented in this document.

4.5 Conclusions

This chapter presented a review of the forecast models for short life cycleproducts and a description of forecasting models based on diffusion models,similarities between time series, and machine learning methods. The useof a Bayessian learning is a common technique that generally improves theforecast results. We note some limitations of the current methods used inforecasting tasks of SLCPs demand.

13

FIVE

TIME SERIES ANALYSIS

This chapter provides an analysis of the different datasets of time series. Ini-tially, an univariate time series analysis is addressed; this analysis includestesting for stationarity and testing for nonlinearity in time series. Our expe-rience with the univariate time series analysis does not allow us to achieveconclusive results. Since some of the techniques used here1 have undefinedresults; in short time series. To complement the analysis, an analysis of mul-tiple time series is conducted using clustering techniques to extract naturalgroups on time series data. Classifying information by means of clusteringallows to extract relevant information from the data that can be valuable inforecasting tasks.

5.1 The datasets of time series

5.1.1 Real datasets

Demand dataset of texts and scholar products

This dataset corresponds to the weekly sales of textbooks and scholar prod-ucts of a Colombian company. The dataset comprises a total of 726 differenttime series which have a maximum length of 13 periods2. We will refer tothis dataset as RD1. Fig. 5.1 (a) shows some of these time series.

1 Such as maximum likelihood estimation of ARMA models and some nonlinear testsfor time series.

2 It is important to note the presence of very short time series. This fact imposesconstraints on the analysis to univariate time series, as will be seen later.

5. Time series analysis 5.1. The datasets of time series

Citations of articles dataset

This dataset correspond to annual citations received by scientific articles pub-lished in several knowledge fields. The articles were published between 1996and 1997 and are available in the Scopus database. The dataset comprises atotal of 600 time series of annual citations comprising years 1996–2014 hav-ing a maximum length of 18 periods. We will refer to this dataset as RD2.Fig. 5.1 (c) shows some of these time series. Note that although these timeseries show a downward trend in the last years, they have not yet completedits life cycle. This does not occur in the dataset RD1 where the lifetime ofsuch time series has been completed.

Patent to Patent citations dataset

This dataset corresponds to annual patent to patent citations in the UnitesStates Patent Office (USPTO). The source of the data is Hall et al. (2001),here authors study the patent citations comprising the years 1975–1999. Inthis work we consider patents that mainly belongs to technological fields, wepreprocess the information to obtain the annual patent citations of patentsgranted between 1976 and 19783. The time series comprises the periods from1976 to 1999 having a maximum length of 23 periods and the total numberof patents analyzed was of 600. We will refer to this dataset as RD3. Fig.5.1 (d) shows some of these time series.

5.1.2 Synthetic dataset

Bass diffusion model

One of the first attempts to model the life cycle of products was the Bassdifussion model (Bass, 1969) which can be expressed using the followingrecurrence relation

xt = a+ b

t−1∑i=1

xi + c

(t−1∑i=1

xi

)2

, (5.1)

where a = pm, b = q − p and c = −q/m. Here p, p ∈ (0, 1) , is calledthe innovation coefficient, q, q ∈ (0, 1) , is called the imitation coefficient

3 The patent citations are analyzed according to their granted year. We consider thegranted year as the year of birth of the patent and the start of the counts for cites.

15

5. Time series analysis 5.2. Stationarity test

and m =∑∞

i=1 xi represent the total demand/sales over the life cycle of theproduct, see Bass (1969). Here t = 0 . . .∞ and x0 = a = mp, the recurrencerelation 5.1 can be used to obtain the following nonlinear autoregressiveprocess

xt = a+ bt−1∑i=1

xi + c

(t−1∑i=1

xi

)2

+ ϵt,

where ϵt is an independent and identically distributed random variable. Werefer to this relation as the Bass Process and construct synthetic time seriesthat follow such process with {ϵt} as normally distributed random variableof mean 0 and variance σ2. Four different groups of time series are generatedeach group containing 150 time series, reaching a total of 600. The lengthof the time series is set to a maximum of 25 periods. The parameters of theprocess are considered as random variables normally distributed with meanand standard deviation (σ) given in Tab. 5.1. We will refer to this datasetas SD1. Fig. 5.1 (b) shows 7 different time series of this dataset.

Tab. 5.1: Mean values of the Model parameters for each group of synthetic param-eters. Each parameter is considered as a random variable. Note: Valuesof p and q out of the range [0, 1] were not considered in the process.

Group p q m

I 0.043 0.425 193.1II 0.002 0.496 119.5III 0.031 0.220 105.2IV 0.015 0.549 193.2σ 0.071 0.001 14.29

5.2 Stationarity test

Stationarity implies some regularity conditions required for some tests ofhypothesis and for the parameter estimation of linear ARMA models (Pena,2005, Chapter 10). A time series {xt} is said to be weakly stationary if theirmean µt, their variance γ0 and the covariance γℓ between yt and yt−ℓ are timeinvariant, where ℓ is an arbitrary integer (see Chapter 2 of Tsay (2005)).Strong stationarity implies that the probability distribution function related

16

5. Time series analysis 5.2. Stationarity test

5 10 15 20 25

10

20

30

5 10 15 20 25

10

20

30

2 4 6 8 10 12 14

20

40

60

t

xt

5 10 15

50

100

150

(d)

(a) (b)

(c)

Fig. 5.1: The datasets of short life cycle product time series. (a) RD1 dataset. (b)SD1 dataset. (c) RD2 dataset. (d) RD3 dataset.

to the values of the time series is time invariant. Strong stationarity is a keyconcept that allows characterizing the SLCPs time series. It is clear that theprobability distribution function of such time series varies with time. In factat the initial and final stages of the life cycle we might expect less variabilityin time series. On the other hand, it is expected that the variability ofthe time series be greater around the peak demand/sales. Also the meanvalue of the time series depends of the stage of the life cycle, i.e. the meandemand/sales at the beginning and at the end of the life cycle is expected tobe close to zero.

5.2.1 Unit-root test

A non-stationary time series is said to be an Autoregressive Integrated Mov-ing Average, ARIMA(p, d, q), process if after applying the difference oper-ator, ∇dyt, the resulting time series is stationary4. Then the original timeseries is called unit-root no stationary (Tsay, 2005, Chapter 2). Consider theAutoregressive AR(1) process

yt = ϕ0 + ϕ1yt−1 + at,

4 The difference is defined as ∇yt = yt − yt−1. Let xt = yt − yt−1, then ∇2yt =xt − xt−1 = yt − 2yt−1 + yt−2.

17

5. Time series analysis 5.3. Clustering of time series

where at is an independent and identically random variable, e.g. a whitenoise sequence that is by definition a stationary process. If ϕ = 1 then

yt = ϕ0 + yt−1 + at,

∇yt = ϕ0 + at.

Given that after applying the difference operator to the original time se-ries we obtain a stationary time series, we conclude that the process is non-stationary. Then for testing unit-root stationarity it is necessary test if ϕ = 1or ϕ < 1.The above analysis can be generalized to ARMA models, however,we omit details (see Chapter 9 of Pena (2005)). Generally the null hypoth-esis of the test is that the series are non-stationary. This implies, possibly,differentiating the original series, by using the difference operator, to obtaina stationary process. We use the augmented Dickey-Fuller test to test forunit-root nonstationarity, with a significance level in the test of 0.01. Per-forming the Dickey-Fuller test requires to construct a suitable linear ARMAmodel beforehand. The parameters of such model are usually estimated bymaximum likelihood estimation.

Results of the unit root test

Our experience on testing unit-root nonstationarity gives no conclusive re-sults. This is confirmed in performing tasks of parameter estimation ofARMA models, where the algorithms used present undefined or invalid re-sults5. We admit that the time series discussed here are non-stationary, sincetime series do not satisfy regularity assumptions required for estimating pa-rameters of ARMA models.

5.3 Clustering of time series

Data mining methods are helpful to find and to describe patterns which arehidden in datasets. Clustering is a major technique in data mining, statisticaland pattern recognition fields; the problem of clustering is also referred to asnon-supervised classification and is identified as a non-supervised learning.

5 In estimating parameters we use here the exact maximum likelihood estimation usingthe Kalman filter, this algorithm requires the calculation of a log-likelihood function butin some cases the argument of the logarithm is negative. The same circumstances arefound in some nonlinear tests. This implies undefined (complex) results.

18


Clustering consists in finding natural groups in data. This means that anelement of a cluster possesses common characteristics to other elements insuch cluster, but is significantly different of the elements of other clusters.In the context of time series data6, clustering becomes the task of findingtime series with similar characteristics. There are several characteristics thatcan be of interest. For example, one characteristic of clusters is that eachtime series in a cluster are generated by approximately the same generatingprocess. Another characteristic of interest may be that the time series of acluster are highly dependent.

5.3.1 Some insights on clustering time series

Given a set of unlabeled time series, it is often desirable to determine groupsof similar (according to our meaning of similar) time series. There are threeclustering approaches for clustering time series (see Liao (2005)).

• Raw-data-based approach. This approach uses the complete time seriesas a feature vector. The clustering is carried out into a high dimensionalspace if the time series are large. This fact can be problematic due tothe curse of the dimensionality.

• Feature-based approach. This approach extracts some relevant featuresof the time series. Techniques such as feature extraction or selection,dimensionality reduction or others can be used to extract informationfrom data.

• Model-based approach. Here the features can be obtained using somemodel. For example, if we deal with SLCP, we can use the Bass diffusionmodel for such time series and set the parameters of such model as thefeatures.

There are three main objectives in clustering time series (Zhang et al. ,2011): Obtain similarity in time, by analyzing series that varies in a sim-ilar way on each time step. Obtain similarity in shape by analyzing timeseries with common shape features together. And finally, obtain similarityin change by analyzing common trends in the data, similar variation in eachtime step.

6 A review of the literature related to time series data mining can be found in Fu (2010).

19


In general it is not desirable to work directly with the raw data. The rea-sons are that such time series are highly noisy (Liao, 2005) and as dimension-ality increases all objects become essentially equidistant to each other (this isknown as curse of dimensionality) and thus classification and clustering losetheir meaning (Ratanamahatana et al. , 2010). Then, transformation of theraw data can improve the efficiency by reducing the dimension of the data orimprove the clustering effects by smoothing the trend and giving prominenceto typical features (Zhang et al. , 2011).

Dimensionality reduction by perceptually important points

Perceptually Important Points (PIP) is a simple method for reducing thedimension of a time series that preserves salient (representative) points. Inthis method all data points are reordered according to its importance. Thenif given a time series of length T is required to reduce their dimension ton, n < T , then the top-n points of the list are selected. The method startsby selecting the initial and final points of the original time series as the firstPIPs. The next PIP is selected as the point that has maximum distanceto the first PIPs. The next PIP is selected as the point with maximumdistance to its two adjacent PIPs. The procedure continues until n pointshave been selected. The distance to one point to its two adjacent PIPs isunderstood as the vertical distance to the point and the line connecting itstwo adjacent PIPs (Fu, 2010). Let the test point p3 = (x3, y3) and let itscorresponding adjacent points be p1 = (x1, y2) and p2 = (x2, y2), then thevertical distance between p3 and the line connecting p1 and p2 is shown inEq. 5.2 (see Chung et al. (2002)).

d (p3, pc) = |yc − y3| =∣∣∣∣(y1 + (y2 − y1)

xc − x1

x2 − x1

)− y3

∣∣∣∣ , (5.2)

where xc = x3. The general procedure of the PIPs is shown in Algorithm 1.

Hard clustering or fuzzy clustering?

It is known that hard clustering7 often does not reflect the description of thereal data, where boundaries between subgroups might be fuzzy and numer-ous problems in the life sciences are better tackled by decision making in a

7 Hard clustering makes reference to the process of assigning each element to one, andonly one, cluster. On the other hand, in fuzzy clustering each element belongs to eachcluster in a certain degree of membership.

20


Algorithm 1: Perceptually Important Points Procedure.

Data: Input sequence {at} , t = 1, . . . , T , required length n.Result: Reduced sequence {qt} , t = 1, . . . , n.

1 Set q1 = a1 and qn = aT ;2 l = 2;3 while l < n do4 Select point aj with maximum distance to adjacent points in {qt};5 Add aj to {qt};6 l = l + 1;

7 end

fuzzy environment. Then a fuzzy clustering becomes to be the best option(Gath & Geva, 1989a).

Initialization method

A clustering procedure generally is based on the following two majors steps8:

1. Obtain an initial partition. That partition can be obtained randomlyor by a more sophisticated method.

2. Iteratively obtain new partitions improving the clustering until sometermination criterion is met.

The initialization method can improve the clustering performance. The ideaof the initialization is to use several clustering algorithms each of which ismore sophisticated than the prior. For example, the Expectation and Maxi-mization algorithm can initialized with the results of the k-means algorithmand this algorithm can initialized at a random partition (Bradley & Fayyad,1998). An initialization method is proposed by Bradley & Fayyad (1998),the authors developed a procedure for computing refined initial centroidsfor the k-means algorithm based on an efficient technique for estimating themodes of a distribution. Other initialization method based on refined seedsor centroids can be found in Gath & Geva (1989a). The characteristic ofsuch initialization is that the initial seeds are chosen randomly.

8 Referring mainly to partitional clustering algorithms.

21


Pena et al. (1999) present a comparison of the performance of four initial-ization methods for the k-means algorithm. The methods include: randominitialization, the Forgy approach, the Macqueen approach and the Kaufmanapproach. Based on the statistical properties of the squared error, the au-thors found that the Kaufman initialization method outperforms the rest ofthe compared methods with respect to the effectiveness and the robustnessof the k-means algorithm, and convergence speed.

Given the above results, we describe in more detail the Kaufman initial-ization method. The Kaufman initialization method is shown in Algorithm2 (see Kaufman & Rousseeuw (1990); Pena et al. (1999)). Here, d (x,y)correspond to some distance measure between vectors x and y.

Algorithm 2: Kaufman initialization method.

Data: Dataset of time series X ; Number of clusters K.Result: The set of cluster centroids {v1, . . . ,vk}, or initial partition.

1 Select as the first seed the most centrally located instance;2 k = 1;3 while k < K do4 for For every non-selected instance wi do5 for For every non-selected instance wj do6 Calculate Cji = max {Dj − dji, 0}, where dji = d (wi,wj)

and Dj = mins {djs} being s one of the selected seeds;

7 end8 Calculate the gain of selecting wi by

∑j Cji;

9 end

10 Select the instance l where l = argmaxi

{∑j Cji

};

11 k = k + 1;12 For having a partition assign the non-selected instance to the

cluster represented by the nearest seed.13 end

Distance measure

In partitional clustering it is necessary to measure the similarity betweentwo objects. For this purpose some distance measure is considered. In

22


general, the use of the Euclidean distance is not necessarily the best op-tion. Euclidean distance is not allowed for the situation when two sequenceshave different scales. In this case it is necessary to normalize the data(Ratanamahatana et al. , 2010). Euclidean distance does not take into ac-count the temporal order and the length of sampling intervals if the time se-ries considered have unevenly distributed points (Moller-Levet et al. , 2003,2005). The so called Dynamic Time Warping Distance can be used when thetime series does not line up in a horizontal scale. However, the time requiredto compute the distance is high (Ratanamahatana et al. , 2010).

When the time series have unevenly distributed points, Moller-Levet et al.(2003) uses the so called Short Time Series Distance. Such distance measureis defined in Eq. 5.3. This distance measure take into account also the shapeof the time series considered.

d2 (y,v) =T−1∑r=1

(yr+1 − yrtr+1 − tr

− vr+1 − vrtr+1 − tr

)2

. (5.3)

On the other hand, Pascazio et al. (2007) propose a clustering procedurethat is based on the Hausdorff distance as a similarity measure betweenclusters elements. The authors use the Hausdorff distance in hierarchicalclustering and they recommend this tool for the analysis of complex setswith complicated (and even fractal-like) structures. The method is appliedto financial time series.

5.3.2 The clustering algorithm

Fuzzy C-Means Clustering Algorithm (FCM)

The idea of partitional clustering algorithms considered here is based on thecompactness of clusters and separation between clusters. The total sum ofthe distances between data points and their cluster centroids is often a figureof merit. The fuzzy-clustering problem can be formulated by minimizing thefunction given in Eqs. 5.4, where K is the number of clusters, uij is thegrade of membership of the jth object to the ith cluster, uij = [0, 1], m is thefuzzyfier which have influence on the performance of the clustering algorithm.Y = {y1, . . . ,yN} ⊂ ℜT are the feature data, V = {v1, . . . ,vK} ⊂ ℜT arethe cluster centroids, U = [uij]K,N is the fuzzy partition matrix.

23


Minimize:

Jm(Y, V, U) =N∑j=1

K∑i=1

umijd

2 (yj,vi) . (5.4a)

Subject to:

K∑i=1

uij = 1, ∀j = 1, . . . , N, (5.4b)

uij must be non-negative for all i, j. The optimization of uij is determined byequating to zero the derivative of the Lagrangian of the optimization problemand solving the resulting system. The results are shown in Eq. 5.5.

uij =

[K∑k=1

(d(yj,vi)

d(yj,vk)

)1/(m−1)]−1

, (5.5)

The optimization of vi follows the same procedure and the results are shownin Eq. 5.6.

vi =

∑nj=1 u

mijyj∑n

j=1 umij

. (5.6)

It is clear that Eqs. 5.5 and 5.6 are coupled and it is not possible to obtainclosed form solutions. One way to proceed is to follow an iterative algorithmto obtain the estimates of uij and vi. Such algorithm is called the FCMclustering algorithm (see Theodoridis & Koutroumbas (2006), pag. 602).Algorithm 3 shows the standard FCM algorithm, which requires an initialset of cluster centroids and we use the Kaufman initialization method (seeAlgorithm 2) to obtain such centroids.

Fuzzy Short Time Series Clustering Algorithm (FSTS)

In clustering time series tasks it is usually helpful to group the data accordingto their shape or other characteristic as discussed before. Then a specialpurpose clustering algorithm might be used. In this work we consider thefuzzy clustering algorithm proposed by Moller-Levet et al. (2003, 2005); thisclustering algorithm is the same of the fuzzy c-means algorithm but uses theshort time series distance metric given in Eq. 5.3 rather than the standard

24


Algorithm 3: Fuzzy C-Means (FCM) clustering algorithm.

Data: Time series matrix X; Number of clusters K; Fuzzifierparameter m; Termination tolerance ε; Number of time seriesn; Length of the time series T .

Result: The set of cluster centroids {v1, . . . ,vk}; Partition matrix U .

1 Obtain the initial partition using the Kaufman initialization methodgiven in Algorithm 2;

2 Get the initial partition matrix, U0;3 l = 0;

4 while∥∥U l − U l−1

∥∥ ≥ ε do5 Compute the cluster prototypes using Eq. 5.6;6 Compute the distances of each data to each cluster centroid using

the Euclidean distance;7 Update the partition matrix using Eq. 5.5;8 l = l + 1;

9 end

Euclidean distance. The optimization problem is the same for the FCMalgorithm given in Eqs 5.4 and the value uij for the FSTS clustering algorithmis given in Eq. 5.5. Such membership values must be calculated using theshort time series distance. The optimization of vi is quite different to that ofFCM clustering; the resulting system of equations after deriving and equatingto zero is shown in Eqs. 5.7.

arvi,r−1 + brvi,r + crvi,r+1 = mir, (5.7)

where

ar = −(tr+1 − tr)2, br = −(ar + cr), cr = −(tr − tr−1)

2,

dr = (tr+1 − tr)2, er = −(dr + fr), fr = (tr − tr−1)

2,

and

mir =

∑Nj=1 u

mij (dryj,r−1 + eryj,r + fryj,r+1)∑N

j=1 umij

.

This yields an undetermined system of equations. However, by adding twofixed time points at the beginning and at the end of the time series with a

25


value of 0 (such points do not alter the results) it is possible to solve suchsystem (see Moller-Levet et al. (2003) and Moller-Levet et al. (2005) fordetails).The resulting system is a tridiagonal system of equations that can beeasily solved recursively by using the so called tridiagonal matrix algorithm,TDMA. On the other hand, Moller-Levet et al. (2003) show a closed formrecursive equation to solve such system. The fuzzy clustering algorithm ofshort time series is shown in Algorithm 4.

Algorithm 4: Fuzzy short time series (FSTS) clustering algorithm.

Data: Time series matrix X; Number of clusters K; Fuzzifierparameter m; Termination tolerance ε; Number of time seriesn; Length of the time series T .

Result: The set of cluster centroids {v1, . . . ,vk}; Partition matrix U .

1 Add two fixed time points at the beginning and the end of the timeseries X = [[0]n×1, X, [0]n×1];


3 Get the initial partition matrix, U0;4 l = 0;


∥∥ ≥ ε do6 Compute the cluster prototypes by set7 v(i, 1) = 0,8 v(i, T + 2) = 09 and solving the system given in Eq. 5.6 to obtain the K centroids;

10 Compute the distances of each data to each cluster using Eq. 5.3;11 Update the partition matrix using Eq. 5.5;12 l = l + 1;

13 end

Fuzzy Maximum Likelihood Estimation (FMLE) Clustering Algorithm

The following analysis is based on Benitez et al. (2013). For this purposeconsider that the clusters are considered as random events each that occurin the sample space X with a positive probability P(i). By the theorem of

26


total probability

P(X ) =K∑i=1

P(X|i)P(i),

where P(X|j) is the conditional probability that given X the event (cluster)j occurs. If it is assumed that P(X|j)P(i) is the Gaussian N(vi,Σi), thenthe function P(X ) can be seen as a Gaussian mixtures. The above equationcan be written as the likelihood function with parameters Θ = {P(i),vi,Σi}

P(X ) = P(X ;Θ) =N∏j=1

P(yj|Θ) =N∏j=1

K∑i=1

P(yj|i)P(i) = L(Θ;X ). (5.8)

Given that P(yj|i) is Gaussian, it can be shown that

P(yj|i) =1√

(2π)T |Σi|exp

[−1

2(yj − vi)

′ Σ−1 (yj − vi)

].

The parameters Θ are obtained by maximizing the likelihood function givenin Eq. 5.8. This is done equating to zero the derivative (with respect toΘ) of the likelihood function and solving (for each parameter) the resultingsystem of equations. The results of the optimization problem are shown inEq. 5.9.

P(i) =1

N

n∑j=1

P(i|yj), (5.9a)

vi =

∑Nj=1 P(i|yj)yj∑nj=1 P(i|yj)

, (5.9b)

Σi =

∑Nj=1 P(i|yj)(yj − vi)

′(yj − vi)∑nj=1 P(i|yj)

. (5.9c)

Note that the expression for vi is similar to those found for the FCM cluster-ing algorithm in Eq. 5.6. The idea is that the posterior probability9 P(i|yj)is equivalent to the degree of membership um

ij . In fact the term P(i|yj) iscalculated as follow:

P(i|yj) =

[K∑k=1

(d2e(yj,vi)

d2e(yj,vk)

)]−1

, (5.10a)

9 Probability of selecting the ith cluster given the jth feature vector.

27


Where:

d2e(yj,vi) =

√|Σi|

P(i)exp

[1

2(yj − vi)

′ Σ−1i (yj − vi)

]. (5.10b)

The FMLE algorithm uses an exponential distance measure d2e based onmaximum likelihood estimation. The characteristics of the FMLE clusteringalgorithm makes it suitable for partitioning the data into hyper-ellipsoidalclusters (see Gath & Geva (1989b)). The FMLE algorithm is shown in Al-gorithm 5.

Algorithm 5: Fuzzy Maximum Likelihood Estimation (FMLE) clus-tering algorithm.

Data: Time series matrix X; Number of clusters K; Terminationtolerance ε; Number of time series n; Length of the time seriesT .

Result: The set of cluster centroids {v1, . . . ,vk}; Posteriorprobabilities P(i|yj).


2 Compute the posterior probabilities given in Eq. 5.10a;3 l = 0;


∥∥ ≥ ε do5 Compute the cluster prototypes using Eq. 5.9b;6 Compute the parameters given in Eqs. 5.9a and 5.9c;7 Update posterior probabilities given in Eq. 5.10a ;8 l = l + 1;

9 end

5.3.3 Fuzzy cluster validity indices

In general, in most clustering algorithms the number of clusters must bespecified beforehand. The selection of a different number of clusters resultin different partitions. For this reason it is necessary to evaluate severalpartitions. The problem of finding the optimal number of clusters is calledcluster validity. The selection of the appropriate partition must be done

28


according to a performance index. Total distance to the cluster centroids(the objective value in the clustering problem, see Eq. 5.4) is not the bestoption because this metric tends to decrease as the number of cluster increase.In general, a fuzzy cluster validity index must consider the partition matrixand the data set itself (Wang & Zhang, 2007).

In Wang & Zhang (2007) an evaluation of different fuzzy cluster validityindices is carried out. The authors performs the evaluation using eight syn-thetic data sets and eight well-known data sets. This work shows that noneof the indices considered identify the correct number of clusters in all datasets, but some indices have good results. The authors find that the PBMFindex (see Pakhira et al. (2004)) only fails once to detect the correct numberof clusters of the sixteen datasets. Other indices such as the PCAES, theGranularity-Dissimilarity GD index, and the SC index obtain good resultstoo. In the following we describe some validity indices used in this work.

Xie and Beni index, VXB

The XB index is defined in Eq. 5.11

VXB =

∑Ki=1

∑nj=1 uijd

2 (yj,vi)

nmini,j {d2 (vi,vj)}. (5.11)

The index focuses on two properties: the compactness and separation. Thenumerator of Eq. 5.11 measures the compactness and the denominator mea-sures the separation between clusters. The validation problem becomes tofind the partition k for which k = argmaxc=2,...,n−1 VXB (c).

PBMF index, VPBMF

The PBMF (see Pakhira et al. (2004)) index is defined in Eq. 5.12

VPBMF =

(1

K× E1

Jm×DK

)2

, (5.12)

here,

E1 =n∑

j=1

d (yj,v) and DK =K

maxi,j=1

d (vi,vj) .

29


Jm is interpreted as the value for the clustering problem

Jm(y,v, u) =N∑j=1

K∑i=1

umijd

2 (yj,vi) .

The index comprises three factors; the first factor 1/K indicates the divisi-bility of a K cluster system, this reduces as K increase and allows avoidingthe problem of convergence as K increases. The second term E1/Jm includesthe sum of intra-cluster distances for the complete dataset taken as a sin-gle cluster and that for the K cluster system (objective value). This factormeasures the compactness of a cluster system. It is required to be increasedit (Pakhira et al. , 2004). The third factor DK is the maximum inter-clusterseparation and it is required to increase it. The validation problem becomesto find the partition k for which k = argmaxc=2,...,n−1 VPBMF (c).

PCAES index, VPCAES

The Partition Coefficient and Exponential Separation (PCAES) index (seeWu & Yang (2005)) is defined in Eq. 5.13

VPBMF =K∑i=1

n∑j=1

u2ij

uM

−K∑i=1

exp

(−min

k =i

{d2 (vi,vk)

βT

}), (5.13)

where uM = mini=1,...,K

{∑nj=1 u

mij

}, βT =

∑Ki=1 d

2 (vi,v) ,v =∑n

j=1 yj/n.

The first term of the index measures the compactness of the cluster systemrelative to the most compact cluster. The second term takes an exponential-type inter-cluster separation. The validation problem becomes to find thepartition k for which k = argminc=2,...,n−1 VPCAES (c).

SC index, VSC

The SC index (see Zahid et al. (1999)) is defined in Eq. 5.14

VSC = SC1 (K)− SC2 (K) , (5.14)

where

SC1 (K) =

∑Ki=1 d

2 (vi,v) /K∑Ki=1

(∑nj=1 u

mijd

2 (yj,vi) /∑n

j=1 uij

) ,30


and

SC2 (K) =

∑K−1i=1

∑K−1r=1

(∑nj=1min {uij, ukj} /

∑nj=1min {uij, ukj}

)∑n

j=1 (maxi=1,...,K {uij})2 /∑n

j=1maxi=1,...,K {uij}.

The first and second terms of Eq. 5.14 measures the fuzzy compactness andfuzzy inter-cluster separation considering geometrical properties of the dataand the membership function. The index obtains a fuzzy compactness/fuzzyseparation degree. The validation problem becomes to find the partition kfor which k = argminc=2,...,n−1 VPCAES (c).

5.3.4 Clustering results

Experiments were conducted in order to validate or test the clustering re-sults and fuzzy cluster indices. We use dataset SD1 described in Section5.1.2, which was generated according to four classes or groups. The ideais to perform the clustering to the time series using the Algorithm 4 and aclustering validation using the indices described in Section 5.3.3 in order de-termine the number of clusters K. The results are compared with the correctnumber of clusters.

In these experiments we dealed with the FSTS clustering algorithm. Thefuzzifier parameter m of the clustering algorithm (see Algorithm 2) was setto m = 1.65, the convergence criterion of Algorithm 2 was set to ε = 10−5.The algorithm used the short time series distance given in Eq. 5.3. Thefuzzy cluster indices were evaluated using different number of clusters, with aminimum number of clusters of Kmin = 2 and a maximum number of clustersconsidered of Kmax =

√n, where n is the number of time series (n = 600, see

Section 5.1.2). The cluster validity indices were calculated using the shorttime series distance.

The results were obtained using a raw-data-based approach in which thecomplete time series was considered as a feature vector and the dimensional-ity of the time series was reduced using the Algorithm 1. The optimal numberof clusters for different cluster validity indices and different dimensionalitiesis shown in Tab. 5.2. It is shown that the PCAES index obtained the mostaccurate results considering that the correct number of clusters is 4. In Fig.5.2 where SD1 dataset has a dimensionality reduction to 3 features, it canbe seen the presence of four fuzzy clusters. This reduction to 3 PIPs for aSLCP is actually obtained by selecting the initial sales, the peak sales andthe final sales of such product.

31


Tab. 5.2: Validation results for SD1 dataset using FSTS algorithm.

Dimensionality*Index Raw data 13 6 3

XB 19 17 22 23PBMF 24 21 3 15PCAES 6 2 2 2SC 19 24 24 24

*Nmber of periods in the time series.

5.3.5 Clustering results for the real datasets

The raw datasets and the PCAES index were used to investigate the optimalnumber of clusters for the real datasets (see Section 5.1.1), the results areshown in Tab. 5.3. Fig. 5.3 shows the representation by its 3 PIPs of thereal datasets. It is noted that these datasets do not have a clear cluster-ing structure as with synthetic, SD1, dataset. On the other hand, all realdatasets take integer values. This implies that the results of a clusteringtendency tests such as Hopkins (see Banerjee & Dave (2004)) and Cox-Lewismay erroneously conclude that the data follows a clustering structure10. Thereason for such argument is the integrality of the data, for example, if weanalyze Fig. 5.3 (c) it may look that the data form clusters for the valuesy1 = {1, 2, 3, 4, 5, 6} when in fact these are the only integer values than cantake such dataset.

Tab. 5.3: Optimal number of clusters for the real datasets.

Dataset Number of clusters

RD1 2RD2 2RD3 5

10 Hopkins test, for example, testing the hypothesis that the data is randomly (uniformly)generated into their convex hull in contrast with the hypothesis that the data form clustersor it is not generated in a completely random manner. The testing is carried out usingMonte Carlo simulations where synthetic data is uniformly generated into the convex hullof the real data and then a comparison of the synthetic and real data is performed in orderto contrast the clustering tendency

32

5. Time series analysis 5.4. Conclusions of the chapter

0

5

10

15

0

20

40

0

5

10

y1y2

y3

0 5 10 150

5

10

15

20

25

30

35

40

y1

y2

Fig. 5.2: Representation of the SD1 data set by its 3 PIPs.

5.4 Conclusions of the chapter

In the univariate analysis of time series this chapter allowed to know thedifficulties on performing the unit-root stationarity test and some non-lineartests for time series. Time series do not satisfy the regularity conditionsrequired in these tests. Therefore, the time series are non-stationary. Dif-ficulties also arise dute to the time series are very short. This implies thanthe estimation process of parameters of ARMA models generate inaccurateresults. In conclusion, our experience showed that it is difficult to analyzetime series of SLCPs by using traditional ARIMA or ARMA models.

On the other hand, the multivariate analysis allowed to finding groupsin time series data. We presents several clustering algorithms which will beused in the forecasting framework as will be seen later. In this chapter wetest the results for the FSTS algorithm only and evaluate different clustervalidity indices. The cluster validity results for the syntetic datasets does notallow to know the correct partition of data, however, the PCAES index getgood partitions. It was shown than the syntetic dataset follow a clusteringstructure of 4 groups. The real datasets have no a clear clustering structure.

33

5. Time series analysis 5.4. Conclusions of the chapter

010

20

050

100

0

1

2

3

0 10 20

0

50

100

050

100200400

50100150200

0 50 100

100

200

300

400

500

24

640

80120

0

0.5

1

y1y2

y3

y1

y2

2 4 6

20

40

60

80

100

120

(a)

(b)

(c)

Fig. 5.3: Representation of the real datasets by its 3 PIPs. (a) RD1 Dataset;(b)RD2 dataset; (c) RD3 dataset.

34

SIX

REGRESSION METHODS

Given that this work proposes the use of machine learning models in fore-casting tasks of SLCPs demand, we consider the analysis of models suchas multiple linear regression, support vector regression and artificial neuralnetworks. In this chapter we briefly describe the theoretical background ofsuch methods. Regression methods are parameter dependent, therefore isnecessary to define a method for tuning parameters. This work employs theresponse surface methodology to tune parameters since it is an efficient wayfor process optimization.

Forecasting methods use historical information to predict what will hap-pen in the future. We can refer to this as the problem of learning fromexamples, as stated by Vapnik (1999) in the context of statistical learn-ing theory and machine learning. Then the problem of forecasting can bestated as a problem of learning, specifically if the functional relationshipyt = f(yt−1, yt−2, . . . , yt−p) must be learned from historical information. Theproblem of finding the correct function that best predicts the value yt is alsocalled the problem of regression. This chapter presents the methods basedon regression used to forecast the demand of short life cycle products. Forconvenience we refer to the argument of the function as the p-dimensionalvector y = (yt−1, yt−2, . . . , yt−p)

′ and the result of the function as y, whichmeans that y ≡ yt. Then the functional can be written as y = f(y), assumingthat we have available historical information

(y1, y1), . . . , (yn, yn),

which is called the training set S, yi ∈ X ⊆ Rp, yi ∈ Y ⊆ R.

6. Regression Methods 6.1. Multiple linear regression

The forecasts based on linear regression discussed here was first proposedby Rodrıguez & Vidal (2009); Rodrıguez (2007) to predict the demand ofSLCPs. In this study compare the performance of this method with nonlinearmodels such as support vector regression and artificial neural networks.

6.1 Multiple linear regression

Here we assume that the function y = f(y) is linear, that is

y = f(y) = w′y + w0. (6.1)

In terms of forecast it is assumed that the values of a time series are linearfunctions of some past values. Such equation is completely determined bydefining the parameters w0 and w, given in Eq. 6.1. The idea is to get thevalues of such parameters to minimize the regression error. In general theleast squares approach is followed to solve the linear regression problem byminimizing the sum of squared errors (deviations). Total sum of squarederrors is defined as the loss function L, which is also known as the squareloss function (Cristianini & Shawe-Taylor, 2000).

L(w, w0) =n∑

i=1

(yi − f(yi))2 =

n∑i=1

(yi −w′yi − w0)2, (6.2)

The above equation can be expressed in matrix notation by setting w =(w′, w0)

′, and (see Cristianini & Shawe-Taylor (2000))

Y =

y′1

y′2...y′n

,where y′i = (y′

i, 1)′.

The loss function (Eq. 6.2) given in matrix notation becomes

L(w) = (y − Yw)′(y − Yw),

taking the derivatives of the loss function with respect to w, equating themto zero and solving for w we obtain the solution of the least squares problem

w =(Y′Y

)−1

Y′y (6.3)

36

6. Regression Methods 6.2. Support vector regression

6.2 Support vector regression

Support vector machines (SVM) are commonly used in forecasting tasks andtime series analysis. Some literature related to forecasting with SVMs canbe found in Pai et al. (2010), Yang et al. (2007), and Hu et al. (2011).In particular, Xu & Zhang (2008) uses the ε-SVR to forecast the demandof a short life cycle product, which is the interest of this work. Howeverthe authors take into account exogenous variables in the process of learningrather than past values of the time series.

In regression problems there are mainly two kinds of SVMs: the so calledν-SVR (nu-Support Vector Regression, see Scholkopf et al. (–)) and the ε-SVR. In this work we consider the Epsilon Support Vector Regression orε-SVR, proposed by Vapnik (Smola & Schlkopf, 2004). In addition, a moredetailed description of this kind of SVM is presented. Consider the sim-plest case in which we need to perform a linear regression to some datasetas shown in Fig. 6.1. The idea is to obtain a regression line contained in atube of width 2ε that contains all (or the greatest number of) data pointsand is as flat as possible. The reason to do this is to overcome overfitting(see (Lin, 2006, page 48)). It can be shown that this objective is equiva-lent to minimize the theoretical maximum of the generalization error (seeCristianini & Shawe-Taylor (2000); Vapnik (1999)). Let the regression line

Ξ

Ξ*

¶

¶

x

y

Fig. 6.1: Illustration of the linear ε-SVR with soft margin.

given by Eq. 6.1. The flatness of the regression function is determined by

37


the value of the parameter w, by minimizing the norm ∥w∥2 (see Lin (2006);Smola & Schlkopf (2004)). Then, the following convex optimization problemis obtained.Minimize:

1

2∥w∥2 .

Subject to:yi −w′y − w0 ≤ ε,

w′y + w0 − yi ≤ ε.

The constraints of the above problem establish that all data points must liebetween a tube of width 2ε. However it is possible to relax these constraintsby adding slack variables. The resulting optimization problem is called softmargin problem.

Soft margin problem

The soft margin problem considers that it is possible to obtain some error(data points outside of the tube) which is measured by the slack variable ξior ξ∗i for data point yi. The additions of slack variables makes necessary topenalize the magnitude of the slack variables (error) in the objective function,according to some penalising cost C. The resulting optimization problem isshown in Eqs. 6.4.

Minimize:

1

2∥w∥2 + C

n∑i=1

(ξi + ξ∗i ). (6.4a)

Subject to:

yi −w′yi − w0 ≤ ε+ ξi, (6.4b)

w′yi + w0 − yi ≤ ε+ ξ∗i , (6.4c)

ξi, ξ∗i ≥ 0. (6.4d)

The optimization problem 6.4 can be solved more easily in its dual formu-lation. More important yet is the fact that the dual problem provides the key

38


for extending the SVM to nonlinear regression problems (see Smola & Schlkopf(2004)). The Lagrangian of the optimization problem is

L =1

2∥w∥2 + C

n∑i=1

(ξi + ξ∗i )

−n∑

i=1

αi(ε+ ξi − yi +w′yi + w0)

−n∑

i=1

α∗i (ε+ ξ∗i −w′yi − w0 + yi)

−n∑

i=1

(ηiξi + η∗i ξ∗i ).

(6.5)

where L is the Lagrangian and αi, α∗i , ηi, η

∗i are Lagrange multipliers or dual

variables, it is required that αi, α∗i , ηi, η

∗i ≥ 0. The partial derivatives of L

with respect to the primal variables (w, w0, ξi, ξ∗i ) have to vanish for optimal-

ity.

∂wL =n∑

i=1

(α∗i − αi) = 0, (6.6a)

∂wL = w −n∑

i=1

(αi − α∗i )yi = 0, (6.6b)

∂ξiL = C − αi − ηi = 0, (6.6c)

∂ξ∗i L = C − α∗i − η∗i = 0. (6.6d)

Substituting the results given in Eqs. 6.6 into the primal optimization prob-lem 6.4 yields the dual optimization problem.

Maximize:

− 1

2

n∑i=1

n∑j=1

(αi − α∗i )(αj − α∗

j )y′iyj − ε

n∑i=1

(αi + α∗i ) +

n∑i=1

(αi − α∗i )yi,

(6.7a)

Subject to:

n∑i=1

(αi − α∗i ) = 0, and αi, α

∗i ∈ [0, C] . (6.7b)

39

6. Regression Methods 6.3. Artificial Neural Networks

Kernel methods

The SVR discussed in past sections can solve linear regression problems, thisis a limitation because many real problems are nonlinear. The concept ofkernel is of great importance to deal with nonlinearities. The basic idea isto map the data points, y, in an input space X to a vector space H, calledthe feature space, via a nonlinear mapping ϕ(·) : X → H. The purpose is totranslate the nonlinear structure on the data in space X into linear structuresin a higher dimensional space H (see Prez-Cruz & Bousquet (2004)).

When a suitable mapping is used the data ϕ(yi), ∀i = 1, . . . , n, seems to belinear, then the regression function is determined using the data in the featurespace to obtain y = w′ϕ(y) + w0. This technique is known as the KernelTrick. There are some mappings for which the inner product ϕ′(x)ϕ(y) can becomputed directly from x and y without explicitly computing ϕ(x) and ϕ(y).Such inner product is written in a simpler form as K(x,y). The objective ofthe SVR optimization problem given in Eq. 6.7a is then modified to Eq. 6.8.Note that by using the kernel trick it is not necessary to compute the mapsϕ(x) and ϕ(y) directly. This is the key of kernel methods such a SVMs andanother reason for using the dual optimization problem rather the primal.

−1

2

n∑i=1

n∑j=1

(αi−α∗i )(αj−α∗

j )K(yi,yj)−εn∑

i=1

(αi+α∗i )+

n∑i=1

(αi−α∗i )yi. (6.8)

It is important to note that the objective 6.8 represents a quadratic functionand such function must be convex, which implies that the kernel matrix [Kij](i.e. the Hessian matrix of the objective function) must be positive definite.

Gaussian Kernel : The Gaussian Kernel is a kernel function commonlyused in the literature and for this reason is the kernel used in this work (Eq.6.9). It is necessary for the Gaussian Kernel to define the parameters σ.

K(x,y) = e−∥x−y∥2

2σ2 . (6.9)

6.3 Artificial Neural Networks

In this work we focus on feed-forward neural networks (or multilayer percep-tron). An illustration of such networks is shown in Fig. 6.2, a network withtwo inputs, two hidden layers with three neurons (nodes) in hidden layer 1,

40

6. Regression Methods 6.3. Artificial Neural Networks

two neurons in hidden layer 2, and one output. The input nodes are con-nected forward to the hidden neurons and this neurons are connected forwardto the output neuron. The connections between neurons i and j is related

Hidden layers

Output

Inputs

wij

Fig. 6.2: Feed-forward neural network with two hidden layers, two inputs and oneoutput.

with a weight wij. When an input vector y is presented to the network, eachelement (input) of the vector is propagated through the network being af-fected by the weights of connections (see Tsay (2005)). The information thatreceives the neuron j from the first hidden layer is a linear combination of theinputs and the weights. This information is processed through an activationfunction that defines the output of such neuron as follow

h1j = fj

(w1

0j +∑i→j

w1ijyi

),

where w10j is a constant term called the bias term, the summation i → j

means summing over all input nodes. The activation function for hiddenneurons is usually tangent sigmoid or logistic function. Given the logisticfunction

fj(z) =ez

1 + ez,

41

6. Regression Methods 6.4. Tuning parameters

the output of the jth neuron of the first hidden layer is

h1j =

exp(w1

0j +∑

i→j w1ijyi

)1 + exp

(w1

0j +∑

i→j w1ijyi

) .If there are H hidden layers then the output of the j neuron of the last layeris

hHj = fj

(wH

0j +∑i→j

wHij h

H−1i

),

and this value corresponds to one input to the output layer neuron. Gener-ally an output layer neuron has a linear activation function or a Heavisidefunction. The case of a linear activation function is

o = wo0 +

∑i→j

woioh

Hi .

The idea of using feed-forward neural networks is that, given the pairs (yl, yl),i.e. the input yl and output yl patterns, determine the values of weights, wij,and biases, w0j, that generates outputs ol as close as possible to yl

1. Thiscan be done by minimizing some fitting criterion, such as the least squareserror

s2 =n∑

l=1

(ol − yl)2 ,

the process of training a neural network becomes a nonlinear optimizationproblem. Several algorithms have been proposed for this problem. A well-known is the back propagation (BP) algorithm that is based on the gra-dient descent method. Other optimization algorithms such as Levenberg-Marquardt are commonly used.

6.4 Tuning parameters

Machine learning models have a set of different parameters that must betuned beforehand. In the case of support vector regression models it is nec-essary to determine the value of parameters such as the penalty constant C(that appears to count the penalization due to the slack variables ξi and ξ∗i ),

1 Ensuring at the same time capability of generalization of the neural network results.

42


the width of the tube ε and the Gaussian kernel parameter σ. In the case ofartificial neural network models the most important parameters to determineare the number of hidden layers and the number of neurons per hidden layer.

The tuning process must be carried out according to the so called gen-eralization regression error criterion. Then it is necessary to estimate sucherror according to some predefined method. Two possible ways of estimatesuch error are the following (see Chapelle et al. (2002)):

• Cross validation error : In this procedure the data is divided into twosubsets according to some proportion. A subset (the training set) isused in the training process and the remaining subset (the validationset) is used for estimating the regression error. The process is repeatedchoosing all different training and validation sets possible of the wholedataset without repetitions. An estimation of the generalization erroris meant as validation error.

• Leave one out error : In this procedure one data point is selected fromthe whole dataset for validation (error estimation) using the remainingdata in the training process. The process is repeated until all datapoints have been chosen for validation.

As can be seen, the leave one out estimates are more computationally expen-sive hence cross validation error is commonly used in practice (see Hsu et al.(2010)). The parameters must be tuned by reaching the minimum regression(validation) error. A search strategy can be carried out into two steps: in thefirst step a coarse grid is constructed in the parameter space and the error isevaluated in each point of the grid. Then the point with the minimum erroris kept. In the second step a fine grid is constructed centered in the bestsolution of the last step. Then the point of the fine grid with minimum crossvalidation error is selected. However, such strategy is very time consumingand it is necessary consider more efficient approaches as discussed below.

6.4.1 Response surface methodology for tuning parameters

The process of tuning parameters is in fact an optimization problem. Somecharacteristics of this optimization problem that makes it a difficult problemare:

• The objective value (generalization error) is actually a random variable.

43


• The evaluation of the objective function is very time consuming.

• The objective function is unknown in practical terms.

On the other hand some advantages of such optimization problem are thatit has a few number of variables and that generally the expected objectivefunction is not highly nonlinear. This fact makes it possible to try to ap-proximate such function using a low order polynomial. Initially the objectivefunction can be written as follows

E(p) = f(p) + ϵ,

where f(p),p is an unknown function of the parameter vector p,p ∈ P ,where P is the parameter space. ϵ is an independent and identically dis-tributed random variable. For simplicity, a second order polynomial can ap-proximate this objective function. Linear regression methods compute thispolynomial. The goal is to optimize such polynomial and estimate the opti-mal set of parameters p∗. Such optimal point is added to the original sampleand the optimization process is repeated. The optimization concludes whenit is not possible to improve the objective value. Gonen & Alpaydin (2011)used the method described for tuning the parameters of a support vectormachine.

Initial sample for fit : The regression method requires a sample to obtaina fit for the second order model, Gonen & Alpaydin (2011) uses design ofexperiments (DOE) and response surface methodology (RSM) for this task.The authors uses Khosal design (see Myers & Montgomery (2002), pag: 384)which is very efficient because it only requires a small sample of data, how-ever, a more robust design such as the central composite design (CCD) canbe used (and this is the design commonly used in the literature). Fig. 6.3shows a two dimensional sample for the Khosal design and central compositedesign. It is evident than a Khosal design is a more efficient approach.

Once the experiment is carried out using the sample points given by theexperimental design, the following quadratic function is obtained from linearregression

E = β0 +k∑

i=1

βipi +k−1∑i=1

k∑j=i+1

βiβjpipj +k∑

i=1

βiip2i , (6.10)

where βi are the model parameters and k is the dimensionality of the pa-rameter space or the number of parameters. Given that the objective value

44


(a)

p2

p1

(b)

Fig. 6.3: Two dimensional sample for a Khosal design (a) and a central compositedesign (b).

is a random variable, replications of the experiment in each sample pointmay give better estimation results, thus it is necessary to define the num-ber of replications R in the algorithm. Algorithm 6 shows a procedure fortuning parameters using response surface methodology based on the work ofGonen & Alpaydin (2011). In this algorithm the optimization problem 6.10is restricted to some operability region of the parameters ℓ.

Proposed metaheuristic for tuning parameters procedure

Algorithm 6 aggregates a new solution to the current sample at each iteration,which increases the sample points. This can be ineffective in the sense thatsome of these points may actually do not contribute to obtain a good fit.Another reason is that near to the optimum, the objective function resemblesa quadratic function and then better results are obtained using points nearto such optimum. This does not necessarily can be achieved using Algorithm6 due the bias induced by sample points far from the optimum. On theother hand, Algorithm 6 is completely deterministic and this may cause rapidconvergence to a local minimum.

According to discussion presented above in this section we propose atuning parameters procedure based on the algorithm of Gonen & Alpaydin

45


Algorithm 6: Tuning parameters procedure.

Data: Number of replications per sample point, R; threshold, ε;dimensions of the experimental design, δ; operability region ofparameters, ℓ.

Result: Parameters p∗

1 Build the design matrix with dimensions δ, using some experimentaldesign;

2 Perform the experiment for each sample point R times and obtain thevalidation errors;

3 Fit a second order model for the generalization error function;4 Solve the quadratic optimization problem given in Eq. 6.10 andsubject to the operability region ℓ and get the optimum p∗

0;5 t = 0;

6 while∥∥p∗

t − p∗t−1

∥∥ ≥ ε do7 Perform the experiment R times for p∗

t and obtain the validationerrors;

8 Fit a second order model using all information (sample points)available;

9 Solve the quadratic optimization problem 6.10 and subject to theoperability region ℓ and get the optimum p∗

t+1;10 t = t+ 1;

11 end

(2011). The proposed algorithm considers a fixed number of sample pointsat each iteration, some of such points obtained in a stochastic manner. Theinitial sample points are obtained in the same manner that Algorithm 6getting a total of N sample points, then a new sample point is obtained byoptimizing the fitted second order model, getting a total of N + 1 samplepoints. From these N +1 sample points we select the N −n better solutionsand generate randomly n new solutions to get a fixed sample size of N again.The n new solutions are generated according to a Gaussian distribution withmean equals to the mean of the current N−n better solutions and covariance

46


matrix of

Σ =

σ21 0 · · · 00 σ2

2 · · · 0...

.... . .

...0 0 · · · σ2

k

, (6.11)

where σ2i is the variance of the parameter i in the current sample of N − n

points. The reason of using a Gaussian distribution is that the generatedpoints tend to form a spherical group around the mean, and such sphericalsample can improve the fit of the second order model. The process is re-peated using the new N sample points. The proposed algorithm is shown inAlgorithm 7.

47


Algorithm 7: Proposed tuning parameters procedure.

Data: Number of replications per sample point, R; threshold, ε;dimensions of the experimental design, δ; operability region ofparameters, ℓ; number of random solutions n.

Result: Parameters p∗

1 Build the design matrix with dimensions δ, using some experimentaldesign;

2 Perform the experiment for each sample point R times and obtain thevalidation errors;

3 Fit a second order model for the generalization error function;4 Solve the quadratic optimization problem given in Eq. 6.10 andsubject to the operability region ℓ and get the optimum p∗

0;5 t = 0;

6 while∥∥p∗

t − p∗t−1

∥∥ ≥ ε do7 Perform the experiment R times for p∗

t and obtain the validationerrors;

8 Select the N − n better solutions from the current sample;9 Generate n new sample points using a Gaussian distribution with

mean equal to the mean of the current N − n points andcovariance matrix given in Eq. 6.11;

10 If the generated points lie out of the operability region such pointsmust be projected into such region;

11 Fit a second order model using the N sample points available;12 Solve the quadratic optimization problem 6.10 subject to the

operability region ℓ and subject to the following constraints

N

minj=1

{pcurij

}≤ pi ≤

Nmaxj=1

{pcurij

}, ∀i = 1, . . . , k,

where pcurij is the jth sample of the ith parameter in the current

sample. Get the optimum p∗t+1;

13 t = t+ 1;

14 end

48

SEVEN

EXPERIMENTAL PROCEDURE

The purpose of this chapter is to give a complete description of the ex-perimental methodology used in this work. The experiments have differentfactors than influences the performance of the forecasting methods. Thenthese factors and the methodological framework are described..

This study aims to solve the SLCP demand/sales forecasting problem aswe set in Chapter 2, in such chapter we establish some hypothesis (strategies)which could allow improving the forecast performance. In order to investigatethe set of hypothesis presented in Section 2.1.1 it is necessary to test the effectin the forecasts of the use of cumulative or non-cumulative data, the effect ofpartitioning or not the data, and the effect of the regression model used (seeChapter 6). We refer to the above cases as experimental factors. Fig. 7.1illustrates each of the experimental factors, with their respective levels. Thereare 3 experimental factors, the regression method with 3 options (levels), theclustering usage with 2 options, and the type of data with 2 options. Theselead us to evaluate 3 · 2 · 2 = 12 experimental treatments which are thecombinations of the options (levels) of the factors as shown in Fig. 7.1.For example, a particular treatment is to evaluate the forecasts obtainedby means of multiple linear regression (MLR), without partitioning the data,and using non-cumulative data. We use a general procedure that standardizesthe steps from data collection to final forecasting evaluation. The frameworkof the forecasting procedures is shown in Fig. 7.2.

7. Experimental procedure 7.1. Collection and analysis of data

Experimental factors

Forecasting method Clustering usage Type of data

MLR SVR ANN Yes No Cumulative

Non-cumulative

FCM FMLE FSTS

Fig. 7.1: Experimental factors.

7.1 Collection and analysis of data

Usually the forecasting methods try to predict the values of a time seriesusing the past values of such time series. For a short life cycle product(SLCP) usually there is not available such historical information or the in-formation available is scarce. Thus, typical forecasting methods (such asmoving average, exponential smoothing, and others) cannot be applied ortheir forecasting results are very poor. Given the scarcity of information ofthe historical demand of a new product, useful information can be obtainedfrom products which have completed, or are completing their life cycle. Weassume that from previous products, the life cycle pattern can be learnt, inorder to predict the demand of a new product. We assume that from previousproducts, the life cycle pattern can be learnt, in order to predict the demandof a new product it is necessary to find large enough datasets of time seriesthat contain similar patterns to the time series which we try to forecast. In acompany this task becomesthe finding of demand time series of older SLCP.When such information is available it is necessary to clean the data of noiseor information without any sense.

In this study each of the considered datasets (see Section 5) are splittedinto two subsets (see Fig. 7.2). The first subset, named the training set,it is used in the training process and corresponds to the 80 % of the whole

50

7. Experimental procedure 7.1. Collection and analysis of data

dataset. The training set allows preparing beforehand the forecasting ma-chines (see Chapter 6) via a training procedure. The second subset is calledthe test set and is used to evaluate the forecast performance of each forecast-ing procedure. This subset simulates the real-time information available oncethe SLCP has been introduced to the market or once the time series to beforecasted appears. Note that the training process is carried out once before-hand and it is not necessary to repeat such process each time the real-timeinformation is available. This fact allows saving considerably computationtime.

1. Collect and analyze dataPreprocessing and data cleaning

Historical data

sales profiles of SLCP demand

(Training-Validation data)

Real time-data

(Testing data)

Obtain cumulative

data if required

Obtain cumulative

data if required

2. Perform clustering if required

Perform clustering. Use Alg. 1,

Alg. 3, Alg. 4, or Alg. 5

Validate clustering results

Get the clusters

and cluster centroids

3. Perform classification if required

Classify the time series of the real - time

data into the cluster with nearest centroid

Use n - fold cross validation to estimate the error.

4. Tune the parameters of the machine, MLR, SVR or ANN

Search for the machine parameters using Alg. 6, or Alg. 7

5. Evaluate the forecasts

Perform the forecasts

Evaluate the forecast performance

Get the cluster with nearest centroid

Train the machine using the optimal values of the parameters.

If clustering is used then use the cluster dataset,

otherwise, use the whole dataset for training.

If cumulative data is used,

then obtain the correct

forecasts by differencing

Fig. 7.2: Framework of the forecasting procedures.

51

7. Experimental procedure 7.2. Clustering

7.2 Clustering

This operation is performed if required by any experimental treatment; theclustering process is carried out using the training set (either with cumu-lative or non-cumulative data). In this study the FCM, FMLE, and FSTSclustering algorithms (see Section 5.3.2) are considered. It is necessary toperform a clustering validation in order to tune clustering parameters suchas the number of clusters K and the fuzzifier parameter m. This processobtain groups in data in which the data of the same cluster share similarcharacteristics of the cluster but are significantly different from the remain-ing clusters. The information that emerges from this stage, the partition ofthe data, and the cluster centroids, will be used in the following stages. Thepartition of the data will be used for training. The clusters centroids enableto classify real-time data (time series) into their most similar cluster.After obtaining the clusters, it is necessary to relate the real time informa-tion (time series) with these groups. The goal is to classify the real timeseries into those clusters. The idea is that given any real-time time seriescan be classified in one of the clusters previously established. In this worka minimum distance classification is performed, considering the real-timetime series up to time t,yt, and considering the cluster centroids up to timet,vti, ∀i = 1, . . . , K, then the real-time time series is classified in the clusterk for which

k =K

argmini=1

{d (yt,vti)} . (7.1)

In other words, the real-time time series up to time t is classified in thecluster with nearest centroid up to time t. The selected cluster provides thedata to be used in the later training process.

7.3 Parameter tuning

The tuning parameter procedure requires the estimation of forecast error.The forecast error is estimated with a 5-fold cross-validation with the trainingdataset1 as is described in Section 6.4. Hence, the objective is to reachthe minimum possible cross validation error by using the parameter searchprocedure described in Algorithm 6 or Algorithm 7. The proposed forecasting

1 Data for training can come from cumulative data, non-cumulative data, cluster data,or any combination, depending on the experimental treatment considered for analysis.

52

7. Experimental procedure 7.4. Forecasts evaluation

method requires the training of regression methods (MLR, ANN, and SVR)for each period to be predicted. For example, suppose that the training setconsists of 3 time series of length 5 as shown below

y11, y12, y13, y14, y15;

y21, y22, y23, y24, y25;

y31, y32, y33, y34, y35,

now consider that the time lag is set to p = 3. Then, to forecast a real-timetime series at period 4 it is necessary to train a machine using the trainingdata comprising periods 1 to 3 as inputs or regressors and using the data ofperiod 4 as output or response. The same procedure is followed to forecastthe other periods. Note that to period 2 we use the training data comprisingperiod 1 only as regressors because there is only one lag period. In thesame manner, to forecast period 3 it is necessary to use the training datacomprising periods 1 to 2 as regressors. It is impossible to get forecasts ofperiod 1 since there is no lag periods beforehand. A practical way to forecastthe first period is taking an average of the training set.

According to the above discussion, to obtain an estimation of the forecasterror by cross validation it is necessary to train T − 1 machines, where T isthe length of the time series. This is a time consuming task if T is large,mainly because it is necessary to conduct a search process for parameters2.To avoid the this problem, in this work we use only 5 equally spaced periodsof time series for training and then get an estimation of the forecast error.Once the optimal values of the parameters are obtained, the training of themachine is performed again according to such results and considering thetotal T − 1 periods. The results obtained will allow to forecast any real-timetime series.

7.4 Forecasts evaluation

At this stage the forecasts of the real-time time series are obtained using theresults of the training machine procedure as discussed before. The forecast isperformed for each time series of the real-time (test) dataset. It is noteworthythat when the forecasts are based on cumulative data, then it is necessary

2 This is particularly true when support vector regression or artificial neural networksare considered. The case of multiple linear regression, however, is very efficient.

53

7. Experimental procedure 7.5. Some computational aspects

to transform the data correctly by differencing3. Root Mean Square Error(RMSE) evaluates forecast performance and it is calcuated as follows:

RMSE =

√√√√ 1

T

T∑t=1

(yt − yt)2, (7.2)

where yt is the forecasted value at period t. This work also uses the MeanAbsolute Error (MAE) becuase, as will be shown later, such metric allowsan absolute comparison point. The MAE is calculated as follows:

MAE =1

T

T∑t=1

∣∣∣∣yt − ytyt

∣∣∣∣ . (7.3)

7.5 Some computational aspects

The algoritms were implemented in Matlab 2011b. We use the Neural Net-work Toolbox of Matlab to train the Feed Forward Neural Networks. For theSupport Vector Machine case we make use of the LIBSVM Toolbox of Matlabdue to Chang & Lin (2011), from this Toolbox the Epsilon Support VectorRegression was considered in this work. All algorithms presented earlier inthis work were programmed in Matlab in the Windows Operating Systemand on a 32 Gb, 3.40 GHz machine.

7.6 Results of the tune parameters procedure

7.6.1 Tuning parameters for SVR machines

As previously mentioned, the tuning parameter procedure takes into accountthe cross validation estimation of the forecasting error for a given trainingset. In the case of a SVM it is necessary to define the values of the width ofthe tube ε, the penalty constant C, and the Gaussian kernel parameter σ asis described in Section 6.2. Previous works in the search of such parametersestablish that an exponential growing sequence of such parameters is a prac-tical method to identify good parameters (Hsu et al. , 2010; Lin, 2006). For

3 To obtain non-cumulative demand we use the difference between cumulative demandswhich is xt = Xt −Xt−1, where Xt is the cumularive demand at period t, and xt is thenon-cumulative demand.

54

7. Experimental procedure 7.6. Results of the tune parameters procedure

this reason the following search space is defined as:

ε ∈[2−5, 24

];

C ∈[2−2, 215

];

1

2σ2= γ ∈

[2−16, 25

].

This search region will be defined as the operability region, ℓ, of the param-eters, which is considered large enough. For the case of the Support VectorMachines (SVM) the proposed search procedure described in Algorithm 7is used as search method. In this work a Central Composite Design (CCD,see Section 6.4.1) is used to get the initial sample points. The betweenaxial points of such design is given by the extreme values of each param-eter into the operability region. The design is chosen to be rotatable (seeMyers & Montgomery (2002)), which implies that the following relation mustbe satisfied

α = k1/4,

where α is a normalized mean distance between the axial points4, and k isthe number of parameters to be considered, which in this case is k = 3. Inthis work only R = 1 replicate is considered and the threshold or stoppingcriterion (see 6.4.1) of the algorithm is set to ε = 0.005. Finally the proposedtuning parameters procedure requires removes n of the worst solutions andgenerate again n new random solutions. In this work we set n = 5 given thatN = 15 (see Section 6.4.1).

The performance of Algorithm 7 at different iterations is shown in Fig. 7.3for the different datasets. From this example it is shown that the algorithmconverges to a, possibly, local optima in general. The results show that theperformance of the proposed method is promising. An interesting fact is thatthe training process is little sensitive to the bandwidth parameter ε as it isshown for real datasets on the intermediate iteration. Tab. 7.1 shows theresults of the tuning parameters procedure for non-cumulative data. Tab.7.2 shows the results for cumulative data according to the optimal numberof lags. Evidently the search time is much larger when cumulative data isconsidered.

4 No details are presented on such standardization; you can refer toMyers & Montgomery (2002).

55


Tab. 7.1: SVR results of the tuning parameters procedure for non-cumulative data.

Dataset p log2 ε log2 C log2 γSearchtime (s)

RD1 2 −2.50 9.86 −14.64 167SD1 16 −1.89 5.68 −10.35 237RD2 6 1.16 12.35 −19.12 118RD3 10 −3.59 10.93 −15.26 223

*Optimal number of lags, see Fig. 8.2.

Tab. 7.2: SVR results of the tuning parameters procedure for non-cumulative data.

Dataset p log2 ε log2 C log2 γSearchtime (s)

RD1 1 −3.19 14.90 −19.92 439SD1 18 −3.01 14.53 −15.99 22 554RD2 1 0.00 15.00 −23.22 163RD3 1 −9.48 13.80 −20.35 3 042


7.6.2 Tuning parameters for ANN machines

This work uses the Feed Forward Neural Network or Multilayer Perceptronas regression method, given the relative simplicity of this neural networkmodel. Such networks requires to adjust or establish many parameters suchas the learning rate, the number of neurons per hidden layer, the number ofhidden layers, the selection of activation functions per neuron, the selectionof the training algorithm, the number of iterations of the training algorithm,the weight initialization procedure, and possibly many other parameters thataffect the performance of the machine.

The interest here, however, is focused on determining the number of neu-rons per hidden layer and the number of hidden layers, the other parametersare set in a relatively arbitrary manner and according to the author’s expe-rience. In this sense the Levenberg-Marquardt with Bayesian regularizationtraining algorithm is considered, the learning rate is set to 0.01, the numberof runs (iterations) is set to 300, the weights are initialized to zero, and atangent sigmoid activation function is considered for hidden neurons and a

56


linear activation function is considered for the output neuron. Our experi-ence working with artificial neural networks evidence a high computationalcost. This may be a consequence of the use of Bayesian regularization in thetraining algorithm. The training algorithm implies longer processing timesbut good results are obtained, and this was a reason to select such trainingalgorithm.

Accordingly it is necessary to determine the number of neurons per hiddenlayer and the number of hidden layers. For this purpose we consider up to 2hidden layers and up to 28 neurons per hidden layer. In this case a hiddenlayer with 0 neurons indicates that such layer is not considered. The tuningparameters procedure for ANN do not show a clear convergence as is thecase with the SVM. The fact that the parameters are integers can make lesseffective the search procedure.

Tab. 7.3: ANN results of the tuning parameters procedure for non-cumulativedata.

Dataset p* Number of neuronsin hidden layer 1

Number of neuronsin hidden layer 2

Search time (s)

RD1 4 25 0 267SD1 23 24 4 309RD2 3 1 0 339RD3 2 14 28 291


Tab. 7.4: ANN results of the tuning parameters procedure for cumulative data.

Dataset p* Number of neuronsin hidden layer 1

Number of neuronsin hidden layer 2

Search time (s)

RD1 11 1 0 785SD1 1 1 0 582RD2 7 1 0 892RD3 1 28 0 611


The results of the search are summarized in Tab. 7.3 for the non-cumulativecase and in Tab. 7.4 for the cumulative case. Note that the search time is

57

7. Experimental procedure 7.7. Conclusions of the chapter

longer as compared with SVM case; also note that such time is larger for thecumulative case than the non-cumulative case.

7.7 Conclusions of the chapter

This chapter presented a general theoretical framework of the regressionmetods to be used in the forecasting framework as will be seen later. Thischapter also considered the problem of tuning parameters for learning ma-chines such as neural networks and support vector machines. We consider thecross-validation regression error as figure of merit in tune perameters. Themetaheuristic procedure proposed for tuning parameters generate reasonableresults.

58

7. Experimental procedure 7.7. Conclusions of the chapter

−4−20 2 4051015

−15−10−5

05

−4−20 2 4051015

−15−10−5

05

−4−20 2 4051015

−15−10−5

05

−4−20 2 4051015

−15−10−5

05

−4−20 2 4051015

−15−10−5

05

−4−20 2 4051015

−15−10−5

05

−4−20 2 4051015

−15−10−5

05

−4−202 4

051015−15−10−5

05

−4−202 4

051015−15−10−5

05

−4−20 2 4051015

−15−10−5

05

−4−20 2 4051015

−15−10−5

05

−4−20 2 4051015

−15−10−5

05

log2 ǫlog2C

log2γ

(d)

(b)

(c)

(a)

Fig. 7.3: Sample points given by the proposed tuning parameters procedure at dif-ferent iterations, start, intermediate and final iteration. (a) RD1 Dataset;(b) SD1 dataset; (c) RD2 dataset; (d) RD3 dataset. This results are ob-tained for a number of lags of p = 2, we omit the results for other valuesof p.

59

EIGHT

RESULTS

8.1 Forecasting results using multiple linear regression

In this situation the parameters of the regression model are obtained usingEq. 6.3 of Section 6.1 by using the training set (see Chapter 7). Then thetest set is used to investigate the performance of the method, by using theRoot Mean Square Error (RMSE) metric. The results were obtained usingnon-cumulative and cumulative data shown in Fig.8.1 for different number oflags. It is shown that the use of non-cumulative data improves the forecastingresults in terms of mean RMSE and variability of the forecast error as it isshown by the error lines. This result is confirmed in practically all datasets.

Tab. 8.1 shows a comparison of the results obtained using non-cumulativeand cumulative data in the forecast process. As observed the use of non-cumulative data improves the forecasts results in all cases as is demonstratedwith the Kruskal-Wallis test1. As expected, the processing time (training andforecasting) is less using non-cumulative data. This is reasonable since theuse of cumulative data requires the transformation of cumulative results intonon-cumulative forecasts spending additional processing time. In average theuse of cumulative data increases the process time in about 3 times the timerequired for non-cumulative data. This results have an important implica-tion: the use of cumulative data does not necessarily improve the forecastperformance, this fact contradicts the hypothesis presented in Section 2.1.1.

1 Kruskal-Wallis test is an estatistical analysis for perform a non-parametric one-wayanalysis of variance by comparing the medians of the experimental treatments or variables.

8. Results 8.1. Forecasting results using multiple linear regression

0 5 102

4

6

8(a)

0 10 201.5

2

2.5

3(b)

0 5 10 1515

20

25

30

Lags (p)

RM

SE

(c) (d)

0 10 202

2.5

3

3.5

Fig. 8.1: Multiple linear regression results for non-cumulative (blue line) and cu-mulative (red line) data and for different number of Lags (p). The blackpoints corresponds to the optimum. The error lines correspond to the 10% of the standard deviation. (a) RD1 dataset. (b) SD1 dataset. (c) RD2dataset. (d) RD3 dataset.

It is notable that the optimal number of lags for the SD1 (Bass processsynthetic dataset) is very large; this implies a strong linear correlation orlag dependence. On the other hand, the optimal number of lags is relativelyshort for the other datasets, and for cumulative data only one lag is required.

8.1.1 Multiple linear regression results with clustering

Preliminary study

Initially (in a preliminary study) we carried out experiments using the SD1dataset and considering the correct partition of the data which consist of fourgroups (see Section 5.1.2). The purpose of this experiment is to evaluate theeffect of partitioning data in the forecast performance given that the correctpartition of the data is known, as is the case for the SD1 dataset.

According to the Kruskal-Wallis test, in this experiment again the use of

61


Tab. 8.1: Multiple linear regression results using complete datasets.

Dataset p*MeanRMSE

S.Deviation

Time (s) p-value**

RD1 2 4.069 2.885 0.29 0.00RD1 cum. 1 5.503 4.513 0.59

SD1 24 1.611 0.355 0.67 0.00SD1 cum. 24 2.324 0.348 4.2

RD2 5 16.504 8.674 0.41 0.00RD2 cum. 1 21.592 12.673 0.71

RD3 3 2.259 1.508 0.34 0.02RD3 cum. 1 2.440 1.537 1.29


**p-value of the Kruskal-Wallis test, values near to 0 implies median differences.

non-cumulative data outperforms the forecasting results with respect to theuse of cumulative data. The results are shown in Tab. A.1 of Appendix A.For non-cumulative data the mean RMSE is smaller when clustering is used inthe forecasting process. However, the methods produce results statisticallyequal, according to the p-value of the Kruskal-Wallis test. On the otherhand, for cumulative data the clustering does not improve the results and thedifference between both methods statistically is significant. An interestingresult is that RMSE variability increases when partitioning the data. Thisfact can be explained because fewer amounts of data are used in the regressionprocess and the variability of estimations increases. However, partitioningthe data apparently reduces the process time.

Use of clustering algorithms

This section assesses the effect of partitioning the data in the forecast processon the real datasets. When forecasting with clustering, there are factors suchas the number of clusters K, the fuzzyfier parameter m of the clustering al-gorithm, and the number of lags p that may affect the forecast performance.Keeping this in mind and to avoid potential bias in the results, we evaluated(by a grid search) several values of such parameters beforehand. Tab. B.1of Appendix B shows the forecasting results obtained with several clusteringalgorithms, for optimal values of the clustering and lag parameters. The

62


results show that the FMLE clustering algorithm outperforms other algo-rithms in RD1 and RD3 datasets. The FSTS do better than other clusteringalgorithms for the SD1 and the cumulative RD3 and the FCM performs bet-ter than other clustering algorithms for non-cumulative RD3 dataset. Aninteresting result related with FSTS clustering algorithm is that the optimalnumber of clusters for each dataset of non-cumulative data is the same thatthe optimal number of clusters found by the PCAES index (see Tab. 5.3 inSection 5.3.3).

The effects of clustering is evaluated with the optimal values of clusteringand lag parameters given in Tab. B.1 (Appendix B). The FMLE clusteringalgorithm is selected to perform the evaluation; the reason for this is thatthis algorithm present good results in most datasets. According to the re-sults with non-cummulative data (Tab. 8.2). It is shown that the RMSE ofthe SD1 dataset is smaller, with a statistically significant difference, whenpartitioning the data. This is not the case for real datasets in which there isnot statistical difference in the error metric. In fact the mean RMSE for theRD1 and RD2 is greater when clustering is used, which may be interpretedas a negative effect to the clustering process. However, in the SD1 and RD3datasets the mean RMSE is smaller when clustering is used. Then, in somecases the clustering improves the results but in other cases clustering has theopposite effect. A possible explanation for this is that clustering works wellfor data in which a cluster structure is evident. When there is not an evidentclustering tendency in the data the clustering actually may not improve theforecast performance. It is important to note that clustering reduces thesize of the training set, limiting the amount of data available for parameterestimation.

According to the results shown in Tab. 8.2, the standard deviation ofthe error for the SD1 and RD3 datasets2 is smaller than the correspondingstandard deviation of datasts RD1 and RD2. This apparently indicates thatif the clustering improves the results, the variability of the forecasts is smaller.It is evident that the processing time increases when clustering is performed.From our results this increase is of about 6 times the time required withoutusing clustering in the non-cumulative case, and of about 3 times in thecumulative case. This, however, without considering the time required toadjust the clustering and lag parameters.

The results when cumulative data is used are shown in Tab. 8.3. These

2 Datasets that have smaller mean RMSE when clustering is used.

63


Tab. 8.2: Optimal result for multiple linear regression partitioning the data, non-cumulative data.

DatasetMeanRMSE

S.Deviation

ClusterTime (s)

ForecastTime (s)

p-value*

RD1 4.070 2.917 2.79 0.30 0.96SD1 1.420 0.320 1.92 0.52 0.00RD2 16.897 9.244 1.42 0.23 0.94RD3 2.247 1.406 1.44 0.37 0.91

*p-value of the Kruskal-Wallis test by evaluating the effect of partitioning or notthe data. The reader may compare these results with the results shown in Tab.8.1 for the non-cumulative case.

results do not improve the forecast performance as already mentioned. Ingeneral, there is an increase on the mean error, the standard deviation of theerror, and the processing time.

Tab. 8.3: Optimal result for multiple linear regression partitioning the data, cu-mulative data.

DatasetMeanRMSE

S.Deviation

ClusterTime (s)

ForecastTime (s)

p-value*

RD1 5.009 3.437 2.82 0.42 0.62SD1 1.833 0.456 1.59 0.70 0.00RD2 18.944 10.259 1.71 0.43 0.02RD3 2.623 1.677 1.53 0.45 0.12

*p-value of the Kruskal-Wallis test by evaluating the effect of partitioning or notthe data. The reader may compare these results with the results shown in Tab.8.1 for the cumulative case.

8.1.2 Conclusions of the MLR case

From the results obtained here we can conclude that the use of cumulativedata does not improve the forecast performance and it increases processingtime. Partitioning the data does not necessarily improve the forecast results.However, there may (likely) be a marked improvement if the data has an evi-dent clustering structure. It is necessary to note that data partition increasesthe time required for processing and parameter tuning. We have stated the

64

8. Results 8.2. Forecasting results using support vector regression

hypothesis that partitioning the data could improve prediction performance,however, our datasets failed to properly validate this fact. A plausible ex-planation is that real datasets do not have clustering structure as observedin Fig. 5.3, Section 5.3.4.

8.2 Forecasting results using support vector regression

The idea of using support vector machine is to capture the nonlinear processthat is performed in the data, in order to obtain a better forecasts. Anevaluation of the lag parameter is shown in Fig. 8.2. It is shown thanthe lag dependence of the non-cumulative case is stronger than in linearregression (see Fig. 8.1). For SVR the use of non-cumulative data improvesthe forecasting results as expected.

(a)

0 5 103.5

4

4.5

5

5.5

6(b)

0 10 201

1.5

2

2.5

(c)

Error

p

0 5 10 1515

20

25

(d)

0 10 202

2.5

3

3.5

Fig. 8.2: Support vector regression results for non-cumulative (blue line) and cu-mulative (red line) data and for different number of Lags (p). The blackpoints correspond the optimum. (a) RD1 dataset. (b) SD1 dataset. (c)RD2 dataset. (d) RD3 dataset.

Tab. 8.4 shows the results considering the optimal number of lags (as isshown in Fig. 8.2). As observed, the use of non-cumulative data improvesthe forecasts results in all cases, as is demonstrated with the Kruskal-Wallistest. On the other hand, the process time (training and forecasting) is less

65


using non-cumulative data in almost all cases. Finally, the standard deviationof the RMSE measures is smaller when non-cumulative data is used. Thiscorroborates the fact that the use of cumulative data does not improve theperformance of the forecast.

Tab. 8.4: Support vector regression results using complete datasets.

Dataset p*MeanRMSE

S. Devi-ation

TrainTime(s)

Forecasttime (s)

p-value**

RD1 2 3.983 3.157 0.29 1.21 0.00RD1 cum. 1 5.131 4.096 2.62 1.10

SD1 16 1.298 0.269 0.51 2.68 0.00SD1 cum. 18 1.792 0.451 54.2 2.83

RD2 6 16.409 10.489 0.32 1.50 0.02RD2 cum. 1 18.417 10.593 0.33 1.41

RD3 10 2.154 1.473 0.52 2.15 0.00RD3 cum. 1 2.506 1.661 1.96 2.07



8.2.1 Support vector regression results with clustering

In order to evaluate the effect of the clustering in the forecast performancewithout any possible bias, we performed a grid search for the clustering andlag parameters as in the case of multiple linear regression. Tab. B.2 ofAppendix B shows the optimal values for the clustering and lag parametersaccording to different clustering algorithms. It is shown that the FMLE clus-tering algorithm improves the results in almost all cases. For this reason theFMLE algorithm is selected for the analysis. Tab. 8.4 shows the results forthe non-cumulative case. As observed, there is no improvement on the meanRMSE when clustering is used. In fact, the variability in the measurementsincreases. According to the Kruskal-Wallis test, the measurements are notstatistically different; this fact shows that the clustering has no effect on theresults when SVR is used as is confirmed with this experiment.

According to the results in Tab. 8.6, there are no improvements comparedwith the case of non-cumulative data. On the other hand, the clustering

66


Tab. 8.5: Optimal results for support vector regression partitioning the data, non-cumulative data.

DatasetMeanRMSE

S. Devi-ation

ClusterTime(s)

Traintime (s)

ForecastTime(s)

p-value*

RD1 3.985 3.185 5.66 0.36 1.58 0.90SD1 1.301 0.273 4.59 0.54 3.91 0.99RD2 16.437 10.926 2.42 0.24 2.19 1.00RD3 2.172 1.661 1.58 1.85 2.49 0.82

*p-value of the Kruskal-Wallis test by evaluating the effect of partitioning or notthe data. The reader may compare these results with the results shown in Tab.8.4 for the non-cumulative case.

process could achieve lower values for the mean RMSE in RD1 and SD1datasets, and the variability of the measurements is smaller for the SD1datasets. However there is not a statistical difference between the resultswhen compared with the effect using clustering.

Tab. 8.6: Optimal results for support regression partitioning the data, cumulativedata.

DatasetMeanRMSE

S. Devi-ation

ClusterTime(s)

Traintime (s)

ForecastTime(s)

p-value*

RD1 5.081 3.838 2.86 0.31 1.41 0.98SD1 1.740 0.457 4.06 3.16 3.56 0.27RD2 18.91 10.96 1.40 0.21 1.37 0.64RD3 2.506 1.661 1.58 1.85 2.49 1.00


8.2.2 Conclusions of the SVR case

The use of SVR machines generates very similar results to those obtainedwith MLR. There is strong evidence that the use of cumulative data does notimprove the forecast performance and requires more computation time. The

67

8. Results 8.3. Forecasting results using artificial neural networks

use of clustering does not show improvement in the forecasting performance,even for synthetic dataset. It is important to note that the use of SVRmachines significantly increases the computation time which is mainly dueto parameters tuning.

8.3 Forecasting results using artificial neural networks

Artificial neural networks are used as an alternative to support vector ma-chines in order to capture the nonlinear structure in the data. In the eval-uation of the lag parameter for ANN some datasets such as SD1 and RD3showed strong lag dependence in the non-cumulative data. The results showthat the use of non-cumulative data improves the forecast performance asexpected; in fact the results for the cumulative case are very unstable. Tab.8.7 shows the results considering the optimal number of lags (as is shown inFig. 8.3), there is enough statistical evidence to indicate that the use of cu-mulative data does not improve the results and, in general, such use increasesthe process time, the error and the variability of the measurements.

(a)

0 5 100

5

10

15

20(b)

0 10 200

10

20

30

(c)

Error

p

0 5 10 150

50

100

150(d)

0 10 200

5

10

Fig. 8.3: Artificial neural network results for non-cumulative (blue line) and cu-mulative (red line) data and for different number of Lags (p). The blackpoints correspond the optimum. (a) RD1 dataset. (b) SD1 dataset. (c)RD2 dataset. (d) RD3 dataset.

68


Tab. 8.7: Artificial neural network results using complete datasets.

Dataset p*MeanRMSE

S. Devi-ation

TrainTime(s)

Forecasttime (s)

p-value**

RD1 4 4.066 2.824 43.6 26.69 0.00RD1 cum. 11 5.545 4.287 16.1 35.3

SD1 23 1.163 0.571 326 48.3 0.00SD1 cum. 1 12.874 0.587 27.1 63.9

RD2 3 18.225 13.811 30.5 31.4 0.00RD2 cum. 7 23.228 11.699 30.0 31.5

RD3 2 2.194 1.345 293 51.7 0.00RD3 cum. 1 2.961 2.064 63.7 41.7



8.3.1 Artificial neural network results with clustering

Clustering effects on forecast performance are evaluated in the same manneras in the previous cases. In the non-cumulative case there is no statisticalevidence of improvement when using clustering for the real datasets. In thecase of synthetic dataset, however, the use of clustering actually improvesthe forecast performance (Tab. 8.7). In the datasets SD1 and RD2 the meanvariability of RMSE decreases when clustering is used.

The results for the cumulative data case are given in Tab. 8.9 and havenot shown any improvement for the real datasets. For synthetic dataset,however, there is a significant improvement in the forecast performance as inthe results of cumulative data.

8.3.2 Conclusions of the ANN case and other cases

The conclusion for the case of neural networks is similar to those obtained formultiple linear regression and support vector regression. The use of cumula-tive data does not improve the forecast performance and the use of clusteringhas no clear effect on the forecast performance according to the real datasets.As additional information about the results obtained so far in Tab. 8.10 wepresent a summary of the results in each experimental treatment. Note thatin any case the use of cumulative data improves the forecast performance.

69


Tab. 8.8: Optimal results for artificial neural network partitioning the data, non-cumulative data.

DatasetMeanRMSE

S. Devi-ation

ClusterTime(s)

Traintime (s)

ForecastTime(s)

p-value*

RD1 4.172 3.034 1.93 60.4 26.8 0.85SD1 1.294 0.439 4.65 276 51.8 0.00RD2 19.139 12.933 2.50 7.92 31.2 0.21RD3 2.194 1.345 2.48 282 47.8 1.00


On the other hand, the use of clustering seems not to have any effect in theforecast performance. According to these results the idea of using clusteringfor forecasting loses validity, in practice, since this procedure does not im-prove prediction performance and has high computational cost, specially, inthe clustering validation procedure.The possibility of improving forecasting performance is one of the main rea-sons for using cumulative data given the noise reduction (i.e curve smooth-ing). Moreover, many diffusion models take into consideration cumulativetime series to facilitate tuning parameter tasks. In contrast to this idea,the results show that prediction performance is not improved, hence it isimportant to answer why the use of cumulative data is not effective in ourforecasting framework? A first answer is that the use of cumulative dataincreases the range of the time series and even though the time series lookssmooth, the error associated to the forecasts increases due to the increase inrange of the time series. In Fig. C.1 of Appendix C it is shown the resultsof the calculation of the variances of the datasets at each time period forcumulative and non-cumulative data. The use of cumulative data evidencesa remarkable increase in the variability of the time series, this provides aclear picture of what happens with the use of cumulative data. The increasein variability implies a poor performance of the forecasting method.

70

8. Results 8.4. Comparison of forecasting methods

Tab. 8.9: Optimal results for artificial neural network partitioning the data, cu-mulative data.

DatasetMeanRMSE

S. Devi-ation

ClusterTime(s)

Traintime (s)

ForecastTime(s)

p-value*

RD1 5.634 4.400 2.93 29.8 26.6 0.90SD1 1.886 0.543 4.11 78.2 44.7 0.00RD2 36.43 48.84 1.76 61.3 30.2 0.47RD3 2.983 1.957 1.62 59.9 40.7 0.26


8.4 Comparison of forecasting methods

In order to have a visual illustration of the forecasting results we have selecteda time series from each datasets to show the forecasting results for each re-gression method, Fig. 8.4 shows such forecasts. It is shown that the forecastsfit relatively well the time series, it is interesting to note the neural networkforecast for the SD1 dataset, in this case the predictions are considerablygood and this is true for this dataset in general. On the other hand, Fig. 8.5shows the p-values of the pairwise comparison of regression methods accord-ing to the Kruskal-Wallis test, it is shown that for real datasets there is nodifference in regression methods at least a 90 % of confidence. It is clear thatmultiple linear regression and artificial neural networks produce very similarresults for real datasets. In contrast, in the case of the synthetic dataset,the regression methods are statistically different with confidence levels wellabove to 95 %, and in this case artificial neural networks are the undisputedwinners in the forecasting results. The absence statistical differences betweenregression methods for real datasets highlights the multiple linear regressionas a very efficient method to forecast the demand of SLCPs according to theproposed forecasting framework discussed in this work (see Chapter 7).

An evaluation of the forecasting and regression methods is performed byconsidering the Mean Absolute Error of the forecasts at each time periodFig. 8.6 shows such results. An interesting result is that the mean absoluteerror is larger at the beginning and at the end of the time series for thedatasets of time series that clearly have completed its life cycle, this fact is

71


Tab. 8.10: Summary evaluation of the experimental treatments according to themeasurements of RMSE.

Dataset Regression methodUse of cumulative

dataUse of clustering

RD1

MLR

No No effectSD1 No YesRD2 No No effectRD3 No No effect

RD1

SVR

No No effectSD1 No No effectRD2 No No effectRD3 No No effect

RD1

ANN

No No effectSD1 No NoRD2 No No effectRD3 No No effect

not obvious for the RD2 dataset and this is due to many of the time seriescontained in such dataset have not yet completed its life as already mentionedin Section 5.1.1. This implies that the absolute error decreases around thepeak sales or around the peak of the time series, as to the comparison of theregression methods; it is shown that these methods achieve similar resultsat the beginning and during the peak of the time series. In contrast, thereare noticeable differences between regression methods at the end of the lifecycle. This is most evident for RD3 dataset in which the results at the end ofthe life cycle look very unstable in particular for SVRs and ANNs. However,given the above results, it is difficult to establish which regression methodperform better for given set of periods.

72


0 5 100

2

4

6

8

0 10 200

5

10

0 5 10 150

50

100

t

yt

0 10 200

2

4

6

8

Data

MLR

SVR

ANN

(a) (b)

(c) (d)

Fig. 8.4: Forecasting results for some time series. (a) RD1 dataset; (b) SD1dataset; (c) RD2 dataset; (d) RD3 dataset. Note: the forecast of thefirst period was obtained as the average of the training set.

Datset

p-value

RD1 SD1 RD2 RD3

0.05

0.2

0.4

0.6

0.8

1

MLR-SVRMLR-ANNSVR-ANN

Fig. 8.5: Pairwise comparison of regression methods according to the Kruskal-Wallis test.

73


0 5 10

1

2

3

4

5

6

7

8

9

10

11

12

13

0 5 10

0

5

10

15

20

25

0 5 10

0

2

4

6

8

10

12

14

16

18

Mean Absolute Error

t

0 5 10

0

5

10

15

20

MLR

SVR

ANN

(a) (b)

(c) (d)

Fig. 8.6: Mean absolute error results for each regression method and each timeperiod. (a) RD1 Dataset; (b) SD1 dataset; (c) RD2 dataset; (d) RD3dataset. Note that the greatest mean absolute errors are found at thebeginning and the end of the life cycle.

74

NINE

CONCLUSIONS

This work addressed the problem of forecasting the demand of short lifecycle products (SLCPs) by using multiple linear regression, and machinelearning methods such as SVM and ANN. The use of regression methodsand the methodology followed in this study show a clear advantage over otherforecasting methods proposed in the literature. This is because it is possibleto obtain forecasts at early stages of the product life cycle. In fact, themethods discussed in this work only requires information available of the firstdemand/sales period, while other methods proposed in the literature requiredemand/sales information of at least 3 previous periods. This feature ispossible due to the effective use of the time series demand of similar productsthat have already completed their life cycle.

This work considers different strategies (hypothesis) aimed to improve theforecast performance of the regression methods. From the results obtainedin this work we can conclude that:

• The use of cumulative data does not improve the forecast performance.The results show a clear evidence of this statement. A possible explana-tion for this is the systematic increase of the variance of the cumulativetime series, as a result of the sum of random variables. Although thecumulative time series is smooth, the values of such time series at dif-ferent periods hide a larger variance than the non-cumulative valuesand such increase in variance generate poor forecasting results. On theother hand, the use of cumulative data increases the processing timeshowing that cumulative data is of little benefit for forecasting by usingthe framework proposed in this work.

9. Conclusions

• The effect of the clustering in the forecasting results is not clear. Ourexperience on using clustering as a method to extract relevant infor-mation from the data to be used in the forecasting process show that,apparently, it is possible to have an improvement in forecast perfor-mance if the data shows a clear clustering structure. Unfortunately inour case all real datasets do not show a clear clustering structure. Wethink that the use of clustering techniques may be a valuable tool inthe development of effective forecasting methods. However, the corre-sponding analysis is beyond the scope of this work. A possible directionis to forecast using the grade of membership to each cluster of each pat-tern. This is expected to improve the results.

• Non-linear regression methods do not show a significant improvementin the forecasting performance for most of the datasets. Multiple linearregression showed statistically equal results to those of support vectorregression and artificial neural networks with at least 90 % of con-fidence. This allows concluding that multiple linear regression is anefficient and effective method to forecast the demand of a SLCP.

To sum up, according to the results and analysis presented in this docu-ment, the application of MLR for forecasting, with non-cumulative data andwithout clustering, is the best method to obtain low prediction errors.

76

APPENDIX

A

RESULTS FOR THE SD1 DATASET USING THE

CORRECT PARTITION

Tab. A.1 shows the effect in the forecasting performance of partitioning thedata for the SD1 dataset according to its correct partition. The forecastingprocess is carried out with the optimal number of lags for the completedataset (see Fig. 8.1 and Tab. 8.1). Tab. B.1 shows that clustering algorithmimproves the results obtained on SD1 dataset.

Tab. A.1: Preliminary multiple linear regression results for SD1 dataset consider-ing all data and their correct partition.

NON-CUMULATIVE DATA

Dataset p* Mean RMSE S. Deviation Time (s) p-value**SD1 24 1.6099 0.3564 0.7431 0.1114SD1 clust. 24 1.6005 0.5225 0.4888

CUMULATIVE DATA

Dataset p* Mean RMSE S. Deviation Time (s) p-value**SD1 24 2.3227 0.3475 5.0256 0.0687SD1 clust. 24 2.4926 0.6497 4.2385



APPENDIX

B

THE EFFECT OF THE CLUSTERING ALGORITHM

B.1 Multiple linear regression case

In this section, STS, FCM and FMLE algorithms cluster the data sets toinvestigate the effect of partitioning on prediction performance (see Section5.3.2). For this study we investigate the effect of the clustering algorithm,the number of clusters, the value of the fuzzyfier parameter, and the numberof lags on the forecasting performance. The number of partitions K changesfrom K = 2 to K = 8 clusters, the number of lags is up to p = 10 lags(for the SD1 dataset, however, we search for the following number of lagsp = {18, . . . , 24} due to its strong lag dependence), and the following valuesfor the fuzzifier parameter m = {1.1, . . . , 2.0, 2.25, 2.5, 2.75, 3}. The resultsare shown in Tab. B.1

B.2 Support vector regression case

After evaluating the effect of the clustering and lag parameters for the SVRcase we obtained the results shown in Tab. B.2. As observing the use of theFMLE clustering algorithm improves the results in almost all datasets.

Appendix B. The effect of the clustering algorithm B.2. Support vector regression case

Tab. B.1: Comparison of FCM, FMLE and FSTS algorithms for MLR.

NON-CUMULATIVE DATA CUMULATIVE DATA

Dataset FCM FMLE FSTS FCM FMLE FSTSRD1 RD1

RMSE 4.084 4.070 4.154 5.048 5.009 5.064K 2 5 2 2 4 2m 1.5 2.8 1.3 1.3 1.1 1.2p 2 4 2 1 1 1

SD1 SD1

RMSE 1.431 1.420 1.413 1.841 1.832 1.7861K 4 5 6 7 6 8m 2 2 1.7 1.7 1.3 1.1p 24 24 20 18 18 15

RD2 RD2

RMSE 17.137 16.897 17.143 19.896 18.944 19.449K 2 2 2 3 4 2m 1.2 2.25 1.3 3 1.1 1.4p 5 6 5 1 1 1

RD3 RD3

RMSE 2.215 2.247 2.240 2.615 2.623 2.583K 2 2 5 2 2 3m 1.4 1.5 1.8 2.5 1.4 3p 3 2 1 1 1 1

79

Appendix B. The effect of the clustering algorithm B.2. Support vector regression case

Tab. B.2: Comparison of FCM, FMLE and FSTS algorithms for SVR.

NON-CUMULATIVE DATA CUMULATIVE DATA

Dataset FCM FMLE FSTS FCM FMLE FSTSRD1 RD1

RMSE 4.030 3.985 4.051 5.148 5.082 5.152K 2 8 2 2 4 2m 1.8 2.5 1.3 1.2 1.1 1.6p 4 5 2 1 1 1

SD1 SD1

RMSE 1.373 1.301 1.320 1.808 1.740 1.723K 3 8 2 8 8 8m 1.5 1.6 1.6 1.2 1.4 1.4p 17 17 16 17 17 17

RD2 RD2

RMSE 17.276 16.437 16.878 19.492 18.911 19.855K 2 8 2 2 2 2m 1.2 2.25 1.4 2 2.5 2.5p 6 5 4 1 1 1

RD3 RD3

RMSE 2.178 2.172 2.188 2.579 2.506 2.564K 2 6 2 2 3 2m 1.4 1.6 1.8 2.5 1.5 2.75p 5 5 5 1 1 1

80

APPENDIX

C

VARIANCE OF CUMULATIVE AND

NON-CUMULATIVE DATA.

This appendix assesses the variability in non-cumulative and cumulative timeseries considering the variances at each time period. The variances are cal-culated using the complete datasets, for example, the variance of the firstperiod of the RD1 dataset is equal to the variance of the values of all-timeseries in such period. We assume that these values follow the same prob-ability distribution, which is not necessarily true given the diverse natureof time series. This analysis, however, only provides a provisional idea ofthe variances of cumulative and non-cumulative time series. The variance ofnon-cumulative time series is calculated in the usual manner, the variance ofcumulative time series is calculated according to the following expression

var [Yt] = var [Yt−1] + var [yt] + 2cov [Yt−1, yt] , (C.1)

where yt is the current value of the time series and Yt−1 is the cumulativevalue of the time series at period t − 1. Note the significant increase invariability when cumulative data is used.

Appendix C. Variance of cumulative and non-cumulative data.

(a)

2 4 6 8 10 12010002000300040005000

(b)

5 10 15 20 250

2000

4000

(c)

2 4 6 8 10 12 14 16 18

×104

0

20

40

t

Variance

(d)

5 10 15 200

1000

2000

3000Non-cum.Cum.

Fig. C.1: Variance of cumulative and non-cumulative data. (a) RD1 Dataset; (b)SD1 dataset; (c) RD2 dataset; (d) RD3 dataset.

82

BIBLIOGRAPHY

Balestrassi, P.P., Popova, E., Paiva, A.P., & Lima, J.W. 1994. Design of ex-periments on neural networks training for nonlinear time series forecasting.Neurocomputing, 1160–1178.

Banerjee, A., & Dave, R. N. 2004. Validating clusters using the Hopkinsstatistic. IEEE, 25–29.

Bashiri, M., & Geranmayeh, A.F. 2011. Tuning parameters of an artificialneural network using central composite design and genetic algorithm. Sci-entia Iranica, 1600–1608.

Bass, F.M. 1969. A new product growth model for consumer durables. Man-agement science, 15(3), 8496–8502.

Benitez, H., D., Florez, J., F., Duque, D., P., Benavides, A., Lucia Baquero,O., & Quintero, J., J. 2013. Spatial pattern recognition of seismic eventsin South West Colombia. Computer & Geosciences., 60–77.

Bradley, P. S., & Fayyad, Usama M. 1998. Refining Initial Points for K-MeansClustering. Pages 91–99 of: -. Morgan kaufmann.

Chang, C.-C., & Lin, C.-J. 2011. LIBSVM: a library for support vectormachines. ACM Transactions on Intelligent Systems and Technology, 1–27.

Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. 2002. Choosingmultiple parameters for support vector machines. Machine learning, 131–159.

BIBLIOGRAPHY BIBLIOGRAPHY

Chiu, C-C., Pignatiello, J.J., & Cook, D.F. 1994. Response surface method-ology for optimal neural network selection. IEEE, 161–167.

Chung, F.-L., Fu, T.-C., Luk, R., & Ng, V. 2002. Evolutionary time seriessegmentation for stock data mining. IEEE, 83–90.

Cristianini, N., & Shawe-Taylor, J. 2000. An introduction to support vectormachines. Cambridge University Press.

Fu, Tak-Chung. 2010. A review on time series data mining. EngineeringApplications of Articial Intelligence.

Gath, & Geva, A. B. 1989a. Unsupervised optimal fuzzy clustering. IEEETransactions on pattern analysis and machine intelligence, 11(7).

Gath, I., & Geva, B. 1989b. Unsupervised optimal fuzzy clustering. IEEE,773–781.

Gonen, M., & Alpaydin, E. 2011. Regularizing multiple kernel learning usingresponse surface methodology. Pattern recognition, 159–171.

Hall, B. H., Jaffe, A. B., & Trajtenberg, M. 2001. The NBER patent citationdata file: Lessons, insights and methodological tools. NBER working paper8498.

Hsu, C.-H., Chang, C.-C., & Lin, C.-J. 2010. A practical guide to supportvector classification. National Taiwan University, 1–16.

Hu, Y., Wu, C., & Liu, H. 2011. Prediction of passenger flow on the highwaybased on the least square support vector machine. Transport, 26(2), 663–673.

Jian, L., Xiuhua, C., & Hai, W. 2009. Comparison of artificial neural net-works with response surface models in characterizing the impact damageresistance of sandwich airframe structures. Pages 210–215 of: -, vol. 2.cited By (since 1996)0.

Kaufman, L., & Rousseeuw, P. J. 1990. Finding groups in data: An intro-duction to cluster analysis. Wiley.

84


Kurawarwala, A.A., & Matsuo, H. 1998. Product growth models for mediumterm forecasting of short life cycle products. Technological Forecasting andSocial Change, 169–196.

Li, B., Li, J., Li, W., & Shirodkar, S.A. 2012. Demand forecasting for pro-duction planning decision making based on the new optimized fuzzy shorttime series clustering. Production Planning and Control, 197–203.

Liao, T. Warren. 2005. Clustering of time series data: A survey. Patternrecognition.

Lin, C.-J. 2006. A gide to support vector machines.

Madadlou, A., Emam-Djomeh, Z., Mousavi, M.E., Ehsani, M., Javanmard,M., & Sheehan, D. 1994. Response surface optimization of an artificialneural network for predicting the size of re-assembled casein micelles. Com-puters and Electronics in Agriculture, 216–221.

Meade, N., & Islam, T. 2006. Modelling and forecasting the diffusion ofinnovation – A 25-year review. International Journal of Forecasting, 519–545.

Moller-Levet, C. S., Klawonn, F., Cho, K.-H., & Wolkenhauer, O. 2005.Clustering of unevenly sampled gene expression time series data. Fuzzysets and systems, 49–66.

Moller-Levet, Carla S., Klawonn, Frank, Cho, Kwang-Hyun, & Wolkenhauer,Olaf. 2003. Fuzzy clustering of short time series and unevenly distributedsampling points. Springer-Verlag, 330–340.

Myers, R., & Montgomery, D. 2002. Response surface methodology: Processand product optimization using designed experiments. Wiley.

Pai, P.-F., Lin, K.-P., Lin, C.-S., & Chang, P.-T. 2010. Time series forecast-ing by a seasonal support vector regression model. Expert Systems withApplications, 37(6), 4261–4265.

Pakhira, M. K., Bandyopadhyay, S., & Maulik, U. 2004. Validity index forcrisp and fuzzy clusters. Pattern Recognition, 487–501.

Pascazio, S., Basalto, N., Bellouti, R., Francesco, D. C., Facchi, P., & Panta-leo, E. 2007. Hausdorff clustering of nancial time series. Physica, 635–644.

85


Pena, Daniel. 2005. Analisis de series temporales. Alianza Editorial.

Pena, Daniel, Tiao, George C., & Tsay, Ruey S. 2001. A course in timeseries analysis. Wiley.

Pena, J. M., Lozano, J. A., & Larraaga, P. 1999. An empirical comparison offour initialization methods for the k-Means Algorithm. Pattern RecognitionLetters, 1027–1040.

Prez-Cruz, F., & Bousquet, O. 2004. Kernel methods and their potential usein signal processing: An overview and guidelines for future development.IEEE SIGNAL PROCESSING MAGAZINE, 54–65.

Ratanamahatana, C. A., Lin, J., Gunopulos, D., Keogh, E., Vlachos, M., &Das, G. 2010. Mining time series data. Pages 1049–1077 of: Data Miningand Knowledge Discovery Handbook. Springer Verlag.

Rodrıguez, J.A. 2007. Diseno y evalucion de modelos de abastecimiento deproductos de corto ciclo de vida. M.Phil. thesis, Universidad del Valle.

Rodrıguez, J.A., & Vidal, C.J. 2009. A heuristic method for the inventorycontrol of short life cycle products. Ingenierıa y Competitividad, 37–35.

Scholkopf, B., Bartlett, P., Smola, A., & Williamson, R. –. Shrinking thetube: A new support vector regression algorithm. –.

Smola, A.J., & Schlkopf, B. 2004. A tutorial on support vector regression.Statistics and Computing, 14(3), 199–222.

Szozda, N. 2010. Analogous forecasting of products with a short life cycle.Decision Making in Manufacturing and Services, 4(1-2), 71–85.

Theodoridis, S., & Koutroumbas, K. 2006. Pattern Recognition. Elsevier.

Thomassey, S., & Fiordaliso, A. 2006. A hybrid sales forecasting system basedon clustering and decision trees. Decision Support Systems, 408–421.

Thomassey, S., & Happiette, M. 2007. A neural clustering and classificationsystem for sales forecasting of new apparel items. Applied Soft Computing,1177–1187.

86


Trappey, C.V., & Wu, H.Y. 2008. An evaluation of the time varying extendedlogistic, simple logistic, and Gompertz models for forecasting short productlifecycles. Advanced Engineering Informatics, 421–430.

Tsay, R.S. 2005. Analysis of Financial Time Series. Wiley.

Tseng, F-M., & Hu, Y-C. 2009. Quadratic interval Bass model for newproduct sales diffusion. Expert System with Applications, 8496–8502.

Vapnik, V.N. 1999. An overview of statistical learning theory. IEEE Trans-actions on Neural Networks, 988–999.

Wang, J., & Wan, W. 2009. Optimization of fermentative hydrogen produc-tion process using genetic algorithm based on neural network and responsesurface methodology. International Journal of Hydrogen Energy, 255–261.

Wang, Weina, & Zhang, Yunjie. 2007. On fuzzy cluster validity indices. Fuzzysets and systems, 2095–2117.

Wu, S.D., & Aytac, B. 2008. Characterization of demand for short life-cycleTechnology products. Annals of Operations Research.

Wu, S.D., Kempf, K.G., Atan, M.O., Aytac, B., Shirodkar, S.A., & Mishra,A. 2009. Extending Bass for improved new product forecasting. Informs,234–247.

Wu, X. L., & Yang, M. S. 2005. A cluster validity index for fuzzy clustering.Pattern Recognition Letters, 1275–1291.

Xu, X.-H., & Zhang, H. 2008. Forecasting demand of short life cycle productsby SVM. Pages 352–356 of: -. cited By (since 1996)1.

Yang, Y., Fuli, R., Huiyou, C., & Zhijiao, X. 2007. SVR mathematicalmodel and methods for sale prediction. Journal of Systems Engineeringand Electronics, 18(4), 769–773.

Zahid, N., Limouri, M., & Essaid, A. 1999. A new cluster-validity for fuzzyclustering. Pattern Recognition, 1089–1097.

Zhang, G., Patuwo, B.E., & Hu, M.Y. 1998. Forecasting with artificial neuralnetworks: the state of the art. International Journal of Forecasting.

87


Zhang, G., Patuwo, B.E., & Hu, M.Y. 2001. A simulation study of artifi-cial neural networks for non-linear time series forecasting. Computers &Operations Research, 381–396.

Zhang, X., Liu, J., Du, Y., & Lv, T. 2011. A novel clustering method ontime series data. Expert Systems with Applications, 11891–11900.

Zhu, K., & Thonemann, U.W. 2004. An adaptive forecasting algorithmand inventory policy for products with short life cycles. Naval ResearchLogistic, 633–653.

88

Documents

Demand Forecast for Short Life Cycle Products Forecast for Short... · 2019-07-08 · Demand Forecast for Short Life Cycle Products Mario Jos´e Basallo Triana December, 2012