15
Computers and Chemical Engineering 33 (2009) 1602–1616 Contents lists available at ScienceDirect Computers and Chemical Engineering journal homepage: www.elsevier.com/locate/compchemeng Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis Chao Y. Ma, Xue Z. Wang Institute of Particle Science and Engineering, School of Process, Environmental and Materials Engineering, University of Leeds, Leeds LS2 9JT, UK article info Article history: Received 3 September 2008 Received in revised form 1 April 2009 Accepted 28 April 2009 Available online 5 May 2009 Keywords: Process historical data analysis Decision trees Decision forest Genetic programming Wastewater treatment plant Inductive data mining abstract An inductive data mining algorithm based on genetic programming, GPForest, is introduced for automatic construction of decision trees and applied to the analysis of process historical data. GPForest not only outperforms traditional decision tree generation methods that are based on a greedy search strategy therefore necessarily miss regions of the search space, but more importantly generates multiple trees in each experimental run. In addition, by varying the initial values of parameters, more decision trees can be generated in new experiments. From the multiple decision trees generated, those with high fitness values are selected to form a decision forest. For predictive purpose, the decision forest instead of a single tree is used and a voting strategy is employed which allows the combination of the predictions of all decision trees in the forest in order to generate the final prediction. It was demonstrated that in comparison with decision tree methods in the literature, GPForest gives much improved performance. © 2009 Elsevier Ltd. All rights reserved. 1. Introduction Collecting data and storing it in databases has become a rou- tine operation in industry. The data clearly represents a useful ‘mine’ from which valuable information and knowledge could be extracted. Discovering information and knowledge from data is particularly useful when first-principle models and knowledge are not available or not applicable due to uncertainties and noise in real world applications (Johannesmeyer, Singhal, & Seborg, 2002; Singhal & Seborg, 2005; Uraikul, Chan, & Tontiwachwuthikul, 2007; Wang, 1999; Wang & McGreavy, 1998). Knowledge extracted from data has statistical basis and the advantage of being objective compared to the knowledge and experience of human experts. There has been great research interest in discovering patterns such as abnormal events from process historical data using clustering, classification and visualization techniques, such as the work of Johannesmeyer et al. (2002) and Wang, Medasani, Marhoon, and Albazzaz, 2004. Of equal interest in process historical data analysis is to uncover the causal knowledge of processes that account for complex interactions between process variables and process per- formance and product quality measures. It is well known that use of manually generated signed digraph (SDG) for fault isolation has generated massive interest (Iri, Aoki, Oshima, & Matsuyama, 1979; Corresponding author. Tel.: +44 113 343 2427; fax: +44 113 343 2405. E-mail address: [email protected] (X.Z. Wang). Venkatasubramanian, Rengaswamy, & Kavuri, 2003; Wang, Yang, Veloso, Lu, & McGreavy, 1995) in the research community of pro- cess fault detection and diagnosis. In our opinion, more interesting work is the research on automatic construction of causal models in the forms of decision trees and rules from process data. Jemwa and Aldrich (2005) applied support vector machines and decision trees to develop causal models for process operational support, using the former for clustering and the latter for causal model construction. Shelokar, Jayaraman, and Kulkarni (2004) adapted an ant colony optimization algorithm for rules generation from process opera- tional data. Nounou, Bakshi, Goel, and Shen (2002) developed a Bayesian latent variable model for process data regression. Bakshi and Stephanopoulos (1994) attempted the integration of wavelet signal processing and inductive learning for deriving tree type of models for process supervisory control. Li and Wang (2001) used digraph for process modeling for which the graphs were manually drawn but the nodes and weights were trained with data. Saraiva and Stephanopoulos (1992) investigated the use of inductive and deductive learning for tree type process model development, where the case studies were relatively simple, but the idea was inspira- tional. Decision tree generation from data, also known as inductive data mining or inductive learning aims at generating causal predictive models automatically from data. Inductive data mining generated decision tree models are attractive because they represent causal, transparent and intuitive knowledge, and can effectively handle non-linear relationships. Inductive data mining has another impor- tant property that other algorithms such as feedforward neural 0098-1354/$ – see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.compchemeng.2009.04.005

Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

Embed Size (px)

Citation preview

Page 1: Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

Io

CI

a

ARRAA

KPDDGWI

1

t‘epnrSWdcTacJAicfog

0d

Computers and Chemical Engineering 33 (2009) 1602–1616

Contents lists available at ScienceDirect

Computers and Chemical Engineering

journa l homepage: www.e lsev ier .com/ locate /compchemeng

nductive data mining based on genetic programming: Automatic generationf decision trees from data for process historical data analysis

hao Y. Ma, Xue Z. Wang ∗

nstitute of Particle Science and Engineering, School of Process, Environmental and Materials Engineering, University of Leeds, Leeds LS2 9JT, UK

r t i c l e i n f o

rticle history:eceived 3 September 2008eceived in revised form 1 April 2009ccepted 28 April 2009vailable online 5 May 2009

a b s t r a c t

An inductive data mining algorithm based on genetic programming, GPForest, is introduced for automaticconstruction of decision trees and applied to the analysis of process historical data. GPForest not onlyoutperforms traditional decision tree generation methods that are based on a greedy search strategytherefore necessarily miss regions of the search space, but more importantly generates multiple trees ineach experimental run. In addition, by varying the initial values of parameters, more decision trees can be

eywords:rocess historical data analysisecision treesecision forestenetic programming

generated in new experiments. From the multiple decision trees generated, those with high fitness valuesare selected to form a decision forest. For predictive purpose, the decision forest instead of a single treeis used and a voting strategy is employed which allows the combination of the predictions of all decisiontrees in the forest in order to generate the final prediction. It was demonstrated that in comparison withdecision tree methods in the literature, GPForest gives much improved performance.

astewater treatment plantnductive data mining

. Introduction

Collecting data and storing it in databases has become a rou-ine operation in industry. The data clearly represents a usefulmine’ from which valuable information and knowledge could bextracted. Discovering information and knowledge from data isarticularly useful when first-principle models and knowledge areot available or not applicable due to uncertainties and noise ineal world applications (Johannesmeyer, Singhal, & Seborg, 2002;inghal & Seborg, 2005; Uraikul, Chan, & Tontiwachwuthikul, 2007;ang, 1999; Wang & McGreavy, 1998). Knowledge extracted from

ata has statistical basis and the advantage of being objectiveompared to the knowledge and experience of human experts.here has been great research interest in discovering patterns suchs abnormal events from process historical data using clustering,lassification and visualization techniques, such as the work ofohannesmeyer et al. (2002) and Wang, Medasani, Marhoon, andlbazzaz, 2004. Of equal interest in process historical data analysis

s to uncover the causal knowledge of processes that account for

omplex interactions between process variables and process per-ormance and product quality measures. It is well known that usef manually generated signed digraph (SDG) for fault isolation hasenerated massive interest (Iri, Aoki, Oshima, & Matsuyama, 1979;

∗ Corresponding author. Tel.: +44 113 343 2427; fax: +44 113 343 2405.E-mail address: [email protected] (X.Z. Wang).

098-1354/$ – see front matter © 2009 Elsevier Ltd. All rights reserved.oi:10.1016/j.compchemeng.2009.04.005

© 2009 Elsevier Ltd. All rights reserved.

Venkatasubramanian, Rengaswamy, & Kavuri, 2003; Wang, Yang,Veloso, Lu, & McGreavy, 1995) in the research community of pro-cess fault detection and diagnosis. In our opinion, more interestingwork is the research on automatic construction of causal models inthe forms of decision trees and rules from process data. Jemwa andAldrich (2005) applied support vector machines and decision treesto develop causal models for process operational support, using theformer for clustering and the latter for causal model construction.Shelokar, Jayaraman, and Kulkarni (2004) adapted an ant colonyoptimization algorithm for rules generation from process opera-tional data. Nounou, Bakshi, Goel, and Shen (2002) developed aBayesian latent variable model for process data regression. Bakshiand Stephanopoulos (1994) attempted the integration of waveletsignal processing and inductive learning for deriving tree type ofmodels for process supervisory control. Li and Wang (2001) useddigraph for process modeling for which the graphs were manuallydrawn but the nodes and weights were trained with data. Saraivaand Stephanopoulos (1992) investigated the use of inductive anddeductive learning for tree type process model development, wherethe case studies were relatively simple, but the idea was inspira-tional.

Decision tree generation from data, also known as inductive datamining or inductive learning aims at generating causal predictive

models automatically from data. Inductive data mining generateddecision tree models are attractive because they represent causal,transparent and intuitive knowledge, and can effectively handlenon-linear relationships. Inductive data mining has another impor-tant property that other algorithms such as feedforward neural
Page 2: Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

C.Y. Ma, X.Z. Wang / Computers and Chemical Engineering 33 (2009) 1602–1616 1603

Table 1Comparison between expert systems, multivariate statistical data analysis, neural networks and inductive data mining.

Expert systems (ESs) Multivariate statistical analysis(MSA)

Neural networks (NNs) Inductive data mining (IDM)

� Human knowledge can be used � Compared with ESs, MSA is datadriven therefore is more objective

� Compared with ESs, NNs are datadriven methods therefore are moreobjective

� It combines the advantages ofESs, MSA and NNs

� Knowledge is transparent andcausal

� Give quantitative predictions � Give quantitative predictions � It can give both qualitative andquantitative predictions

© Knowledge is subjective © The models are often linear � The models are non-linear � Both databases and humanknowledge can be used togethereffectively; e.g. a decision treeautomatically generated from datacan be expanded or revisedmanually by human experts.

© Data cannot be used effectively © Models are largely black-boxes � Models are easy to setup and betrained

� Knowledge is transparent andcausal

© Output is qualitative © Human experts’ knowledgecannot be easily added in once aMSA model is trained

© Compared with ESs, the modelsare largely black-boxes

� Input variable feature selection isan integral step of the modelbuilding process

© Not suitable for optimizationand automatic control

© Human experts’ knowledgecannot easily added in once a NNmodel is trained

� Can capture the models of smallnumber of data cases, i.e. abnormaldata cases, in a large databaseconsisting of mainly normaloperations

© No input variable featureselection capability

© No input variable featureselection capability

© The proposed method needsresearch to address the challenges

nspwtravagvddcvctikadsoco

a(2wasibsuaC

struction of causal models between variables from sensor data isidentified as an important component of the FDA (U.S. Food andDrug Administration)’s PAT (process analytical technology) initia-tive (FDA, 2006). Fig. 1 shows that FDA’s PAT initiative divides

: advantages and ©: disadvantages.

etworks do not possess, which is the capability of automaticelection of input variables, known as feature variables, during therocess of tree model building. In applying feedforward neural net-orks, researchers tend to include as many variables as possible

o ensure that no relevant variables are omitted. This however mayesult in inclusion of irrelevant or not important input variables. Forfixed number of training data patterns, with the increase of inputariables it becomes more sparse in the multi-dimensional space,nd therefore degrades the learning performance. As a result, theenerality of the learned model may also be reduced. Table 1 pro-ides a comparison between expert systems, multivariate statisticalata analysis, neural networks and inductive data mining. Inductiveata mining generated decision tree models are causal, transparent,an handle non-linear problems effectively, and perform featureariable selection as an integral step of the model building pro-ess therefore can deal with large number of input variables (overhousands to tens of thousands). In addition, once a decision trees built, it can be expanded or revised by adding human expert’snowledge which might have not been covered by data. We willlso demonstrate in this paper that the genetic programming basedecision tree generator can effectively deal with data set where amall number of cases representing one scenario (e.g. abnormalperational data of a process) are mixed with a large number ofases representing a different scenario (e.g. normal operational dataf a process).

A noticeable development in the last five years in process datanalysis has been the use of support vector machines (SVMs)Jemwa & Aldrich, 2005; Zeng, Li, Jiang, Li, & Huang, 2006; Zhang,008). SVM has robust capacity to establish non-linear modelshich could offer accurate quantitative predictions in regression

nd classification. Compared with neural networks, it demon-trated great advantages in avoiding over-fitting and local miniman training and in building a model with good generalisation capa-

ility (Vapnik, 1998). However, SVM has limitations in featureelection (Martens, Baesens, & Van Gestel, 2009). Some researcherssed dimension reduction technique such as principal componentnalysis (PCA) and independent component analysis (ICA) (Deniz,astrillon, & Hernandez, 2003) when deal with large number of

with which existing inductive datamining techniques cannot cope

input variables. However, dimension reduction using PCA and ICAonly transforms the original variables to new latent variables; infor-mation of original variables that might be irrelevant to the outputis kept in the transformation, potentially adversely affecting modelperformance. In contrast the feature selection ability of inductivedata mining will automatically exclude these irrelevant variablesfrom the tree model.

Though little attention has yet been received, in clear con-trast to the overwhelming interest in some other techniques suchas multivariate statistical process control (MSPC), automatic con-

Fig. 1. Causal link model development is at level three (levels are numbered frombottom to top) of the technology pyramid of the FDA PAT initiative (Hussain, 2009).

Page 3: Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

1 hemic

Ptbositc

tcbatdomsLt(mO1maacainpvbFSc

eH1btiidvsi

ad&crefk(oi

Gdon

604 C.Y. Ma, X.Z. Wang / Computers and C

AT technology into five levels (Hussain, 2009), from bottom toop: individual PAT sensors; understanding and decision makingased on univariate approach; understanding and decision basedn causal models and performance prediction; mechanistic under-tanding; and first principle understanding. Although the FDA PATnitiative was primarily proposed for the pharmaceutical indus-ry, it is widely accepted that the principles apply to many otherhemical and allied industries.

A variety of inductive learning techniques have been proposed inhe literature. At the simplest level, they can be split into those pro-eeding in a bottom up manner starting with one data pattern, anduilding up rules until contradictions occur. In contrast, top downpproaches start with all the data, and recursively splitting or par-itioning this until stopping criteria are reached. The choice of theescriptors and values in the splitting criterion will usually be basedn statistical measures, exemplified by formal inference recursiveodelling—FIRM (Cho, Shen, & Hermsmeier, 2000), statistical clas-

ification of the activities of molecules—SCAM (Rusinko, Farmen,ambert, Brown, & Young, 1999), chi-square automatic interac-ion detector—CHAID and classification and regression trees—CARTBreiman, Freidman, Olshen, & Stone, 1993), or will use entropy

easures such as ID3 (Quinlan, 1986) and C4.5 (Quinlan, 1993).ther splitting criteria are employed by CN2 (Clark & Niblett,989) using a mixture of entropy and the Laplacian error esti-ate (Clark & Boswell, 1991) and novel approaches using genetic

lgorithms (Bala, Huang, Vafaie, DeJong, & Wechsler, 1995). Otherpproaches include Version Spaces (Mitchell, 1977), which usesandidate elimination to build up a set of rules covering positivend negative examples. Earlier methods can only handle categoricalnputs, whereas newer versions, such as C4.5 and C5.0 can handleumeric inputs. An alternative approach uses first order logic (FOL),roviding relational statements as well as the usual attributes andalues to build up rules fitting the data. This approach is representedy FOIL (Quinlan & Cameron-Jones, 1993), GOLEM (Muggleton &eng, 1992), PROGOL (King & Srinivasan, 1996) and Warmr (King,rinivasan, & Dehaspe, 2001). These methods often require compli-ated encoding prior to the induction.

A departure from the majority of the reported approachesxtracts rules from trained neural networks (Bacha, Gruver, Denartog, Tamura, & Nutt, 2002; Wang, Chen, Yang, McGreavy, & Lu,997). These methods however all require a relatively small num-er of inputs, since the possible combinations of weights and inputso outputs that must be considered grows exponentially. Recentnterest has been shown in the development of the so-called fuzzynference systems. This approach avoids crisp rules partitioning theata into exclusive sets. Instead each rule applies to a data point by aarying degree, quantified by a fuzzy membership function. Theseystems use either Mamdani (1974) or Takagi and Sugeno (1985)nference to give predictions.

Most recently, genetic programming (also known as evolution-ry programming) has been investigated for the generation ofecision tree models from data (Buontempo et al., 2005; DeLisleDixon, 2004; Wang, Buontempo, Young, & Osborn, 2006). Appli-

ations of the methods to QSAR (quantitative structure–activityelationship) models for chemical toxicity prediction (Buontempot al., 2005; Wang et al., 2006) have proved their superior per-ormance in comparison with other techniques such as C5.0 (alsonown as See5) (Quinlan, 1986, 1993). The GPTree algorithmBuontempo et al., 2005; Wang et al., 2006) was able to handlever a thousand input variables (known as molecular descriptors

n QSAR modelling).

The current paper presents a new decision forest method, the

PForest, for historical data analysis, which is an extension of theecision tree construction algorithm, the GPTree, originally devel-ped by Buontempo et al. (2005) and Wang et al. (2006). Theovelty of GPForest lies in the idea of decision forest that it assumes

al Engineering 33 (2009) 1602–1616

that a single decision tree may not be able to fully represent therelationships between input and the output variables because asingle tree model is optimum only for a specific region of the solu-tion space. Decision forest therefore combines multiple trees tocapture relationships of different regions using different models.Previous studies (Tong, Hong, Fang, Xie, & Perkins, 2003; Tong etal., 2004) have demonstrated the advantages using decision for-est, but the algorithm they used for generating the multiple treeswas not very efficient because the algorithm could only generate asingle tree in every run. In this study we will investigate the useof GPTree to develop decision forest, and the focus is on deriv-ing methodologies for selection of trees into the decision forest,and on a methodology to combine the predictions of multiple dis-tinct but comparable decision tree models to reach a consensus.The performance of GPForest will be compared with GPTree andSee5 (See5 is the PC version of C5.0) based on a database of awastewater treatment plant corresponding to 527 days of opera-tion. See5 is widely regarded as the decision tree generator thathas the best performance, especially for domain problems thatvariables take numerical values (Liu, Hussain, Tan, & Dash, 2002;Yuan, 2002). Therefore it is used as a benchmark method in thisstudy.

2. Genetic programming for generation of decision treesand decision forest

2.1. Decision tree generation using genetic programming

Since details of the method can be found in the literature(Buontempo et al., 2005; Wang et al., 2006), here the proce-dure for generating decision trees using GPTree will be introducedonly very briefly. The GPTree algorithm contains the followingsteps.

Firstly, the data is divided into training and test sets. GPTreethen generates binary decision trees from the training data by ini-tially growing trees with randomly selected attribute and valuepairs from randomly selected rows in the training data to form eachsplitting node (Fig. 2). For example randomly picking attribute mwith corresponding value s for the randomly selected training rown would form the decision node

If attribute m ≤ value s (1)

Any training data for which this is true is partitioned to the leftchild node, while the remaining data is partitioned to the right childnode. If less than 5% of the training data will be partitioned to onechild node, a new row and attribute is chosen at random. This per-centage is user configurable. The child nodes are grown as follows.When less than 10% (or double the pre-specified minimum percent-age coverage required at a node) of the training data are partitionedto a branch of the tree, so that any further splits will cover less than10% of the data, or all the data at that branch are pure (in the sameclass), a terminal or leaf node is formed. This predicts the class ofthe data partitioned there as the majority class of the training par-titioned to that node. This process continues until all nodes havechild nodes or are themselves leaf nodes.

Secondly, once the first generation is fully populated, new treesare grown by crossover, splitting the selected parents at a randomnode and recombining the parts to form new trees (Fig. 3). In orderto select the parent trees that will take part in crossover, tourna-

ment selection is employed. The number of trees taking part in thetournament is configurable. The fittest tree from the tournamentforms the first parent. This process is then repeated to find a sec-ond parent. The fitness function uses the accuracy of the trees in thecompetition since it is enforced that a node must contain a certain
Page 4: Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

C.Y. Ma, X.Z. Wang / Computers and Chemical Engineering 33 (2009) 1602–1616 1605

Fig. 2. Randomly select an attribute, m, as the root node and select a row, n, to use its value, s, as the splitting point for the attribute m in growing one of the trees in the firstgeneration. The subsequent nodes of the same tree are chosen and split in the same way. This process of generating a tree continues until the specified number of trees isgenerated as the first generation.

F ed byo

n

F

witdtstf(

ame

(

cbab

oa

ig. 3. Two parent trees forming a new child by splitting at a random place, indicatf the solution encoded in each parent.

umber of rows during the tree growth process:

itness (Tree) =i=n∑

i=1

(Rows at node i with Classm,i)Rows

(2)

here n is the leaf nodes and Classm,i is the majority class at node. If the number of leaf nodes was included in the fitness function,he population tended to converge to a relatively small set of trees,ecreasing the parts of the search space explored, thereby leadingo a slower overall increase in accuracy and often seeming to gettuck in regions containing less accurate trees of the same size ashose produced without using the leaf node count in the fitnessunction. The tree taking part in the tournament maximizing Eq.2) is selected as a parent.

Thirdly, the child trees may possibly be mutated before beingdded to the next generation. The pre-defined number of trees areutated, with random choice of mutation operators (Buontempo

t al., 2005).

(i) Change of split value (corresponding to choosing a differenttraining row’s value for the current attribute).

(ii) Choose a new attribute whilst keeping the same row.iii) Choose a new attribute and new row.

(iv) Re-grow part of the tree from any randomly selected node(apart from leaf nodes).

If either crossover or mutation gives rise to a node previously

lassed as a leaf node which is no longer pure or can now usefullye split further, that part of the tree is re-grown. If a node becomesleaf node during either operation, its previous children will not

e copied to the new tree.The second and third step is repeated until the required number

f trees has been grown for the new generation, and generationsre grown up to the pre-specified number.

the dotted lines, and crossing-over to generate a new individual that contains part

2.2. Decision forest

Much research on decision trees has been devoted to improv-ing the prediction accuracy in particular for unseen data that wasnot used in building the tree models. One of the most promisingmethods is decision forest that uses ensembles of trees. Decisionforest combines the results of multiple distinct but comparabledecision tree models to reach a consensus prediction (Keefer &Woody, 2006).

Decision forest has major advantages because the idea of com-bining multiple trees implicitly assumes that a single decision treecould not fully represent the relationships between input variablesand the output variable (Tong et al., 2003) or a single tree modelis optimum only for a specific region of the solution space, whilea decision forest captures relationships of different regions usingdifferent models.

Two approaches became popular for generating multiple mod-els (Tong et al., 2004). One generates separate decision trees usingdifferent portions of the data, where the data portions were selectedfrom the training set based on resampling techniques, such as bag-ging (Breiman, 1996) and boosting (Schapire, 1996). The secondapproach focuses on choosing an ensemble of decision trees by ran-dom selection of predictor variables (Amit & Geman, 1997; Breiman,2001). However, the majority of the literature methods for decisiontree generation from data suffers from the limitation that they onlyproduce a single tree for a given set of data. The genetic program-ming method developed by Buontempo and Wang (Buontempo etal., 2005; Wang et al., 2006) for decision tree generation, the GPTree,has great advantages in decision forest construction because it isdesigned to generate multiple trees from generation to generation.In addition, by varying the parameter values, more groups of trees

can be produced. This method is distinctive from the above men-tioned two decision forest approaches, because in here, the entiretraining data is always used avoiding the need of portioning thedata first. Four steps constitute GPForest, the genetic programmingbased decision forest method.
Page 5: Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

1 hemic

e

••

606 C.Y. Ma, X.Z. Wang / Computers and C

Firstly, the algorithm is initiated for the following seven param-ters:

Maximum number of generations for each experiment.Number of trees to be generated for each generation.

Fig. 4. The procedure for generation of the decisi

al Engineering 33 (2009) 1602–1616

• Number of trees participating the tournament.• Number of trees that are allowed to be placed directly into the

next generation without cross-over and mutation.• The minimum tolerable increase in accuracy for the best perform-

ing decision tree in a generation compared to the accuracy of the

on forest (a), and its use for prediction (b).

Page 6: Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

hemic

••

ai

no

itcpo

d

wfvtb(eta

asienaiatpvp

i

C.Y. Ma, X.Z. Wang / Computers and C

best performing tree of a previous generation. If there is no obvi-ous increase in accuracy of the best performing trees over a fewgenerations, it forces mutation of trees.Mutation rate, i.e. percentage of trees that undergo mutation.Allowable minimum number of data cases that are covered by aleaf node. If a leaf node does not cover sufficient number of datacases, then this branch of the whole tree needs to be re-grown.

Assignment of parameter values can be conducted using a trial-nd-error approach. Further discussion on this point will be maden a later section.

Secondly, a group of decision tree models are generated. By run-ing GPTree again and varying the values of parameters, new groupsf trees can be produced.

After groups of decision tree models are generated, the third steps model selection which involves selecting the best-performingrees from all the trees produced in all experiments. The selectionriteria need to consider the accuracies for the training data, com-lexity of a tree defined by the leaf node numbers and a measuref misclassification defined by Eq. (3)

egree of misclassification =N∑

i=1

|CPi

− CTi|

N(3)

here CPi

and CTi

are the predicted and target class assignmentsor the ith data case. CP

iand CT

ieach takes one of three numerical

alues, 1 represents ‘High’, 2 refers to ‘Normal’ and 3 for ‘Low’. N ishe number of data cases. In Eq. (3), the ordering information haseen taken into account. For example, if the true class value is 1i.e. ‘High’), a prediction of 3 (i.e. ‘Low’) is considered as a largerrror than a prediction of 2 (i.e. ‘Normal’), although in both cases,he predictions are incorrect. Trees with high accuracy, fewer nodesnd low degrees of misclassification are the winners.

Decision trees selected to the decision forest should be as diverses possible, therefore an additional criterion is to avoid selectingimilar trees. In this study, if two trees have over 70% of their nodesdentical, they are said to be similar. For trees generated from differ-nt runs or experiments, i.e. with different initial parameter values,o more than two similar trees should be selected. If similar treesre generated from the same experiment, then only one of the sim-lar trees should be kept. Nevertheless, these rules should not be

pplied too rigidly. When applying the decision forest for predic-ion, the final prediction result is determined by considering theredictive results of all the tree models in the forest with a majorityoting strategy. This voting strategy predicts the class based on therinciple of minority being subordinate to the majority. If a major-

Fig. 5. The wastewater

al Engineering 33 (2009) 1602–1616 1607

ity class cannot be found, the final class will be determined by thetree that gives the highest accuracy for the training data which wasrecorded at the decision forest model building stage. Details of themajority voting strategy and alternative methods for handling a sit-uation that no majority can be found will be discussed further whenthe case study is presented a later section. The GPForest algorithmwas depicted in Fig. 4 and implemented in C++.

3. The data and its pre-processing

The data was collected during the period 1990–1991 by JavierBejar and Ulises Cortes for a wastewater treatment plant in Man-resa, a town located near Barcelona and was made available in theUCI Machine Learning Repository (Wastewaterdatabase). Thoughavailable in the literature, for convenience of discussion in the restof the paper, the sketch of the plant is still given in Fig. 5. It con-sists of mainly three stages: pre-treatment, primary treatment byclarification and secondary treatment by means of activated sludge.

A total of 527 sets of data were collected which correspondto 527 days of operation. Each dataset has 38 variables, of which29 correspond to measurements taken at different points of theplant, the remaining 9 are calculated performance measures forthe plant (Sanchez et al., 1997). A complete list of the variables isgiven in Table 2 (again they are listed here for the convenience ofdiscussion of results later though they can be found in the litera-ture), together with a brief comment on the meaning and the mean,maximum and minimum value of each variable. The database hassome missing values for some variables for about 144 days out ofthe 527, which have been estimated in previous studies (Albazzaz,Wang, & Marhoon, 2005; Wang et al., 2004). This valuable industrialdatabase has been studied in numerous investigations (Albazzaz &Wang, 2006; Albazzaz et al., 2005; Fuente, Vega, Zarrop, & Poch,1996; Huang & Wang, 1999; Rodriguez-Roda, Poch, & Banares-Alcantara, 2000; Sanchez et al., 1997; Wang et al., 2004).

In this paper, the following four out of the seven variables char-actering the effluent quality are analyzed:

i SS-S, output suspended solids,ii DBO-S,output biological oxygen demand,ii COND-S, output conductivity, andiv DQO-S, output chemical oxygen demand.

The purpose is to construct tree models predicting the values ofthe above four variables being high, normal or low, and explainingwhy. The values for each variable were scaled to the range between0 and 1. As Albazzaz and Wang (2006) recommended, Box-Cox

treatment plant.

Page 7: Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

1608 C.Y. Ma, X.Z. Wang / Computers and Chemic

Table 2The variables for the wastewater treatment plant.

Pre-treatment1 Q-E (input flow to plant)2 ZN-E (input Zinc to plant)3 PH-E (input pH to plant)4 DBO-E (input biological oxygen demand to plant)5 DQO-E (input chemical oxygen demand to plant)6 SS-E (input suspended solids to plant)7 SSV-E (input volatile suspended solids to plant)8 SED-E (input sediments to plant)9 COND-E (input conductivity to plant)

Primary treatment10 PH-P (input pH to primary settler)11 DBO-P (input biological oxygen demand to primary settler)12 SS-P (input suspended solids to primary settler)13 SSV-P (input volatile suspended solids to primary settler)14 SED-P (input sediments to primary settler)15 COND-P (input conductivity to primary settler)

Secondary treatment16 PH-D (input pH to secondary settler)17 DBO-D (input biological oxygen demand to secondary settler)18 DQO-D (input chemical oxygen demand to secondary settler)19 SS-D (input suspended solids to secondary settler)20 SSV-D (input volatile suspended solids to secondary settler)21 SED-D (input sediments to secondary settler)22 COND-D (input conductivity to secondary settler)

Calculated performance23 RD-DBO-P (performance input biological oxygen demand in

primary settler)24 RD-SS-P (performance input suspended solids to primary

settler)25 RD-SED-P (performance input sediments to primary settler)26 RD-DBO-S (performance input biological oxygen demand to

secondary settler)27 RD-DQO-S (performance input chemical oxygen demand to

secondary settler)28 RD-DBO-G (global performance input biological oxygen

demand)29 RD-DQO-G (global performance input chemical oxygen

demand)30 RD-SS-G (global performance input suspended solids)31 RD-SED-G (global performance input sediments)

Output32 PH-S (output pH)33 DBO-S (output biological oxygen demand)34 DQO-S (output chemical oxygen demand)35 SS-S (output suspended solids)333

taatvis

i&Tcc4lbfInc

6 SSV-S (output volatile suspended solids)7 SED-S (output sediments)8 COND-S (output conductivity)

ransformation was performed for each of the four output vari-bles and the purpose was to make the data follow approximatelyGaussian distribution (Albazzaz & Wang, 2006). After Box-Cox

ransformation, an output variable is classified as ‘Normal’ if thealue is between � ± 2�, ‘Low’ if smaller than � − 2�, and ‘High’ ift is larger than � + 2�, where � represents the mean and � is thetandard deviation.

Various methods are now available to split an entire data setnto training and test data (Wang, Neskovic, & Cooper, 2005; Zhang

Cho, 1999), among which clustering is an established approach.he method classifies the whole dataset into clusters, and from eachluster, training and test data are selected. Sanchez et al. (1997) havelassified the WWTP data into five clusters, each contains 125, 158,0, 49 and 64 cases. In theory, if one cluster is improportionally

arger than other clusters, care should be taken. To avoid a model

eing biased by the large cluster, the number of data cases selected

or training should be in the same order as for other smaller clusters.n the current case of the WWTP data, the five cluster sizes areot hugely different, therefore, from each cluster, 75% cases werehosen for model training and 25% for test.

al Engineering 33 (2009) 1602–1616

4. Results and discussion

This section focuses discussions on selecting trees for the deci-sion forest for predicting the four variables about the effluents:suspended solids (SS-S), biological oxygen demand (DBO-S), con-ductivity (COND-S) and chemical oxygen demand (DQO-S), themajority voting strategy in applying the decision forest for predic-tion, as well as comparison with results of decision trees obtainedby applying See5 (Quinlan, 1986, 1993; Quinlan & Cameron-Jones, 1993) and GPTree (Buontempo et al., 2005; Wang et al.,2006).

In applying the rules discussed before to the selection of decisiontrees into the decision forest for the current case study, the selectioncriteria are: 97% for training accuracy, 12 as the maximum numberof leaves and 3% as the tolerable value for the measure of misclassi-fication. The numbers of decision trees thus selected into decisionforests are: 24 decision trees for the decision forest for predictingoutput suspended solids, SS-S; 25 decision trees for predicting bio-logical oxygen demand of effluent, DBO-S; 23 trees for chemicaloxygen demand, DQO-S; and 34 decision trees for the decision for-est for predicting effluent conductivity, COND-S. These numbers ofdecision trees were selected for each decision forest because includ-ing more trees than the numbers selected was found not to be ableto improve predictive performance further.

Fig. 6 shows three examples out of the 24 decision trees in thedecision forest for predicting output suspended solids of effluent,SS-S. We use Fig. 6(c) as an example to explain the meaning of somesymbols. (N 15/3(H)) means the prediction result is N (Normal), and15 data cases were correctly classified but 3 data cases were incor-rectly classified as H (High). Similarly (N 228) means the predictionresult is N (Normal) and all 228 data cases were correctly classifiedand no misclassification, while (N 123/1(H)/9(L)) means the predic-tion is N, and 123 data cases were correctly classified, but 1 data casewas misclassified as H and 9 cases were misclassified as L. The val-ues of the misclassified cases were also important because a H valuebeing misclassified as L is a bigger error than being misclassified asN, for example.

It is interesting to examine the three example trees in terms ofcomplexity and prediction accuracy. The simplest tree of the threeis Fig. 6(c) that contains two variables: RD-SS-G and SS-E, and mis-classified 13 data cases. Fig. 6(b) tree is more complex and containsone more variable, SED-E, in addition to containing the same twovariables as Fig. 6(c), RD-SS-G and SS-E. Consequently, Fig. 6(b) treemisclassified 12 data cases, one less than Fig. 6(c) tree. The mostcomplicated decision tree of the three is Fig. 6(a) which containstwo more variables, SED-D and PH-P, in addition to containing allthe three variables used in Fig. 6(b). Fig. 6(a) only misclassified twodata cases. This trend that a more complex tree gives improved pre-diction accuracy is true for the majority of cases, but not alwaysvalid, as will be discussed later for the three example trees forpredicting the biological oxygen demand, DBO-S.

For the three trees shown in Fig. 6, they have the same root node,RD-SS-G, the global performance input suspended solids, and sharea node, SS-E (input suspended solids to plant). This is a strong sug-gestion that these two variables are likely to be the most importantvariables to the suspended solid of effluent. This is sensible becauseprocess knowledge tells that it is perfectly reasonable that theyare the most influential variables to the suspended solid levels ofeffluent.

Similar observations to SS-S can be made for decision forestsfor predicting output biological oxygen demand, DBO-S, chemical

oxygen demand of effluent, DBQ-S and conductivity of effluents,COND-S. Figs. 7–9 show example trees selected for the three forestsfor predicting these three variables. The numbers of decision treesin the forests for DBO-S, DQO-S and COND-S are 25, 23 and 34 (alsosee Table 3).
Page 8: Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

C.Y. Ma, X.Z. Wang / Computers and Chemical Engineering 33 (2009) 1602–1616 1609

F olids,7

ovAatef

TP

O

S

D

D

C

ig. 6. Three example trees of the decision forest for predicting output suspended s6 of experiment 9 and (c) generation 45 of experiment 9.

For the three example trees in Fig. 7 for predicting biologicalxygen demand, DBO-S, tree (c) is the simplest containing twoariables, RD-DQO-G and DBO-E and misclassified nine data cases.

lthough tree (b) is more complex and introduced three more vari-bles Zn-E, ED-DBO-P and SSV-D, in addition to the two contained inree (c), it misclassified twelve data cases, more than tree (c). How-ver, tree (a), the most complex tree of all three, only misclassifiedour data cases.

able 3erformance of trees selected to the decision forests.

utput No. of trees in thedecision forest

Totalpredi

S-S 24 Training 378Test 130

BO-S 25 Training 375Test 126

QO-S 23 Training 379Test 129

OND-S 34 Training 383Test 127

(SS-S), generated by GPForest in: (a) generation 65 of experiment 14, (b) generation

Of the two example trees in Fig. 8, decision tree (a) is onlyslightly more complex than (b), but has much better performance:(a) misclassified only one data case, in comparison with twelve

misclassified data cases by the decision tree of (b).

Another important capability of the decision trees generated byGPForest is that they can effectively handle small data cases. In thedata base analyzed, the numbers of data cases that give H (high) andL (low) output values are much smaller than that have an N (nor-

of correctlycted cases

Total of incorrectlypredicted cases

% of incorrectlypredicted cases

15 96.24 97.0

18 95.48 94.0

14 96.45 96.3

10 97.57 94.7

Page 9: Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

1610 C.Y. Ma, X.Z. Wang / Computers and Chemical Engineering 33 (2009) 1602–1616

Fig. 7. Three example trees of the decision forest for predicting output biologicaloxygen demand, (DBO-S), generated by GPForest in: (a) generation 14 of experiment4, (b) generation 1 of experiment 20 and (c) generation 6 of experiment 8.

Fig. 8. Two example trees of the forest for predicting output chemical oxygendemand, (DQO-S), generated by GPForest in: (a) generation 22 of experiment 4 and(b) generation 1 of experiment 8.

mal) output value. This is because the plant operated at normaloperation for most of the time. This imbalanced data could lead tobiased models if neural networks or linear regression models areused. In fact, dealing with small and exceptional cases within a largedatabase has been a recognized challenge. The results demonstratethat decision tree generation based on the GPForest algorithm caneffectively deal with small data cases. As revealed by the represen-tative decision trees from the forests, Figs. 6–9, the misclassifiedcases are mainly for N (normal) cases, instead of H (high) and L(low) cases. This point will be further discussed in more detail whenGPForest are compared with GPTree and See5.

4.1. Majority voting

The voting principle is based on that the minority is subordinateto the majority. As an illustrative example, consider a decision forestconsisting of five decision trees. Suppose for a specific data case, thepredictions by the five trees are H, H, N, H, and H respectively, thenthe prediction of the decision forest will be H. If the true value is N,then the prediction is an incorrect result. If the true value is H, thenthe prediction is correct.

If no conclusion can be reached based on the majority vot-

ing strategy, one of the following alternative solutions can beapplied:

1. Use the prediction result of the tree that gives the highest accu-racy for the training data, as the prediction of the decision forest.

Page 10: Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

C.Y. Ma, X.Z. Wang / Computers and Chemical Engineering 33 (2009) 1602–1616 1611

F , (CON5

2

3

dpilmta8

TI

O

SD

D

C

ig. 9. Three example trees of the decision forest for predicting output conductivity8 of experiment 8 and (c) generation 51 of experiment 13.

. Assign a weight to the prediction by each decision tree basedon its accuracy obtained for the training data, and then rank thepredictions, or

. leave the decision right to the user.

In this work, method (1) was used. It is obvious that since aecision forest combines the results of multiple decision trees torovide a solution, analysis of the relative importance of the nodes

s less straightforward than for a single decision tree. Neverthe-

ess, there are ways to carry out such analysis. In this study, the

ost important variables are identified by counting the attributeshat appeared consistently in multiple trees. The important vari-bles identified for DBO-S, DQO-S and COND-S that appeared in0% of the decision trees of the decision forest are listed in the

able 4dentified important attributes.

utput Most important input attrib

S-S, output suspended solids RD-SS-G, global performancBO-S, output biological oxygen demand RD-DBO-G, global performa

plantQO-S, output chemical oxygen demand RD-DQO-G, global performa

oxygen demand to secondarOND-S, output conductivity COND-D, input conductivity

D-S), generated by GPForest in: (a) generation 46 of experiment 18, (b) generation

second column of Table 4. The identified most important vari-ables to the effluent biological oxygen demand DBO-S, are theglobal performance input biological oxygen demand RD-DBO-Gand the input chemical oxygen demand to plant DQO-E, whichare both considered very reasonable. To the effluent chemicaloxygen demand (DQO-S), the ‘global performance input chemicaloxygen demand RD-DQO-G’ and the ‘performance input chemi-cal oxygen demand to the second settler’ are flagged as the keyinfluencing variables, which again are considered as reasonable

identification.

There is only one variable that was flagged out as most influen-tial to the output effluent conductivity, COND-S, that is the ‘inputconductivity to second settler, COND-D’. It is worth mentioning thatthis does not necessarily mean there is definitely only one key vari-

utes identified

e input suspended solids SS-E, input suspended solids to plantnce input biological oxygen demand DQO-E, input chemical oxygen demand to

nce input chemical oxygen demand RD-DQO-S, performance input chemicaly settlerto secondary settler

Page 11: Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

1612 C.Y. Ma, X.Z. Wang / Computers and Chemical Engineering 33 (2009) 1602–1616

Table 5Accuracies and confusion matrices of trees induced by GPForest, GPTree and See5a.

Training Test

Accuracy Predicted class Accuracy Predicted class

Actual class H N L Actual class H N L

SS-S GPF:391/393 = 99.5% H 17, 17, 14 1, 1, 4 GPF:133/134 = 99.3% H 3, 3, 3GPT:391/393 = 99.5% N 366, 366, 366 GPF:132/134 = 98.5% N 128, 128, 128See5:380/393 = 96.7% L 1, 1, 9 8, 8, 0 See5:132/134 = 98.5% L 1, 2, 2 2, 1, 1

DBO-S GPF:391/393 = 99.5% H 5, 5, 5 1, 1, 1 GPF:133/134 = 99.3% H 2, 2, 1 0, 0, 1GPT:389/393 = 99.0% N 378, 378, 378 GPT:133/134 = 99.3% N 130, 130, 127 0, 0, 3See5:390/393 = 99.2% L 1, 3, 2 8, 6, 7 See5:128/134 = 95.5% L 1, 1, 2 1, 1, 0

DQO-S GPF:392/393 = 99.7% H 10, 10, 9 1, 1, 2 GPF:131/134 = 97.8% H 2, 2, 0 0, 0, 2GPT:392/393 = 99.7% N 375, 375, 374 0, 0, 1 GPT:131/134 = 97.8% N 129, 129, 129See5:390/393 = 99.2% L 7, 7, 7 See5:129/134 = 96.3% L 3, 3, 3 0, 0, 0

COND-S GPF:393/393 = 100% H 13, 13, 10 0, 0, 3 GPF:132/134 = 98.5% H 1, 1, 1 2, 2, 2GPT:393/393 = 100% N 375, 375, 375 GPT:131/134 = 97.8% N 128, 128, 128

5, 5

-S: oua

ao

4

autfas‘StGnt

Gccsnc

tGc

S39

tta3opaccsG

See5:389/393 = 99.0% L 0, 0, 1

a SS-S: output suspended solids; DBO-S: output biological oxygen demand; DQOnd L: Low.

ble, it just means that based on the rules set in the program, onlyne variable was flagged out.

.2. Comparison of GPForest result with that of GPTree and See5

Comparison in performance between GPForest, GPTree as wells See5 will be discussed by reference to Table 5. To help readersnderstand the table, an introduction is firstly given by discussinghe shaded area in Table 5. The shaded area is the confusion matrixor predicting SS-S (output suspended solids), showing informationbout actual and predicted classifications done by the three deci-ion tree systems. In one of the cell that contains 17, 17, 14, the first17’ means that GPForest predicted 17 data cases that are ‘H’ in SS-

(output suspended solids). The predictions are correct becauseheir actual classes are ‘H’. The second 17 in the cell indicates thatPTree also predicted correctly 17 data cases as ‘H’ in SS-S. The thirdumber of the cell, 14, means that See5 predicted correctly 14 caseshat are ‘H’ for SS-S.

Still in the shaped area, the cell containing 1, 1, 4 indicates thatPForest, GPTree and See5 each predicted one, one and four dataases as ‘N’ (normal) in SS-S respectively, but their actual classifi-ations all should be ‘H’ (high) in SS-S. Similarly the cell of 1, 1, 9hows that GPForest, GPTree and See5 each predicted one, one andine data cases as ‘N’ (normal) in SS-S respectively, but their actuallassifications all should be ‘L’ (low) in SS-S.

In other words, the cells on the diagonal contain information ofhe number of data cases that are predicted correctly by GPForest,PTree and See5, while cells elsewhere contain the numbers of dataases that are incorrectly predicted by the three approaches.

To summarise the shaded area of Table 5, GPForest, GPTree andee5, each incorrectly predicted 2, 2, and 13 data cases, out of total93 data cases of training data, giving rise prediction accuracies of9.5%, 99.5% and 96.7%.

It needs to point out here that comparing the performance of thehree methods using accuracies can be misleading. Examination ofhe data reveals that the majority of the data cases are normal oper-tional data, i.e. having an actual value of ‘N’ (still in the shaped area,66 normal data cases), with only 18 data cases having actual valuef ‘H’ in SS-S, and 9 cases having actual value of ‘L’. Due to this dis-roportional feature in the size of data taking values of H, N and L,

n incorrect prediction for a data case that actually belongs to N isonsidered a smaller error than an incorrect prediction for a dataase that actually belongs to H or L. Based on this viewpoint, thehaped area indicates that for the prediction of SS-S, GPForest andPTree performed the same, and both significantly over-performed

, 4 See5:129/134 = 96.3% L 0, 1, 3 3, 2, 0

tput chemical oxygen demand; COND-S: output conductivity; H: High; N: Normal;

See5, because GPForest and GPTree each correctly predicted 25H and L cases, while See5 only correctly predicted 14 H andzero L cases.

Based on the discussion above, we can analyse the performanceof GPForest, GPTree and See5 for predicting SS-S for the test data(now outside the shaded area). It shows that three approaches gaveidentical performance in correctly predicting H in SS-S for threedata cases (cells containing 3, 3, 3). But GPForest performed betterthan either GPTree or See5 in predicting data cases that are L in SS-S: GPForest correctly predicted two cases, while GPTree and See5each correctly predicted one. GPForest also preformed better thanGPTree and See 5 in predicting N data cases.

The discussion above is aimed at illustrating how the tableshould be read. Clearly, comparison of the performance of GPFor-est, GPTree and See5 should be made for all the four variables,SS-S, DBO-S, DQO-S and COND-S, and based on both test as wellas training data.

Examination of Table 5, for all the four variables, and both train-ing and test data reveals that GPForest always either outperformsor matches the performance of GPTree and See5.

It was found that some decision trees generated by geneticprogramming are very similar to the trees generated by See5 instructure. For example, for predicting output suspended solidsSS-S, the tree generated by See5, Fig. 10(a), is similar to one ofthe trees generated by genetic programming, Fig. 6(c): they sharetwo nodes, global performance input suspended solids RD-SS-G,and input suspended solids to plant SS-E, though Fig. 6(c) hasone more node than Fig. 10(a). Both misclassified 13 data casesthat should be N (normal), one misclassified as H (high), and 12misclassified as L (low). Some other trees generated by geneticprogramming for SS-S, e.g. the two trees in Fig. 6(a) and (b), aremore complicated than the See5 tree, and give less misclassifica-tions. Fig. 6(b) tree not only contains all the two variables usedby See5, RG-SS-G and SS-E, but also introduced one more variable,input sediments to plant SED-E. It has 12 misclassifications, oneless than the See5 tree. Fig. 6(a) tree also contains the two variablesused by See5, RD-SS-G and SS-E, but introduced two more vari-ables, SED-D (input sediments to secondary settler), and input pHto primary settler PH-P. This tree only has two more variables thanthe See5 tree, but has improved the performance dramatically. It

misclassified only two data cases, in comparison See5 misclassifiedthirteen.

Similar observations can be made for trees predicting the out-put chemical oxygen demand, DQO-S. See5 tree, Fig. 10(c) sharessome nodes with the genetic programming tree, Fig. 8(a), including

Page 12: Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

C.Y. Ma, X.Z. Wang / Computers and Chemical Engineering 33 (2009) 1602–1616 1613

F -S), oua

RT(tm

dco

ig. 10. Decision trees generated by See5 for predicting output suspended solids (SSnd output conductivity (COND-S).

D-DBO-S, RD-DQO-G, DQO-E, but also have a few nodes different.he genetic programming tree Fig. 8(a) misclassified one data caseN misclassified as H), and the See5 tree, Fig. 10(c) misclassifiedhree data cases, two N cases misclassified as H, and one L case

isclassified as N.Similar comparison can be made for trees predicting output con-

uctivity, COND-S. See5 tree of Fig. 10(d) misclassified four dataases, while the genetic programming tree of Fig. 9(a) misclassifiednly one data case.

tput biological oxygen demand (DBO-S), output chemical oxygen demand (DQO-S)

For trees predicting output biological oxygen demand, DBO-S,although the See5 tree of Fig. 10(b) misclassified three data casesone less than the genetic programming tree of Fig. 7(a) which mis-classified four data cases, conclusion cannot be drown that See5

performs better than GPTree and GPForest. This is because Fig. 7(a)is just an example tree from the tree forest. The best performingdecision tree generated by genetic programming still performs bet-ter than See5, as evidenced by Table 5 (for the row of DBO-S, inparticular for test data).
Page 13: Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

1614 C.Y. Ma, X.Z. Wang / Computers and Chemical Engineering 33 (2009) 1602–1616

Table 6Effects of the number of generations in each experiment and the number of trees in each generation on the prediction accuracy of the best performing tree for each experiment.

Attribute Generation number required Trees generated in each generation

200 100 60 30 1200 900 600 300

SS-S Training 97.2 99.2 95.4 96.4 98.7 95.7 99.2 99.2Test 99.2 99.3 99.3 98.5 98.5 98.5 99.3 98.5

DBO-S Training 96.9 99.2 99.5 99.0 98.2 99.0 99.2 97.2Test 98.5 98.5 97.0 98.5 99.3 98.5 98.5 98.5

DQO-S Training 99.7 99.7 99.0 97.7 99.7 97.2 99.8 99.297.0 97.8 97.0 97.8 97.8

C 99.7 99.2 99.7 99.7 98.797.0 97.0 97.8 97.0 97.0

4

fg(kcpdwu

••

esingtfntn

asgmiipaao

Test 97.8 97.8 97.0

OND-S Training 98.7 99.7 99.0Test 97.0 97.0 97.0

.3. Parameters of the genetic programming algorithm

After having presented the results, it is appropriate to have aurther discussion about the parameters used in the genetic pro-ramming algorithm. Different values for the seven parametersintroduced in Section 2) have been tested in order to gain thenowledge about the sensitivity of the results to them. This wasonducted using a trial-and-error approach, which led to a set ofarameter values which were considered as near optimum for theata concerned. In the following, for each parameter, the first valueas considered as the near optimum value, and the rest values aresed as variations in more experiments:

Maximum number of generations for each experiment: 200, 100,60, 30.Number of trees to be generated for each generation, 1200, 900,600, 300.Number of trees participating the tournament, 16, 8, 32.Number of trees that are allowed to be placed directly into thenext generation without cross-over and mutation, 0, 1, 5.The minimum tolerable increase in accuracy for the best perform-ing decision tree in a generation compared with the accuracy ofthe best performing tree of a previous generation: 5%, 8%, 10%.Mutation rate, i.e. percentage of trees that undergo mutation: 50%,20%, 40%, 60%, 80%.Allowable minimum number of data cases that are covered by aleaf node: 2, 1, 5, 10.

It needs to point out here that although there are seven param-ters, for the majority of them, it is not difficult to give a value,uch as for the maximum number of generations for each exper-ment (equivalent to the maximum of iterations in feed forwardeural networks), and number of trees to be generated for eacheneration (they proved to be not sensitive), as well as minimumolerable increase in accuracy (equivalent to the error tolerance ineed forward neural network training). Other parameters are alsoot difficult to give values, and are no more difficult than choosinghe numbers of layers and hidden neurons in feed forward neuraletwork training.

As an example, for the maximum number of generations to run,nd the number of trees to be generated for each generation, Table 6hows the results. Table 6 shows that the maximum numbers ofenerations to run for each experiment, 30, 60, 100 and 200, do notake meaningful difference in chance of finding the best perform-

ng decision tree. Fig. 11 shows the highest training accuracy of trees

n each generation, plotted against the generations, for output sus-ended solids, SS-S. This figure shows that the tree with the highestccuracy was obtained at around the 20th generation. Taking intoccount a safe factor, 100 was assigned as the maximum numberf generations to run for each experiment. For the number of trees

Fig. 11. The highest training accuracy of trees in each generation, plotted against thegenerations, for output suspended solids, SS-S.

generated in each generation, 300, 600, 900 and 1200 were tested(Table 6). Again, no significant difference was found. As a result, 60were considered as a reasonable safe value. The first experimentwas conducted using these values. All other experiments involvedslightly varying the parameter values, one parameter at each exper-iment. It needs to point out that even with the same parametervalues, two runs will not generate identical decision trees, due tothe nature of the algorithm involving randomly selecting nodes andsplit values.

5. Final remarks

Decision tree models generated through inductive data miningbring advantages in addressing some challenges facing practicaldata analysis.

Firstly, feature variable selection from input variables is an inte-gral step in decision tree induction. This is an important propertybecause many times, users like to know what input variables arekey variables that affect the output and what are the variablesthat do not impact the output. Inductive data mining has provedto be very effective in processing problems that have thousandsor input variables by reducing to only a few variables. It is impor-tant to be aware that feature variable selection is different fromdimension reduction such as using principal component analysisor independent component analysis, because in the latter method,the latent variables still contain information about original vari-ables that are in fact irrelevant to the output. It is well knownthat inclusion of information of irrelevant input variables in thepredictive model can adversely affect model’s performance, e.g. ingeneralisation.

Secondly, decision tree models are causal and transparent. Adecision tree gives not only information about what are the mostimportant variables to the output, but also the causal relationships.

Once a decision tree is built, it can be altered by users, e.g.adding branches and nodes based on their knowledge. This is useful

Page 14: Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

hemic

be

srccosb

gesmcpgw1fugtf

ldpgiofo

twtmttbrcto

A

intAT

R

A

A

A

B

C.Y. Ma, X.Z. Wang / Computers and C

ecause this newly added knowledge by experts could be knowl-dge which was not reflected in the data.

The decision forest approach presented in this work also demon-trated effectiveness in dealing with small number of data casesepresenting a specific scenario (e.g. data about abnormal pro-ess operation) when they are mixed with a large number of dataases representing a different scenario (e.g. data of normal processperation). It is known that for some other methods, without pre-election and treatment of the training data, models built could beiased towards mainly representing the large cluster.

Traditional decision tree generation methods are often based onreedy search algorithms. Since greedy search based methods gen-rate only one tree at a time, tactics such as using various differentubsets of the database must be introduced in order to generateultiple trees which can be used to form a decision forest. In

ontrast, the method presented here which is based on a geneticrogramming approach for construction of decision trees can easilyenerate multiple trees. For instance, in the current work, 600 treesere generated in each generation and each experiment runs for

00 generations, which means 6000 decision trees were generatedor each experiment. In addition, by slightly varying the initial val-es for the seven parameters of the algorithm, new experiments willenerate more decision trees. Therefore the method presented inhis paper, the GPForest, naturally represents a powerful techniqueor decision forest construction.

Application of the methodology to the analysis of the data col-ected from the wastewater treatment plant, corresponding to 527ays of operation, proved that the decision forest approach out-erforms See5, the probably most well established decision treeenerator, as well as GPTree. For all the four output variables exam-ned, the GPForest approach presented in this paper, consistentlyutperforms See5 and GPTree, in particular with regard to the per-ormance for the test data which were not used in the constructionf decision trees, indicating better generalisation capability.

It needs to emphasize that despite the above mentioned advan-ages of GPForest, it is not intended to completely replace otherell studied data mining and analysis methodologies, such as mul-

ivariate statistical process control, support vector machines andulti-dimensional data visualization. Rather, it is always advan-

ageous to use different tools together. One tool can pre-processhe data for another tool, and results from different tools cane compared for validation purpose. In addition, several heuristicules with associated parameters require fine tuning, specificallyoncerning the genetic algorithm use. Future research will be inves-igating methodologies that automate the determination of a set ofptimum values for the parameters used in GPForest.

cknowledgements

The work has benefited from a grant from the UK Engineer-ng and Physical Sciences Research Council (EPSRC, grant referenceumber EP/D038391). The authors would also like to acknowledgehe support of the Dorothy Hodgkin International Postgraduateward which is jointly financed by the Research Councils UK andhe Shell Group.

eferences

lbazzaz, H., & Wang, X. Z. (2006). Historical data analysis based on plots of inde-pendent and parallel coordinates and statistical control limits. Journal of ProcessControl, 16, 103–114.

lbazzaz, H., Wang, X. Z., & Marhoon, F. (2005). Multidimensional visualisation for

process historical data analysis: A comparative study with multivariate statisti-cal process control. Journal of Process Control, 15, 285–294.

mit, Y., & Geman, D. (1997). Shape quantization and recognition with randomizedtrees. Neural Computation, 9, 1545–1588.

acha, P. A., Gruver, H. S., Den Hartog, B. K., Tamura, S. Y., & Nutt, R. F. (2002). Ruleextraction from a mutagenicity data set using adaptively grown phylogenetic-

al Engineering 33 (2009) 1602–1616 1615

like trees. Journal of Chemical Information and Computer Sciences, 42,1104–1111.

Bakshi, B. R., & Stephanopoulos, G. (1994). Representation of process trends 4. Induc-tion of real-time patterns from operating data for diagnosis and supervisorycontrol. Computers & Chemical Engineering, 18, 303–332.

Bala, J., Huang, J., Vafaie, H., DeJong, K., & Wechsler, H. (1995). Hybrid learning usinggenetic algorithms and decision trees for pattern classification. In Proceedingsof 14th International Joint Conference on Artificial Intelligence (IJCAI’95), vol. 1Montreal, Quebec, Canada, August 19–25, pp. 719–724.

Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.Breiman, L., Freidman, J. H., Olshen, R. A., & Stone, C. J. (1993). Classification and

regression trees. Boca Raton, London: Chapman & Hall/CRC.Buontempo, F. V., Wang, X. Z., Mwense, M., Horan, N., Young, A., & Osborn, D. (2005).

Genetic programming for the induction of decision trees to model ecotoxicitydata. Journal of Chemical Information and Modeling, 45, 904–912.

Cho, S. J., Shen, C. F., & Hermsmeier, M. A. (2000). Binary formal inference-basedrecursive modeling using multiple atom and physicochemical property class pairand torsion descriptors as decision criteria. Journal of Chemical Information andComputer Sciences, 40, 668–680.

Clark, P., & Boswell, R. (1991). Rule induction with CN2: Some recent improvements.Berlin: Presented at machine learning—EWSL-91.

Clark, P., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning, 3,261–283.

DeLisle, R. K., & Dixon, S. L. (2004). Induction of decision trees via evolutionary pro-gramming. Journal of Chemical Information and Computer Sciences, 44, 862–870.

Deniz, O., Castrillon, M., & Hernandez, M. (2003). Face recognition using independentcomponent analysis and support vector machines. In Presented at 3rd interna-tional conference on audio and video based biometric person authentication (AVBPA2001) Halmstad, Sweden.

FDA, 2006. FDA PAT Initiative. <http://www.fda.gov/cder/OPS/PAT.htm>.Fuente, M. J., Vega, P., Zarrop, M., & Poch, M. (1996). Fault detection in a real wastew-

ater plant using parameter-estimation techniques. Control Engineering Practice,4, 1089–1098.

Huang, Y. C., & Wang, X. Z. (1999). Application of fuzzy causal networks to wastewater treatment plants. Chemical Engineering Science, 54, 2731–2738.

Hussain, A. S. (2009). FDA’s PAT & cGMP initiative for the 21st century: status update andchallenges to be addressed. <http://www.camppharma.org/secured/display/pdfs/Year2003/Proposals&Presentations/Regulatory/FDA-Ajaz-PAT-cGMP-Initiative-March-2003.pdf> Accessed March 2009.

Iri, M., Aoki, K., Oshima, E., & Matsuyama, H. (1979). An algorithm for diagnosis ofsystem failures in the chemical process. Computers & Chemical Engineering, 3,489–493.

Jemwa, G. T., & Aldrich, C. (2005). Improving process operations using support vectormachines and decision trees. AIChE Journal, 51, 526–543.

Johannesmeyer, M. C., Singhal, A., & Seborg, D. E. (2002). Pattern matching in histor-ical data. AIChE Journal, 48, 2022–2038.

Keefer, C. E., & Woody, N. A. (2006). Rejecting unclassifiable samples with decisionforests. Chemometrics and Intelligent Laboratory Systems, 84, 40–45.

King, R. D., & Srinivasan, A. (1996). Prediction of rodent carcinogenicity bioassaysfrom molecular structure using inductive logic programming. EnvironmentalHealth Perspectives, 104, 1031–1040.

King, R. D., Srinivasan, A., & Dehaspe, L. (2001). Warmr: A data mining tool forchemical data. Journal of Computer-Aided Molecular Design, 15, 173–181.

Li, R. F., & Wang, X. Z. (2001). Qualitative/quantitative simulation of process temporalbehavior using clustered fuzzy digraphs. AIChE Journal, 47, 906–919.

Liu, H., Hussain, F., Tan, C. L., & Dash, M. (2002). Discretization: An enabling technique.Data Mining and Knowledge Discovery, 6, 393–423.

Mamdani, E. H. (1974). Application of fuzzy algorithms for control of simple dynamicplant. In Proceedings of the IEE, Vol. 121 (pp. 1585–1588).

Martens, D., Baesens, B., & Van Gestel, T. (2009). Decompositional rule extractionfrom support vector machines by active learning. IEEE Transactions on Knowledgeand Data Engineering, 21, 178–191.

Mitchell, T. M. (1977). Version spaces: A candidate elimination approach to rulelearning. In Presented at proceedings of the IJCAI-87 Cambridge, Mass,

Muggleton, S., & Feng, C. (1992). In S. Muggleton (Ed.), Efficient induction of logicprograms. Inductive logic programming (pp. 281–297). Academic Press.

Nounou, M. N., Bakshi, B. R., Goel, P. K., & Shen, X. T. (2002). Process modeling byBayesian latent variable regression. AIChE Journal, 48, 1775–1793.

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106.Quinlan, J. R. (1993). C4 5: Programs for machine learning. Morgan Kaufmann Publish-

ers Inc.Quinlan, J. R., & Cameron-Jones, R. M. (1993). FOIL: A midterm report. Springer-Verlag.Rodriguez-Roda, I., Poch, M., & Banares-Alcantara, R. (2000). Conceptual design of

wastewater treatment plants using a design support system. Journal of ChemicalTechnology and Biotechnology, 75, 73–81.

Rusinko, A., Farmen, M. W., Lambert, C. G., Brown, P. L., & Young, S. S. (1999). Anal-ysis of a large structure/biological activity data set using recursive partitioning.Journal of Chemical Information and Computer Sciences, 39, 1017–1026.

Sanchez, M., Cortes, U., Bejar, J., DeGracia, J., Lafuente, J., & Poch, M. (1997). Concept

formation in WWTP by means of classification techniques: A compared study.Applied Intelligence, 7, 147–165.

Saraiva, P. M., & Stephanopoulos, G. (1992). Continuous process improvementthrough inductive and analogical learning. AIChE Journal, 38, 161–183.

Schapire, Y. F. a. R. E. (1996). Experiments with a new boosting algorithm. In Inter-national conference on machine learning Morgan Kaufmann.

Page 15: Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

1 hemic

S

S

T

T

T

U

VV

W

W

616 C.Y. Ma, X.Z. Wang / Computers and C

helokar, P. S., Jayaraman, V. K., & Kulkarni, B. D. (2004). An ant colony classifier sys-tem: Application to some process engineering problems. Computers & ChemicalEngineering, 28, 1577–1584.

inghal, A., & Seborg, D. E. (2005). Clustering multivariate time-series data. Journalof Chemometrics, 19, 427–438.

akagi, T., & Sugeno, M. (1985). Fuzzy identification of systems and its applicationto modelling and control. IEEE Transactions on Systems, Man and Cybernetics, 15,116–132.

ong, W. D., Hong, H. X., Fang, H., Xie, Q., & Perkins, R. (2003). Decision forest: Com-bining the predictions of multiple independent decision tree models. Journal ofChemical Information and Computer Sciences, 43, 525–531.

ong, W. D., Xie, W., Hong, H. X., Fang, H., Shi, L. M., Perkins, R., et al. (2004). Usingdecision forest to classify prostate cancer samples on the basis of SELDI-TOFMS data: Assessing chance correlation and prediction confidence. EnvironmentalHealth Perspectives, 112, 1622–1627.

raikul, V., Chan, C. W., & Tontiwachwuthikul, P. (2007). Artificial intelligence formonitoring and supervisory control of process systems. Engineering Applicationsof Artificial Intelligence, 20, 115–131.

apnik, V. N. (1998). Statistical learning theory. New York: Springer-Verlag.enkatasubramanian, V., Rengaswamy, R., & Kavuri, S. N. (2003). A review of process

fault detection and diagnosis Part II: Quantitative model and search strategies.

Computers & Chemical Engineering, 27, 313–326.

astewaterdatabase. Available from: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/.

ang, J. G., Neskovic, P., & Cooper, L. N. (2005). Training data selection for sup-port vector machines. In Advances in natural computation, Pt 1, proceedings (pp.554–564).

al Engineering 33 (2009) 1602–1616

Wang, X. Z. (1999). Data mining and knowledge discovery for process monitoring andcontrol. London, UK: Springer.

Wang, X. Z., Buontempo, F. V., Young, A., & Osborn, D. (2006). Induction of decisiontrees using genetic programming for modelling ecotoxicity data: Adaptive dis-cretization of real-valued endpoints. SAR and QSAR in Environmental Research, 17,451–471.

Wang, X. Z., Chen, B. H., Yang, S. H., McGreavy, C., & Lu, M. L. (1997). Fuzzy rule gener-ation from data for process operational decision support. Computers & ChemicalEngineering, 21, S661–S666.

Wang, X. Z., & McGreavy, C. (1998). Automatic classification for mining processoperational data. Industrial & Engineering Chemistry Research, 37, 2215–2222.

Wang, X. Z., Medasani, S., Marhoon, F., & Albazzaz, H. (2004). Multidimensionalvisualization of principal component scores for process historical data analysis.Industrial & Engineering Chemistry Research, 43, 7036–7048.

Wang, X. Z., Yang, S. A., Veloso, E., Lu, M. L., & McGreavy, C. (1995). Qualitative pro-cess modelling—a fuzzy signed directed graph method. Computers & ChemicalEngineering, 19, S735–S740.

Yuan, B. (2002). Process data mining using neural networks and inductive learning. PhDThesis, University of Leeds.

Zeng, G. M., Li, X. D., Jiang, R., Li, J. B., & Huang, G. H. (2006). Fault diagnosis of WWTPbased on improved support vector machine. Environmental Engineering Science,

23, 1044–1054.

Zhang, B. T., & Cho, D. Y. (1999). Genetic programming with active data selection.Simulated Evolution and Learning, 146–153.

Zhang, Y. W. (2008). Fault detection and diagnosis of nonlinear processes usingimproved kernel independent component analysis (KICA) and Support VectorMachine (SVM). Industrial & Engineering Chemistry Research, 47, 6961–6971.