Using Unfold-PCA for batch-to-batch start-up process understanding and steady-state identification in a sequencing batch reactor

Research Article

Received: 15 May 2007, Revised: 27 September 2007, Accepted: 23 October 2007, Published online in Wiley InterScience: 2007

(www.interscience.wiley.com) DOI: 10.1002/cem.1104

Using Unfold-PCA for batch-to-batch start-upprocess understanding and steady-stateidentification in a sequencing batch reactorD. Aguadoa*, A. Ferrerb, A. Secoc and J. Ferrera

J. Chemom

In chemical and biochemical processes, steady-state models are widely used for process assessment, control andoptimisation. In these models, parameter adjustment requires data collected under nearly steady-state conditions.Several approaches have been developed for steady-state identification (SSID) in continuous processes, but noattempt has beenmade to adapt them to the singularities of batch processes. Themain aim of this paper is to proposean automated method based on batch-wise unfolding of the three-way batch process data followed by a principalcomponent analysis (Unfold-PCA) in combinationwith themethodology of Brown and Rhinehart [2] for SSID. A secondgoal of this paper is to illustrate how by using Unfold-PCA, process understanding can be gained from thebatch-to-batch start-ups and transitions data analysis. The potential of the proposed methodology is illustratedusing historical data from a laboratory-scale sequencing batch reactor (SBR) operated for enhanced biologicalphosphorus removal (EBPR). The results demonstrate that the proposed approach can be efficiently used to detectwhen the batches reach the steady-state condition, to interpret the overall batch-to-batch process evolution and alsoto isolate the causes of changes between batches using contribution plots. Copyright� 2007 John Wiley & Sons, Ltd.

Keywords: diagnosis; PCA; sequencing batch reactor; start-up; steady-state detection; wastewater

* Department of Hydraulic Engineering and Environment, Technical University

of Valencia, Camino de Vera s/n, 46022 Valencia, Spain.

E-mail: [email protected]

a D. Aguado, J. Ferrer

Department of Hydraulic Engineering and Environment, Technical University

of Valencia, Camino de Vera s/n, 46022 Valencia, Spain

b A. Ferrer

Department of Applied Statistics, Operations Research and Quality, Technical

University of Valencia, Camino de Vera s/n, 46022 Valencia, Spain

c A. Seco

Department of Chemical Engineering, University of Valencia, Doctor Moliner,

50, 46100-Burjassot, Valencia, Spain 8

1. INTRODUCTION

In chemical and biochemical processes, steady-state models arewidely used for unit design, process assessment, control andoptimisation. Due to the inherent non-stationarity of theseprocesses, model parameters must be adjusted frequently.However, parameter adjustment requires data collected undernearly steady-state conditions. For example when a wastewatertreatment plant (WWTP) needs to be designed for treating aspecific wastewater (i.e. non-typical influent composition), adeterministic model of the process is used and the bio-kineticparameters of the model are usually determined in a pilot plantrunning a sequence of experiments (with different operatingconditions) and collecting data when the pilot plant is atsteady-state. In such cases, an automated method for stea-dy-state identification (SSID) that would not require continualoperator surveillance would be of great practical usefulness.The term ‘steady-state’ can be understood in different ways. In

statistics it is related to the concept of a stationary model. A timeseries (i.e. a sequence of process data sampled every fixed timeinterval) will be considered at steady-state if the statisticalproperties of the multidimensional random variables associatedto the stochastic process do not change with time. However, aspointed out by Cao and Rhinehart [1], for deterministic processmodelling purposes the steady-state condition is less strict andonly themean of the process data needs to be relatively constant.Given the singularities of batch processes, the above definition ofsteady-state is not suitable and an appropriate definition shouldbe stated.Batch process data consist of several variables measured over

the duration of the batch (trajectories) which can be arranged in a

etrics 2008, 22 81–90 Copyright � 2007

three-directional matrix X (batch� variable� time). This kind ofmatrices exhibit a highly complex dynamics due to thenonlinearities of the trajectories and their complex auto/cross-correlation structure. Taking this into account, our proposalis to consider a batch at steady-state when each of its processvariable trajectories remains in equilibrium about a constanttrajectory (i.e. each trajectory follows a stable pattern (shape)with random noise deviations from it, so that auto andcross-correlation of the process variables remain stable).Different approaches for SSID have been suggested in the

literature and a brief review can be found in Cao and Rhinehart[1]. These authors proposed a novel method based on an F-likestatistic test that compares the variance of a single processvariable computed in two different ways (the method will bethoroughly explained later in the paper). Subsequently, Brownand Rhinehart [2] extended this methodology to multivariable

John Wiley & Sons, Ltd.

1

D. Aguado et al.

82

processes. Jiang et al. [3] introduced a wavelet-based method.Ruiz et al. [4] suggested the use of the latent variables from a PCAmodel instead of the collected process variables in the Brown andRhinehart [2] approach. Moreover, they also developed a newmethod named the ‘predicted ratio method’ based on calculatingthe distance in the latent space between a new observation and aspecified number of previous observations. All these approacheshave been developed and applied for SSID in continuousprocesses, but no attempt has been made to adapt them andextend their application to batch processes.Thus, the main goal of this paper is to develop an automatic

procedure able to classify each batch as being from the start-upperiod or from a steady-state period. For this purpose, wepropose a structured approach based on batch-wise unfolding ofthe three-way batch process data followed by a principalcomponent analysis (PCA) in combination with the methodologyof Brown and Rhinehart [2] for SSID. The application of theproposed methodology will be illustrated using historical datafrom a laboratory-scale sequencing batch reactor (SBR) operatedfor enhanced biological phosphorus removal (EBPR).The SBR is a fill and draw activated sludge technology for

wastewater treatment. In one tank, all the treatment steps of acontinuous flow WWTP with multiple reactors for activatedsludge and settle are combined. In SBR systems, short-termunsteady-state conditions can be imposed which allow, afterseveral cycles, the development of a stable predominant bacterialpopulation. The main advantage of these systems is theirflexibility to meet many different effluent quality limits. Thenumber of full-scale applications is increasing considerably andhas become an alternative to sludge activated continuousprocesses [5].The SBR process is characterised by a cyclic sequence of

phases: fill, anaerobic, aerobic, anoxic, settle, draw and idle. It is adynamic and highly nonlinear process in which the on-linerecorded variables can be arranged in a three-directional matrix X(I� J� K) containing the values of J variables at K sampling timesin I batches. The highly complex structure of matrix X can beanalysed by multivariate statistical methods based on projectionto latent structures. In particular, bilinear models such as PCA arehighly recommended due to their simplicity and effectiveness incompressing the highly dimensional and correlated batch datainto a few uncorrelated latent variables which can be used tovisualise and interpret hidden phenomena. Prior to theapplication of PCA, matrix X must be unfolded into a two-waymatrix. Several options exist but since this work is focused onanalysing the ‘batch-to-batch evolution’, batch-wise unfoldingresults more appropriate for the application of interest, yielding atwo-directional matrix X (I� JK) as shown in Figure 1.The paper is organised as follows. First, a short description of

the lab-scale SBR and its operation mode is given, followed by an

Figure 1. Three-way representation of batch process data (X) and the

batch-wise unfolding scheme.

www.interscience.wiley.com/journal/cem Copyright � 200

in-depth explanation and discussion on the statistical approachesused for the data analysis. Afterwards, in Section 3 it is illustratedhow process understanding can be gained by analysing theevolution of the trajectories of the collected batch processvariables during the start-up period by means of multivariateprojection techniques. Later in this section, the results of theapplication of the proposed SSID methodology, first to real datafrom the SBR and later to simulated data, are shown anddiscussed. Finally, the main conclusions from this work aresummarised.

2. MATERIALS AND METHODS

2.1. Reactor operation

Data were obtained from a laboratory-scale SBR (7 L of workingvolume) operated under anaerobic/aerobic (A/O) conditions forEBPR. The SBR, equipped with a mechanical stirrer and an airdiffuser, was operated with four 6-h cycles per day in atemperature controlled room (20� 18C). Each cycle consistedof several phases with constant duration (Figure 2): fill (2min),anaerobic (1 h–28min), aerobic (3 h), settle (1 h–28min) and draw(2min). Five process variables were recorded by means ofinexpensive, reliable and low-maintenance electronic sensorsinstalled in the SBR: pH, electric conductivity (Cond), oxidationreduction potential (ORP), dissolved oxygen (DO) and tempera-ture (Temp). The overall hydraulic retention time and sludgeretention time in the SBR system were maintained at 12 h and 9days, respectively. Activated sludge from a full-scale biologicalnutrient removal WWTP was used to inoculate the SBR. Duringthe aerobic stage, the DO concentration was maintained at about3mg L�1 using an on–off controller. The experiments werecarried out using synthetic wastewater.EBPR is a wastewater treatment process aimed at achieving

low phosphorus effluent concentrations. This process is based onthe ability of certain type of bacteria (polyphosphate accumu-lating organisms, PAOs) to store phosphorus in higher amountthan just that corresponding to nutritional requirements. Two keyfactors promote the proliferation of PAOs in the system: thepresence of short chain fatty acids in the anaerobic zone and thealternation of the sludge through anaerobic and aerobicconditions.In anaerobic conditions, PAOs take up the short chain fatty

acids and store them intracellularly as poly-hydroxy-alcanoates(PHA), obtaining the necessary energy from the hydrolysis of theinternal poly-phosphate (poly-P) and glycogen degradation. As aresult of the poly-P hydrolysis, phosphorus and associated cations(potassium and magnesium) are released, thus, increasing theirconcentration in the medium along this stage. Afterwards, underaerobic conditions the PHA is consumed, providing energy for

Figure 2. Scheme of the SBR phases for each cycle. This figure is

available in colour online at www.interscience.wiley.com/journal/cem

7 John Wiley & Sons, Ltd. J. Chemometrics 2008, 22 81–90

Steady-state identification in SBR

8

bacterial growth and intracellular storage of glycogen and poly-P.To replenish the poly-P pools, phosphorus as well as potassiumand magnesium are taken up from the medium, thus reducingtheir concentration. Since the phosphorus uptake is higher thanthe release, a net phosphorus uptake is achieved.

2.2. Three-way data and statistical approaches

Data from the five electronic sensors installed in the SBR wererecorded approximately every 1.06min. Thus, resulting in arecord of 340 values from each sensor in each batch. For the dataanalysis 188 batches were used, including those from the start-upof the process. The start-up is the period where batches areevolving till a steady-state once the SBR has been inoculated anda given set of operating conditions have been imposed. It is atransition period, although the term ‘transition’ is usually used torefer to the period where batches are evolving from onesteady-state to another (after imposing a different set ofoperating conditions in the SBR). Since this is a biologicalprocess, the evolution is due to the adaptation of themicroorganism populations to the influent wastewater and theoperating conditions imposed in the SBR, which influence boththeir behaviour and predominance.The trajectories from the threemain phases (anaerobic, aerobic

and settle) were considered in this study. Since the duration ofthe process phases is the same in every batch, data can bestructured in a three-waymatrix (X) of 188 batches by five processvariables by 340 instants of time.As commented before, the three-way matrix can be unfolded

into a bi-directional array. Six different ways are possible,although only three of them are not equivalent [6]. In this study,the main interest lies in analysing the differences among thebatches, thus, batch-wise unfolding was employed. A schematicrepresentation of the three-way matrix and the unfoldingmethod is shown in Figure 1. Note that batch direction is keptunaltered and the trajectory of each variable was arranged inconsecutive order. In this way, the trajectory of the weights of thePCA model (as well as the contribution plots) can be directlycompared with the trajectory of the original collected variables,thus, facilitating the interpretation of the results [7,8].The unfolded matrix (X) was centred and scaled to unit

variance. In this way, the differences in the measurements unitsare handled and equal weight to each variable at each timeinterval is given. As a row of the unfolded matrix (oneobservation) contains all the information of a batch, when dataare centred, the average trajectory for all the batches analysed issubtracted from every process variable. Since in our applicationbatches are expected not to be similar (there will be a mixture ofnon-steady-states batches and stationary ones), the shape of thetrajectory does matter and the average trajectory means just areference trajectory. Therefore, multivariate analysis on centreddata will summarise the evolution of the deviation of the variablestrajectories from the reference trajectory for each particular batch.This will be helpful for batch-to-batch process evolutionunderstanding and also for SSID. In this application it is notnecessary to build the reference trajectory from in-controlsteady-state batches (as is the case in batch multivariate statisticalprocess control—BMSPC). If batches reach the steady-stateperiod, their intra-batch variables trajectories will follow the sameaverage pattern with random noise deviations from it. Therefore,deviations from the reference trajectory will also be atsteady-state. This is a key point for understanding the SSIDproposal made in this work, explained in the following.

J. Chemometrics 2008, 22 81–90 Copyright � 2007 John Wiley

Since within a batch, the variables trajectories are bothautocorrelated and cross-correlated, by using PCA after batch-wise unfolding new uncorrelated latent variables are obtained(leading to a high dimensional reduction). This facilitates the laterapplication of other methods, as in this case, the extension tomultivariable processes of the methodology developed by Caoand Rhinehart [1] for SSID proposed by Brown and Rhinehart [2].In this multivariable methodology, a statistical test is applied to

determine whether each process variable is at steady-state or not,and the process is considered to be at steady-state when all theindividual process variables are at steady-state. In our study, themethod is not applied to the original collected variablestrajectories but to the latent variables, as Ruiz et al. [4] did fora continuous process, and also to the so-called Distance to themodel (DmodX) statistic obtained from the PCA model. TheDmodX statistic is just the residual standard deviation for aparticular observation and it is proportional to the square root ofthe residual sum of squares [9]. Note that considering theresiduals is of extreme importance because the covariancestructure of the process could be drifting, and therefore, thiswould indicate that the batch-to-batch process is not atsteady-state (even if the latent variables are stable) but evolving(i.e. in a transition or start-up period). Moreover, valuableinformation on the PCA model adequacy is contained in theresiduals, and also for this reason, they must be taken intoaccount in the method for identifying when the overall process isat steady-state.In the method proposed by Cao and Rhinehart [1] for SSID of a

single variable, the variance of a set of data is calculated in twodifferent ways: (a) the mean square deviation from the sampleaverage and (b) the mean square differences of consecutive data.However, to reduce the computational effort, instead of using theconventional sample average and variance, process mean andvariance are estimated by using exponentially weighted moving(EWM) filters:

xf ;i ¼ l1xi þ ð1� l1Þxf ;i�1 (1)

n2f ;i ¼ l2ðxi � xf ;i�1Þ2 þ ð1� l2Þ n2f ;i�1 (2)

d2f ;i ¼ l3ðxi � xi�1Þ2 þ ð1� l3Þ d2f ;i�1 (3)

where x is the process variable; xf the filtered value of x (EWMaverage) that provides an estimation of the process mean; i thetime sampling index; l1, l2 and l3 are the filter factors; n2f ;ithefiltered value of the variance based on the difference betweenthe data and the EWM average (filtered value of x) and d2f ;i thefiltered value of the variance based on sequential datadifferences.Afterwards, the ratio of variances, named R-statistic (derivation

in Appendix I), is calculated as follows:

R ¼ð2� l1Þn2f ;i

d2f ;i(4)

The R-statistic is dimensionless and independent of themeasurement level. Since it is a ratio of estimated variances, it isalso independent of the process variance. Assuming stationaryand independent data, the probability density function of theR-statistic (pdf(R)) depends on the filter factors (l1, l2, l3) but it ispractically not sensitive to the nature of the process random

& Sons, Ltd. www.interscience.wiley.com/journal/cem

3

D. Aguado et al.

84

variable. Cao and Rhinehart [1] showed that for a wide range ofdifferent random distributions (normal, uniform, gamma, . . .) thepdf(R) is almost identical.When the process is at steady-state, the R-statistic will be close

to unity, while if the process is not at steady-state the filteredvalue xf will lag behind the data, making the numerator term inEquation (4) larger than the denominator, and the statisticbecomes larger than unity. To determine if the process is atsteady-state, the calculated R-statistic is compared to a thresholdor critical value (R-crit). Critical values for the R-statistic assumingprocess at steady-state with white noise (independent identicallynormal distributed data) were obtained by Cao and Rhinehart[10] for different type I risk (a) and filter factors (l1, l2, l3).The filter values should be selected as a trade-off between

rapid tracking of the process and making the pdf(R) ofsteady-state condition and non-steady-state condition splitapart. Small filter values (which mean high filtering of processdata) allow separating widely the pdf(R) of steady-state conditionfrom that of the non-steady-state condition. Therefore, both typeI risk (probability of triggering a ‘not at steady-state’ when theprocess is at steady-state) and type II risk (probability of nottriggering a ‘not at steady-state’ response when the process is notat steady-state) can be minimised. However, small filter valuesmake the R-statistic lag far behind the present process status(which means that longer time will be needed to detect anychange in the process). The contrary occurs when high filtervalues are used. It should be stressed that the methodology isbased on the fact that the pdf(R) is different when the process isat steady-state that when the process is not at steady-state.Therefore, it is obvious that data from the process includingsteady-state as well as non-steady-state situations are necessaryto select the optimal filter parameters. This selection can be doneby trying different filter factors and R-crit values, and choosingthe combination of these values that yields low type I and II riskstogether with a fast tracking of the process. Cao and Rhinehart[10] assuming white noise suggested that the values for the filterfactors l1¼ 0.2 and l2¼ l3¼ 0.1 lead to the best balancebetween type I and type II risks. These suggested values wereused in our study.The null hypothesis is that the process is at steady-state.

Therefore, when the R-statistic is larger than the R-crit we are100(1�a)% confident that the process is not at steady-state. Onthe other hand, when the calculated R-statistic is lower than theR-crit, there is not enough statistical evidence to reject the nullhypothesis and, consequently, we consider that the process maybe at steady-state. The authors used a variable (SS) to representthe state of the process and set SS¼ 0 when it is not atsteady-state while SS¼ 1 when it is. As commented before, theR-statistic is dimensionless and independent of both themeasurement level and the process variance. However, theapproach fails in those situations in which there is autocorrelationin the single variable under study when the process is atsteady-state.In the extension to multivariable analysis by Brown and

Rhinehart [2], the process is considered to reach the steady-statewhen all the variables are at steady-state. For N variables, this iscalculated as follows:

SSprocess ¼YNi¼1

SSi (5)


If the process is at steady-state and the N variables areindependent, the type I risk for the process to be at steady-state(aprocess) can be calculated as follows:

aprocess ¼ 1�YNi¼1

ð1� aiÞ (6)

where ai is the type I risk for the test on each variable, which canbe calculated according to the Bonferroni correction as follows:

ai ¼ 1�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið1� aprocessÞN

q� aprocess

N(7)

Therefore, the described approach for multivariable processesrequires no autocorrelation and independence among theprocess variables at steady-state. Moreover, from Equations (6)and (7) it can easily be deduced that the method is limited to afew variables, because for a given ai, the type I risk for the process(aprocess) increases notably with the number of variables (N); and ifthe aprocess is set to a reasonable value (e.g. 0.05 or 0.01) whenthere are many variables, extremely low values for the ai will beobtained, resulting in a considerable increase in the type II risk ofeach individual hypothesis test. These limitations could beconsidered to be quite severe for modern industrial processeswhere huge amounts of highly collinear data are routinelycollected. To overcome the autocorrelation problem Cao andRhinehart [1] suggested reducing the sampling interval so thatthe autocorrelation becomes negligible. This will imply thatlonger time will be needed to have enough data to determine ifthe process has reached the steady-state, and although in thisway autocorrelation can be eliminated, it is obvious that the othershortcomings still remain: the instantaneous cross-correlation(correlation between the variables at the same time instant) willnot be removed, and the method can only be applied to a fewvariables in order to have reasonable type I risks.To solve these last two drawbacks, in our study the latent

variables from a PCA will be used to summarise the informationcontained in the original collected batch variables trajectories forevery batch. This provides an important advantage as the latentvariables are independent and therefore the absence ofinstantaneous cross-correlation is assured. These latent variablesare summarising all the within batch correlation structure (i.e. thewithin batch autocorrelations and cross-correlations). However,the values of the latent variables for a series of consecutivebatches could exhibit some type of correlation (autocorrelation orlagged cross-correlation), which would be due to the presence ofbatch-to-batch correlation. The proposed method for SSID isaimed at detecting this batch-to-batch correlation: if there is anybatch-to-batch correlation the process is not at SS (i.e. the processis in a transition period), while if there is no batch-to-batchcorrelation, the process is considered to be at SS.Moreover, using the latent variables from a PCA provides

another benefit since the dimension reduction achieved by theapplication of this technique makes it possible to use the SSIDmethodology when numerous process variables are collected. Inthis way, most of the information contained in the data is used forSSID (making unnecessary any variable selection procedureaimed at identifying representative variables for SSID) and stillreasonable type I risks for each individual variable as well as forthe overall process are possible. Nevertheless, it should be kept inmind that ‘variable selection’ based on technical knowledge of



the process and PCA are not excluding approaches (i.e. in somecases both techniques could be combined for SSID).Furthermore, it should be stressed that when the process

reaches a steady-state period, the intra-batch variables’ trajec-tories of the batches from this period will follow the same averagepattern with some random noise deviations from it. Since the PCAis using the deviation of the variables trajectories from theaverage trajectory (i.e. a reference trajectory), in those batchesfrom a steady-state period, the variability of the PCA-scores willbe due to that random noise and consequently, it is reasonable toassume that there is no batch-to-batch autocorrelation when theprocess is at steady-state.In brief, the three-way batch process data are unfolded

batch-wise followed by autoscaling as data preprocessing stage.Afterwards, the scaled deviations of the trajectories of thecollected process variables from the reference trajectories alongthe batches are summarised by PCA. For each one of theuncorrelated latent variables as well as the distance to the model(DmodX) from the PCA, the corresponding R-statistics are workedout using Equations (1)–(4). These R-statistics are then comparedto the corresponding threshold (R-crit), obtained from Cao andRhinehart [10] for a given type I risk (a) (after Bonferronicorrection) and filter factors (l1, l2, l3). In this way, each batch canbe classified as being from the start-up period or from asteady-state period.

3. RESULTS AND DISCUSSION

In this section several results are presented and discussed. First, itis shown how the visualisation capabilities of the PCA can beexploited to track the trajectories evolution during the start-up ofthe process, thus, providing useful information for processunderstanding and interpretation. Secondly, the application ofthe proposed approach for SSID to all the collected trajectories isillustrated, detecting a short period where the process can beconsidered at steady-state (around 4 days). Finally, theperformance of the proposed methodology is validated andassessed using a simulated data-set where the steady-stateperiods are perfectly known.

3.1. Start-up batch-to-batch process understanding

After autoscaling the unfolded batch-wise process data matrix, aPCA can be conducted and the evolution of the trajectories of allthe collected process variables during the sequence of thebatches can be tracked in the corresponding score plots.

Figure 3. (a) Score line plot of the first principal component from a

first principal component; (c) pH trajectories in five different batch


However, a PCA can also be carried out for each individualtrajectory and monitoring their scores allows process engineersand operators to assess and to interpret the shape evolution ofeach particular trajectory along the sequence of batches, relatingthis pattern evolution to the microorganism activity in the SBR. Toillustrate this, the results of applying two PCAs (using all thebatches), one on the pH trajectory and another on theconductivity trajectory are presented next. These trajectorieswere chosen because they can provide valuable information inSBRs operated for EBPR [11]. Although process understandingcan also be gained from a PCA conducted on all the variablestrajectories, in this application the process evolution interpret-ation became more evident from the PCA analysis of eachindividual trajectory.The first principal component from the PCA on the pH

trajectory explains 86% of the variance. Its score line plot shows aclear increasing trend until it settles at a nearly constant valuearound batch 100 [Figure 3(a)]. To discover how the measuredtrajectory contributes to the formation of this first principalcomponent, the corresponding loadings are shown in Figure 3(b).This plot clearly shows that there is an opposite evolution of thepH trajectory in the anaerobic stage from aerobic-settle stages. Tounderstand the score evolution along the batches [Figure 3(a)], itis necessary to combine information from the loadings structure[Figure 3(b)] with that from the pH recorded trajectories indifferent batches [Figure 3(c)]. At the beginning of the process(i.e. during the initial batches), the pH trajectory in the anaerobicstage is higher than the pH average trajectory (worked out fromall the batches analysed), taking into account that the loadings inthis stage are negative, the contribution of this stage to the scoreresults negative. In the aerobic and settle stages, the pH trajectoryis lower than the pH average trajectory, and since the loadings inboth stages are positive, the contributions to the score of thesestages also result negative, leading to high negative scores forthese initial batches [Figure 3(a)]. However, as the process evolves(i.e. themicroorganisms’ population proliferates), the pH profile inthe anaerobic stage decreases more and more (and at a givenbatch the recorded trajectory in the anaerobic phase starts to belower than the pH average trajectory) until it stabilises. In theaerobic and settle stages the pH level increases more and more(and at a given batch the recorded trajectory in the aerobic andsettle stages starts to be higher than the pH average trajectory)until stabilisation. This explains the evolution of scores for thedifferent batches from negative to positive values till stabilisationshown in Figure 3(a).Process insight can be useful to understand the above

described changes. The process was operated to obtain a PAO

PCA on the pH trajectory; (b) loadings of the pH trajectory in this

es (10, 30, 45, 55 and 65).


85

Figure 4. (a) Score line plot of the first principal component from a PCA on the conductivity trajectory; (b) loadings of the conductivitytrajectory in this first principal component; (c) conductivity trajectories in five different batches (10, 30, 45, 55 and 65).

D. Aguado et al.

86

enriched sludge, and as the batches passed, this bacterial groupdeveloped and became predominant and stable. The biologicalactivity of the PAOs is responsible for the changes in the pHtrajectory: phosphorus release (in anaerobic conditions) causesthe pH decrease, while phosphorus uptake (in aerobic conditions)causes the pH increase.When a PCA was conducted on the conductivity trajectory, the

first principal component explained almost 65% of the variance.Its score line plot shows a decreasing trend until batch 73, when itstarts to increase, afterwards it settles and around batch 150 itdecreases again [Figure 4(a)]. The loadings of the conductivitytrajectory in this first principal component are shown inFigure 4(b). This plot reveals a change in the trajectory profilebetween the anaerobic and beginning of the aerobic stage withrespect to the remaining of the aerobic and settle stages. Tomakethe interpretation more clear, the conductivity trajectories indifferent batches are shown in Figure 4(c), where it can be seenthat in the anaerobic stage the trajectory increases more quicklyand to a higher value, while in the aerobic stage it decreasesmorequickly and reaches a lower value. Note that the plottedtrajectories in Figure 4(c) correspond to the first decreasingtendency in the first score (i.e. until batch 73). This was done forthe sake of clearness: the conductivity trajectory of the batchesfrom 100 to 150 would be in an intermediate position inFigure 4(c).The shifts in the conductivity trajectories are again explained

by the biological activity of the PAOs: in the anaerobic phase, theconductivity increases during phosphorus (and associated metalcations: potassium and magnesium) release, while in the aerobicphase, the conductivity decreases during phosphorus (andassociated metal cations) uptake. The increase and decrease ofthe trajectory in each batch can be related to the amount ofphosphorus released and uptaken by the PAOs, respectively[11–13]. As it was previously mentioned, the PAO concentration inthe SBR was progressively increasing from the first batch until astable population was achieved, therefore, the higher concen-tration of PAOs in the system the higher variations in theconductivity trajectory of each batch (i.e. higher conductivityvalues are reached in the anaerobic stage and lower values in theaerobic stage).The batches whose projections do not match the initial

descending trend (batches in the approximate range 75–150) arereflecting that something in the process negatively affected thebiological activity of the PAOs. From technical knowledge of theprocess, it was concluded that there was insufficient nitrogen inthe synthetic influent wastewater, and this inhibited the cellulargrowth.


As it has been illustrated, the biological activity that takes placein the SBR affects the shape of the trajectories of the collectedprocess variables. Therefore, tracking the evolution of thesetrajectory shapes could be useful to get information on how theprocess is evolving (i.e. how the microbial community isdeveloping), allowing an early detection of possible undesirablesituations and also to detect when the process reaches thesteady-state (i.e. when the shape of the registered variabletrajectories become stable). The information on the evolution ofthe trajectory shapes can be efficiently summarised usingbatch-wise unfolding followed by PCA modelling, as illustrated inFigures 3(a) and 4(a). Although in this case only one principalcomponent was enough for process understanding, in othercases more than one component may be necessary.

3.2. SSID using real data from the SBR

To fulfil the main goal of this paper (i.e. SSID), the newuncorrelated variables from the PCA: the latent variables andresidual standard deviation (DmodX) are used in the multivariateextension of Brown and Rhinehart [2] for SSID. This combinedapproach will help the process operator to assess if the process(i.e. the microbiological activity in the reactor) has reachedsteady-state conditions and, therefore, to decide when theextensive sampling (to characterise: process performance,settling properties of the sludge, . . .) can start. Moreover, thisapproach takes advantage of data from the inexpensive, reliableand low-maintenance sensors installed in the SBR.As previously mentioned, in this batch process five variables

are collected during the evolution of each batch, so a PCA wasconducted on the complete unfoldedmatrix (188 batches� 1700variables). Three latent variables were necessary to account for68% of the variance in the data. Therefore for aprocess¼ 0.05(which means ai¼ 0.0127 after applying Bonferroni correctionfrom Equation (7)), and using the suggested values for the filterfactors l1¼ 0.2 and l2¼ l3¼ 0.1, the interpolated R-critical valuefrom the tables of Cao and Rhinehart [10] is 1.7.The evolution of the four variables that summarise all the

process information (three latent variables plus the DmodX)along the sequence of batches is displayed in Figure 5. This figureindicates that, according to the methodology of Brown andRhinehart [2], the process is at steady-state from batch number134 to 149 (i.e. approximately 4 days). It should be kept in mindthat the steady-state indicator is calculated according to Equation(5) and the process is considered to be at steady-state (SS¼ 1)only when all the individual process variables are at steady-state.During the identified period (from batch number 134 to 149) it


Figure 5. SSID on the four variables that summarise the multivariateinformation from the SBR data: (a) three latent variables plus the residuals

(DmodX) from the PCA on the complete unfoldedmatrix; (b)R-statistic for

each variable. This figure is available in colour online at www.interscien-

ce.wiley.com/journal/cem


was verified that all the variables (the three latent variablesand the DmodX) were not autocorrelated by looking at thesimple and partial autocorrelation plots (not shown). This canbe explained by the fact that the variables from the PCA are

Figure 6. (a) Contribution plot in the second component from

differences between batches 140 (from the steady-state period)

in batches 140 and 160.


summarising the evolution of the deviation of the variablestrajectories from the reference trajectories (i.e. the averagetrajectories), thus, it is expected that at steady-state periods thereis no batch-to-batch autocorrelation.When the process is not at steady-state (e.g. due to new

operating conditions, microbial population changes, . . .), contri-bution plots can be used to help in the diagnosis process andsearch for the responsible process variables. To illustrate this,batch number 140 (from the steady-state period) and batchnumber 160 (from a transition period) are selected as example,and the contribution plot for the score that showed the highestR-statistic (second score, PC2) obtained as the difference betweenboth batches is shown in Figure 6(a). From this plot, it can beconcluded that the main responsible variable is the conductivityfrom the second mid of the aerobic stage until the end of thebatch. The pH trajectory also presents some differences but notas obvious as those from the conductivity. These results arereflecting that in the batches from the transition period (after thesteady-state), the conductivity decrease in the aerobic stage ishigher than in batches from the steady-state [see Figure 6(b)].This was due to the fact that the nitrogen in the influent was nolonger a limiting factor since the ammonium concentration wasincreased in the synthetic influent wastewater when theaforementioned problem of insufficient nitrogen in the influentwas detected. Similar conclusions can be drawn from the analysisof the contribution to the other scores and residuals (plots notshown).

3.3. SSID using simulated data

A simulation study was conducted in order to achieve three maingoals. First, to perform an objective assessment of the proposedapproach for SSID in batch processes. This is only possible withsimulated data because the SS periods are perfectly known (i.e. itis known when they start and when they finish). The second goalis to illustrate how the methodology would perform in an on-lineapplication. In this case the reference PCA model is developedusing the available data; when new data from the processbecome available, they are projected onto the PCA model andclassified as being from either a transition or a steady-state periodaccording to the proposed methodology. The last aim of thestudy was to test the robustness of the methodology against the

the PCA on the complete unfolded matrix, reflecting the

and 160 (from a transition period); (b) conductivity trajectory


87

Figure 7. SSID results from the simulated data: (a) first component and

residuals (DmodX) from a PCA on the simulated trajectory (note that theDmodX during the third SS period is far outside the range of the plot:

indicated as ""); (b) R-statistic for each variable. This figure is available in

colour online at www.interscience.wiley.com/journal/cem

D. Aguado et al.

88

number of batches used to build the reference PCA model andthe fact that these batches were similar or not, that is to assess ifthe results when only steady-state batches were used to build thereference PCA model were notably different to those obtained

Table I. Summary of classification results from the application ofpercentages of misclassifications for each PCA reference model

Error description

Type II error: non-steady-state classified as steady-stateType I error: steady-state classified as non-steady-stateTotal misclassifications


when a mixture of non-steady-state batches and stationary oneswere used to develop the PCA model.Based on the previously explained evolution of the conductivity

trajectory in this batch process [Figure 4(c)] we simulated newdata from this trajectory. This new data were obtained modifyingthe trajectory shape (during the transition periods) and addingrandom noise (in the transition as well as in the steady-stateperiods). Simulated data from this trajectory in 800 batchesincluding three steady-state periods were generated [Figure 7(a)].The first steady-state period spans from batch 1 to 100, thesecond one from batch 200 to 550 and the last steady-stateperiod from batch 575 to 800. From this, it is evident that the firstoperating condition (OP_1) finishes at batch 100, the secondoperating condition (OP_2) is imposed at batch 101 and it is keptuntil batch 550 and the third operating condition (OP_3) isimposed at batch 551 [see Figure 7(a)].Initially, only one PCA reference model was built using data

from 80 batches from the first steady-state period and the fittedmodel (89.3% of explained variance) was not updated afterwards.The 800 batches were projected onto this PCA model andclassified as being from either a transition or a steady-state periodaccording to the proposed methodology. The results arepresented in Figure 7. As can be seen in this figure, the threesteady-state periods and the transition periods are clearlyidentified by the SS-indicator. It should be highlighted thatwhen the process status changes from a SS to a transition period,almost immediately the R-statistics signals the situation (which isa very desirable behaviour), but when transition finishes and theprocess reaches another SS, it takes the R-statistics some time toindicate the new situation due to its filtering nature (recall thatthe process mean and variance are estimated using EWM filters).Another interesting aspect that deserves attention in Figure 7 isthat since the PCA model (which was fitted using data from thefirst SS) was not updated; the residuals (DmodX values) were onlysmall during this first SS. However, it can be seen that the latersteady-states (second and third SS) are also identified despite thefact that the DmodX values are big, thus, demonstrating that forthe SSID it is not important how big the DmodX values are buthow stable they are.In order to fulfil the third goal of the simulation study, two

different new PCA reference models were built using data fromthe first 200 batches (includes the first SS and the first transition)and from all the available batches (800 batches, includes thethree SS and the two transitions), respectively. The 800 batcheswere projected on each PCA reference model and they wereclassified as being from either a transition or a steady-state periodaccording to the proposed SSID methodology. The classificationresults of the SSID methodology using these new PCA reference

the methodology for SS identification to the simulated data:

Reference PCA model

80 batches 200 batches 800 batches

1.63 2.43 1.6316.98 17.87 12.9914.62 15.50 12.25



models were compared to those obtained with the first PCAreference model (which was built using 80 batches from the firstSS). Since the SS periods were perfectly known (i.e. when theystarted and finished) both type I error (triggering a ‘not atsteady-state’ when the process is at steady-state) and type II error(not triggering a ‘not at steady-state’ response when the processis not at steady-state) were calculated for each PCA referencemodel. Additionally, the percentage of total misclassifications wasalso obtained. These results are summarised and presented inTable I.Recall at this point that in real applications there is an evident

interest in running a sequence of experiments (with differentoperating conditions) and collecting data when the process is atsteady-state, therefore, what is important to this end is to obtain alow type II risk. From Table I, it can be concluded that in this sense,the results are quite satisfactory and, moreover, they are highlyconsistent regardless of the reference PCA model used. There-fore, the proposed approach can be efficiently used to detect thestead-state as well as transition periods in batch processes.

8

4. CONCLUSIONS

This paper has presented an automated procedure for SSID inbatch processes. The proposed approach is based on batch-wiseunfolding of the three-directional batch process data followed bya PCA (Unfold-PCA) in combination with the methodology ofBrown and Rhinehart [2] for SSID. The developed methodologymakes it possible to classify each batch as being from a transitionperiod or from a steady-state period.The key idea of the proposed procedure is that in batches from

a steady-state period, their intra-batch variables trajectories willfollow the same average pattern with random noise deviationsfrom it. Thus, during a steady-state period the autocorrelationfrom batch-to-batch can be negligible despite that within eachbatch the variables can be both autocorrelated and cross-correlated.In the proposed approach, the three-directional process data

are batch-wise unfolded, autoscaled and summarised by PCA.The independent latent variables and the residuals (DmodX) fromthe PCA are further used in the multivariate extension of Brownand Rhinehart [2] for SSID. In this way, no variable selectionprocedure aimed at identifying representative variables for SSIDis necessary (all the collected variables are used), independenceamong the new variables is assured and reasonable type I risks foreach individual variable as well as for the process are possible.The residuals must be considered in this analysis because theycan detect non-steady-state situations (due to covariancestructure drift) and they contain information on the validity ofthe PCA model. All these considerations allow overcoming someof the main limitations of the existing approaches for SSID, andextended the traditional application (only valid for continuousprocesses) to batch processes.Using real data from a laboratory-scale SBR operated for EBPR it

has been illustrated that the combined approach can beefficiently used for SSID in batch processes taking advantageof the available data collected bymeans of inexpensive electronicsensors. Moreover, the statistic used (R-statistic) is independent ofthe measurement level and robust against process variancechanges because it is a ratio of estimated variances.As a second research aim, it has also been illustrated how the

latent variables obtained from the PCA can be used for analysing


and interpreting the evolution of the shape of the trajectoriesfrom the collected process variables during the start-up of theprocess. This analysis can provide useful information for processunderstanding. Moreover, since to some extent the shapes of thetrajectories of the collected variables are reflecting the biologicalactivity in the system, tracking their evolution could be extremelyuseful to get information on how the process is evolving(allowing an early detection of possible undesirable situations)and also to detect when the process reaches the steady-state (i.e.when the shape of the trajectories become stable).Further research is being devoted to the analysis of the

sensitivity of the proposed approach to the presence of outliers(or anomalous values) in order to make the methodology morerobust against this type of inconvenience.

Acknowledgements

Financial support from MCYT (project CTM2005-06919-C03/TECN) is gratefully acknowledged. The authors also acknowledgethe anonymous reviewers for their valuable comments andsuggestions.

REFERENCES

1. Cao S, Rhinehart RR. An efficient method for on-line identification ofsteady-state. J. Process Control 1995; 5(6): 363–374.

2. Brown PR, Rhinehart RR. Demonstration of a method for automatedsteady-state identification in multivariable systems. HydrocarbonProcess. 2000; 79(9): 79–83.

3. Jiang T, Chen B, He X, Stuart P. Application of steady-state detectionmethod based on wavelet transform. Comput. Chem. Eng. 2003; 27:569–578.

4. Ruiz G, Castellano M, Gonzalez W, Roca E, Lema JM. Algorithm forsteady state detection of multivariate process: application t0o waste-water anaerobic digestion process. Proceedings of 2nd InternationalIWA Conference on Automation in Water Quality Monitoring, AutMoNet,19–20 April 2004, Vienna, Austria.

5. Wilderer PA, Irvine RL, Goronszy MC. Sequencing Batch ReactorTechnology. IWA Publishing: London, 2001.

6. Westerhuis JA, Kourti T, MacGregor JF. Comparing alternativeapproaches for multivariate statistical analysis of batch process data.J. Chemometrics 1999; 13: 397–413.

7. Kourti T. Multivariate dynamic data modelling for analysis and stat-istical process control of batch processes, start-ups and grade tran-sitions. J. Chemometrics 2003; 17(1): 93–109.

8. Zarzo M, Ferrer A. Batch process diagnosis: PLS with variable selectionversus block-wise PCR. Chemom. Intell. Lab. Syst. 2004; 73(1): 15–27.

9. Eriksson L, Johansson E, Kettaneh-Wold N, Wold S. Multi- and Mega-variate Data Analysis: Principles and Applications. Umetrics Academy,2001, Umea, Sweden.

10. Cao S, Rhinehart RR. Critical values for a steady-state identifier.J. Process Control 1997; 7(2): 149–152.

11. Serralta J, Borras L, Blanco C, Barat R, Seco A. Monitoring pH andelectric conductivity in an EBPR sequencing batch reactor. Water Sci.Technol. 2004; 50(10): 145–152.

12. Maurer M, Gujer W. Monitoring of microbial phosphorus release inbatch experiments using electric conductivity. Water Res. 1995; 29:2613–2617.

13. Aguado D, Montoya T, Ferrer J, Seco A. Relating ions concentrationvariations to conductivity variations in a sequencing batch reactoroperated for enhanced biological phosphorus removal. Environ.Model. Softw. 2006; 21: 845–851.

14. Lucas JM, Saccucci MS. Exponentially weighted moving averagecontrol schemes: properties and enhancements. Technometrics1990; 32(1): 1–29.

15. MacGregor JF, Harris TJ. The exponentially weightedmoving variance.J. Qual. Technol. 1993; 25(2): 106–118.


9

D. Aguado et al.

90

5. APPENDIX I: DERIVATION OF THER-STATISTIC

Consider a process variable where the sequentially recordedobservations at each time i, xi are independent random valueswith constant mean h and variance s2 (i.e. the process is atsteady-state condition). The R-statistic is a ratio of twoestimations of the process variance, obtained from the samedata-set in two different ways. The first method uses an EWMvariance, n2f , based on the difference between the data and thesample mean. To relieve the computational process, samplemean is estimated using the EWM average (i.e. conventionalfirst-order filter of the process variable x), xf. The EWM averageand EWM variance at time i are defined as:

xf ;i ¼ l1xi þ ð1� l1Þxf ;i�1

¼ l1Xi�1

j¼0

ð1� l1Þjxi�j þ ð1� l1Þixf ;0 (I.1)

n2f ;i ¼ l2ðxi � xf ;i�1Þ2 þ ð1� l2Þn2f ;i�1

¼ l2Xi�1

j¼0

ð1� l2Þjðxi�j � xf ;i�j�1Þ2 þ ð1� l2Þin2f ;0 (I.2)

where l1 and l2 are the filter factors (0< l� 1), and xf,0 and n2f,0are the initial estimates of process mean and variance,respectively (starting values). Note that in Equation (I.2) xi andxf,i� 1 are also independent because the EWM average xf,i� 1 isobtained before the process observation at time i, xi.As shown in Equations (I.1) and (I.2), EWM filters are geometric

moving averages and from the formula of the sum of a geometricseries (and assuming independent observations) the statisticalproperties of the EWM filters can be easily obtained. In particularit is straightforward to show that the mean and variance of theEWM filters, xf and n2f, converge to the following asymptoticvalues [14,15]:

Eðxf Þ ¼ EðxÞ ¼ h (I.3)

s2ðxf Þ ¼l1

2� l1s2 (I.4)

Eðn2f Þ ¼ Eðxi � xf ;i�1Þ2 (I.5)

s2ðn2f Þ ¼l2

2� l2s2½ðx � xf Þ2� (I.6)


Given the fact that in Equation (I.5) xf,i� 1 and xi areindependent, it follows:

Eðn2f Þ ¼ Eðxi � xf ;i�1Þ2 ¼ E½ðxi � hÞ � ðxf ;i�1 � hÞ�2

¼ s2ðxiÞ þ s2ðxf Þ (I.7)

By substituting Equation (I.4) into Equation (I.7) and solving for s2:

s2 ¼ 2� l1

2Eðn2f Þ (I.8)

Therefore, an unbiased estimation of the process variance attime i will be:

s21;i ¼

2� l1

2n2f ;i (I.9)

The second method to estimate the variance will use EWMsquare differences of successive observations, d2f. Its value at timei is expressed as:

d2f ;i ¼ l3ðxi � xi�1Þ2 þ ð1� l3Þd2f ;i�1 (I.10)

where l3 is a filter factor (0< l3� 1).Assuming that xi and xi� 1 are independent, it is straightfor-

ward that:

Eðd2f Þ ¼ Eðxi � xi�1Þ2 ¼ E½ðxi � hÞ � ðxi�1 � hÞ�2 ¼ 2s2 (I.11)

Therefore, an unbiased estimate of the process variance at timei can also be obtained as follows:

s22;i ¼

1

2d2f ;i (I.12)

By taking the ratio of the two estimates of the process variancegiven in Equation (I.9) and Equation (I.12), the expression of theR-statistic at time i yields:

Ri ¼s21;i

s22;i

¼ð2� l1Þn2f ;i

d2f ;i(I.13)


Documents

Using Unfold-PCA for batch-to-batch start-up process understanding and steady-state identification in a sequencing batch reactor