9
JOURNAL OF CHEMOMETRICS J. Chemometrics 2005; 19: 439–447 Published online 12 December 2005 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cem.946 Framework for regression-based missing data imputation methods in on-line MSPC Francisco Arteaga 1 and Alberto Ferrer 2 * 1 Facultad de Estudios de la Empresa, Universidad Cato¤ lica deValencia SanVicente Ma¤ rtir,Guillem de Castro175, 46008 Valencia, Spain 2 Departamento de Estad|¤ stica e I.O. Aplicadas y Calidad, Universidad Polite¤ cnica deValencia,Camino deVera s/n, Edificio I-3, 46022 Valencia, Spain Received 19 February 2004; Revised 10 October 2005; Accepted 19 October 2005 Missing data are a critical issue in on-line multivariate statistical process control (MSPC). Among the different scores estimation methods for future multivariate incomplete observations from an existing principal component analysis (PCA) model, the most statistical efficient ones are those that estimate the scores for the new incomplete observation as the prediction from a regression model. We have called them regression-based methods. Several approximations have been proposed in the literature to overcome the singularity or ill-conditioning problems that some of the mentioned methods can suffer due to missing data. This is particularly acute in on-line batch process monitoring. In order to ease the comparison of the statistical performance of these methods and to improve the under- standing of their relationships, in this paper we propose a framework that allows to write these regression-based methods by an unique expression, function of a key matrix. From this framework a statistical performance index (PRESV) is introduced as a way to compare the statistical efficiency of the different framework members and to predict the impact of specific missing data combinations on scores estimation without requiring real data. The results are illustrated by application to several continuous and batch industrial data sets. Copyright # 2005 John Wiley & Sons, Ltd. KEYWORDS: principal component analysis (PCA); missing data, multivariate statistical process control (MSPC) 1. INTRODUCTION Missing measurements are a common occurrence in on-line multivariate statistical process control (MSPC) [1,2]. Several methods have been proposed to estimate the latent variables scores for new incomplete multivariate observations from a pre-built, fixed and known principal component analysis (PCA) model [2–8]. Arteaga and Ferrer [3] study the equiva- lence between these methods concluding that almost every one can be seen as different imputation methods, that is different ways to impute values for the missing variables. In all the methods studied there appears, in one form or another, a regression model. However, we have called regression-based methods those whose scores for the new incomplete observation are estimated as the prediction from a regression model: the so-called known data regres- sion (KDR) method (also called conditional mean replace- ment method) and the trimmed scores regression (TSR) method. The regression-based methods turn up as statistically more efficient than the other methods proposed, the most efficient being the KDR method [2,3]. Nevertheless, this method may suffer ill-conditioning problems with highly correlated data. This is particularly acute in on-line batch process monitoring with unfold-PCA models where singu- larity problems appear [9]. Biased regression methods should be used to overcome these problems. Although several authors [2,3,9] comment on this, none of them provide a scientific (statistical) comparison of the perfor- mance of all these regression-based methods in a wide range of on-line monitoring scenarios. In order to ease this comparison study and to improve understanding of their relationships and differences, in this paper a framework to express these regression-based scores estimation method is proposed. This unified description can be useful in simplifying the design of flexible computer programs for missing data estimation methods and to pro- vide a fast and easy way to evaluate the performance of the different members of the framework. Section 2 introduces the notation. Section 3 outlines the different regression-based methods proposed in the litera- ture facing the problem of scores estimation from an existing PCA model when new observations are incomplete. An alternative formulation for the regression-based scores *Correspondence to: Alberto Ferrer, Universidad Polite ´cnica de Valencia, Departamento de Estadı ´stica e I.O. Aplicadas y Calidad, Camino de Vera s/n, Edificio I-3, 46022 Valencia, Spain. E-mail: [email protected] Contract/grant sponsor: Spanish Government (MICYT). Contract/grant sponsor: European Union (RDE funds); contract/ grant number: DPI2001-2749-C02-01. Copyright # 2005 John Wiley & Sons, Ltd.

Framework for regression-based missing data imputation methods in on-line MSPC

Embed Size (px)

Citation preview

Page 1: Framework for regression-based missing data imputation methods in on-line MSPC

JOURNAL OF CHEMOMETRICSJ. Chemometrics 2005; 19: 439–447Published online 12 December 2005 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cem.946

Framework for regression-based missing dataimputation methods in on-line MSPC

FranciscoArteaga1andAlberto Ferrer2*1FacultaddeEstudiosdela Empresa,UniversidadCato¤ licadeValencia SanVicenteMa¤ rtir,GuillemdeCastro175,46008Valencia,Spain2DepartamentodeEstad|¤ sticae I.O.AplicadasyCalidad,UniversidadPolite¤ cnicadeValencia,CaminodeVeras/n,Edificio I-3,46022Valencia,Spain

Received 19 February 2004; Revised 10 October 2005; Accepted 19 October 2005

Missing data are a critical issue in on-linemultivariate statistical process control (MSPC). Among the

different scores estimationmethods for futuremultivariate incomplete observations from an existing

principal component analysis (PCA) model, the most statistical efficient ones are those that estimate

the scores for the new incomplete observation as the prediction from a regression model. We have

called them regression-based methods. Several approximations have been proposed in the literature

to overcome the singularity or ill-conditioning problems that some of the mentioned methods can

suffer due to missing data. This is particularly acute in on-line batch process monitoring. In order to

ease the comparison of the statistical performance of these methods and to improve the under-

standing of their relationships, in this paper we propose a framework that allows to write these

regression-based methods by an unique expression, function of a key matrix. From this framework a

statistical performance index (PRESV) is introduced as a way to compare the statistical efficiency of

the different frameworkmembers and to predict the impact of specific missing data combinations on

scores estimation without requiring real data. The results are illustrated by application to several

continuous and batch industrial data sets. Copyright # 2005 John Wiley & Sons, Ltd.

KEYWORDS: principal component analysis (PCA); missing data, multivariate statistical process control (MSPC)

1. INTRODUCTION

Missing measurements are a common occurrence in on-line

multivariate statistical process control (MSPC) [1,2]. Several

methods have been proposed to estimate the latent variables

scores for new incomplete multivariate observations from a

pre-built, fixed and known principal component analysis

(PCA) model [2–8]. Arteaga and Ferrer [3] study the equiva-

lence between these methods concluding that almost every

one can be seen as different imputation methods, that is

different ways to impute values for the missing variables.

In all the methods studied there appears, in one form or

another, a regression model. However, we have called

regression-based methods those whose scores for the new

incomplete observation are estimated as the prediction

from a regression model: the so-called known data regres-

sion (KDR) method (also called conditional mean replace-

ment method) and the trimmed scores regression (TSR)

method.

The regression-based methods turn up as statistically

more efficient than the other methods proposed, the most

efficient being the KDR method [2,3]. Nevertheless, this

method may suffer ill-conditioning problems with highly

correlated data. This is particularly acute in on-line batch

process monitoring with unfold-PCA models where singu-

larity problems appear [9]. Biased regression methods

should be used to overcome these problems. Although

several authors [2,3,9] comment on this, none of them

provide a scientific (statistical) comparison of the perfor-

mance of all these regression-based methods in a wide range

of on-line monitoring scenarios.

In order to ease this comparison study and to improve

understanding of their relationships and differences, in this

paper a framework to express these regression-based scores

estimation method is proposed. This unified description can

be useful in simplifying the design of flexible computer

programs for missing data estimation methods and to pro-

vide a fast and easy way to evaluate the performance of the

different members of the framework.

Section 2 introduces the notation. Section 3 outlines the

different regression-based methods proposed in the litera-

ture facing the problem of scores estimation from an existing

PCA model when new observations are incomplete. An

alternative formulation for the regression-based scores

*Correspondence to: Alberto Ferrer, Universidad Politecnica deValencia, Departamento de Estadıstica e I.O. Aplicadas y Calidad,Camino de Vera s/n, Edificio I-3, 46022 Valencia, Spain.E-mail: [email protected]/grant sponsor: Spanish Government (MICYT).Contract/grant sponsor: European Union (RDE funds); contract/grant number: DPI2001-2749-C02-01.

Copyright # 2005 John Wiley & Sons, Ltd.

Page 2: Framework for regression-based missing data imputation methods in on-line MSPC

estimation methods is also introduced. Section 4 shows how

this reformulation can be generalised, yielding a general

framework. The performance of the different regression-

based estimation method members of the framework is

studied in Section 5, based on several continuous and batch

industrial data sets. Finally, Section 6 presents the conclu-

sions of the present paper.

2. NOTATION

Following the same notation as Nelson et al. [2] and Arteaga

and Ferrer [3], lower case bold variables are column vectors

and upper cases are matrices.

Let us assume that a PCA model [5,10] has been built from

a reference data set X, with N observations and K variables,

representing normal operating conditions (NOC) from a

multivariate process,

X ¼ TPT ð1Þ

where T is an N�H matrix of scores, and P is a K�H matrix

of loadings, being H¼ rank(X).

Consider that a new observation z has some unmeasured

variables and that these can be taken to be the first R

elements of the data vector without loss of generality.

Thus, the vector can be partitioned as z ¼ ½z#z� �, where z#

denotes the missing measurements and z� the observed

variables. This induces the following partition in X,

X ¼ ½X# X��, where X# is the submatrix containing the first

R columns of X, and X� accommodates the remaining K�R

columns.

Correspondingly, the P matrix can be partitioned as

P ¼ ½P#

P� �, where P# is the submatrix made up of the first R

rows of P, and matrix P* contains the remaining K�R rows.

Assuming that matrix X is of rank H, and only A out of the

H components (A�H) are significant, we are only interested

in working out the first A elements of the scores vector for

the new individual, s1:A. In this situation, P matrix can be

expressed as:

P ¼ P1:A PAþ1:H½ � ¼ P#1:A P#

Aþ1:HP�

1:A P�Aþ1:H

� �ð2Þ

where P1:A contains the first A loading vectors and PAþ1:H the

remaining H�A of the PCA model.

From the previous expressions, the first A elements of the

scores vector for the new observation can be written as

s1:A ¼ PT1:Az ¼ P#T

1:Az# þ P�T

1:Az� ¼ s

#1:A þ s�1:A ð3Þ

This allows to express the scores vector as the sum of two

elements: the first is the contribution from the unknown

variables, s#1:A ¼ P#T1:Az

#; and the second is the contribution

from the known variables, s�1:A ¼ P�T1:Az

�.

The scores matrix from the NOC PCA model can also be

expressed as

T1:A ¼ XP1:A ¼ X#P#1:A þ X�P�

1:A ¼ T#1:A þ T�

1:A ð4Þ

The partition in the new incomplete observation also

induces the following partition in the covariance matrix

S ¼ XTX

N � 1¼ 1

N � 1X#TX# X#TX�

X�TX# X�TX�

� �¼ S## S#�

S�# S��

� �ð5Þ

3. REGRESSION-BASED SCORESESTIMATION METHODS

Given a new incomplete individual z, and assuming the

same variables to be missing in each row of data matrix X,

Arteaga and Ferrer [3] propose to fit the regression model

T1:A ¼ X�B1 þU1 ð6Þ

and to estimate the scores vector from the known variables

as ss1:A ¼ BBT

1z� being BB1 ¼ ðX�TX�Þ�1X�TT1:A the least squares

estimator of matrix B1. This results in the KDR estimator

ss1:A ¼ H1:AP�T1:A S��ð Þ�1z� ð7Þ

where H1:A is an A�A diagonal matrix of the higher A

eigenvalues of the covariance matrix S (Equation (5)) in

decreasing order along the diagonal.

Nelson et al. [2], assuming that z follows a multivariate

normal distribution, propose to replace the unknown vari-

ables with its conditional expectation, given the known

variables and the under control PCA model, zz# ¼Eðz# j z�;SÞ, and to calculate the scores of the reconstructed

observation as if no measurements were missing. This is the

conditional mean replacement method (CMR) and results

in the same estimator that the above-mentioned KDR

method [2,3].

The KDR method requires the inversion of S�� (see

Equation (7)), a K�R order matrix (K�R is the number of

known variables). S�� can be ill-conditioned in data sets with

a large number of highly correlated variables. This is

particularly acute in on-line batch process monitoring with

unfold-PCA models where singularity problems appear [9].

Nelson et al. [2] suggests to overcome this problem by

replacing the ordinary least squares estimate from

Equation (6) with a biased regression method such as ridge

regression (RR), principal component regression (PCR) or

partial least squares (PLS). In the case of S�� being singular,

Nelson [8] suggests replacing the inverse ðS��Þ�1 with the

pseudoinverse ðS��Þþ.

Arteaga and Ferrer [3] propose an alternative method to

the KDR method, called the TSR method, replacing the

KDR method regression model (Equation (6)) with a new

regression model

T1:A ¼ T�1:AB2 þU2 ð8Þ

in which the scores vector is estimated from the known

variables contribution to the scores vector (trimmed scores),

s�1:A ¼ P�T1:Az

�, that corresponds to the new incomplete

observation.

The TSR estimator from Equation (8) yields [3]

ss1:A ¼ H1:AP�T1:AP

�1:A P�T

1:AS��P�

1:A

� ��1P�T

1:Az� ð9Þ

The regression models for the KDR (Equation (6)) and

TSR (Equation (8)) methods can be replaced by equivalent

models in which, instead of straightforward estimation of

the scores vector, the unknown variables are estimated first

from the measured variables or from the trimmed scores,

respectively, based on the following models

X# ¼ X�B3 þU3 ð10Þ

X# ¼ T�1:AB4 þU4 ð11Þ

440 F. Arteaga and A. Ferrer

Copyright # 2005 John Wiley & Sons, Ltd. J. Chemometrics 2005; 19: 439–447

Page 3: Framework for regression-based missing data imputation methods in on-line MSPC

From Equation (3), the estimated missing measurements

can then be used in a second step in the score calculation

along with the measured data as if no measurements were

missing. The equivalence proof between these models is

shown in the Appendix.

Arteaga and Ferrer [3] illustrate, through several contin-

uous processes, the statistical superiority of the KDR

method, based on the squared prediction error criterion.

Nevertheless, the TSR method is shown to be practically

equivalent to the KDR method, with the advantage of

requiring the inversion of an A order matrix (P�T1:AS

��P�1:A),

of a much smaller size and with fewer ill-conditioning

problems than the S�� matrix.

Garcıa-Munoz et al. [9] compare different scores estima-

tion methods to ‘fill in’ the future unknown trajectories in

on-line batch process monitoring. They conclude that KDR

with PLS, and TSR outperform other methods in the

accuracy of the forecast done and in the quality of the score

estimates, even from the beginning of the batch.

However, none of the above mentioned papers provide an

overall scientific (statistical) comparison of the performance

of all these regression-based methods in a wide range of

on-line monitoring scenarios. One may ask if it does matter

which regression method one uses. The lack of response to

this question may lead to some practitioners to have a

favourite method and stick to that one without a sound

reason.

4. FRAMEWORK FOR THE REGRESSION-BASED SCORES ESTIMATION METHODSFOR NEW INCOMPLETE OBSERVATIONS

Let us briefly review the different regression-based estima-

tion methods presented in this paper.

4.1. KDR method with PCRThe KDR method with principal components regression

(instead of least squares regression model from Equation (10))

can be written as

X# ¼ X�V1:�

� �B5 þU5 ð12Þ

where � � � ¼ rank S��ð Þ, and V1:� is the ðK � RÞ � � matrix

which columns are the eigenvectors of S�� associated with

the greatest � eigenvalues.

The resulting estimation for the missing values is

zz# ¼ S#�V1:� X1:�

� ��1VT

1:�z� ð13Þ

where X1:� is a �� � diagonal matrix of the higher � eigen-

values of the covariance matrix S�� in decreasing order along

the diagonal.

4.2. KDR method with pseudoinverseIf we take all the eigenvectors associated with the positive

eigenvalues in Equation (12)

X# ¼ X�V1:�

� �B6 þU6 ð14Þ

the resulting imputation for the missing variables is

zz# ¼ S#�V1:� X1:�

� ��1VT

1:�z� ¼ S#� S��ð Þþz� ð15Þ

That is, the model from Equation (15) is the KDR method

with pseudoinverse. Equation (14) is a restatement of a

method proposed by Nelson [8] to overcome the singularity

of S��.

4.3. KDR method with PLSAnalogously, the KDR method with PLS can be written as

X# ¼ X�W�ð ÞB7 þU7 ð16Þ

W� being the loadings matrix that allows to write the PLS

scores TPLS as TPLS ¼ X�W� in the PLS model for estimating

X# from X� [12,13].

4.4. TSR methodBeing T�

1:A ¼ X�P�1:A (Equation (4)), the TSR method model

(Equation (11)) can be written as

X# ¼ X�P�1:A

� �B4 þU4 ð17Þ

From Equations (10), (12), (14), (16) and (17), the

regression-based methods can be expressed as members of

a framework by using the general regression model

X# ¼ X�Lð ÞBþU ð18Þ

where L is a key matrix that serves to particularise the

framework members (IK�R for the KDR method, V1:� for

the KDR with PCR method, V1:� for the KDR with

pseudoinverse method, W� for the KDR with PLS method

and P�1:A for the TSR method).

Table I shows a summary of the different scores estimation

methods members of the regression framework as a function

of the key matrix L. This framework allows to express its

members as different approximations to the KDR method

(even the TSR method).

It is interesting to point out that when the KDR method is

approximated with PCR or PLS, if the number of compo-

nents extracted equals the rank of X�, we obtain the same

result as when we directly apply the KDR with pseudoin-

verse. In the case of X� being full rank (and then the KDR

method is applicable), all of them match the KDR method.

Solving for B, the ordinary least squares (OLS) esti-

mation matrix from Equation (18) is expressed as BB ¼ðLTX�TX�LÞ�1LTX�TX# ¼ ðLTS��LÞ�1LTS�# from where the

missing data estimation results in

zz# ¼ BBTLTz� ¼ S#�L LTS��L

� ��1LTz� ð19Þ

Table I. Expression for the different regression-based

scores estimation methods for incomplete observations as

framework members. �<�¼ rank(S**); V, eigenvectors

matrix of S��; W*, loadings matrix that allows to write the

PLS scores TPLS as TPLS¼X*W* in the PLS model for

estimating X# from X*

Method Key matrix (L)

KDR IK�R

KDR with PCR V1:�

KDR with pseudoinverse V1:�

KDR with PLS W�

TSR P�1:A

Framework for missing data imputation methods in MSPC 441

Copyright # 2005 John Wiley & Sons, Ltd. J. Chemometrics 2005; 19: 439–447

Page 4: Framework for regression-based missing data imputation methods in on-line MSPC

By substituting Equation (19) into Equation (3), the estima-

tion of the score vector s1:A is obtained

ss1:A ¼ P#T1:AS

#�L LTS��L� ��1

LTz� þ P�T1:Az

� ð20Þ

From Equations (3) and (20) the estimation error vector is

worked out as

s1:A � ss1:A ¼ P#T1:A z# � S#�L LTS��L

� ��1LTz�

� �ð21Þ

The estimation error uncertainty can be measured by the

covariance matrix

Var s1:A � ss1:Að Þ ¼ P#T1:A S## � S#�L LTS��L

� ��1LTS�#

� �P#

1:A

ð22Þ

The trace of this covariance matrix is the mean square

error (MSE). Given that the score vector estimator ss1:A

(Equation (20)) is unbiased (this is a linear transformation

of the least squares linear predictor zz#; Equation (19)), the

MSE is equivalent to the prediction error sum of variances

(PRESV) for all the PCA components extracted. We propose

PRESV as a suitable performance index to measure the

statistical efficiency of the estimators.

Equation (22) shows that the covariance matrix of the

estimation error vector does not depend on the particular

values of the new incomplete observation z, but on the

partition that z induces in the pre-built NOC PCA model

and on the choice of the key matrix L. The framework allows

to analytically work out this covariance matrix and hence an

statistical performance criterion (PRESV) for the different

framework members is straightforwardly obtained without

the need of using a test data set for this purpose. This is one

of the benefits of the proposed framework.

5. INDUSTRIAL EXAMPLES

The goal of this section is to compare the efficiency

(measured by the PRESV criterion) of the different regres-

sion-based scores estimation methods for new incomplete

observations, members of the framework. Three industrial

data sets are used: the first two are continuous processes; the

third one comes from a batch process.

For each method and missing data combination, the

corresponding partitions in matrices S (Equation (5)) and P

(Equation (2)) have been done. By substituting these sub-

matrices and the choice of the key matrix L (Table I) into

Equation (22), the error covariance matrix is analytically

calculated, and PRESV is straightforwardly obtained by

summing up the elements along the diagonal of this

covariance matrix.

5.1. Continuous process: mineral-sorting plantdata set (SOVR)This data set comes from a mineral sorting plant at LKAB in

Sweden and is available in SIMCA-P software [14]. In this

process, raw iron ore is divided into finer material by

passing through several grinders. For illustrative purposes,

out of the 12 process variables, only 8 have been selected for

this study (K¼ 8). This data set is the same used by Arteaga

and Ferrer [1]. From the 230 observations available, 150

observations have been used for building the PCA model.

The first three larger PCA components jointly explain 94% of

the total variance of the process variables (R2¼ 0.94) and the

predictive ability is high (Q2¼ 0.78).

With the purpose of comparing the mean value of the

PRESV worked out analytically from Equation (22), in the

different methods derived from the regression framework

discussed in Section 4, an analysis of variance (ANOVA) was

run using the unknown variables combination as a block

factor. This allows comparison of the methods to be carried

out in similar conditions, increasing the discriminant ability

of the comparisons. Given that in this data set the maximum

number of components in KDR with PCR or with PLS

methods is K�R (rank of X*), this depends on the number

of missing variables R. For this reason, three ANOVAs have

been run for cases R¼ 1, R¼ 2 and R¼ 3 (representing 12.5%,

25.0% and 37.5% of missing data respectively). Given the

positive skewness of the PRESV in the different methods, a

logarithmic transformation was applied to the PRESV

variable to correct for normality.

Figure 1 displays the least significance difference (LSD)

intervals for the average of the logarithm of the PRESV for

each method under the three cases studied. As expected,

as number of components extracted increases, PRESV

decreases for both KDR with PCR and with PLS methods.

For small number of components extracted, the KDR method

with PLS is statistically better on average than the KDR

with PCR method with the same number of components,

although both of them are statistically worse on an average

than the KDR and TSR methods (their LSD intervals do not

overlap). Nevertheless, when the number of components

extracted is greater than two this difference is no longer

statistically significant. When the maximum number of

components is extracted in each case, the KDR with PCR

and with PLS methods are equivalent to the KDR method.

For the most severe missing data cases studied (R> 1) there

is no statistical significant difference between the average

PRESV of the TSR and KDR methods. However, KDR turns

up as statistically more efficient than TSR in the case of

having only one missing variable (R¼ 1). Anyway, this is the

less severe case and all the methods yield good estimations.

Given that out of the 230 observations available, only 150

have been used for building the model, the 80 observations

left have been used as a validation set, to check the analytical

results shown before. For each missing data combination

and method studied the sample PRESV was obtained from

the sample estimation error vector. Similar results (not

shown) have been found to the previous ones obtained

analytically from Equation (22).

Up to now, we have shown how PRESV index can be used

to compare the statistical efficiency of the different regres-

sion-based methods members of the proposed framework.

Another potential use of this index is to evaluate off-line

the future impact of combination of missing data on

score estimation. Those missing data sets yielding large score

estimation errors are called critical combinations. In the

simplest cases, these critical combinations can be detected

by simple inspection of the loading matrix of the NOC PCA

model. Nevertheless, in a lot of cases this has to be done by

simulation. See, for example Arteaga and Ferrer [3].

442 F. Arteaga and A. Ferrer

Copyright # 2005 John Wiley & Sons, Ltd. J. Chemometrics 2005; 19: 439–447

Page 5: Framework for regression-based missing data imputation methods in on-line MSPC

From Equation (22), PRESV can be straightforwardly

worked out in advance (even before starting the on-line

process monitoring) for any given or simulated missing

data combination. In this way, critical combinations can be

detected without requiring real data. This is illustrated in

Figure 2, using the KDR method, where the highest critical

combination occurs when variables X4, X5 and X6 are miss-

ing. This is due to the high correlation of variables X4 and X6

with the second component, and variable X5 with the third

component (see Table II).

Similar results (not shown) can be obtained calculating the

PRESV with the other framework methods.

5.2. Continuous process: high-densitypolyethylene data set (HDPE)This data set comes from a petrochemical company in Spain.

A commercial scale polymerisation process produces large

volumes of a polymer (high-density polyethylene) used in

many familiar consumer products. We have used 61 obser-

vations with 13 process variables for building a three

-3,5

-3

-2,5

-2

-1,5

-1

-0,5

0

Method

Log(

PR

ESV

)

PCR

1

PCR

2

PLS3

PLS2

PLS1

PCR

7

PCR

6

PCR

5

PCR

4

PCR

3

KD

R

TSR

PLS7

PLS6

PLS5

PLS4

3 m

issi

ngva

riab

les

2 m

issi

ngva

riab

les

1 m

issi

ngva

riab

le

Figure 1. SOVR. Least significance difference (LSD) intervals for the average of the log

prediction error sum of variances, log(PRESV), for each method under the three cases

studied (R¼ 1, 2, 3). PCR# means known data regression (KDR) with principal component

regression (PCR) with # components. PLS# means KDR with partial least squares (PLS)

with # components.

Figure 2. Mineral-sorting plant data set SOVR. Prediction error sum of variances (PRESV) calcu-

lated with the KDR method for the different one, two and three missing variables combinations.

Framework for missing data imputation methods in MSPC 443

Copyright # 2005 John Wiley & Sons, Ltd. J. Chemometrics 2005; 19: 439–447

Page 6: Framework for regression-based missing data imputation methods in on-line MSPC

components PCA model with R2¼ 0.85 and Q2¼ 0.71. The

results obtained with this data set generally match the

previous results from the SOVR data set, although in this

case KDR and some of its approximations with PCR and PLS

methods are statistically more efficient (from PRESV criter-

ion) than TSR method for all missing data combinations

studied. This is shown in Figure 3, where the LSD intervals

for the average PRESV for the different methods are dis-

played for several missing variables combinations. To ease

the interpretation of the results only three and five missing

variables combinations (representing 23% and 38% of miss-

ing data respectively) are shown. In this case, logarithmic

transformation of PRESV was not needed.

5.3. Batch process: sequencing batch reactordata set (SBR)This example has been chosen to illustrate a case where the

KDR method is not applicable, due to S�� being singular.

This is not an uncommon case but a very frequent problem in

modern industries: for example in batch processes or any

process with more variables than observations.

The data set comes from a sequencing batch reactor (SBR)

operated under anaerobic/aerobic conditions for biological

phosphorus removal. During the batch duration, 4 process

variables were measured at regular intervals for a total of

265 time periods. Following Nomikos and Macgregor’s

approach [15], an unfold-PCA model was built on the

unfolded matrix of 72 rows (NOC batches) by 1060 columns

(4 variables by 265 time periods). The eight larger PCA

components jointly explain 84.5% of the total variance of

the process variables (R2¼ 0.845) and the predictive ability is

high (Q2¼ 0.80).

For real-time monitoring of a new batch run, a natural

problem to overcome with Nomikos and Macgregor’s ap-

proach is that at time k (k<K) there are K� k unknown

samples that should be estimated in order to calculate a score

value. In this application we deal with the on-line estimation

of the scores vector for new batches at five times k in the

batch evolution: k1¼ 50, k2¼ 100, k3¼ 150, k4¼ 200, k5¼ 250.

Figure 4 displays the values of PRESV (worked out from

Equation (22)) for the different methods at the different time

points studied. When the new batch evolves the length of the

unknown trajectory reduces and, consequently, the scores

estimation accuracy increases. This is the reason why PRESV

decreases with time for any given estimation method.

Figure 4 shows that, as expected, the more components

extracted (for both KDR with PCR and with PLS) the lower

PRESV obtained. KDR with PLS tends to perform better than

KDR with PCR for the same number of components. The

performance of the TSR method is quite similar to the other

KDR approximations with high number of components

extracted.

Table II. SOVR; loadings matrix for the NOC PCA model

X1 X2 X3 X4 X5 X6 X7 X8

p1 �0.446 �0.441 �0.446 �0.184 �0.020 �0.051 �0.432 �0.429p2 0.084 0.076 0.064 �0.664 �0.217 �0.695 0.100 0.045p3 �0.017 �0.086 0.031 �0.002 �0.945 0.292 �0.028 0.112

PCR

1

PCR

2

PCR

3

PCR

4

PCR

5

PCR

6

PCR

7

PCR

8

PCR

9

PCR

10

PLS1

PLS2

PLS3

PLS4

PLS5

PLS6

PLS7

PLS8

PLS9

PLS1

0

KD

R

TSR

0,0

0,2

0,4

0,6

0,8

1,0

1,2

1,4

PR

ESV

Figure 3. High-density polytehylene data set (HDPE). LSD intervals for the average of the

PRESV for each method assuming all the possible three (solid line) and five (dotted line)

missing variables combination. PCR# means KDR with PCR with # components. PLS# means

KDR with PLS with # components.

444 F. Arteaga and A. Ferrer

Copyright # 2005 John Wiley & Sons, Ltd. J. Chemometrics 2005; 19: 439–447

Page 7: Framework for regression-based missing data imputation methods in on-line MSPC

It is important to point out that for a given time k,

calculation of PRESV from Equation (22) implies to work

out the key matrix L. As shown in Table I, this key matrix

takes different expressions for the different framework

members. In the TSR method, L is the submatrix made up

of the first k� J rows of the NOC PCA loading matrix P1:A

and this is very easy to work out and has a low-computa-

tional cost. Nevertheless, in the other approximations to the

KDR method, working out L requires either extracting

eigenvectors of the S�� matrix corresponding to the partition

induced by the missing trajectory at every time point k (KDR

with PCR or with pseudoinverse) or building as many PLS

models as time periods k of the batch run to calculate the W*

matrix (KDR with PLS). This makes the latter methods more

complex and with higher computational cost than the TSR

method.

6. CONCLUSIONS

We have proposed a framework for several regression-based

missing data estimation methods used in on-line MSPC with

incomplete observations, assuming an existing in-control

PCA model. The regression-based framework is posed based

on the reconstruction of the incomplete observation through

the missing variables estimation obtained from the regres-

sion model X# ¼ ðX�LÞBþU (Equation (18)), expressed as a

function of a key matrix L. Since all the different scores

estimation methods members of the regression framework

are only variants of a single model, all of them can be

computed by writing one program code and varying the

key matrix L (Table I) according to which method is chosen.

This simplifies the design of flexible computer programs for

missing data estimation methods.

All of the methods can be understood as different approx-

imations to KDR method (even the TSR method). These

approximations can be convenient when KDR suffer from

ill-conditioning problems and are necessary when singular-

ity problems appear (as in on-line batch MSPC and processes

with more variables than observations).

From the proposed framework the analytical derivation of

the expression for the covariance matrix of the estimation

error Varðs1:A � ss1:AÞ (Equation (22)) for the different regres-

sion-based estimation methods members of the framework is

straightforward. We propose to use the PRESV for all the

PCA components extracted as a suitable performance index

negatively correlated with the statistical efficiency of the

estimators.

Under normality, out of all the regression-based methods,

KDR is the best from PRESV criterion. Approximations of

KDR using PLS or PCR with large number of components

can show better statistical performance than TSR. Anyway,

the practical advantage of the latter is that its computational

cost is low because there is no need to define the number of

components to be extracted.

One of the advantages of the PRESV statistical perfor-

mance index is that it does not depend on the particular

PCR01

PCR01

PCR01

PCR01

PCR01

PCR02

PCR02

PCR02

PCR02

PCR02

PCR03

PCR03

PCR03

PCR03

PCR03

PCR04

PCR04

PCR04

PCR04

PCR04

PCR05

PCR05

PCR05

PCR05

PCR05

PCR06

PCR06

PCR06

PCR06PCR06

PCR07

PCR07

PCR07

PCR07 PCR07

PCR08

PCR08

PCR08

PCR08 PCR08

PCR09

PCR09

PCR09

PCR09 PCR09

PCR10

PCR10

PCR10

PCR10 CR10

PCR11

PCR11

PCR11PCR11 PCR11

PCR12

PCR12

PCR12PCR12 CR12

PCR13

PCR13

PCR13PCR13 CR13

PCR14

PCR14

PCR14PCR14 PCR14

PCR15

PCR15

PCR15PCR15 PCR15

PLS01

PLS01

PLS01

PLS01

PLS01

PLS02

PLS02

PLS02

PLS02

PLS02

PLS03

PLS03

PLS03

PLS03

PLS03

PLS04

PLS04

PLS04

PLS04PLS04

PLS05

PLS05

PLS05

PLS05PLS05

PLS06

PLS06

PLS06

PLS06 PLS06

PLS07

PLS07

PLS07

PLS07 PLS07

PLS08

PLS08

PLS08PLS08 PLS08

PLS09

PLS09

PLS09PLS09 PLS09

PLS10

PLS10

PLS10PLS10 PLS10

PLS11

PLS11

PLS11PLS11 PLS11

PLS12

PLS12

PLS12PLS12 PLS12

PLS13

PLS13PLS13

PLS13 PLS13

PLS14

PLS14PLS14

PLS14 PLS14

PLS15

PLS15PLS15

PLS15 PLS15

TSR

TSR

TSRTSR TSR0

50

100

150

200

250

300

350

400

450

500

k=50 k=100 k=150 k=200 k=250

PR

ESV

Figure 4. Sequencing batch reactor data set (SBR). PRESV for the different methods at different time points

k. PCR# means KDR with PCR with # components. PLS# means KDR with PLS with # components.

Framework for missing data imputation methods in MSPC 445

Copyright # 2005 John Wiley & Sons, Ltd. J. Chemometrics 2005; 19: 439–447

Page 8: Framework for regression-based missing data imputation methods in on-line MSPC

values of a given new incomplete observation z, but on the

partition that z induces in the pre-built NOC PCA model and

on the choice of the matrix L. The practical benefits of this

property is that practitioners do not have to split their data

set and use a test set for comparing the performance of

different estimation methods, but PRESV can be straightfor-

wardly calculated in advance (even before starting the on-

line process monitoring) from Equation (22) for any given or

simulated missing data combination. Add to it, this can be

very useful in practice for early detection of harmful missing

data combinations (those causing large errors in the scores

estimations). Such studies can assist the design of sensor

maintenance schedules and give some insight for sensor

redundancy.

In addition, the general expression for the covariance

matrix of the estimation error can be very useful for char-

acterising the uncertainty that missing data propagate

through the scores and other monitoring statistics (Hotell-

ing-T2 and squared prediction error). This is a key point in

on-line MSPC with missing data and deserves future work.

AcknowledgementsWe thank CALAGUA research group (Universidad deValencia—Universidad Politecnica de Valencia, Spain)

for providing us with the SBR data set. Thanks also to the

Editorial Board Member for the helpful comments. This

research was supported by the Spanish Government

(MICYT) and the European Union (RDE funds) under

grant DP12001-2749-C02-01.

APPENDIX. EQUIVALENT MODELS FORTHE KDR AND TSR METHODS

Let us consider the NOC PCA model (Equation (1)), and a

new incomplete observation that induces the partition in X,

X ¼ ½X# X�� discussed in Section 2. The elements of the

covariance matrix S (Equation (5)) can be expressed as

S�� ¼ X�TX�

N � 1¼ P�TTTP�T

N � 1¼ P�HP�T;

S#� ¼ P#HP�T; S�# ¼ P�HP#T

where H ¼ ½H1:A 00 HAþ1:H

� is the H-order diagonal covar-

iance matrix of the latent variables from the NOC PCA

model.

If it is assumed that the missing variables are estimated

first from the measured variables from Equation (10)

X# ¼ X�B3 þU3

the least squares estimator of the matrix of coefficients B3 is

BB3 ¼ X�TX�� ��1X�TX# ¼ S��ð Þ�1S�#

and thus, the predicted missing variables for the new in-

complete observation can be expressed as

zz# ¼ S#� S��ð Þ�1z� ð24Þ

If missing variables in Equation (3) are replaced by their

estimated values from Equation (24), the following estima-

tion of the score vector s1:A results

ss1:A ¼ P#T1:Azz

# þ P�T1:Az

� ¼ P#T1:AS

#� S��ð Þ�1z� þ P�T1:Az

� ð25Þ

but

P#T1:AS

#� ¼ P#T1:AP

#HP�T

¼ P#T1:A P#

1:A P#Aþ1:H

h i H1:A 0

0 HAþ1:H

� �P�T

1:A

P�TAþ1:H

" #

¼ I� P�T1:AP

�1:A �P�T

1:AP�Aþ1:H

� � H1:AP�T1:A

HAþ1:HP�TAþ1:H

" #

¼ H1:AP�T1:A�P�T

1:AP�1:AH1:AP

�T1:A�P�T

1:AP�Aþ1:HHAþ1:HP

�TAþ1:H

� �¼ H1:AP

�T1:A � P�T

1:AS��

ð26Þ

and hence, by substituting Equation (26) in Equation (25), we

can write

ss1:A ¼ P#T1:Azz

# þ P�T1:Az

� ¼ H1:AP�T1:A�P�T

1:AS��� �

S��ð Þ�1z�þP�T1:Az

¼ H1:AP�T1:A S��ð Þ�1z�

ð27Þ

that matches the expression for the CMR (KDR) estimator

(Equation (7)).

Alternatively, if the missing variables are estimated first

from the trimmed scores (Equation (11))

X# ¼ T�1:AB4 þU4

solving for B4, the least squares estimator matrix is expressed

as

BB4 ¼ T�T1:AT

�1:A

� ��1T�T

1:AX# ¼ P�T

1:AX�TX�P�

1:A

� ��1P�T

1:AX�TX#

¼ P�T1:AS

��P�1:A

� ��1P�T

1:AS�#

and the estimation of the unknown variables for the new

observation results

zz# ¼ S#�P�1:A P�T

1:AS��P�

1:A

� ��1P�T

1:Az� ð28Þ

By substituting Equation (28) in Equation (3), the following

estimation of the score vector s1:A is obtained

ss1:A ¼ P#T1:Azz

# þ P�T1:Az

¼ P#T1:AS

#�P�1:A P�T

1:AS��P�

1:A

� ��1P�T

1:Az� þ P�T

1:Az�

ð29Þ

But, from Equation (26) we can rewrite Equation (29) as

ss1:A ¼ H1:AP�T1:A � P�T

1:AS��� �

P�1:A P�T

1:AS��P�

1:A

� ��1P�T

1:Az� þ P�T

1:Az�

¼ H1:AP�T1:AP

�1:A P�T

1:AS��P�

1:A

� ��1P�T

1:Az�

ð30Þ

yielding the expression for the TSR estimator (Equation (9)).

REFERENCES

1. Little RJA, Rubin DB. Statistical Analysis with MissingData. Wiley: New York, 1987.

2. Nelson PRC, Taylor PA, MacGregor JF. Missing datamethods in PCA and PLS: score calculations with incom-plete observations. Chemometrics Intell. Lab. Syst. 1996; 35:45–65.

3. Arteaga F, Ferrer A. Dealing with missing data in MSPC:several methods, different interpretations, some exam-ples. J. Chemometrics 2002; 16: 408–418.

4. Wold S, Albano C, Dunn WJ, Esbensen K, Hellberg S,Johansson E, Sjostrom M. Pattern recognition: finding

446 F. Arteaga and A. Ferrer

Copyright # 2005 John Wiley & Sons, Ltd. J. Chemometrics 2005; 19: 439–447

Page 9: Framework for regression-based missing data imputation methods in on-line MSPC

and using regularities in multivariate data. In FoodResearch and Data Analysis, Martens H, Russwurm H Jr(eds). Applied Science Publishers: London and NewYork, 1983; 183–185.

5. Martens H, Naes T. Multivariate Calibration. Wiley: NewYork, 1989.

6. Wise BM, Ricker NL. Recent advances in multivariatestatistical process control: improving robustness andsensitivity. IFAC Int. Symp., ADCHEM ’91, Toulouse,1991; 125–130.

7. Walczak B, Massart DL. Dealing with missing data:Part I. Chemometrics Intell. Lab. Syst. 2001; 58: 15–27.

8. Nelson PRC. Treatment of missing measurementsin PCA and PLS models. Ph.D. Dissertation. Departmentof Chemical Engineering, McMaster University.Hamilton, Ontario, Canada 2002.

9. Garcıa-Munoz S, Kourti T, MacGregor JF. Model predic-tive monitoring for batch processes. Ind. Eng. Chem. Res.2004; 43: 5929–5941.

10. Jackson JE. A User Guide to Principal Components. Wiley:New York, 1991.

11. Wold S, Esbensen K, Geladi P. Principal componentanalysis. Chemometrics Intell. Lab. Syst. 1987; 2: 37–52.

12. Hoskuldsson A. PLS regression methods. J. Chemometrics1988; 2: 211–228.

13. De Jong S. SIMPLS: an alternative approach to partialleast squares regression. Chemometrics Intell. Lab. Syst.1993; 18: 251–263.

14. SIMCA-P 8.0: User Guide and Tutorial. Umetrics AB:Umea, 1999.

15. Nomikos P, MacGregor JF. Multivariate SPC charts formonitoring batch processes. Technometrics 1995; 41: 41–59.

Framework for missing data imputation methods in MSPC 447

Copyright # 2005 John Wiley & Sons, Ltd. J. Chemometrics 2005; 19: 439–447