Upload
francisco-arteaga
View
229
Download
2
Embed Size (px)
Citation preview
JOURNAL OF CHEMOMETRICSJ. Chemometrics 2005; 19: 439–447Published online 12 December 2005 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cem.946
Framework for regression-based missing dataimputation methods in on-line MSPC
FranciscoArteaga1andAlberto Ferrer2*1FacultaddeEstudiosdela Empresa,UniversidadCato¤ licadeValencia SanVicenteMa¤ rtir,GuillemdeCastro175,46008Valencia,Spain2DepartamentodeEstad|¤ sticae I.O.AplicadasyCalidad,UniversidadPolite¤ cnicadeValencia,CaminodeVeras/n,Edificio I-3,46022Valencia,Spain
Received 19 February 2004; Revised 10 October 2005; Accepted 19 October 2005
Missing data are a critical issue in on-linemultivariate statistical process control (MSPC). Among the
different scores estimationmethods for futuremultivariate incomplete observations from an existing
principal component analysis (PCA) model, the most statistical efficient ones are those that estimate
the scores for the new incomplete observation as the prediction from a regression model. We have
called them regression-based methods. Several approximations have been proposed in the literature
to overcome the singularity or ill-conditioning problems that some of the mentioned methods can
suffer due to missing data. This is particularly acute in on-line batch process monitoring. In order to
ease the comparison of the statistical performance of these methods and to improve the under-
standing of their relationships, in this paper we propose a framework that allows to write these
regression-based methods by an unique expression, function of a key matrix. From this framework a
statistical performance index (PRESV) is introduced as a way to compare the statistical efficiency of
the different frameworkmembers and to predict the impact of specific missing data combinations on
scores estimation without requiring real data. The results are illustrated by application to several
continuous and batch industrial data sets. Copyright # 2005 John Wiley & Sons, Ltd.
KEYWORDS: principal component analysis (PCA); missing data, multivariate statistical process control (MSPC)
1. INTRODUCTION
Missing measurements are a common occurrence in on-line
multivariate statistical process control (MSPC) [1,2]. Several
methods have been proposed to estimate the latent variables
scores for new incomplete multivariate observations from a
pre-built, fixed and known principal component analysis
(PCA) model [2–8]. Arteaga and Ferrer [3] study the equiva-
lence between these methods concluding that almost every
one can be seen as different imputation methods, that is
different ways to impute values for the missing variables.
In all the methods studied there appears, in one form or
another, a regression model. However, we have called
regression-based methods those whose scores for the new
incomplete observation are estimated as the prediction
from a regression model: the so-called known data regres-
sion (KDR) method (also called conditional mean replace-
ment method) and the trimmed scores regression (TSR)
method.
The regression-based methods turn up as statistically
more efficient than the other methods proposed, the most
efficient being the KDR method [2,3]. Nevertheless, this
method may suffer ill-conditioning problems with highly
correlated data. This is particularly acute in on-line batch
process monitoring with unfold-PCA models where singu-
larity problems appear [9]. Biased regression methods
should be used to overcome these problems. Although
several authors [2,3,9] comment on this, none of them
provide a scientific (statistical) comparison of the perfor-
mance of all these regression-based methods in a wide range
of on-line monitoring scenarios.
In order to ease this comparison study and to improve
understanding of their relationships and differences, in this
paper a framework to express these regression-based scores
estimation method is proposed. This unified description can
be useful in simplifying the design of flexible computer
programs for missing data estimation methods and to pro-
vide a fast and easy way to evaluate the performance of the
different members of the framework.
Section 2 introduces the notation. Section 3 outlines the
different regression-based methods proposed in the litera-
ture facing the problem of scores estimation from an existing
PCA model when new observations are incomplete. An
alternative formulation for the regression-based scores
*Correspondence to: Alberto Ferrer, Universidad Politecnica deValencia, Departamento de Estadıstica e I.O. Aplicadas y Calidad,Camino de Vera s/n, Edificio I-3, 46022 Valencia, Spain.E-mail: [email protected]/grant sponsor: Spanish Government (MICYT).Contract/grant sponsor: European Union (RDE funds); contract/grant number: DPI2001-2749-C02-01.
Copyright # 2005 John Wiley & Sons, Ltd.
estimation methods is also introduced. Section 4 shows how
this reformulation can be generalised, yielding a general
framework. The performance of the different regression-
based estimation method members of the framework is
studied in Section 5, based on several continuous and batch
industrial data sets. Finally, Section 6 presents the conclu-
sions of the present paper.
2. NOTATION
Following the same notation as Nelson et al. [2] and Arteaga
and Ferrer [3], lower case bold variables are column vectors
and upper cases are matrices.
Let us assume that a PCA model [5,10] has been built from
a reference data set X, with N observations and K variables,
representing normal operating conditions (NOC) from a
multivariate process,
X ¼ TPT ð1Þ
where T is an N�H matrix of scores, and P is a K�H matrix
of loadings, being H¼ rank(X).
Consider that a new observation z has some unmeasured
variables and that these can be taken to be the first R
elements of the data vector without loss of generality.
Thus, the vector can be partitioned as z ¼ ½z#z� �, where z#
denotes the missing measurements and z� the observed
variables. This induces the following partition in X,
X ¼ ½X# X��, where X# is the submatrix containing the first
R columns of X, and X� accommodates the remaining K�R
columns.
Correspondingly, the P matrix can be partitioned as
P ¼ ½P#
P� �, where P# is the submatrix made up of the first R
rows of P, and matrix P* contains the remaining K�R rows.
Assuming that matrix X is of rank H, and only A out of the
H components (A�H) are significant, we are only interested
in working out the first A elements of the scores vector for
the new individual, s1:A. In this situation, P matrix can be
expressed as:
P ¼ P1:A PAþ1:H½ � ¼ P#1:A P#
Aþ1:HP�
1:A P�Aþ1:H
� �ð2Þ
where P1:A contains the first A loading vectors and PAþ1:H the
remaining H�A of the PCA model.
From the previous expressions, the first A elements of the
scores vector for the new observation can be written as
s1:A ¼ PT1:Az ¼ P#T
1:Az# þ P�T
1:Az� ¼ s
#1:A þ s�1:A ð3Þ
This allows to express the scores vector as the sum of two
elements: the first is the contribution from the unknown
variables, s#1:A ¼ P#T1:Az
#; and the second is the contribution
from the known variables, s�1:A ¼ P�T1:Az
�.
The scores matrix from the NOC PCA model can also be
expressed as
T1:A ¼ XP1:A ¼ X#P#1:A þ X�P�
1:A ¼ T#1:A þ T�
1:A ð4Þ
The partition in the new incomplete observation also
induces the following partition in the covariance matrix
S ¼ XTX
N � 1¼ 1
N � 1X#TX# X#TX�
X�TX# X�TX�
� �¼ S## S#�
S�# S��
� �ð5Þ
3. REGRESSION-BASED SCORESESTIMATION METHODS
Given a new incomplete individual z, and assuming the
same variables to be missing in each row of data matrix X,
Arteaga and Ferrer [3] propose to fit the regression model
T1:A ¼ X�B1 þU1 ð6Þ
and to estimate the scores vector from the known variables
as ss1:A ¼ BBT
1z� being BB1 ¼ ðX�TX�Þ�1X�TT1:A the least squares
estimator of matrix B1. This results in the KDR estimator
ss1:A ¼ H1:AP�T1:A S��ð Þ�1z� ð7Þ
where H1:A is an A�A diagonal matrix of the higher A
eigenvalues of the covariance matrix S (Equation (5)) in
decreasing order along the diagonal.
Nelson et al. [2], assuming that z follows a multivariate
normal distribution, propose to replace the unknown vari-
ables with its conditional expectation, given the known
variables and the under control PCA model, zz# ¼Eðz# j z�;SÞ, and to calculate the scores of the reconstructed
observation as if no measurements were missing. This is the
conditional mean replacement method (CMR) and results
in the same estimator that the above-mentioned KDR
method [2,3].
The KDR method requires the inversion of S�� (see
Equation (7)), a K�R order matrix (K�R is the number of
known variables). S�� can be ill-conditioned in data sets with
a large number of highly correlated variables. This is
particularly acute in on-line batch process monitoring with
unfold-PCA models where singularity problems appear [9].
Nelson et al. [2] suggests to overcome this problem by
replacing the ordinary least squares estimate from
Equation (6) with a biased regression method such as ridge
regression (RR), principal component regression (PCR) or
partial least squares (PLS). In the case of S�� being singular,
Nelson [8] suggests replacing the inverse ðS��Þ�1 with the
pseudoinverse ðS��Þþ.
Arteaga and Ferrer [3] propose an alternative method to
the KDR method, called the TSR method, replacing the
KDR method regression model (Equation (6)) with a new
regression model
T1:A ¼ T�1:AB2 þU2 ð8Þ
in which the scores vector is estimated from the known
variables contribution to the scores vector (trimmed scores),
s�1:A ¼ P�T1:Az
�, that corresponds to the new incomplete
observation.
The TSR estimator from Equation (8) yields [3]
ss1:A ¼ H1:AP�T1:AP
�1:A P�T
1:AS��P�
1:A
� ��1P�T
1:Az� ð9Þ
The regression models for the KDR (Equation (6)) and
TSR (Equation (8)) methods can be replaced by equivalent
models in which, instead of straightforward estimation of
the scores vector, the unknown variables are estimated first
from the measured variables or from the trimmed scores,
respectively, based on the following models
X# ¼ X�B3 þU3 ð10Þ
X# ¼ T�1:AB4 þU4 ð11Þ
440 F. Arteaga and A. Ferrer
Copyright # 2005 John Wiley & Sons, Ltd. J. Chemometrics 2005; 19: 439–447
From Equation (3), the estimated missing measurements
can then be used in a second step in the score calculation
along with the measured data as if no measurements were
missing. The equivalence proof between these models is
shown in the Appendix.
Arteaga and Ferrer [3] illustrate, through several contin-
uous processes, the statistical superiority of the KDR
method, based on the squared prediction error criterion.
Nevertheless, the TSR method is shown to be practically
equivalent to the KDR method, with the advantage of
requiring the inversion of an A order matrix (P�T1:AS
��P�1:A),
of a much smaller size and with fewer ill-conditioning
problems than the S�� matrix.
Garcıa-Munoz et al. [9] compare different scores estima-
tion methods to ‘fill in’ the future unknown trajectories in
on-line batch process monitoring. They conclude that KDR
with PLS, and TSR outperform other methods in the
accuracy of the forecast done and in the quality of the score
estimates, even from the beginning of the batch.
However, none of the above mentioned papers provide an
overall scientific (statistical) comparison of the performance
of all these regression-based methods in a wide range of
on-line monitoring scenarios. One may ask if it does matter
which regression method one uses. The lack of response to
this question may lead to some practitioners to have a
favourite method and stick to that one without a sound
reason.
4. FRAMEWORK FOR THE REGRESSION-BASED SCORES ESTIMATION METHODSFOR NEW INCOMPLETE OBSERVATIONS
Let us briefly review the different regression-based estima-
tion methods presented in this paper.
4.1. KDR method with PCRThe KDR method with principal components regression
(instead of least squares regression model from Equation (10))
can be written as
X# ¼ X�V1:�
� �B5 þU5 ð12Þ
where � � � ¼ rank S��ð Þ, and V1:� is the ðK � RÞ � � matrix
which columns are the eigenvectors of S�� associated with
the greatest � eigenvalues.
The resulting estimation for the missing values is
zz# ¼ S#�V1:� X1:�
� ��1VT
1:�z� ð13Þ
where X1:� is a �� � diagonal matrix of the higher � eigen-
values of the covariance matrix S�� in decreasing order along
the diagonal.
4.2. KDR method with pseudoinverseIf we take all the eigenvectors associated with the positive
eigenvalues in Equation (12)
X# ¼ X�V1:�
� �B6 þU6 ð14Þ
the resulting imputation for the missing variables is
zz# ¼ S#�V1:� X1:�
� ��1VT
1:�z� ¼ S#� S��ð Þþz� ð15Þ
That is, the model from Equation (15) is the KDR method
with pseudoinverse. Equation (14) is a restatement of a
method proposed by Nelson [8] to overcome the singularity
of S��.
4.3. KDR method with PLSAnalogously, the KDR method with PLS can be written as
X# ¼ X�W�ð ÞB7 þU7 ð16Þ
W� being the loadings matrix that allows to write the PLS
scores TPLS as TPLS ¼ X�W� in the PLS model for estimating
X# from X� [12,13].
4.4. TSR methodBeing T�
1:A ¼ X�P�1:A (Equation (4)), the TSR method model
(Equation (11)) can be written as
X# ¼ X�P�1:A
� �B4 þU4 ð17Þ
From Equations (10), (12), (14), (16) and (17), the
regression-based methods can be expressed as members of
a framework by using the general regression model
X# ¼ X�Lð ÞBþU ð18Þ
where L is a key matrix that serves to particularise the
framework members (IK�R for the KDR method, V1:� for
the KDR with PCR method, V1:� for the KDR with
pseudoinverse method, W� for the KDR with PLS method
and P�1:A for the TSR method).
Table I shows a summary of the different scores estimation
methods members of the regression framework as a function
of the key matrix L. This framework allows to express its
members as different approximations to the KDR method
(even the TSR method).
It is interesting to point out that when the KDR method is
approximated with PCR or PLS, if the number of compo-
nents extracted equals the rank of X�, we obtain the same
result as when we directly apply the KDR with pseudoin-
verse. In the case of X� being full rank (and then the KDR
method is applicable), all of them match the KDR method.
Solving for B, the ordinary least squares (OLS) esti-
mation matrix from Equation (18) is expressed as BB ¼ðLTX�TX�LÞ�1LTX�TX# ¼ ðLTS��LÞ�1LTS�# from where the
missing data estimation results in
zz# ¼ BBTLTz� ¼ S#�L LTS��L
� ��1LTz� ð19Þ
Table I. Expression for the different regression-based
scores estimation methods for incomplete observations as
framework members. �<�¼ rank(S**); V, eigenvectors
matrix of S��; W*, loadings matrix that allows to write the
PLS scores TPLS as TPLS¼X*W* in the PLS model for
estimating X# from X*
Method Key matrix (L)
KDR IK�R
KDR with PCR V1:�
KDR with pseudoinverse V1:�
KDR with PLS W�
TSR P�1:A
Framework for missing data imputation methods in MSPC 441
Copyright # 2005 John Wiley & Sons, Ltd. J. Chemometrics 2005; 19: 439–447
By substituting Equation (19) into Equation (3), the estima-
tion of the score vector s1:A is obtained
ss1:A ¼ P#T1:AS
#�L LTS��L� ��1
LTz� þ P�T1:Az
� ð20Þ
From Equations (3) and (20) the estimation error vector is
worked out as
s1:A � ss1:A ¼ P#T1:A z# � S#�L LTS��L
� ��1LTz�
� �ð21Þ
The estimation error uncertainty can be measured by the
covariance matrix
Var s1:A � ss1:Að Þ ¼ P#T1:A S## � S#�L LTS��L
� ��1LTS�#
� �P#
1:A
ð22Þ
The trace of this covariance matrix is the mean square
error (MSE). Given that the score vector estimator ss1:A
(Equation (20)) is unbiased (this is a linear transformation
of the least squares linear predictor zz#; Equation (19)), the
MSE is equivalent to the prediction error sum of variances
(PRESV) for all the PCA components extracted. We propose
PRESV as a suitable performance index to measure the
statistical efficiency of the estimators.
Equation (22) shows that the covariance matrix of the
estimation error vector does not depend on the particular
values of the new incomplete observation z, but on the
partition that z induces in the pre-built NOC PCA model
and on the choice of the key matrix L. The framework allows
to analytically work out this covariance matrix and hence an
statistical performance criterion (PRESV) for the different
framework members is straightforwardly obtained without
the need of using a test data set for this purpose. This is one
of the benefits of the proposed framework.
5. INDUSTRIAL EXAMPLES
The goal of this section is to compare the efficiency
(measured by the PRESV criterion) of the different regres-
sion-based scores estimation methods for new incomplete
observations, members of the framework. Three industrial
data sets are used: the first two are continuous processes; the
third one comes from a batch process.
For each method and missing data combination, the
corresponding partitions in matrices S (Equation (5)) and P
(Equation (2)) have been done. By substituting these sub-
matrices and the choice of the key matrix L (Table I) into
Equation (22), the error covariance matrix is analytically
calculated, and PRESV is straightforwardly obtained by
summing up the elements along the diagonal of this
covariance matrix.
5.1. Continuous process: mineral-sorting plantdata set (SOVR)This data set comes from a mineral sorting plant at LKAB in
Sweden and is available in SIMCA-P software [14]. In this
process, raw iron ore is divided into finer material by
passing through several grinders. For illustrative purposes,
out of the 12 process variables, only 8 have been selected for
this study (K¼ 8). This data set is the same used by Arteaga
and Ferrer [1]. From the 230 observations available, 150
observations have been used for building the PCA model.
The first three larger PCA components jointly explain 94% of
the total variance of the process variables (R2¼ 0.94) and the
predictive ability is high (Q2¼ 0.78).
With the purpose of comparing the mean value of the
PRESV worked out analytically from Equation (22), in the
different methods derived from the regression framework
discussed in Section 4, an analysis of variance (ANOVA) was
run using the unknown variables combination as a block
factor. This allows comparison of the methods to be carried
out in similar conditions, increasing the discriminant ability
of the comparisons. Given that in this data set the maximum
number of components in KDR with PCR or with PLS
methods is K�R (rank of X*), this depends on the number
of missing variables R. For this reason, three ANOVAs have
been run for cases R¼ 1, R¼ 2 and R¼ 3 (representing 12.5%,
25.0% and 37.5% of missing data respectively). Given the
positive skewness of the PRESV in the different methods, a
logarithmic transformation was applied to the PRESV
variable to correct for normality.
Figure 1 displays the least significance difference (LSD)
intervals for the average of the logarithm of the PRESV for
each method under the three cases studied. As expected,
as number of components extracted increases, PRESV
decreases for both KDR with PCR and with PLS methods.
For small number of components extracted, the KDR method
with PLS is statistically better on average than the KDR
with PCR method with the same number of components,
although both of them are statistically worse on an average
than the KDR and TSR methods (their LSD intervals do not
overlap). Nevertheless, when the number of components
extracted is greater than two this difference is no longer
statistically significant. When the maximum number of
components is extracted in each case, the KDR with PCR
and with PLS methods are equivalent to the KDR method.
For the most severe missing data cases studied (R> 1) there
is no statistical significant difference between the average
PRESV of the TSR and KDR methods. However, KDR turns
up as statistically more efficient than TSR in the case of
having only one missing variable (R¼ 1). Anyway, this is the
less severe case and all the methods yield good estimations.
Given that out of the 230 observations available, only 150
have been used for building the model, the 80 observations
left have been used as a validation set, to check the analytical
results shown before. For each missing data combination
and method studied the sample PRESV was obtained from
the sample estimation error vector. Similar results (not
shown) have been found to the previous ones obtained
analytically from Equation (22).
Up to now, we have shown how PRESV index can be used
to compare the statistical efficiency of the different regres-
sion-based methods members of the proposed framework.
Another potential use of this index is to evaluate off-line
the future impact of combination of missing data on
score estimation. Those missing data sets yielding large score
estimation errors are called critical combinations. In the
simplest cases, these critical combinations can be detected
by simple inspection of the loading matrix of the NOC PCA
model. Nevertheless, in a lot of cases this has to be done by
simulation. See, for example Arteaga and Ferrer [3].
442 F. Arteaga and A. Ferrer
Copyright # 2005 John Wiley & Sons, Ltd. J. Chemometrics 2005; 19: 439–447
From Equation (22), PRESV can be straightforwardly
worked out in advance (even before starting the on-line
process monitoring) for any given or simulated missing
data combination. In this way, critical combinations can be
detected without requiring real data. This is illustrated in
Figure 2, using the KDR method, where the highest critical
combination occurs when variables X4, X5 and X6 are miss-
ing. This is due to the high correlation of variables X4 and X6
with the second component, and variable X5 with the third
component (see Table II).
Similar results (not shown) can be obtained calculating the
PRESV with the other framework methods.
5.2. Continuous process: high-densitypolyethylene data set (HDPE)This data set comes from a petrochemical company in Spain.
A commercial scale polymerisation process produces large
volumes of a polymer (high-density polyethylene) used in
many familiar consumer products. We have used 61 obser-
vations with 13 process variables for building a three
-3,5
-3
-2,5
-2
-1,5
-1
-0,5
0
Method
Log(
PR
ESV
)
PCR
1
PCR
2
PLS3
PLS2
PLS1
PCR
7
PCR
6
PCR
5
PCR
4
PCR
3
KD
R
TSR
PLS7
PLS6
PLS5
PLS4
3 m
issi
ngva
riab
les
2 m
issi
ngva
riab
les
1 m
issi
ngva
riab
le
Figure 1. SOVR. Least significance difference (LSD) intervals for the average of the log
prediction error sum of variances, log(PRESV), for each method under the three cases
studied (R¼ 1, 2, 3). PCR# means known data regression (KDR) with principal component
regression (PCR) with # components. PLS# means KDR with partial least squares (PLS)
with # components.
Figure 2. Mineral-sorting plant data set SOVR. Prediction error sum of variances (PRESV) calcu-
lated with the KDR method for the different one, two and three missing variables combinations.
Framework for missing data imputation methods in MSPC 443
Copyright # 2005 John Wiley & Sons, Ltd. J. Chemometrics 2005; 19: 439–447
components PCA model with R2¼ 0.85 and Q2¼ 0.71. The
results obtained with this data set generally match the
previous results from the SOVR data set, although in this
case KDR and some of its approximations with PCR and PLS
methods are statistically more efficient (from PRESV criter-
ion) than TSR method for all missing data combinations
studied. This is shown in Figure 3, where the LSD intervals
for the average PRESV for the different methods are dis-
played for several missing variables combinations. To ease
the interpretation of the results only three and five missing
variables combinations (representing 23% and 38% of miss-
ing data respectively) are shown. In this case, logarithmic
transformation of PRESV was not needed.
5.3. Batch process: sequencing batch reactordata set (SBR)This example has been chosen to illustrate a case where the
KDR method is not applicable, due to S�� being singular.
This is not an uncommon case but a very frequent problem in
modern industries: for example in batch processes or any
process with more variables than observations.
The data set comes from a sequencing batch reactor (SBR)
operated under anaerobic/aerobic conditions for biological
phosphorus removal. During the batch duration, 4 process
variables were measured at regular intervals for a total of
265 time periods. Following Nomikos and Macgregor’s
approach [15], an unfold-PCA model was built on the
unfolded matrix of 72 rows (NOC batches) by 1060 columns
(4 variables by 265 time periods). The eight larger PCA
components jointly explain 84.5% of the total variance of
the process variables (R2¼ 0.845) and the predictive ability is
high (Q2¼ 0.80).
For real-time monitoring of a new batch run, a natural
problem to overcome with Nomikos and Macgregor’s ap-
proach is that at time k (k<K) there are K� k unknown
samples that should be estimated in order to calculate a score
value. In this application we deal with the on-line estimation
of the scores vector for new batches at five times k in the
batch evolution: k1¼ 50, k2¼ 100, k3¼ 150, k4¼ 200, k5¼ 250.
Figure 4 displays the values of PRESV (worked out from
Equation (22)) for the different methods at the different time
points studied. When the new batch evolves the length of the
unknown trajectory reduces and, consequently, the scores
estimation accuracy increases. This is the reason why PRESV
decreases with time for any given estimation method.
Figure 4 shows that, as expected, the more components
extracted (for both KDR with PCR and with PLS) the lower
PRESV obtained. KDR with PLS tends to perform better than
KDR with PCR for the same number of components. The
performance of the TSR method is quite similar to the other
KDR approximations with high number of components
extracted.
Table II. SOVR; loadings matrix for the NOC PCA model
X1 X2 X3 X4 X5 X6 X7 X8
p1 �0.446 �0.441 �0.446 �0.184 �0.020 �0.051 �0.432 �0.429p2 0.084 0.076 0.064 �0.664 �0.217 �0.695 0.100 0.045p3 �0.017 �0.086 0.031 �0.002 �0.945 0.292 �0.028 0.112
PCR
1
PCR
2
PCR
3
PCR
4
PCR
5
PCR
6
PCR
7
PCR
8
PCR
9
PCR
10
PLS1
PLS2
PLS3
PLS4
PLS5
PLS6
PLS7
PLS8
PLS9
PLS1
0
KD
R
TSR
0,0
0,2
0,4
0,6
0,8
1,0
1,2
1,4
PR
ESV
Figure 3. High-density polytehylene data set (HDPE). LSD intervals for the average of the
PRESV for each method assuming all the possible three (solid line) and five (dotted line)
missing variables combination. PCR# means KDR with PCR with # components. PLS# means
KDR with PLS with # components.
444 F. Arteaga and A. Ferrer
Copyright # 2005 John Wiley & Sons, Ltd. J. Chemometrics 2005; 19: 439–447
It is important to point out that for a given time k,
calculation of PRESV from Equation (22) implies to work
out the key matrix L. As shown in Table I, this key matrix
takes different expressions for the different framework
members. In the TSR method, L is the submatrix made up
of the first k� J rows of the NOC PCA loading matrix P1:A
and this is very easy to work out and has a low-computa-
tional cost. Nevertheless, in the other approximations to the
KDR method, working out L requires either extracting
eigenvectors of the S�� matrix corresponding to the partition
induced by the missing trajectory at every time point k (KDR
with PCR or with pseudoinverse) or building as many PLS
models as time periods k of the batch run to calculate the W*
matrix (KDR with PLS). This makes the latter methods more
complex and with higher computational cost than the TSR
method.
6. CONCLUSIONS
We have proposed a framework for several regression-based
missing data estimation methods used in on-line MSPC with
incomplete observations, assuming an existing in-control
PCA model. The regression-based framework is posed based
on the reconstruction of the incomplete observation through
the missing variables estimation obtained from the regres-
sion model X# ¼ ðX�LÞBþU (Equation (18)), expressed as a
function of a key matrix L. Since all the different scores
estimation methods members of the regression framework
are only variants of a single model, all of them can be
computed by writing one program code and varying the
key matrix L (Table I) according to which method is chosen.
This simplifies the design of flexible computer programs for
missing data estimation methods.
All of the methods can be understood as different approx-
imations to KDR method (even the TSR method). These
approximations can be convenient when KDR suffer from
ill-conditioning problems and are necessary when singular-
ity problems appear (as in on-line batch MSPC and processes
with more variables than observations).
From the proposed framework the analytical derivation of
the expression for the covariance matrix of the estimation
error Varðs1:A � ss1:AÞ (Equation (22)) for the different regres-
sion-based estimation methods members of the framework is
straightforward. We propose to use the PRESV for all the
PCA components extracted as a suitable performance index
negatively correlated with the statistical efficiency of the
estimators.
Under normality, out of all the regression-based methods,
KDR is the best from PRESV criterion. Approximations of
KDR using PLS or PCR with large number of components
can show better statistical performance than TSR. Anyway,
the practical advantage of the latter is that its computational
cost is low because there is no need to define the number of
components to be extracted.
One of the advantages of the PRESV statistical perfor-
mance index is that it does not depend on the particular
PCR01
PCR01
PCR01
PCR01
PCR01
PCR02
PCR02
PCR02
PCR02
PCR02
PCR03
PCR03
PCR03
PCR03
PCR03
PCR04
PCR04
PCR04
PCR04
PCR04
PCR05
PCR05
PCR05
PCR05
PCR05
PCR06
PCR06
PCR06
PCR06PCR06
PCR07
PCR07
PCR07
PCR07 PCR07
PCR08
PCR08
PCR08
PCR08 PCR08
PCR09
PCR09
PCR09
PCR09 PCR09
PCR10
PCR10
PCR10
PCR10 CR10
PCR11
PCR11
PCR11PCR11 PCR11
PCR12
PCR12
PCR12PCR12 CR12
PCR13
PCR13
PCR13PCR13 CR13
PCR14
PCR14
PCR14PCR14 PCR14
PCR15
PCR15
PCR15PCR15 PCR15
PLS01
PLS01
PLS01
PLS01
PLS01
PLS02
PLS02
PLS02
PLS02
PLS02
PLS03
PLS03
PLS03
PLS03
PLS03
PLS04
PLS04
PLS04
PLS04PLS04
PLS05
PLS05
PLS05
PLS05PLS05
PLS06
PLS06
PLS06
PLS06 PLS06
PLS07
PLS07
PLS07
PLS07 PLS07
PLS08
PLS08
PLS08PLS08 PLS08
PLS09
PLS09
PLS09PLS09 PLS09
PLS10
PLS10
PLS10PLS10 PLS10
PLS11
PLS11
PLS11PLS11 PLS11
PLS12
PLS12
PLS12PLS12 PLS12
PLS13
PLS13PLS13
PLS13 PLS13
PLS14
PLS14PLS14
PLS14 PLS14
PLS15
PLS15PLS15
PLS15 PLS15
TSR
TSR
TSRTSR TSR0
50
100
150
200
250
300
350
400
450
500
k=50 k=100 k=150 k=200 k=250
PR
ESV
Figure 4. Sequencing batch reactor data set (SBR). PRESV for the different methods at different time points
k. PCR# means KDR with PCR with # components. PLS# means KDR with PLS with # components.
Framework for missing data imputation methods in MSPC 445
Copyright # 2005 John Wiley & Sons, Ltd. J. Chemometrics 2005; 19: 439–447
values of a given new incomplete observation z, but on the
partition that z induces in the pre-built NOC PCA model and
on the choice of the matrix L. The practical benefits of this
property is that practitioners do not have to split their data
set and use a test set for comparing the performance of
different estimation methods, but PRESV can be straightfor-
wardly calculated in advance (even before starting the on-
line process monitoring) from Equation (22) for any given or
simulated missing data combination. Add to it, this can be
very useful in practice for early detection of harmful missing
data combinations (those causing large errors in the scores
estimations). Such studies can assist the design of sensor
maintenance schedules and give some insight for sensor
redundancy.
In addition, the general expression for the covariance
matrix of the estimation error can be very useful for char-
acterising the uncertainty that missing data propagate
through the scores and other monitoring statistics (Hotell-
ing-T2 and squared prediction error). This is a key point in
on-line MSPC with missing data and deserves future work.
AcknowledgementsWe thank CALAGUA research group (Universidad deValencia—Universidad Politecnica de Valencia, Spain)
for providing us with the SBR data set. Thanks also to the
Editorial Board Member for the helpful comments. This
research was supported by the Spanish Government
(MICYT) and the European Union (RDE funds) under
grant DP12001-2749-C02-01.
APPENDIX. EQUIVALENT MODELS FORTHE KDR AND TSR METHODS
Let us consider the NOC PCA model (Equation (1)), and a
new incomplete observation that induces the partition in X,
X ¼ ½X# X�� discussed in Section 2. The elements of the
covariance matrix S (Equation (5)) can be expressed as
S�� ¼ X�TX�
N � 1¼ P�TTTP�T
N � 1¼ P�HP�T;
S#� ¼ P#HP�T; S�# ¼ P�HP#T
where H ¼ ½H1:A 00 HAþ1:H
� is the H-order diagonal covar-
iance matrix of the latent variables from the NOC PCA
model.
If it is assumed that the missing variables are estimated
first from the measured variables from Equation (10)
X# ¼ X�B3 þU3
the least squares estimator of the matrix of coefficients B3 is
BB3 ¼ X�TX�� ��1X�TX# ¼ S��ð Þ�1S�#
and thus, the predicted missing variables for the new in-
complete observation can be expressed as
zz# ¼ S#� S��ð Þ�1z� ð24Þ
If missing variables in Equation (3) are replaced by their
estimated values from Equation (24), the following estima-
tion of the score vector s1:A results
ss1:A ¼ P#T1:Azz
# þ P�T1:Az
� ¼ P#T1:AS
#� S��ð Þ�1z� þ P�T1:Az
� ð25Þ
but
P#T1:AS
#� ¼ P#T1:AP
#HP�T
¼ P#T1:A P#
1:A P#Aþ1:H
h i H1:A 0
0 HAþ1:H
� �P�T
1:A
P�TAþ1:H
" #
¼ I� P�T1:AP
�1:A �P�T
1:AP�Aþ1:H
� � H1:AP�T1:A
HAþ1:HP�TAþ1:H
" #
¼ H1:AP�T1:A�P�T
1:AP�1:AH1:AP
�T1:A�P�T
1:AP�Aþ1:HHAþ1:HP
�TAþ1:H
� �¼ H1:AP
�T1:A � P�T
1:AS��
ð26Þ
and hence, by substituting Equation (26) in Equation (25), we
can write
ss1:A ¼ P#T1:Azz
# þ P�T1:Az
� ¼ H1:AP�T1:A�P�T
1:AS��� �
S��ð Þ�1z�þP�T1:Az
�
¼ H1:AP�T1:A S��ð Þ�1z�
ð27Þ
that matches the expression for the CMR (KDR) estimator
(Equation (7)).
Alternatively, if the missing variables are estimated first
from the trimmed scores (Equation (11))
X# ¼ T�1:AB4 þU4
solving for B4, the least squares estimator matrix is expressed
as
BB4 ¼ T�T1:AT
�1:A
� ��1T�T
1:AX# ¼ P�T
1:AX�TX�P�
1:A
� ��1P�T
1:AX�TX#
¼ P�T1:AS
��P�1:A
� ��1P�T
1:AS�#
and the estimation of the unknown variables for the new
observation results
zz# ¼ S#�P�1:A P�T
1:AS��P�
1:A
� ��1P�T
1:Az� ð28Þ
By substituting Equation (28) in Equation (3), the following
estimation of the score vector s1:A is obtained
ss1:A ¼ P#T1:Azz
# þ P�T1:Az
�
¼ P#T1:AS
#�P�1:A P�T
1:AS��P�
1:A
� ��1P�T
1:Az� þ P�T
1:Az�
ð29Þ
But, from Equation (26) we can rewrite Equation (29) as
ss1:A ¼ H1:AP�T1:A � P�T
1:AS��� �
P�1:A P�T
1:AS��P�
1:A
� ��1P�T
1:Az� þ P�T
1:Az�
¼ H1:AP�T1:AP
�1:A P�T
1:AS��P�
1:A
� ��1P�T
1:Az�
ð30Þ
yielding the expression for the TSR estimator (Equation (9)).
REFERENCES
1. Little RJA, Rubin DB. Statistical Analysis with MissingData. Wiley: New York, 1987.
2. Nelson PRC, Taylor PA, MacGregor JF. Missing datamethods in PCA and PLS: score calculations with incom-plete observations. Chemometrics Intell. Lab. Syst. 1996; 35:45–65.
3. Arteaga F, Ferrer A. Dealing with missing data in MSPC:several methods, different interpretations, some exam-ples. J. Chemometrics 2002; 16: 408–418.
4. Wold S, Albano C, Dunn WJ, Esbensen K, Hellberg S,Johansson E, Sjostrom M. Pattern recognition: finding
446 F. Arteaga and A. Ferrer
Copyright # 2005 John Wiley & Sons, Ltd. J. Chemometrics 2005; 19: 439–447
and using regularities in multivariate data. In FoodResearch and Data Analysis, Martens H, Russwurm H Jr(eds). Applied Science Publishers: London and NewYork, 1983; 183–185.
5. Martens H, Naes T. Multivariate Calibration. Wiley: NewYork, 1989.
6. Wise BM, Ricker NL. Recent advances in multivariatestatistical process control: improving robustness andsensitivity. IFAC Int. Symp., ADCHEM ’91, Toulouse,1991; 125–130.
7. Walczak B, Massart DL. Dealing with missing data:Part I. Chemometrics Intell. Lab. Syst. 2001; 58: 15–27.
8. Nelson PRC. Treatment of missing measurementsin PCA and PLS models. Ph.D. Dissertation. Departmentof Chemical Engineering, McMaster University.Hamilton, Ontario, Canada 2002.
9. Garcıa-Munoz S, Kourti T, MacGregor JF. Model predic-tive monitoring for batch processes. Ind. Eng. Chem. Res.2004; 43: 5929–5941.
10. Jackson JE. A User Guide to Principal Components. Wiley:New York, 1991.
11. Wold S, Esbensen K, Geladi P. Principal componentanalysis. Chemometrics Intell. Lab. Syst. 1987; 2: 37–52.
12. Hoskuldsson A. PLS regression methods. J. Chemometrics1988; 2: 211–228.
13. De Jong S. SIMPLS: an alternative approach to partialleast squares regression. Chemometrics Intell. Lab. Syst.1993; 18: 251–263.
14. SIMCA-P 8.0: User Guide and Tutorial. Umetrics AB:Umea, 1999.
15. Nomikos P, MacGregor JF. Multivariate SPC charts formonitoring batch processes. Technometrics 1995; 41: 41–59.
Framework for missing data imputation methods in MSPC 447
Copyright # 2005 John Wiley & Sons, Ltd. J. Chemometrics 2005; 19: 439–447