5

Click here to load reader

How to simulate normal data sets with the desired correlation structure

Embed Size (px)

Citation preview

Page 1: How to simulate normal data sets with the desired correlation structure

Chemometrics and Intelligent Laboratory Systems 101 (2010) 38–42

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems

j ourna l homepage: www.e lsev ie r.com/ locate /chemolab

How to simulate normal data sets with the desired correlation structure

Francisco Arteaga a,⁎, Alberto Ferrer b

a Catholic University of Valencia San Vicente Mártir, Valencia, Spainb Universidad Politécnica de Valencia, Valencia, Spain

⁎ Corresponding author.E-mail addresses: [email protected] (F. Artea

(A. Ferrer).

0169-7439/$ – see front matter © 2009 Elsevier B.V. Adoi:10.1016/j.chemolab.2009.12.003

a b s t r a c t

a r t i c l e i n f o

Article history:Received 26 April 2009Received in revised form 22 December 2009Accepted 24 December 2009Available online 6 January 2010

Keywords:SimulationMultivariate normalSingular value decompositionEigenvalues

The Cholesky decomposition is a widely used method to draw samples from multivariate normal distributionwith non-singular covariance matrices. In this work we introduce a simple method by using singular valuedecomposition (SVD) to simulate multivariate normal data even if the covariance matrix is singular, which isoften the case in chemometric problems. The covariance matrix can be specified by the user or can begenerated by specifying a subset of the eigenvalues. The latter can be an advantage for simulating data setswith a particular latent structure. This can be useful for testing the performance of chemometric methodswith data sets matching the theoretical conditions for their applicability; checking their robustness when thehypothesized properties fail; or generating data from multi-stage or multi-phase processes.

ga), [email protected]

ll rights reserved.

© 2009 Elsevier B.V. All rights reserved.

1. Introduction

Chemometric data are usually multivariate in nature, and theircorrelation structures affect the performance of the methods used.This justifies the practical usefulness of procedures to simulate datasets with a pre-specified correlation structure.

The Cholesky decomposition [1] is generally used to simulatesamples from multivariate normal distribution with non-singularcovariance matrices. In this work, we introduce a singular valuedecomposition (SVD) based method that deals with both non-singular and singular covariance matrices (approach 1).

The correlation structure for a data set can also be specified by theeigenvalues for the covariance matrix (instead of the covariancematrix), and the goal is then to build a covariancematrixmatching thegiven eigenvalues (approach 2).

The paper is organized as follows: Section 2 introduces bothapproaches; in Section 3 their performance is illustrated by usingseveral examples; and Section 4 provides an evaluation study ofapproach 2. The utility of both approaches is discussed in Section 5,and some conclusions are drawn. The MatLab code for bothalgorithms is detailed in the Appendix A.

2. Singular value decomposition (SVD) based method

The key idea of the method is based on the following knownproperty of multivariate normal distributions. Let x be a random vectorfollowing a K-dimensional normal distribution with mean vector µ and

covariance matrix∑, i.e. x∼N(µ,∑), and C a L×Kmatrix. If y=Cx is alinear transformation of x, then y follows an L-dimensional normaldistribution with mean vector Cµ and covariance matrix C∑CT [2]:

y∼NðCμ; C∑CTÞ

Therefore, by properly choosing matrices ∑ and C the desiredcorrelation structure can be obtained.

The user can specify the desired correlation structure by fixing thecovariance matrix (approach 1) or by fixing a subset of theeigenvalues for the covariance matrix (approach 2).

2.1. Approach 1

This approach generates an N by K matrix Y with a samplecovariance matrix exactly matching the specified by the user.

Let S be the specified K by K covariance matrix, r=rank(S). Thematrix S can be factorized by singular value decomposition asS=VDVT=VD1/2 (VD1/2)T, where D1/2 is a diagonal matrix containingthe square root of the values of thediagonalmatrixD, i.e. the square rootof the eigenvalues of S. Since the rank of S is r, the remaining K−relements of the diagonal of D1/2 are zeros. Then, S=V1:rD1/2

1:r(V1:rD1/21:r)T

where VI:r contains the r first columns of V, and D1/21:r the first r rows

and r columns of D1/2.Let N be the desired number of samples (NN r). Let X be an N by r

matrix with independent and normally distributed (with zero meanand unit variance) random values that can be generated, e.g. by usingthe Box–Muller transform. The matrix X can be centred by columnsand factorized by QR decomposition as X=QR to obtain a column-wise orthogonal matrix Q. Note that QR factorization provides a

Page 2: How to simulate normal data sets with the desired correlation structure

39F. Arteaga, A. Ferrer / Chemometrics and Intelligent Laboratory Systems 101 (2010) 38–42

substantial reduction in computing time compared to alternativeapproaches as SVD. Such saving is particularly relevant if N≫r.

The desired data set can be generated as Y =ffiffiffiffiffiffiffiffiffiffiffiN−1

pQD

1 = 21:r VT

1:r .X is generated as a multivariate normal data set with zero mean

and unit variance, and Q is a linear transformation of X, i.e. Q=XR−1,then Q is a multivariate normal data set with zero mean andcovariance matrix equal to I/(N−1). As Y is a linear transformation ofQ, and YTY /(N−1)=S, it yields that Y is a multivariate normal dataset with zero mean and covariance matrix equal to S.

Note that NN r is needed to fulfil the requirement that r=rank(S).Otherwise, it would be impossible that the sample covariance matrixfrom the generated data set exactly matched S.

The MatLab code for this algorithm (the randnm function) isoutlined in Appendix A.

Fig. 1 illustrates this approach.Note that this approach can generate different data sets Yi with

exactly the same given covariance matrix S. This is a big differencefrom other programs, e.g. themvnrnd function included in the MatLabStatistics toolbox. The mvnrnd function generates samples from amultivariate normal distribution with a pre-specified symmetricpositive definite or semidefinite covariance matrix S. These generatedsamples will have sample covariance matrices that do not exactlymatch the desired covariancematrix S. The advantage of the proposedrandnm function over the MatLab mvnrnd function is due to theorthogonality of the transformation of X into the matrix Q, whichguarantees that the sample covariance matrix of the generated data isexactly the one expected.

In R language [3] there is the mvrnorm function that permits usboth options by using the logical parameter named empirical. Whenempirical = TRUE the sample covariance matrix for the generatedsample exactlymatches the covariancematrix specified by the user, asit is obtained from our approach 1. Note that this function also handlessingular covariance matrices.

2.2. Approach 2

In this section we propose a more straightforward way to specify aparticular latent structure by fixing a subset of the eigenvalues of thecovariance matrix. This new approach can be considered as a way tosimulate a data set or, alternatively, a way to build a covariancematrixthat matches a desired latent structure, defined from a subset of theeigenvalues for the covariance matrix.

Let {vi}j=1,…,K be the pre-specified variances for the K columns,p=min(N−1, K) and (λa)a=1,…,A the desired subset of A eigenvalues,

with Abp, and ∑A

a=1λa≤ ∑

K

j=1vj. The remaining p−A eigenvalues

can be generated as random non-negative values that sum

Fig. 1. Schedule for approach 1: simulating an N by Kmultivariate normal data set witha pre-specified covariance matrix.

∑K

j=1vj− ∑

A

a=1λa (note that if K≥N we must be careful not to specify

more than N−1 non-null eigenvalues).Let L be a diagonal p by p matrix with diagonal values equal to the

square root of the eigenvalues, in descending order, i.e. la;a =ffiffiffiffiffiffiλa

p.

The proposed algorithm is outlined in the following:

Step 0 generate the X matrix with N by K independent and normallydistributed (with zero mean and unit variance) random values.

Step 1 scale X column-wise, to have the desired variances.Step 2 factorize X by SVD, resulting X=UXDXVT

X.Step 3 call U the N by K matrix UX, autoscaled by columns.Step 4 generate the data matrix as X=ULVT

X.Step 5 repeat steps 1 to 4 until convergence on X.

X matrix, in Step 4, has the desired eigenvalues, but the columnsmay not have the desired variances defined in Step 1 (note that UXDX

is changed by UL). Each time the algorithm goes back to Step 1,X matrix is scaled again to get the desired variances at the expense ofchanging the desired eigenvalues. The iterative nature of thisapproach allows that, if the algorithm converges, at each iterationthe differences between the desired and obtained variances andeigenvalues are smaller. If this is the case, the X matrix in Step 4 willconverge to a matrix whose columns have the pre-specified variances.

This approach also provides a multivariate normal data set. This isderived from the fact that in Step 2, UX is a linear transformation of X,i.e. UX = XðDXV

TXÞ−1; and in Step 4 the new X matrix is built as a

linear transformation of UX. In the first iteration X is generated as amultivariate normal data set, therefore all the UX and X matricesworked out through all iterations are linear transformations of amultivariate normal distribution.

In Fig. 2 the scheme for this algorithm is shown.Note that a given set of eigenvalues does not univocally determine

either a data set or a particular covariance matrix. Therefore, this

Fig. 2. Scheme for approach 2: simulating an N by Kmultivariate normal data set with apre-specified set of eigenvalues for the covariance matrix.

Page 3: How to simulate normal data sets with the desired correlation structure

40 F. Arteaga, A. Ferrer / Chemometrics and Intelligent Laboratory Systems 101 (2010) 38–42

approach allows to simulate different data sets Yi having differentcovariance matrices Si with the same subset of eigenvalues (λa)a=1,…,A.Itmust also be noted thatwith this approach, one or several eigenvaluescan be set to zero, yielding a singular covariance matrix.

The MatLab code for this algorithm (the simdataset function) isgiven in Appendix A.

3. Examples

In this section the performance of both approaches is illustrated byusing three examples.

Example 1: two different data sets from the same covariancematrix.

If we specify the covariance matrix: S =

1 0:3 0:8 0:20:3 1 0:7 0:80:8 0:7 1 0:70:2 0:8 0:7 1

2664

3775,

we can simulate two N=10 by K=4 data sets whose covariancematrix is S, as shown in Table 1.

The MatLab command for this example is [Y]=randnm(S,10), withS¼ ½1;0:3;0:8;0:2;0:3;1;0:7;0:8;0:8;0:7;1;0:7;0:2;0:8;0:7;1�:

Example 2: two different data sets from the same subset ofeigenvalues, but different covariance matrix.

In Table 2 we show two different simulated N=10 by K=4full rank data sets generated by specifying the columns variances as{4, 3, 2, 1}, and the largest two eigenvalues as {6 and 3}. As NNK, thereare K=4 non-null eigenvalues that sum up 10 (the sum of thevariances).

The command for this is: [X,S]=simdataset(10,4,[6,3],[4,3,2,1]).

Example 3: a data set with more variables K than observations N.Table 3 displays a simulated N=5 by K=7 centred and autoscaled

by columns data set, obtained by specifying only two eigenvalues{4, 2}. As NbK, there are only N–1=4 non-null eigenvalues thatsum up 7.

The MatLab command for this example is: [X,S]=simdataset(5,7,[4,2],0).

4. Evaluation study of approach 2

Given the iterative nature of approach 2, in this section anevaluation study of this approach is carried out. Five different matrixsizes were considered, and for each one, four different combinationsof pre-specified eigenvalues were evaluated. For each one of thetwenty matrix size times eigenvalues combinations, approach 2 wasrun 100 times, and the number of iterations and the execution time, inseconds, were registered. Calculations have been run on a 2.16 GHz,2 Gb RAM, Intel Core duo computer with a 32 bit operating system.Table 4 shows the statistical summary of the results. In general terms,the algorithm is quite fast and even when dealing with short and fatmatrices (i.e. 10×1000) it only takes about 1 s to provide the results.

Table 1Two N=10 by K=4 simulated data sets with the same covariance matrix.

First simulated data set

0.2343 −0.2699 0.3141 0.3536−0.6358 −0.5915 −0.3103 0.3915−0.2317 −0.0573 −0.9653 −1.3790

1.5666 −0.1161 1.2903 −0.4921−0.1128 0.0664 0.0016 0.2747

1.0426 1.1709 1.1946 1.26581.1200 0.8266 0.8510 0.5056

−0.5909 −2.2575 −1.6937 −1.8608−1.6798 0.0435 −1.0324 −0.1378−0.7126 1.1849 0.3500 1.0785

5. Discussion and conclusions

Simulated data sets with desired statistical properties are usefulfor the development of new chemometric methods. This allows us totest the performance of the newmethods with data sets matching thetheoretical conditions for their applicability. In addition, this is alsointeresting to test the robustness of the chemometric methods whenthe hypothesized properties fail.

In this paper we propose amethodwith two approaches to simulatemultivariate normal data setswith a desired correlation structure. In thefirst approach the user specifies the desired covariancematrix (that canbe singular), yielding a data set with a sample covariancematrix exactlymatching the one specified by the user. In the second approach the userspecifies the correlation structure byfixing a subset of the eigenvalues ofthe covariancematrix and the variances for the variables of the data set,yielding a column-wise centred data set.

The advantage of the first approach proposed in this paper over thewidely known Cholesky decomposition is that the former deals notonly with non-singular but also with singular covariance matrices.

There exist other methods available handling singular covariancematrices, such as themvnrnd function in the MatLab Statistics toolboxor the mvrnorm function in R. The advantage of the proposed randnmfunction over the MatLab mvnrnd function is in the orthogonalisationof the matrix X into the matrix Q, which guarantees that the samplecovariance matrix of the generated data is exactly the one expected.This is a benefit of our approach 1 for MatLab users.

The second approach, the simdataset function, provides a flexibleway to generate a particular correlation structure from a given latentstructure, and this is the main novelty of this paper.

To illustrate the utility of the second approach we can think ofstatisticalmethodswhoseperformancedepends on the eigenvalues of thecovariancematrix. V.g., Arteaga and Ferrer [4] study the estimation of thescores vector for an incomplete new observation from a fixed and knownPCAmodel and conclude that, for various estimationmethods, the relativedifferences among the values of the eigenvalues of the PCA modelinfluence theestimationerror in the sense that themore similar thevaluesthehigher estimationerror. To test this fact it is useful to simulatedifferentdata sets that, with the same number of principal components, havedifferent distribution for the eigenvalues, yielding different PCA models.

Arteaga and Ferrer [5] use approach 2 to test the performance ofseveral methods of estimation of the mean vector and the covariancematrix for incomplete data sets, by simulating data sets with a pre-specified correlation structure.

Another potential use is to test the performance of different cross-validation methods [6] for the determination of the number ofprincipal components for a PCA model, by simulating data sets withknown latent dimension, and different structures.

Note that data sets simulated using both approaches can havemore variables K than objects N, providing that N is greater than therank of the pre-specified covariance matrix, r=rank(S), in approach 1,and that the maximum number of non-null eigenvalues specified inapproach 2 is the minimum among N−1 and K (see Table 3).

Second simulated data set

0.4938 1.2259 0.9596 0.3992−0.7449 −0.2068 −0.5595 0.4826−0.4447 1.1622 −0.0084 0.5221−1.5679 −0.3387 −1.6212 −0.6437

1.1025 1.1877 1.3538 1.38571.0446 −0.6406 0.4452 −0.1038

−0.6354 −1.3008 −1.1219 −1.5682−1.0749 −0.7992 −0.3541 0.3115

0.8424 0.8026 1.2631 0.85720.9847 −1.0923 −0.3568 −1.6428

Page 4: How to simulate normal data sets with the desired correlation structure

Table 3Simulated N=5 by K=7 centred and autoscaled by columns data set by fixing two eigenvalues of the covariance matrix as {4, 2}.

Simulated data set −1.0940 1.0641 −1.7806 −1.1874 1.5452 1.2545 −1.08511.3222 −1.5090 0.5181 −0.5518 0.1806 0.0458 1.3485

−0.4289 0.2127 0.5563 −0.0939 −1.1804 −0.2873 −0.7988−0.5519 −0.3930 0.3869 1.4628 −0.4100 −1.4627 −0.0660

0.7526 0.6253 0.3193 0.3702 −0.1354 0.4496 0.6014Covariance matrix 1.0000

−0.6407 1.00000.6053 −0.6276 1.00000.0202 −0.1985 0.6151 1.0000

−0.2052 0.2993 −0.8791 −0.6184 1.0000−0.0107 0.5151 −0.6981 −0.8653 0.7062 1.0000

0.9504 −0.7394 0.5882 0.1864 −0.1362 −0.1758 1.0000Eigenvalues 4.0000 2.0000 0.5779 0.4221 0.0000 0.0000 0.0000

Numbers in bold emphasis are the specified values for the simulation.

Table 2Two N=10 by K=4 simulated centred by columns data sets with variances {4, 3, 2, 1} and with largest two eigenvalues of the covariance matrix {6, 3}.

Covariance matrices 4.0000 4.0000−2.0064 3.0000 1.7802 3.0000

0.9132 0.8555 2.0000 1.6013 −0.3509 2.0000−1.0668 1.0048 0.1174 1.0000 −1.1203 −0.1547 −0.7850 1.0000

Data sets 1.2280 −0.7257 −1.3780 −0.0542 1.1172 2.0086 −0.3247 −0.8085−0.6118 −2.3216 −1.1215 −0.2905 −0.4296 0.6816 −1.4594 1.2601

1.9510 −0.6283 2.3155 −0.8880 0.1419 0.4229 2.0879 −0.5502−2.1534 2.4193 0.0846 1.0162 2.2209 2.8526 −0.1049 0.1405

0.6593 1.3576 2.0017 1.8779 2.9781 0.4058 1.0770 −1.2300−3.9553 1.4029 −1.4710 0.5736 −0.6676 −0.3822 0.1703 −0.7618−0.9094 1.9545 0.3637 0.0239 −0.1604 −2.0839 −0.0111 −1.0314

1.2144 −2.4918 −1.6073 −0.0757 −1.7169 −3.0111 0.9720 0.5681−0.1208 0.0304 0.7654 −0.4409 0.6308 −0.6265 0.5801 1.0582

2.6980 −0.9972 0.0470 −1.7423 −4.1144 −0.2679 −2.9873 1.3551Eigenvalues 6.0000 3.0000 0.5553 0.4447 6.0000 3.0000 0.5391 0.4609

Numbers in bold emphasis are the specified values for the simulation.

Table 4Iteration numbers and execution time, in seconds, for different data set specifications with approach 2.

Size Eigenvalues Iteration Time

Mean Median Min Max std Mean Median Min Max std

20×10 9 12.44 12.0 9 16 1.6411 0.0090 0.0090 0.0060 0.0120 0.00137 2 22.21 21.0 11 43 7.6109 0.0163 0.0150 0.0070 0.0300 0.00545 3 1 20.93 20.0 12 54 6.3996 0.0151 0.0150 0.0080 0.0400 0.00473 2 2 2 29.25 28.0 19 59 7.5710 0.0211 0.0200 0.0140 0.0410 0.0054

100×10 9 12.44 12.0 8 18 1.8385 0.0156 0.0150 0.0090 0.0420 0.00357 2 21.51 20.5 11 44 6.9928 0.0270 0.0255 0.0140 0.0550 0.00865 3 1 20.63 19.0 11 45 6.3923 0.0259 0.0245 0.0140 0.0550 0.00773 2 2 2 20.55 20.0 14 39 4.4162 0.0257 0.0250 0.0170 0.0490 0.0055

1000×10 9 12.66 12.0 9 17 1.7708 0.0861 0.0845 0.0610 0.1330 0.01267 2 22.01 21.0 12 46 7.2202 0.1520 0.1450 0.0840 0.3160 0.04895 3 1 21.02 20.0 12 42 6.0419 0.1451 0.1390 0.0840 0.2860 0.04053 2 2 2 30.28 28.0 17 151 14.4080 0.2086 0.1920 0.1190 1.0270 0.0976

10×100 90 27.03 27.0 17 43 5.0422 0.0344 0.0350 0.0210 0.0570 0.006370 20 22.79 22.0 18 31 2.5077 0.0300 0.0295 0.0240 0.0400 0.003150 30 10 19.63 20.0 15 24 1.8071 0.0260 0.0260 0.0200 0.0330 0.002530 20 20 20 15.97 16.0 13 19 1.0196 0.0216 0.0220 0.0180 0.0260 0.0015

10×1000 900 24.09 24.0 21 27 1.3566 1.1854 1.1820 1.0520 1.3220 0.0613700 200 22.47 22.0 20 25 0.8463 1.1108 1.0960 1.0100 1.2180 0.0369500 300 100 19.79 20.0 19 22 0.6243 0.9943 1.0020 0.9540 1.0840 0.0266300 200 200 200 15.40 15.0 15 17 0.5125 0.7983 0.7860 0.7670 0.8730 0.0248

41F. Arteaga, A. Ferrer / Chemometrics and Intelligent Laboratory Systems 101 (2010) 38–42

It is also important to note that by combining both approaches wecan obtain data sets with a great variety of structures. For example, itis possible to generate two different covariance matrices S1 and S2using the second approach, and combine both into a new covariance

matrix S = S1 00 S2

� �that can be used as the specified covariance

matrix in the first approach. This can be useful to generate data setsfrom processes with several stages or phases, i.e. processes withdynamics of different order and changes in the correlation structureamong variables. This is typical of batch processes where the

correlation structure and process dynamics may change as the batchis being processed [7]. The multi-stage nature of both batch andcontinuous processes can also be the result of the different processingunits and the distinguishable operations inside a unit.

Acknowledgements

This research was supported by the Spanish Government(MICINN) and the European Union (RDE funds) under grant

Page 5: How to simulate normal data sets with the desired correlation structure

42 F. Arteaga, A. Ferrer / Chemometrics and Intelligent Laboratory Systems 101 (2010) 38–42

DPI2008-06880-C03-03. The authors would like to thank the refereesfor their helpful comments and suggestions that have improved thework and the speed of the MatLab code of approach 1.

Appendix A

In this appendix we show the MatLab code for both algorithms:

function [Y,srndn]=randnm(S,N,seed)

% Build a multivariate normal data set Y with N objectsand % K variables, centred by columns with sample

covariance% matrix S (K×K)

% S must be a symmetric positive definite or semidefinite% matrix with rank(S)=r (check that NNr).

% seed is an optional parameter with the seed for the

% normal generator.% srndn saves the state for the normal generator.

if narginN2,randn('state',seed);

elserandn('state',sum(100*clock));

endsrndn=randn('state');

K=size(S,1);[V,D]=svd(S);

r=sum(diag(D)NK*eps(D(1)));V=V(:,1:r);

D=sqrt(D(1:r,1:r));X=randn(N,r);

[Ux Dx]=qr(X-ones(N,1)*mean(X),0);Y=sqrt(N-1)*Ux*D*V';

function [X,S,srnd,srndn]=simdataset(N,K,eig_val,var,s1,s2)

% Build N independent samples from a K-multivariate% normal distribution, which columns are centred and

scaled% to have the desired variances, specified in the 1 by K

% vector var, and the covariance matrix having the% eigenvalues fixed in the 1xA vector eig_val, by

% completing the K-A remaining eigenvalues in arandomized

% way.% If var==0, the columns are autoscaled, i.e. var=[1 … 1]

% s1 and s2 are optional parameters with the seeds for the% uniform and the normal generators, respectively.

% srnd and srndn saves the state for the uniform and% the normal generator, respectively.

%Example:

%[X,S,srnd,srndn]=simdataset(100,5,[531],[54321])% Build a matrix with 100 rows and 5 columns.

% The columns are centred and have variances:5,4,3,2,1.

% The covariance matrix verifies that 3 out of their% eigenvalues are 5, 3 and 1.

% We can fix more eigenvalues (check that their sum% is not greater than the Total Variance: sum(var)).

if narginN4,rand('state',s1);

randn('state',s2);

elserand('state',sum(100*clock));

randn('state',sum(100*clock));end

srnd=rand('state');srndn=randn('state');if var==0, var=ones(1,K); end

TV=sum(var);

sd=sqrt(var);A=size(eig_val,2);

p=min(N-1,K);if ANp,

error('No more than %d eigenvalues are allowed.',p);end

if TV-sum(eig_val)Nmin(eig_val),warning('Non specified eigenvalues may be larger than

the specified ones');end

if NNK,nal=K-A;

else,nal=N-A-1;

eig_val=[eig_val 0];end

alea=rand(1,nal);L=[eig_val ((TV-sum(eig_val))/sum(alea))*alea];

L=diag(sqrt(sort(L,'descend')));X=randn(N,K);

X=X-ones(N,1)*mean(X);dif=100;

while difN1e-10,Xs=X*diag(sd./std(X));

if NNK,[U,D,V]=svd(Xs,0);

else,[V,D,U]=svd(Xs',0);

endXnew=U*diag(1./std(U))*L*V';

dif=sum(sum((X-Xnew).^2))/(N*K);X=Xnew;

end

S=cov(X);

References

[1] J.E. Gentle, Cholesky Factorization, Numerical Linear Algebra for Applications inStatistics, Springer-Verlag, Berlin, 1998.

[2] B. Flury, A First Course in Multivariate Statistics, Springer-Verlag, New York, 1997.[3] R Development Core Team, R: A Language and Environment for Statistical Computing,

R Foundation for Statistical Computing, Vienna, Austria3-900051-07-0, 2009 URLhttp://www.R-project.org.

[4] F. Arteaga, A. Ferrer, Dealing with missing data in MSPC: several methods, differentinterpretations, some examples, Journal of Chemometrics 16 (2002) 408–418.

[5] F. Arteaga, A. Ferrer, Missing Data, in: S. Brown, R. Tauler, B. Walczak (Eds.),Comprehensive Chemometrics, vol. 3, Elsevier, Oxford, 2009, pp. 285–314.

[6] R. Bro, K. Kjeldahl, A.K. Smilde, H.A.L. Kiers, Cross-validation of component models:a critical look at current methods, Analytical and Bioanalytical Chemistry 390(2008) 1241–1251.

[7] J. Camacho, J. Picó, A. Ferrer, Bilinear modelling of batch processes. Part I:theoretical discussion, Journal of Chemometrics 22 (2008) 299–308.