14
1560 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 6, DECEMBER 2012 Multivariate Multilinear Regression Ya Su, Member, IEEE, Xinbo Gao, Senior Member, IEEE, Xuelong Li, Fellow, IEEE, and Dacheng Tao, Senior Member, IEEE Abstract—Conventional regression methods, such as multi- variate linear regression (MLR) and its extension principal com- ponent regression (PCR), deal well with the situations that the data are of the form of low-dimensional vector. When the dimen- sion grows higher, it leads to the under sample problem (USP): the dimensionality of the feature space is much higher than the number of training samples. However, little attention has been paid to such a problem. This paper first adopts an in-depth investigation to the USP in PCR, which answers three questions: 1) Why is USP produced? 2) What is the condition for USP, and 3) How is the influence of USP on regression. With the help of the above analysis, the principal components selection problem of PCR is presented. Subsequently, to address the problem of PCR, a multivariate multilinear regression (MMR) model is proposed which gives a substitutive solution to MLR, under the condition of multilinear objects. The basic idea of MMR is to transfer the multilinear structure of objects into the regression coefficients as a constraint. As a result, the regression problem is reduced to find two low-dimensional coefficients so that the principal components selection problem is avoided. Moreover, the sample size needed for solving MMR is greatly reduced so that USP is alleviated. As there is no closed-form solution for MMR, an alternative projection pro- cedure is designed to obtain the regression matrices. For the sake of completeness, the analysis of computational cost and the proof of convergence are studied subsequently. Furthermore, MMR is applied to model the fitting procedure in the active appearance model (AAM). Experiments are conducted on both the carefully designed synthesizing data set and AAM fitting databases verified the theoretical analysis. Index Terms—Active appearance model (AAM), multivariate linear regression (MLR), principal component regression (PCR), under sample problem (USP). Manuscript received July 10, 2011; revised January 10, 2012; accepted March 30, 2012. Date of publication June 1, 2012; date of current version November 14, 2012. This work is supported by the National Basic Research Program of China (973 Program) (Grant No. 2012CB316400), the National Natural Science Foundation of China (Grant Nos. 61125204, 61172146, 60832005, 61125106, 91120302, 61072093), the Fundamental Research Funds for the Central Universities, and the Ph.D. Programs Foundation of Ministry of Education of China (Grant No. 20090203110002); the State Administration of STIND (Grant No. B1320110042); and the Australian ARC discovery project (ARC DP-120103730). Y. Su is with the Department of Electronic Engineering, Tsinghua University, Beijing 100084, China (e-mail: [email protected]). X. Gao is with the School of Electronic Engineering, Xidian University, Xi’an 710071, China (e-mail: [email protected]). X. Li is with the Center for OPTical IMagery Analysis and Learning (OPTIMAL), State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China (e-mail: [email protected]). D. Tao is with the Centre for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology, Sydney, Ultimo, NSW 2007, Australia (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCB.2012.2195171 I. I NTRODUCTION L INEAR REGRESSION is an important analysis tool in statistics and machine learning. It aims to create the linear relationship between two sets of variables, namely the depen- dent/response and the independent/predictor variables, and then to predict the values of the dependent variables given new values of independent variables. In recent decades, regression analysis has been widely used for prediction and forecasting, and intersects much with the field of pattern recognition and machine learning [3]. For example, it plays important roles in industrial batch process analysis [4], chemical calibration [30], and machine learning [13], [19]. Among the linear regression techniques, multivariate linear regression (MLR) is one of the most important methods. MLR simultaneously models the relationship between multiple de- pendent variables and a set of independent variables. It can be easily derived as a maximum likelihood estimator under the assumption that the errors are normally distributed. As a result, the model possesses a unique global minimum, which can be given explicitly. Because of the simplicity, MLR has been regarded as a basic tool in the social and natural sciences. One problem of MLR is the requirement that the sample size should be large enough relative to the number of de- pendent variables; otherwise, the model parameters cannot be estimated (underdetermined). Unfortunately, this assumption is not always satisfied and may lead to the under sample problem (USP). For this purpose, many researchers proposed various substituting methods. For example, principal components re- gression (PCR) [17] is one of the most popular alternative algorithms. It is based on the assumption that the data (samples of independent variables) lies in a low-dimensional subspace. Thus, it is helpful to decompose the data to find a low- dimensional subspace and project the samples onto it. This is generally done by the eigenvalue decomposition techniques. As a result, the sample size becomes enough to determine the model parameters of MLR. However, a difficulty is still left to be solved that there is no general strategy to determine the “optimal” components to remain in the eigenvalue decomposi- tion procedure [11]. This leads to the approximate accuracy of the result. In addition, another problem of MLR, shared with PCR, is the heavy computational cost. When the number of independent variables becomes high, the situation is hard to be dealt with by MLR. Finally, the structure information, in the case of multilinear (more than one order) data, is also destroyed in the vectorization phase. This paper addresses these problems of MLR under the sit- uation where dependent variables are intrinsically multilinear. Instead of vectorizing the multilinear data samples, the pro- posed method remains an object in the matrix form. It re- sults in a multilinear version of MLR (multivariate multilinear 1083-4419/$31.00 © 2012 IEEE

Multivariate Multilinear Regression

  • Upload
    xidian

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

1560 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 6, DECEMBER 2012

Multivariate Multilinear RegressionYa Su, Member, IEEE, Xinbo Gao, Senior Member, IEEE, Xuelong Li, Fellow, IEEE, and

Dacheng Tao, Senior Member, IEEE

Abstract—Conventional regression methods, such as multi-variate linear regression (MLR) and its extension principal com-ponent regression (PCR), deal well with the situations that thedata are of the form of low-dimensional vector. When the dimen-sion grows higher, it leads to the under sample problem (USP):the dimensionality of the feature space is much higher than thenumber of training samples. However, little attention has beenpaid to such a problem. This paper first adopts an in-depthinvestigation to the USP in PCR, which answers three questions:1) Why is USP produced? 2) What is the condition for USP, and3) How is the influence of USP on regression. With the help ofthe above analysis, the principal components selection problem ofPCR is presented. Subsequently, to address the problem of PCR,a multivariate multilinear regression (MMR) model is proposedwhich gives a substitutive solution to MLR, under the conditionof multilinear objects. The basic idea of MMR is to transfer themultilinear structure of objects into the regression coefficients asa constraint. As a result, the regression problem is reduced to findtwo low-dimensional coefficients so that the principal componentsselection problem is avoided. Moreover, the sample size needed forsolving MMR is greatly reduced so that USP is alleviated. As thereis no closed-form solution for MMR, an alternative projection pro-cedure is designed to obtain the regression matrices. For the sakeof completeness, the analysis of computational cost and the proofof convergence are studied subsequently. Furthermore, MMR isapplied to model the fitting procedure in the active appearancemodel (AAM). Experiments are conducted on both the carefullydesigned synthesizing data set and AAM fitting databases verifiedthe theoretical analysis.

Index Terms—Active appearance model (AAM), multivariatelinear regression (MLR), principal component regression (PCR),under sample problem (USP).

Manuscript received July 10, 2011; revised January 10, 2012; acceptedMarch 30, 2012. Date of publication June 1, 2012; date of current versionNovember 14, 2012. This work is supported by the National Basic ResearchProgram of China (973 Program) (Grant No. 2012CB316400), the NationalNatural Science Foundation of China (Grant Nos. 61125204, 61172146,60832005, 61125106, 91120302, 61072093), the Fundamental Research Fundsfor the Central Universities, and the Ph.D. Programs Foundation of Ministry ofEducation of China (Grant No. 20090203110002); the State Administration ofSTIND (Grant No. B1320110042); and the Australian ARC discovery project(ARC DP-120103730).

Y. Su is with the Department of Electronic Engineering, Tsinghua University,Beijing 100084, China (e-mail: [email protected]).

X. Gao is with the School of Electronic Engineering, Xidian University,Xi’an 710071, China (e-mail: [email protected]).

X. Li is with the Center for OPTical IMagery Analysis and Learning(OPTIMAL), State Key Laboratory of Transient Optics and Photonics, Xi’anInstitute of Optics and Precision Mechanics, Chinese Academy of Sciences,Xi’an 710119, China (e-mail: [email protected]).

D. Tao is with the Centre for Quantum Computation and Intelligent Systems,Faculty of Engineering and Information Technology, University of Technology,Sydney, Ultimo, NSW 2007, Australia (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSMCB.2012.2195171

I. INTRODUCTION

L INEAR REGRESSION is an important analysis tool instatistics and machine learning. It aims to create the linear

relationship between two sets of variables, namely the depen-dent/response and the independent/predictor variables, and thento predict the values of the dependent variables given newvalues of independent variables. In recent decades, regressionanalysis has been widely used for prediction and forecasting,and intersects much with the field of pattern recognition andmachine learning [3]. For example, it plays important roles inindustrial batch process analysis [4], chemical calibration [30],and machine learning [13], [19].

Among the linear regression techniques, multivariate linearregression (MLR) is one of the most important methods. MLRsimultaneously models the relationship between multiple de-pendent variables and a set of independent variables. It canbe easily derived as a maximum likelihood estimator underthe assumption that the errors are normally distributed. As aresult, the model possesses a unique global minimum, whichcan be given explicitly. Because of the simplicity, MLR hasbeen regarded as a basic tool in the social and natural sciences.

One problem of MLR is the requirement that the samplesize should be large enough relative to the number of de-pendent variables; otherwise, the model parameters cannot beestimated (underdetermined). Unfortunately, this assumption isnot always satisfied and may lead to the under sample problem(USP). For this purpose, many researchers proposed varioussubstituting methods. For example, principal components re-gression (PCR) [17] is one of the most popular alternativealgorithms. It is based on the assumption that the data (samplesof independent variables) lies in a low-dimensional subspace.Thus, it is helpful to decompose the data to find a low-dimensional subspace and project the samples onto it. This isgenerally done by the eigenvalue decomposition techniques.As a result, the sample size becomes enough to determine themodel parameters of MLR. However, a difficulty is still leftto be solved that there is no general strategy to determine the“optimal” components to remain in the eigenvalue decomposi-tion procedure [11]. This leads to the approximate accuracy ofthe result. In addition, another problem of MLR, shared withPCR, is the heavy computational cost. When the number ofindependent variables becomes high, the situation is hard to bedealt with by MLR. Finally, the structure information, in thecase of multilinear (more than one order) data, is also destroyedin the vectorization phase.

This paper addresses these problems of MLR under the sit-uation where dependent variables are intrinsically multilinear.Instead of vectorizing the multilinear data samples, the pro-posed method remains an object in the matrix form. It re-sults in a multilinear version of MLR (multivariate multilinear

1083-4419/$31.00 © 2012 IEEE

SU et al.: MULTIVARIATE MULTILINEAR REGRESSION 1561

regression (MMR)) which possesses three merits: 1) USP isalleviated because of a smaller set of independent variables;2) computational cost is greatly reduced, and 3) structure infor-mation is reserved for regression. Consequently, the regressionperformance is improved.

Recently, the multilinear algebra has been widely exploitedin community of machine learning [22], [49], [55]. An in-creasing number of applications have shown the fact that theobserved data always possesses a multiway structure ratherthan a vector. In these cases, samples are indexed by twoor more independent indices, giving rise to a higher ordermultiway array. Exploitation of this structure requires the useof signal processing tools based on multilinear algebra ratherthan standard linear algebra [10]. On the one hand, unsu-pervised learning methods were studied by using the higherorder representation [24], [41], [34], [53], [54]. On the otherhand, discriminative information has been introduced in themultilinear algebra for the aim of recognition. For example,Liu et al. [25] and Sanguansat et al. [23] have made use of2-D LDA to learn discriminative features based on the trainingimages. Then, Tao et al. [32] have developed a general tensordiscriminant analysis for gait recognition, incorporating thehigh-order tensor analysis and a generalized linear discriminantanalysis. Because the tensor analysis represents the image asa 2-D tensor, feature dimension is greatly reduced, and spatialinformation is retained. As a result, USP is alleviated [42]. Ithas been widely applied to remote sensing [56], tracking [50],face recognition [29], video semantic analysis [15], 3-D facemodeling [45], probabilistic graphical models [46], [47], andfeature selection [43]. However, USP in multivariate regressionproblem has seldom been dealt with. Although some nonlinearmethods discussed the problem, such as multilinear extensionsto the least-square regression (PLS) [5], [44], [35], they gobeyond the linear category.

In order to study the practical performance of the presentedalgorithm, we applied MMR to the well-known computer visionmodel, active appearance model (AAM) [52]. Since AAMmakes use of MLR/PCR to model the linear relationship ofvariations between the high-dimensional texture vectors andthe parameters, the USP is inevitable. To this end, this papersubstitutes the MLR with the proposed MMR in AAM. Itresults in a more accurate, robust, and efficient fitting algorithm,namely multilinear/tensor AAM (TAAM).

The remainder of this paper is organized as follows:Section II briefly introduces the fundamental of regressionand USP. Section III formulates the data generation model toexplicitly describe the USP in PCR. The proposed TMLR, aswell as an alternative projection optimization solution and theanalysis of computational cost, is introduced in Section IV.Then, the model is applied to AAM in Section V. The com-parison between PCR and MMR, and the evaluation of TAAM,are conducted in Section V. Finally, Section VII concludes andends the paper.

II. RELATED WORK

In the conventional use of the regression models, two proce-dures can be distinguished: model training and prediction. Thebiggest difference among several models is the training phase.In this section, we briefly introduce some related regression

techniques, focusing on the model training. First, we start withthe definition of the basic MLR. Then, more complex methods,such as PCR and PLS, are illustrated.

A. Multivariate Linear Regression

MLR has two procedures, model building and prediction.Given the independent variables x = [x1, x2, . . . , xm]T ∈ �m

and the dependent variables y = [y1, y2, . . . , yn]T ∈ �n. With-out loss of the generality, it is assumed that xi and yi are bothcentralized. Suppose there is a linear relationship between thesetwo sets of variables, MLR models it as the multiple regressioncoefficient T such that

yj =Tj1x1 + Tj2x2 + . . . + Tjmxm + εj

=Tjx + εj , for j = 1, 2, . . . , n, (1)

where Tj = [Tj1, Tj2, . . . , Tjm] is the jth row of the regressionmatrix T , and εj ∼ N(0, σ2) denotes the estimation error.

In order to estimate T , it needs a set of samples of both x andy. In fact, given a set of sample pairs {xi,yi}, i = 1, 2, . . . , N ,T can be estimated by using the least-squares estimator,

minn∑i

(yi − TT xi)2. (2)

The equivalent matrix formulation is given as

min ‖Y − TT X‖2, (3)

where X = (x1,x2, . . . ,xN ), Y = (y1,y2, . . . ,yN ), and ‖ · ‖denotes the 2-norm. This is an optimization problem which canbe easily resolved as

T = (XXT )−1

XYT . (4)

With the estimated regression coefficient, dependent variablescan be predicted for independent variables.

Generally, (4) requires that m = rank(XXT ) < N , orXXT is invertible. However, this condition may not always besatisfied. For example, in the case of digital images, the numberof the pixels is much larger than that of samples, i.e., m � N .As a result, XXT in (4) is not invertible.

B. Under Sample Problem of PCR

To avoid the irreversible problem of MLR, PCR substitutesthe independent variables by principal components to predictthe dependent variables. Particularly, PCR firstly projects xonto a low-dimensional subspace

x = PT · x, (5)

where x ∈ Rk with k < rank(X) < m, and P stands for theeigenvectors of the subspace �k which can be obtained byeigenvalue decomposition techniques. Thus, the solution re-duces to

Tp = (XXT )−1

XYT, (6)

where X = (x1, x2, . . . , xN ). If an appropriate k is chosen,e.g., XXT is invertible, the optimization problem can be easilyresolved. The only question is how to determine P .

1562 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 6, DECEMBER 2012

The transformation P is obtained by choosing k eigenvectorsof the column subspace of X. Consider the singular valuedecomposition of X

X = UxΣxV Tx , (7)

where Ux and Vx are the eigenvectors of column and rowsubspaces of X, respectively. Then, P can be chosen fromUx by correlation ranking [8], generic algorithms [40], or thecross-validation strategy [11]. These methods work quite wellin some cases and have been applied in machine learning andcomputer vision [27]. However, the USP in PCR has beenpaid little attention in the literature. This paper takes an in-depth investigation to the problem and further answers whatthe condition and the influence of USP are. According to ouranalysis, even if the best PCs are found, PCR still suffers fromthe USP.

It is worthy to notice that many nonlinear procedures tothe regression problem have been paid much attention in theindustrial and chemical community [37]. For example, UnfoldPLS [16] unfolds the multiway data to the vectors and then per-forms a PLS. N -way PLS [51] generalizes PLS to high-orderdata in company with parallel factor analysis (PARAFAC).Similarly, multiway covariates regression [5], [35] extends thisstructure to more general forms, i.e., not only PARAFAC butalso Tucker3 [36]. These formulations differ from MLR andPCR in that they utilize a nonlinear procedure to recover thelatent variables (LVs) and their relations to dependent andindependent variables. Since this paper mainly focuses on thelinear models, e.g., MLR and PCR, the discussion of thesenonlinear models is out of the range of this paper and will beinvolved in a future paper.

III. UNDER SAMPLE PROBLEM

Since the consequence of USP in MLR is simple, which leadsto the irreversible problem in (4), much more attention is paidto USP in PCR. At first, the data generation model is given asthe basis for the regression analysis. Then, the USP in PCRis analyzed based on this generation model. Notice that ourdiscussion is limited in the situation when rank(XXT ) < m.Otherwise, USP is turned into the overdetermined problem,which can be dealt with by MLR.

A. Data Generation Model

The generalization model of the dependent and the indepen-dent variables is predefined as

x = Υ · K + s−1ε, (8)

y =K(1 : n), (9)

where x ∈ �m and y ∈ �n are centralized independent anddependent variables, K ∈ �k is a random vector yielding tothe uniform distribution that Ki ∼ U(0, 1), i = 1, 2, . . . , k, andgenerates both x and y, K(1 : n) fetches the first (or random) nelements from K, Υ denote the regression matrices defined asin (2) and (4), s means the informal signal-to-noise ratio (SNR),ε stands for the Gaussian noise. Without misunderstanding, wecall K as LV for its contribution to the regression problem.

To keep clarity and avoid singularity, Υ is constructed asorthogonal matrix. As a result, the regression problem can bedefined, based on the generation model (8), as finding a set ofcoefficients, T , which is able to correctly predict portions of Kfrom x (1)

ΥT Υ · K = K

TT x =y. (10)

Although the orthogonality of Υ is not necessary, we canfind that practical regression problems are always equivalentor transformed to this standard form. There are three differentsituations.

(a) Υ is not orthogonal but linear independent.Suppose there is a Υ′ which is nonorthogonal and

linear independent, then

x′ = Υ′ · K ′, (11)

y′ =K ′(1 : n). (12)

It follows that:

x′ = Υ · M · K ′ = Υ · K, (13)

y′ = (M−1K)(1 : n) = M−1(1 : n)K, (14)

where M denotes a rotation matrix which transforms Υ.to an orthogonal basis, Υ′, such that

Υ′ = Υ · M. (15)

In the case of PCR, the regression problem between x′

and y′ should be deduced as follows,

T ′T x′ = y′, (16)

where T ′ means the first n lines of the matrix Υ′. Theproblem is converted to find a transformed coefficientmatrix T ′, which is equivalent to the original regressionproblem (10).

(b) Υ is collinear.It means that there is collinearity in both K and Υ.

In this case, some preprocessing procedures should beconducted to detect and remove the collinearity in thedata set. Many suggestions have been proposed in [21]to solve the problem.

(c) y is linearly transformed from K.This case is similar to the first one that Υ is not

orthogonal.

After above discussions, it has been shown that the datageneration model, (8) and (9), accounts for all the multivari-able linear regression problems. Moreover, it should be notedthat this model accounts for the inverse of the multivariableregression problem [13]. It addresses the obstacle in the originalregression model that prevents researchers from utilizing thebidirectional relationship between x and y, and numericallycalculating the accuracy of PCR.

SU et al.: MULTIVARIATE MULTILINEAR REGRESSION 1563

B. Under Sample Problems

For better understanding the USP, we first consider a simplecase without the noise term (or s → ∞) as

x = Υ · K, (17)

y = K(1 : n). (18)

As a result, the transformation P in (5) lies in the subspacespanned by Ux. The situation with noise is followed in the nextsection.

According to the linear relationship (17), the following equa-tion is easily obtained:

XXT = ΥKKT ΥT . (19)

In the training phase, PCR firstly decomposes X so that

X = UxΣxV Tx . (20)

Then, a similar decomposition is conducted on K such that

K = UkΣkV Tk . (21)

Subsequently, (20) and (21) can be substituted into (19) whichleads to

UxΣxUTx = ΥUkΣkUT

k ΥT = ΥΣkΥT , (22)

where Υ = ΥT Uk. From the above equation and the unique-ness property of SVD, we can immediately obtain that

Ux = Υ = ΥT Uk, (23)

Σx = Σk. (24)

Thus, by substituting (23) and (21) to the PCR procedure (6),we get

Y = TTp X = TT

p UTk K. (25)

For the relationship between Ux and Υ, we have the followingtheorem. Due to space constraints, all proofs are provided in thesupplementary file.

Theorem 3.1: The sufficient and necessary condition forPCR to accurately model the regression problem is

span(Ux) = span(Υ). (26)

Here, the term “accurate” means that given an example, x andy, the model should predict y from x without error.

Based on Theorem 3.1, it is easy to recognize the USPin PCR. In fact, there are two kinds of Uk corresponding todifferent training set sizes.

(a) When the training set size is large enough that KKT isnonsingular, it follows that Uk ∈ �k×k. In this case, Uk

acts as a rotation matrix. It means that Ux spans the samespace with Υ,

span(Ux) = span(Υ). (27)

As a result, the model is accurate and no USP exists.

(b) On the other hand, when the training set size is small,the condition is more complex. Since KKT is singular,it follows that Uk ∈ �k×k′

, where k′ < k. Uk ∈ �k×k′

is not a rotation matrix, Ux only spans a subspace of Υ,that is

span(Ux) ⊂ span(Υ). (28)

As a result, the model is not accurate and USP exists.We can view the second case in a different viewpoint. If we

employ the model to the training set, the condition (26) can beconsidered to be coincided, because Ux plays the role of Υ togenerate X

span(Υt) = span(Ux), (29)

where Υt means the regression coefficient according to thetraining set. Accordingly, no USP is present, and the modelworks well.

If we employ the model obtained from the training set to thetest set, the situation is different. Given a test set X′ differentfrom the training set, and the corresponding “core” K ′, itfollows that:

X′X′T =U ′xΣ′

xU ′Tx , (30)

K ′K ′T =U ′kΣ′

kU ′Tk , (31)

which finally yields to

U ′x =Υ′ = ΥU ′

k, (32)

Σ′x = Σ′

k. (33)

When we employ the trained model to X′, it is generallyassumed that

span (U ′x) = span(Ux), (34)

and consequently

span (ΥU ′k) = span(ΥUk). (35)

Unfortunately, it is not true because U ′x and Ux both depend

on the individual data set, which results in a different subspacefrom span(Υ)

span(Υt) �= span (U ′x) . (36)

It is the reason why USP is produced in PCR. Generally speak-ing, (29) and (36) are called overfitting. The term describes thesituation where models perform well on the training set whiledo not work on the test set. This error is produced because Ux

lies in the subspace of Υ. We call it the Type-1 error.The above deduction uncovers the underlying reason whether

PCR can be used to approximate the MLR. We can summarizethe condition for USP in the following. When the sample sizeis large enough (not necessarily larger than m), or the LVcovariate matrix KKT is nonsingular, PCR can be used toapproximate the regression. When the sample size is not largeenough, or KKT is singular, PCR suffers from the USP.

To evaluate the Type-1 error, it is useful to introduce a term

Γ = k/Rank(X), (37)

1564 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 6, DECEMBER 2012

namely, latent ratio. In practice, Rank(X) can approximatelybe substituted by N , and (37) can be written as

Γ =k

N. (38)

The latent ratio Γ is able to evaluate the degree of USP.Particularly, when Γ < 1, it means that the data set is large toalleviate the USP. In case of Γ > 1, the USP is present, whosedegree depends on the value of Γ.

We can also conclude how USP affects PCR. Although theUSP for PCR has been well defined, it is generally unmanage-able in practice because the LVs are actually unknown, so istheir length. More particularly, since Ux is specific to data set, itis generally impossible to know the exact regression coefficientΥ. Therefore, there will always be the dilemma how to selectthe PCs, under the condition of USP.

The solutions of USP, in the literature, are designed tovalidate the model on the test set, by generic algorithms [13] orthe cross-validation strategy [11]. The basic idea behind thesemethods is to find the best Uk given a training set and a probeset. However, they still suffer from USP, except for large enoughtraining set, because Ux always lies in the subspace of Υlike (28).

C. Noise Effect

The final problem is the influence of noise. This paperdescribes the degree of noise by SNR. It should be claimed thatthe SNR defined here is a qualitative evaluation. However, thisis reasonable because the term “is sometimes used informallyto refer to the ratio of useful information to false or irrelevantdata in a conversation or exchange.” [27]

Since the impact of noise on the regression has been studiedby Draper et al. [13], we only give a qualitative analysis underthe case of USP. If we can coarsely consider s as the SNR, theSVD is applied to the contaminated independent variables

(X − s−1E)(X − s−1E)T = UxΣxUTx . (39)

The resulted eigenvectors span a different space with Ux. Asa result, prediction will be disturbed. The higher SNR is, thesmaller disturbance will be.

The other consequence of the noise term hides in the casewhere there is no USP. Consider the Ux disturbed by the noise(39). Since parts of the components inevitably fall outside thesubspace, Υ, there exists in the regression coefficient a setdifference (D) of Ux and Υ,

D = Ux \ Υ ⊂ ΥC , (40)

which accounts for the effect of noise. According to Theorem3.1, error is produced. When applying this model to the test set,which also possesses a U ′

x disturbed by error, as well as a setdifference (D′) on behalf of the noise

D′ = U ′x \ Υ ⊂ ΥC . (41)

Since D is never equivalent to D′ (in probability), Ux nevercovers U ′

x

D′ \ D ⊆ U ′x \ Ux. (42)

Subsequently, the model is hardly accurate, and the USP isresulted.

It is remarkable that the error caused by noise (40) is differentfrom that caused by k. In the latter case, Type-1 error, U ′

x liesin Υ. While in the former case, U ′

x partly lies outside Υ. Wecall it as Type-2 error. In terms of training and testing, the twotypes of error are the same. This is because they both resultin regression coefficients that contain subspaces outside thespace spanned by the test set. However, they lead to differentdegrees of error. This is because Type-1 error can be avoided bysimply increasing the sample number, or intrinsically the rankof the data set. While Type-2 error is unavoidable since the setdifference D caused by noise can never be removed. More vividevaluation for different types of error is given in Section V.

IV. MULTIVARIABLE MULTILINEAR REGRESSION

In this section, we address the USP encountered by PCRunder the multilinear condition. The basic idea behind MMRis to transform the high-dimensional regression problem to alower one, by introducing the structure information embeddedin the data. As preparations, the multilinear algebra is first intro-duced. Subsequently, we propose the multivariable multilinearregression (MMR) technique which deals with the situationwhere the independent variables can intrinsically be consid-ered as multilinear arrays. Computational cost and convergenceanalysis are analyzed by the end of this section.

A. Multilinear Algebra

In order to concisely describe multilinear regression models,the conventional matrix product is not sufficient. Therefore, wefirst introduce some multilinear notations in the literature [13],[24], [44].

Given the M th-order tensor X ∈ �m1×m2×...×mM , an ele-ment of X can be denoted by M indices mi, i = 1, 2, . . . M .Then, the following definitions can be introduced.

n-Mode (Matrix) Product: The n-mode product of a tensorX ∈ �m1×m2×...×mM with a matrix U ∈ �J×mn is denoted byX ×n U . It results in a tensor of size m1 × . . . × mn−1 × J ×mn+1 × . . . × mM .

Matrix Kronecker Product: The Kronecker product of twomatrices A ∈ �m1×m2 and B ∈ �m3×m4 is defined as

A ⊗ B =

⎡⎢⎣

a11B · · · a1m2B...

. . ....

am11B · · · am1m2B

⎤⎥⎦ . (43)

The size of the resulting matrix is m1m3 × m2m4.Matrix Khatri-Rao Product: The Khatri-Rao product

of two matrices A = (α1, α2, . . . , αn) ∈ �m1×n andB = (β1, β2, . . . , βn) ∈ �m1×n is defined as

A ◦ B = [α1 ⊗ β1 · · · αn ⊗ βn ] . (44)

The size of the resulting matrix is m1m2 × n.

B. Multivariate Multilinear Regression

Particularly, consider a multivariate linear function,f(X) = y, where X ∈ �m1×m2 lies in an order-2 multilinear

SU et al.: MULTIVARIATE MULTILINEAR REGRESSION 1565

space contains the independent variables, and y ∈ �n is thedependent variable vector associated with X. The functionf : �m1×m2 → �n denotes the linear relationship betweenthe order-2 multilinear variables and the dependent variables.Given a set of sample pairs {Xi,yi}, i = 1, 2, . . . , N , theregression problem to estimate the linear function can bemodeled as

minN∑i

n∑j

‖yij − Xi ×1 αj ×2 βj‖2, (45)

where N is the number of samples, A = (α1, α2, . . . , αn)and B = (β1, β2, . . . , βn) are the regression matrices calledthe mode-1 (or left) matrix and the mode-2 (or right) matrix,respectively, with αj ∈ �m1 and βj ∈ �m2 . Without loss ofthe generality, we assume that Xi and yi are both central-ized. In particular, Xi is centralized across the samples, whichmeans that

Xi = Xi − mx, (46)

where the sample mean mx is obtained by

mx =N∑

i=1

Xi. (47)

The “hat” of the centralized samples is eliminated in the later.The MMR model (45) is different from MLR (2) in three as-

pects: 1) the independent variable X is an order-2 tensor ratherthan a vector, with the structure information being reserved;2) the new formulation requires to solve not only the model-1matrix A but also the model-2 matrix B, which is actually abilinear optimization problem, and 3) the free order is changedfrom m1 × m2 to m1 + m2, which leads to a great reductionof both USP and computational cost. This can be seen in theanalysis later.

In order to uncover the relation between MMR and MLR, itis necessary to make a transformation to (45). By using somesimple multilinear operators, MMR can be summarized as

minN∑i

n∑j

∥∥yij − vec(Xi)T · (αj ⊗ βj)∥∥2

, (48)

where vec denotes the vectorization operator, ⊗ refer to theKronecker product. If we denote αj ⊗ βj by γj and substituteΥ = [γ1, γ2, . . . γn] in (48), an equivalent formulation to (48) isobtained as

minN∑i

∥∥yi − ΥT vec(Xi)∥∥2

. (49)

As a result, Υ = A ◦ B. This form is similar to (2). With thehelp of this transformation, we find that the intrinsic differ-ence between MMR and MLR is that the former reserves thestructure information of the data in the structured relationshipΥ. This structured relationship can also be considered as asubspace constructed by the Khatri-Rao product of two low-dimensional subspaces, A ∈ �m1 and B ∈ �m2 . In this sense,(49) can be considered as the unified formulation of MMR (45),MLR (2), and PCR. Without misunderstanding, we generalize

the notation Υ to denote either the variable vector lying inthe subspace �m1m2 , or the regression matrix needed to beestimated.

C. Alternating Projection Optimization

Since MMR consists of two low-dimensional regressionmatrices A and B, it leads to a bilinear model. Unfortunately,to the best of our knowledge, it is impossible to find the closed-form solution with respect to the criterion used in TMLR.Therefore, we proposed an alternative projection optimizationprocedure to approximate the solution for TMLR. We have thefollowing theorems.

Theorem 4.1: Let A and B be the matrices minimizingthe objective E(A,B) =

∑Ni

∑nj ‖yij − Xi ×1 αj ×2 βj‖2,

then• For a given mode-2 matrix B, the optimal mode-1 matrix

A can be given by

αj = H†βj

· Jj · βj , j = 1, 2, . . . , n, (50)

where Hβj=

∑Ni (Xiβjβ

Tj XT

i ), and Jj =∑N

i (yijXi).• For a given mode-1 matrix A, the mode-2 matrix B can be

obtained by

βj = H†αj

JTj αj , j = 1, 2, . . . , n, (51)

where Hαj=

∑Ni (XT

i αjαTj Xi), and Jj =

∑Ni (yijXi).

Theorem 4.1 tells us that the solutions of A and B aredependent on each other. Therefore, an iterative procedurecan be utilized for computing the regression matrix. Morespecifically, with a given A, we can solve B by (51). With thesolved B, we can subsequently update A by (50). Moreover, theconvergence of the alternating projection optimization is givenby the following theorem.

Theorem 4.2: Let A, B be the matrices minimizing the ob-jective function E(A,B) =

∑Ni

∑nj ‖αT

j Xiβj − yij‖2, thenthe iterative projection optimization procedure is convergencefor MMR.

Based on Theorem 4.2, it can be concluded that given aninitialization, the iterative procedure is repeated until the resultconverges (to a certain threshold). The pseudo code of theiterative optimization algorithm is listed in Table I. In this table,steps 1–11 iteratively finds the regression matrices.

D. Computational Cost Analysis

We now turn to analyze the computational complexity ofproposed algorithm. On the one hand, the complexity of MLR(4) is

CCMLR = O(m × N2 + m3 + n × m2 + n × N × m),(52)

where m = m1 × m2. On the other hand, with the help ofthe Gaussian elimination-based matrix inversion, the compu-tational cost of the TMLR turns out to be

CCTMLR = O((

m21 + 4 × m1 × m2 + m2

2

)× N

×t +(m3

1 + m32

)× n × t

), (53)

1566 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 6, DECEMBER 2012

TABLE IITERATIVE OPTIMIZATION PROCEDURES FOR MMR

where m1 and m2 are the orders of the image, respectively, Nis the size of the training set, N is the number of independentvariables, and t is the number of iterations involved in MMR.

Without loss of the generality, suppose the image size sat-isfies m1 = m2. When the number of samples is small, orm1 < N � m, the computational cost of the two models is

CCMLR = O(m3), (54)

CCTMLR = O(6m2

1 × N × t + 2m31 × n × t

), (55)

respectively. As a result, CCMLR � CCTMLR.When the number of samples is large, or m1 � N < m, the

computational cost of the two models is

CCMLR =O(m × N2), (56)

CCTMLR =O(6m2

1 × N × t), (57)

respectively. Consequently, CCMLR � CCTMLR.In a summary, the proposed TMLR greatly reduces the

computational cost of the regression task.

V. MMR-BASED AAM

AAM is one of the most powerful appearance models todescribe the deformable object [6], [12], [26]. It consists oftwo procedures, training and fitting. First, in the modelingprocedure, AAM uses a statistical way, principle componentanalysis (PCA), to model the variation modes of the shapeand the texture. As a result, every object is described by a setof compact parameters. Then, in the fitting phase, the modelheuristically finds the best parameters for an unseen image ofthe object by gradient descent method. It can be efficientlycompleted by modeling the linear relationship between textureresiduals and parameter updates. Traditionally, this procedureis modeled as a MLR problem which can be done offline basedon the training data [20]. As a result, AAM can efficiently find

the deformable object in the image heuristically. Unfortunately,when the dimensionality of images is high, this method is notreasonable. There have been many improvements to AAM [8].For example, researchers extended the intensity-based textureto multiple features which enhance AAMs’ ability to fit theobject in an image [9], [14], [18], [33], [38]. However, morefeatures also lead to heavier USP and computational cost.

Based on the proposed MMR, we can extend AAM to thetensor framework. Given a training set, we can generate sta-tistical models of shape and texture variation, respectively (see[39] for details). First, the shape of the object is extracted andrepresented as vector s, as well as the texture, g. Then, the shapeand the texture are controlled by the appearance parameter, c,according to

s = s + Qs · c, (58)

g = g + Qg · c, (59)

where s is the mean shape, g the mean texture, and Qs and Qg

are matrices describing the modes of variation derived from thetraining set. As a result, any instance can be described by theparameter c.

Suppose pT = (cT |tT |uT ) ∈ �n denotes all the parametersof model (c is the appearance model parameter, t is the posetransformation parameter and u is the texture normalizationparameter). The objective of the model fitting is to find the op-timal parameter p which minimizes the texture residual r(p) =gs − gm, where gs and gm denote the sampled texture and thecurrent model texture, and r(p) ∈ �m. This is an optimizationproblem which can be solved with the gradient descendentmethod by assuming that there is a linear relationship betweenthe texture residual r(p) and the parameter updates δp

δpi = f(r), (60)

where δpi is the ith element of δp, and f : �m → � is thelinear function between r and δp. Originally, AAM considers(60) as a linear regression problem which can be solved bythe MLR. However, as discussed above, MLR is suffered fromUSP because of the high-dimensional residual r. Though PCRis proposed to model the fitting procedure [8], the problem stillremains.

In order to alleviate the problem, we consider the textureresidual as a two-order tensor. As a result, the problem turns outto be the tensor-based MLR problem and can be solved usingMMR. Particularly, given R ∈ �m1×m2 is the texture residualof two-order tensor, the relationship between the texture resid-ual and the parameter update becomes

δpi = −αTi · R(p) · βi, i = 1, 2, . . . ,m, (61)

where A = (α1, α2, . . . , αm) and B = (β1, β2, . . . , βm) arethe left projection matrix and right projection matrix, respec-tively, with αi ∈ �m1 and βi ∈ �m2 . They can be consideredfixed during fitting the procedure. Therefore, this problem canbe formulated as a regression problem and can be resolved byMMR as in Table I.

After the MMR model is trained, an unseen image could befitted using A and B. Specifically, given an image and initial-ized model parameter p, the sampled texture and the modeltexture should be calculated in the form of 2-D tensor that

SU et al.: MULTIVARIATE MULTILINEAR REGRESSION 1567

Fig. 1. Comparison of MMR and PCR with respect to k. (a) k = 10; (b) k = 20; (c) k = 30; (d) k = 40; (e) k = 50; (f) k = 60.

Gs ∈ �m1×m2 and Gm ∈ �m1×m2 . Then, the texture residual,R(p), is produced by

R(p) = Gs − Gm. (62)

Consequently, the parameter update δp can be predicted byusing (61) and the parameter is updated by

p = p − δp. (63)

This prediction-update procedure continues until convergence.Since the proposed multilinear framework preserves the im-

age in matrix, spatial information between pixels is protectedand transferred to A and B, and USP can be greatly alleviated.Contrast to traditional vector-based AAM, the proposed MMR-based AAM is denoted by TAAM.

VI. EXPERIMENTS

In this section, the performance of the proposed regressionmethod is verified by comparing to PCR. First, numericalexperiments are conducted to illustrate the USP of PCR andthe advantages of MMR. The key idea of this section is tocarefully construct appropriate data generation model, whichmakes the component selection of PCR conceptual clear anddistinguishable to error. Subsequently, the TAAM is examinedin terms of accuracy, convergence ratio, and convergence speed.

A. Artificial Experiments

1) Data Generation Model: This paper makes a basic as-sumption that the data set is generated from the Tucker model[37], which is often used for unsupervised decomposion [48].The PARAFAC model [2] can also be utilized alternatively.

Particularly, consider a multilinear multivariate regressionproblem f(X) = y, where X ∈ �m1×m2 and y ∈ �n de-note the independent and dependent variables, respectively.

The generalization of these variables in the data set ispredefined as

X =L · K · RT + s−1E, (64)

y =NZ(K, n), (65)

where K ∈ �m1×m2 denotes the “core” of the data set whichgenerates both X and y, which has a number of k > n nonzeroelements, NZ(K,n) means the “elements-chosen” operatorwhich fetches the first (or random) n nonzero elements fromK into a vector, L ∈ �m1×m1 and R ∈ �m2×m2 denote theregression matrices defined as in (45), s means the informalSNR, E ∈ �m1×m2 stands for the Gaussian noise. Withoutmisunderstanding, we call K as LV for its contribution to theregression model. k is called the length of K. For clarity andconvenience, L and R are constructed as orthogonal matrices,although this is not necessary for the regression model.

This model is consistent with the model in Section III. Thiscan be obtained by some simple deductions. Particularly, itfollows from (64) and (65) that

vec(X) = (L ◦ R)T · vec(K) + s−1vec(E), (66)

y = vec(K)(1 : n), (67)

where vec(X), L ◦ R, vec(K), and vec(E) are correspondingto x, Υ, K, and ε, respectively.

Subsequently, samples of X and y can be drawn based on(64) and (65). For example, given a random matrix Ki, a samplepair {Xi,yi} can be generated according to (64) and (65).Before applying the regression models, samples are centralized.Finally, a total number of N sample pairs are generated.

The general setting of the data set is described, while someof them will be modified in later experiments for specificpurposes. Because the symmetry property of L and R, we cansafely suppose m1 = m2 = m0 for convenience, without lossof the generality. The settings contain m0 = 10, n = 5, and

1568 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 6, DECEMBER 2012

Fig. 2. Comparison of MMR and PCR with respect to N . (a) N = 10; (b) N = 15; (c) N = 20; (d) N = 40; (e) N = 50; (f) N = 60.

k = 50, N = 40, y ∼ U(0, 20) is drawn from uniform distrib-ution, SNR is set to s = ∞ (no error is added to the signal), ande ∼ N(0, 1) is drawn from Gaussian distribution. The generalrequirement of these settings is that m1 × m2 � N . It resultsin the USP of traditional regression models.

We adopted three groups of experiments. The first one in-vestigates the importance of the length k of K; the second oneexamines the size of training set (STS); the final one checksthe SNR. In each group of experiment, four tests are recorded:two models, PCR and MMR, are used to predict on two sets,training set (TNS) and test set (TTS). MLR is not involvedbecause of the invertible problem in (4). Since the choice of thenumber of principal components (NPC) of PCR is important, allthe possible NPC is examined, resulting in an accuracy–NPCcurve such as Fig. 1.

The strategy for each experiment is described as follows.First, a data set of N training samples, as well as 200 test sam-ples is generated randomly (the “random” function in Matlab).Then, the models are trained using the training samples. Subse-quently, the training and the test set are utilized for verifying themodels, respectively, which results in two evaluation indicatorsfor each model. This train-test procedure is repeated for 100times, and the mean result is recorded.

2) Length k: In this test, k is changed from 10 to 100, in stepof 10. Different accuracy–NPC curves are produced accordingto distinct k. These results are shown in Fig. 1. The diagram ofthe prediction accuracy against k is shown in Fig. 3.

On the one hand, the USP is obvious for PCR. First, whenk is smaller than Rank(X), or Γ < 1, Ux covers the wholespace spanned by Υ. In this case, according to Theorem 3.1,USP should not affect the model, which means that the modelcan accurately predict on both the training set and the test set.Moreover, since PCR is powerful to recover all the PCs, thereis an upper bound in the variation of NPC, which dependson the rank of the training set, Rank(X). This is obvious inFig. 1(a)–(d) in which the prediction error is zero after all the kPCs are selected.

Second, when k grows larger so that Γ > 1, Ux cannot coverthe whole space spanned by Υ. As a result, USP is produced.This is presented in Fig. 1(e)–(f) where k is larger than thetraining set size N . On the training set, the precision decreasesto zero when the NPC grows to the maximum. On the contrary,there is always prediction error on the test set even if all thePCs are selected. This is the overfitting problem. It can also beobserved in Fig. 3 that PCR always best fits on the training set,and occasionally on the test set (k ≤ 40). When k > 40, theerror never disappears.

On the other hand, we can find that MMR always givesthe “nearly” optimal prediction on both the training set andthe test set. Along with the variation of k, the prediction isalways close to zero, even if k exceeds the training set sizeN . This is because MMR avoids the unnecessarily intermediateprocedure of estimating PCs, and does not limit to a small k. Itonly requires to calculate the inverse of Hβj

in (50) and Hαj

in (51), which have an invertible condition Rank(H∗) = m0

that is easy to satisfy, compared to PCR in (4). Nevertheless,the solution of MMR is not optimal because of the numericalmethod, i.e., the stop condition of the algorithm in Table I.Except the prediction accuracy of PCR on the test set decreaseswhen k > 40, other predictions remain at a low level. Thisturning point k > 40 is caused by the STS and Γ = 1. Theseresults verify the USP of PCR and the effectiveness of MMR.

3) STS: We adjust STS, N , to observe the variation of pre-diction results, keeping other parameters as the general setting.In this process, the STS is changed from 10 to 100. Finally, theaccuracy–NPC curves according to different N are shown inFig. 2; the best accuracy for each N is shown in Fig. 5.

We first look at the performance of PCR model. There aretwo cases of the prediction during the change of N . First,in the case of Γ > 1 [Fig. 2(a)–(d)], the prediction error onthe training set decreases to zeros when all the PCs are used.While on the test set, the prediction error is reduced to thebottom, which is nonzero. This is the type-1 error caused byUSP. Moreover, it can be shown from Fig. 5 that, this lowest

SU et al.: MULTIVARIATE MULTILINEAR REGRESSION 1569

Fig. 3. Prediction accuracy variations against k.

prediction error is approximately a monotonically decreasingfunction of N . Differently, in the case of Γ ≤ 1 [Fig. 2(e)–(f)],the prediction error on the test set is gradually reduced to thezero at N = k. The overall variation tendency can be observedin Fig. 5, which is opposite to that of Fig. 3. This is because theincrease of k and N has completely converse effects to Γ.

Differently, MMR suffers smaller USP than PCR. It caneasily be concluded from Fig. 5 that MMR needs a smallerN than PCR to achieve the accurate prediction on test set,around N = 2m0. This is rational because the coefficientsto be estimated in MMR are A and B, which have a totalof m1 + m2. As a result, only N = m1 + m2 (N = 2m0 inour experimental setting) is needed for MMR to solve theUSP. When this condition is unsatisfied, there is still USP oroverfitting for MMR. However, the prediction is much betterthan that of PCR. This is obvious in Fig. 5.

4) SNR: As discussed in Section III, Type-2 error relatedto error is unavoidable in practice. In this section, we studyby examples how noise affects regression results. This is doneby modifying the SNR of the data set, s, from 0.5 to 10.Accordingly, different accuracy–NPC curves are obtained.

To observe the Type-2 error clearly, we first set k = 30 suchthat Γ < 1, in which case the Type-1 error is avoided. Theresulted accuracy–NPC curves are shown in Fig. 4, and the bestaccuracy for each curve is recorded in Fig. 7. On the one hand, itshould be found that when the SNR is low in Fig. 4(a)–(e), theprediction accuracy of PCR on the test set is greatly reduced,compared to Fig. 1(c). This Type-2 error can also be foundon Fig. 7. They also show us that with the decrease of noise,the Type-2 error, is monotonically decreased. A side product ofthe Type-2 error is that the max NPC is increased, which leadsto the difficulty to determine the true PCs. This phenomenonis more prominent when SNR is low [Fig. 4(a)–(b)]. On theother side, though MMR also suffers from Type-2 error, theprediction accuracy on the test set is higher than that of PCR.An interesting conclusion can be drawn from Fig. 7 that MMRhas little overfitting problem. The prediction accuracy is not ashigh as that of PCR on the training set and decreases as noisereduces.

Subsequently, another experiment is adopted to show thejoint influence of the two types of error. To this purpose, k isset to 50, and the SNR is changed from 0.5 to 10. As a result,Γ > 1 and Type-1 error is added. The resulted accuracy–NPCcurves as well as their best accuracy are shown in Figs. 6 and 8,respectively. Comparing to the case of k = 30, we clearly find

that the overfitting problem of PCR is greatly magnified by thecombination of the two types of error. On the contrary, MMR ishardly affected compared to Fig. 7.

5) Discussion: From the above experiments, we have ex-amined how the two algorithms are influenced by the everykind of parameters. In a summary, PCR suffers from two typesof errors. The Type-1 error is dependent on the latent ratio Γ,and the Type-2 error is determined by the SNR. Unfortunately,PCR is sensitive to both types of error, particularly whenthey are combined (Fig. 8). Comparatively speaking, MMReffectively overcomes the two types of error. It is able to obtaingood performance even if two types of error are combined.Therefore, MMR is preferred in practice.

B. MMR-Based AAM

1) Setup: To test the performance of various regressionmodels in AAM, two databases are used in our experiments:XM2VTS database [7] and IMM face database [28].

• The XM2VTS frontal data set contains 2360 mug shotsof 295 individuals which are collected over four sessions.These images are taken under completely different lightingconditions and different cameras. We select 400 imagesfrom the database to test the performance of the algo-rithms. For training models, every image is labeled with58 landmarks by hands.

• IMM face database consists of 240 annotated images of 40different human faces. Similarly, every image is annotatedwith 58 landmarks by hands. Example images of the twodatabases are shown in Fig. 9.

In the model building phase, the PCA for shape modelleads to a shape subspace to reserve 95% of the variation inshapes. Then, all images are warped into the mean shape basedon Delaunay triangulation to construct the normalized frames.As a result, the normalized frame has around 11 000 pixels.Subsequently, based on these normalized textures, we constructa texture subspace to represent 95% of the variation in textures.Finally, the combined PCA is conducted on the concatenatedparameter vectors of the shape and the texture to represent 99%of the total variation in appearance.

In order to evaluate the fitting result, it is necessary todetermine the criterion for accuracy. In this paper, point-to-point error (pt-pt) is selected for this purpose, which means thedifference between model points and hand-labeled points. Infact, if the shape is represented as:

x = [x1, x2, . . . , xw, y1, y2, . . . , yw]T , (68)

the pt-pt error is denoted by:

Ept−pt(x,xg) =1n

n∑i=1

√(xi − xgi)2 + (yi − ygi)2, (69)

where xg = [xg1, xg2, . . . , xgw, yg1, yg2, . . . , ygw]T means theground true of the shape. When the pt-pt error is less than 10pixels (pts), the search is considered to be successful,

hit ={

1, if Ept−pt < 100, else.

(70)

1570 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 6, DECEMBER 2012

Fig. 4. Comparison of MMR and PCR with respect to SNR (k = 30). (a) s = 0.5; (b) s = 1; (c) s = 1.5; (d) s = 2; (e) s = 2.5; (f) s = 10.

Fig. 5. Prediction accuracy variations against N .

The test strategy for the two models is identical. Given a testset, the accurate shape of each image is shifted by 10 pixels inx or y, respectively, to be the initial fitting location. Then, themodel iteratively searches the best parameter from the initialsetting until convergence. After all the images are fitted, theaverage of the successful results is recorded.

We compare TAAM with the standard AAM in terms of ac-curacy, robustness, and fitting speed, respectively. The buildingprocedures for both models are same. The fitting procedure ofAAM is based on the PCR [31].

2) Accuracy Comparison: In order to verify the accuracy ofthe proposed method, we train the models with different STS,which ranges from 40 to 200. Given a specific training set, thestandard training procedure is performed to generate the twomodels. There are two kinds of test sets. The first one containsimages which are the same with the training set. The other onehas no intersection with the training set. The results of fittingresults on these two test sets are shown in Tables II and III.

Since AAM suffered from USP, the overfitting problem isresulted that the model performs much better on the trainingset than that on the testing set. This can be seen obviouslyfrom the results of Tables II and III. In Table II, the fittingaccuracy of AAM is high on the training set, between 0.72 pts

and 1.3 pts, and is low on the test set, between 3.79 pts and5.67 pts. Contrarily, this accuracy gap is not so large for TAAM.It obtains the highest 2.31 pts and 2.85 pts on the trainingand the test set. In other word, TAAM alleviates the overfittingproblem of AAM. Moreover, TAAM obtains a better accuracythan that of AAM on the test set.

3) Convergence Ratio Comparison: To investigate the per-formance of the proposed algorithm in terms of convergenceratio, two models (TAAM and AAM) are trained on the sametraining set of 200 images. For a labeled image, the correctshape is displaced up to 60 pixels in x and y. Then the fittingprocedure starts from these displaced shapes and ends untilconvergence. Finally, the convergence results of the fittingare shown in Fig. 10. The results show that with the samedisturbance, TAAM is always able to obtain higher or equiv-alent convergence ratio than AAM. This is because TAAM ismore robust than AAM against disturbance. However, whenthe disturbance is more than about 25 pixels, both algorithmscannot predict well. This is caused by the limitation of theAAM itself, which has been addressed by other techniques[18], [37].

4) Faster Convergence: To investigate the convergencespeed of the two models, the number of the iterative steps ata specific parameter disturbance is studied. Given disturbedinitial parameters, the fitting procedure is iterated until con-vergence. Then, the mean of the iteration steps of successfulexamples is recorded to account for the convergence speed. Thefinal results for both models are shown in Fig. 11. The resultsshow that TAAM always needs fewer iteration steps than AAMto convergent. This gap becomes higher when the disturbancegrows larger. In this sense, TAAM is more efficient than AAM.

VII. DISCUSSION AND CONCLUSION

In this paper, we have discussed the USP of PCR, based ona data generation model. It shows that there are two types oferror in PCR, which are resulted from small sample size and

SU et al.: MULTIVARIATE MULTILINEAR REGRESSION 1571

Fig. 6. Comparison of MMR and PCR with respect to SNR (k = 50). (a) s = 0.5; (b) s = 1; (c) s = 1.5; (d) s = 2; (e) s = 2.5; (f) s = 10.

Fig. 7. Prediction accuracy variations against s(k = 30).

Fig. 8. Prediction accuracy variations against s(k = 50).

noise, respectively. To address the USP of PCR, this paper hasdeveloped a MMR model, as well as an alternative projectionoptimization solution, which generalizes the MLR model tomultilinear data. The proposed model reserves the structureinformation of the data to help alleviate the USP. As a result,both types of error of PCR are greatly reduced. Numericalexperiments carefully discussed the influences of all the factors

Fig. 9. Labeled images in the databases XM2VTS (top) and IMM (bottom).

TABLE IIFITTING ACCURACY ON TRAINING SET

TABLE IIIFITTING ACCURACY ON TEST SET

in models and showed that MMR is more effective in case ofmultilinear regression problems.

The proposed MMR has been applied to the AAM, whichresults in a multilinear AAM (TAAM). TAAM considers the

1572 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 6, DECEMBER 2012

Fig. 10. Comparison of convergence ratio between AAM and TAAM.

Fig. 11. Convergence rate versus initial displacement.

texture residuals as matrices and utilizes MMR to predictthe update of parameters. Since the spatial information of thetexture is reserved, the prediction is more accurate than thetraditional vector-based AAM. Several experiments have beenconducted to verify the proposed model in terms of the accu-racy, convergence speed, and the robustness.

Although proposed MMR only accepts the object in the formof matrix, objects in higher order are also applicable. Futurework will be performed to study more extensive situations.

REFERENCES

[1] Signal-to-Noise Ratio. [Online]. Available: http://en.wikipedia.org/wiki/Signal-to-noise_ratio

[2] E. Acar and B. Yener, “Unsupervised multiway data analysis: A litera-ture survey,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 1, pp. 6–20,Jan. 2009.

[3] C. M. Bishop, Pattern Recognition and Machine Learning. Berlin,Germany: Springer-Verlag, 2007.

[4] R. Bro, “Exploratory study of sugar production using fluorescencespectroscopy and multi-way analysis,” Chemometrics Intell. Lab. Syst.,vol. 46, no. 2, pp. 133–147, Mar. 1999.

[5] R. Bro, “Multiway calibration. Multilinear Pls,” J. Chemometrics, vol. 10,no. 1, pp. 47–61, Jan./Feb. 1996.

[6] T. Ching-Ting and J. J. J. Lien, “Automatic location of facial feature pointsand synthesis of facial sketches using direct combined model,” IEEETrans. Syst., Man, Cybern. B, Cybern., vol. 40, no. 4, pp. 1158–1169,Aug. 2010.

[7] F. Conditions and A. H. Richard, “Foundations of the Parafac procedure:Models and conditions for an ‘Explanatory’ multi-modal factor analysis,”UCLA Working Papers Phonetics, vol. 16, no. 1, pp. 1–84, 2009.

[8] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,”in Proc. Eur. Conf. Comput. Vis., 1998, vol. 2, pp. 484–498.

[9] T. F. Cootes and C. J. Taylor, “On representing edge structure for modelmatching,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. PatternRecog., 2001, vol. 1, pp. 1114–1119.

[10] L. De Lathauwer, “Signal processing based on multilinear algebra,” Ph.D.dissertation, Katholike Universiteit Leuven, Leuven, Belgium, 1997.

[11] U. Depczynski, V. J. Frost, and K. Molt, “Genetic algorithms appliedto the selection of factors in principal component regression,” AnalyticaChimica Acta, vol. 420, no. 2, pp. 217–227, Sep. 2000.

[12] F. Dornaika and J. Ahlberg, “Fast and reliable active appearance modelsearch for 3-D face tracking,” IEEE Trans. Syst., Man, Cybern. B,Cybern., vol. 34, no. 4, pp. 1838–1853, Aug. 2004.

[13] N. Draper, H. Smith, and E. Pownell, Applied Regression Analysis,New York: Wiley, vol. 706, 1998.

[14] X. Gao, Y. Su, X. Li, and D. Tao, “A review of active appearance models,”IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 40, no. 2, pp. 145–158,Mar. 2010.

[15] X. Gao, Y. Yang, D. Tao, and X. Li, “Discriminative optical flow tensorfor video semantic analysis,” Comput. Vis. Image Understand., vol. 113,no. 3, pp. 372–383, Mar. 2009.

[16] S. P. Gurden, J. A. Westerhuis, R. Bro, and A. K. Smilde, “A comparisonof multiway regression and scaling methods,” Chemometrics Intell. Lab.Syst., vol. 59, no. 1/2, pp. 121–136, Nov. 2001.

[17] I. T. Jolliffe, “A note on the use of principal components in regres-sion,” J. Roy. Statist. Soc. C, Appl. Statist., vol. 31, no. 3, pp. 300–303,1982.

[18] P. Kittipanya-ngam and T. F. Cootes, “The effect of texture representationson AAM performance,” in Proc. Int. Conf. Pattern Recog., 2006, vol. 2,pp. 328–331.

[19] D. Kleinbaum, L. Kupper, and K. Muller, “Applied Regression Analy-sis and Other Multivariable Methods,” Florence, KY: Duxbury Pr,2007.

[20] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,”SIAM Rev., vol. 51, no. 3, pp. 455–500, Aug. 2009.

[21] P. M. Kroonenberg, Three-Mode Principal Component Analysis: Theoryand Applications. Leiden, The Netherlands: DWSO Press, 1983.

[22] X. Li, S. Lin, S. Yan, and D. Xu, “Discriminant locally linear embeddingwith high-order tensor data,” IEEE Trans. Syst., Man, Cybern. B, Cybern.,vol. 38, no. 2, pp. 342–352, Apr. 2008.

[23] K. Liu, Y.-Q. Cheng, and J.-Y. Yang, “Algebraic feature extraction forimage recognition based on an optimal discriminant criterion,” PatternRecog., vol. 26, no. 6, pp. 903–911, Jun. 1993.

[24] H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, “MPCA: Multilin-ear principal component analysis of tensor objects,” IEEE Trans. NeuralNetw., vol. 19, no. 1, pp. 18–39, Jan. 2008.

[25] H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, “Multilinear prin-cipal component analysis of tensor objects for recognition,” in Proc. Int.Conf. Pattern Recog., 2006, vol. 2, pp. 776–779.

[26] P. Lucey, J. F. Cohn, I. Matthews, S. Lucey, S. Sridharan, J. Howlett, andK. M. Prkachin, “Automatically detecting pain in video through facialaction units,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 3,pp. 664–674, Jun. 2011.

[27] B. Mertens, T. Fearn, and M. Thompson, “The efficient cross-validation ofprincipal components applied to principal component regression,” Statist.Comput., vol. 5, no. 3, pp. 227–235, 1995.

[28] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, Xm2vtsdb:The Extended M2vts Database. [Online]. Available: http://www.ee. sur-rey.ac.uk/CVSSP/xm2vtsdb/

[29] Y. Mu, D. Tao, X. Li, and F. Murtagh, “Biologically inspired tensorfeatures,” Cognit. Comput., vol. 1, no. 4, pp. 327–341, 2009.

[30] J. Nilsson, S. de Jong, and A. K. Smilde, “Multiway calibration in 3DQSAR,” J. Chemometrics, vol. 11, no. 6, pp. 511–524, Nov./Dec. 1997.

[31] M. M. Nordstrøm, M. Larsen, J. Sierakowski, and M. B. Stegmann,“The Imm Face Database—An Annotated Dataset of 240 FaceImages,” . [Online]. Available: http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=3160

[32] P. Sanguansat, W. Asdornwised, S. Jitapunkul, and S. Marukatat, “Two-dimensional linear discriminant analysis of principle component vectorsfor face recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, SignalProcess., 2006, vol. 2, pp. 2164–2170.

[33] I. M. Scott, T. F. Cootes, and C. J. Taylor, “Improving appearance modelmatching using local image structure,” in Proc. Int. Conf. Inf. Process.Med. Imag., 2003, pp. 258–269.

[34] A. Shashua and A. Levin, “Linear image coding for regression and clas-sification using the tensor-rank principle,” in Proc. IEEE Comput. Soc.Conf. Comput. Vis. Pattern Recog., 2001, vol. 1, pp. 42–49.

[35] A. K. Smilde and H. A. L. Kiers, “Multiway covariates regression mod-els,” J. Chemometrics, vol. 13, no. 1, pp. 31–48, Jan./Feb. 1999.

[36] A. K. Smilde, J. A. Westerhuis, and R. Boqué, “Multiway multiblockcomponent and covariates regression models,” J. Chemometrics, vol. 14,no. 3, pp. 301–331, May/Jun. 2000.

[37] M. B. Stegmann, “Active appearance models: Theory, extensions &cases,” M.S. thesis, Technical University of Denmark, Kongens Lyngby,Denmark, 2000.

[38] M. B. Stegmann and R. Larsen, “Multi-band modelling of appearance,”Image Vis. Comput., vol. 21, no. 1, pp. 61–67, Jan. 2003.

SU et al.: MULTIVARIATE MULTILINEAR REGRESSION 1573

[39] Y. Su, D. Tao, X. Li, and X. Gao, “Texture representation in Aam usinggabor wavelet and local binary patterns,” in Proc. IEEE Int. Conf. Syst.,Man Cybern., 2009, pp. 3274–3279.

[40] J. Sun, “A correlation principal component regression analysis of NIRdata,” J. Chemometrics, vol. 9, no. 1, pp. 21–29, Jan./Feb. 1995.

[41] J. Sun, D. Tao, S. Papadimitriou, P. S. Yu, and C. Faloutsos, “Incrementaltensor analysis: Theory and applications,” ACM Trans. Knowl. DiscoveryData, vol. 2, no. 3, pp. 1–37, Oct. 2008.

[42] D. Tao, X. Li, X. Wu, W. Hu, and S. Maybank, “Supervised tensorlearning,” Knowl. Inf. Syst., vol. 13, no. 1, pp. 1–42, 2007.

[43] D. Tao, X. Li, X. Wu, and S. Maybank, “Tensor rank one discrimi-nant analysis—A convergent method for discriminative multilinear sub-space selection,” Neurocomputing, vol. 71, no. 10–12, pp. 1866–1882,Jun. 2008.

[44] D. Tao, X. Li, X. Wu, and S. J. Maybank, “General tensor discriminantanalysis and Gabor features for Gait recognition,” IEEE Trans. PatternAnal. Mach. Intell., vol. 29, no. 10, pp. 1700–1715, Oct. 2007.

[45] D. Tao, M. Song, X. Li, J. Shen, J. Sun, X. Wu, F. C. , and S. J. Maybank,“Bayesian tensor approach for 3-D face modeling,” IEEE Trans. CircuitsSyst. Video Technol., vol. 18, no. 10, pp. 1397–1410, Oct. 2008.

[46] D. Tao, J. Sun, J. Shen, X. Wu, X. Li, S. J. Maybank, and C. Faloutsos,“Bayesian tensor analysis,” in Proc. IEEE Int. Joint Conf. Neural Netw.,2008, pp. 1402–1409.

[47] D. Tao, J. Sun, X. Wu, X. Li, J. Shen, S. J. Maybank, and C. Faloutsos,“Probabilistic tensor analysis with Akaike and Bayesian information cri-teria,” in Proc. Int. Conf. Neural Inf. Process., 2008, pp. 791–801.

[48] L. Tucker, “Some mathematical notes on three-mode factor analysis,”Psychometrika, vol. 31, no. 3, pp. 279–311, Sep. 1966.

[49] Q. Wang, F. Chen, and W. Xu, “Tracking by third-order tensor representa-tion,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 2, pp. 385–396, Apr. 2011.

[50] J. Wen, X. Gao, Y. Yuan, D. Tao, and J. Li, “Incremental tensor biaseddiscriminant analysis: A new color-based visual tracking method,” Neu-rocomputing, vol. 73, no. 4–6, pp. 827–839, Jan. 2010.

[51] S. Wold, P. Geladi, K. Esbensen, and J. Öhman, “Multi-way principalcomponents-and PLS-analysis,” J. Chemometrics, vol. 1, no. 1, pp. 41–56, Jan. 1987.

[52] S. Wold, A. Ruhe, H. Wold, and W. J. Dunn, “The collinearity problem inlinear regression. The Partial Least Squares (PLS) approach to generalizedinverses,” SIAM J. Sci. Stat. Comput., vol. 5, no. 3, pp. 735–743, 1984.

[53] J. Yang, D. Zhang, A. F. Frangi, and J.-Y. Yang, “Two-dimensional Pca: Anew approach to appearance-based face representation and recognition,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 1, pp. 131–137,Jan. 2004.

[54] J. Ye, “Generalized low rank approximations of matrices,” in Proc. Int.Conf. Mach. Learn., 2004, pp. 887–894.

[55] J. Zhang, J. Pu, C. Chen, and R. Fleischer, “Low-resolution gait recogni-tion,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 40, no. 4, pp. 986–996, Aug. 2010.

[56] L. Zhang, L. Zhang, D. Tao, and X. Huang, “A multifeature tensor forremote-sensing target recognition,” IEEE Geosci. Remote Sens. Lett.,vol. 8, no. 2, pp. 374–378, Mar. 2011.

Ya Su (M’03) received the B.Sc., M.Sc., and Ph.D.degrees in signal and information processing fromXidian University, Xi’an, China, in 2003, 2006, and2010, respectively.

He is currently a Postdoctoral Fellow in the De-partment of Electronic Engineering, Tsinghua Uni-versity, Beijing, China. His research interests includemachine learning and computer vision.

Xinbo Gao (M’02–SM’07) received the B.Eng.,M.Sc., and Ph.D. degrees in signal and informationprocessing from Xidian University, Xi’an, China, in1994, 1997, and 1999, respectively.

From 1997 to 1998, he was a Research Fellow inthe Department of Computer Science at ShizuokaUniversity, Shizuoka, Japan. From 2000 to 2001,he was a Postdoctoral Research Fellow in the De-partment of Information Engineering at the ChineseUniversity of Hong Kong, Shatin, Hong Kong. Since2001, he joined the School of Electronic Engineering

at Xidian University. Currently, he is a Professor of Pattern Recognition andIntelligent System, and the Director of the VIPS Lab, Xidian University. Hisresearch interests are computational intelligence, machine learning, computervision, pattern recognition, and wireless communications. In these areas,he has published five books and around 150 technical articles in refereedjournals and proceedings including IEEE TIP, TCSVT, TNN, TSMC, PatternRecognition, etc.

Dr. Gao is on the editorial boards of journals including EURASIP SignalProcessing (Elsevier) and Neurocomputing (Elsevier). He served as a GeneralChair/Cochair or Program Committee Chair/Cochair or PC member for around30 major international conferences. Now, he is a Fellow of IET and a SeniorMember of IEEE.

Xuelong Li (M’02–SM’07–F’12) is a Full Professor with the Center forOPTical IMagery Analysis and Learning, State Key Laboratory of TransientOptics and Photonics, Xi’an Institute of Optics and Precision Mechanics,Chinese Academy of Sciences, Xi’an, China.

Dacheng Tao (M’07–SM’12) is a Professor of Com-puter Science with the Center for Quantum Com-putation and Information Systems and the Facultyof Engineering and Information Technology in theUniversity of Technology, Sydney, Australia. Hemainly applies statistics and mathematics for dataanalysis problems in computer vision, data mining,machine learning, multimedia, and video surveil-lance. He has authored and coauthored more than100 scientific articles at top venues including IEEET-PAMI, T-IP, AISTATS, ICDM, CVPR, ECCV, and

ACM Multimedia, with the best theory/algorithm paper runner up award inIEEE ICDM’07.