A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

1/14Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

1

A Deformable 3D Facial Expression Model for

Dynamic Human Emotional State RecognitionYun Tie, Member, IEEE, Ling Guan, Fellow, IEEE,

AbstractAutomatic emotion recognition from facial expres-sion is one of the most intensively researched topics in affectivecomputing and human-computer interaction (HCI). However, itis well known that due to the lack of 3D feature and dynamicanalysis the functional aspect of affective computing is insufficientfor natural interaction. In this paper we present an automaticemotion recognition approach from video sequences based ona fiducial point controlled 3D facial model. The facial regionis first detected with local normalization in the input frames.The 26 fiducial points are then located on the facial region andtracked through the video sequences by multiple particle filters.Depending on the displacement of the fiducial points, they may

be used as landmarked control points to synthesize the inputemotional expressions on a generic mesh model. As a physics-based transformation, Elastic Body Spline (EBS) technology isintroduced to the facial mesh to generate a smooth warp thatreflects the control point correspondences. This also extracts thedeformation feature from the realistic emotional expressions aswell. Discriminative Isomap (D-Isomap) based classification isused to embed the deformation feature into a low dimensionalmanifold that spans in an expression space with one neutral andsix emotion class centers. The final decision is made by computingthe Nearest Class Center (NCC) of the feature space.

Index TermsVideo analysis, Elastic Body Spline, DifferentialEvolution Markov Chain, Discriminative Isomap, Nearest ClassCenter,

I. INTRODUCTION

W ITH the rapid development of Human-Machine Inter-action (HMI), affective computing is currently gainingpopularity in research and flourishing in the industry domain.

It aims to equip computing devices with effortless and natural

communication. The ability to recognize human affective state

will empower the intelligent computer to interpret, under-

stand, and respond to human emotions, moods, and possibly,

intentions. This is similar to the way that humans rely on

their senses to assess each others affective state [1]. Many

potential applications such as intelligent automobile systems,

game and entertainment industries, interactive video, indexing

and retrieval of image or video databases can benefit from thisability.

Emotion recognition is the first and one of the most im-

portant issues in the affective computing field. It incorporates

computers with the ability to interact with humans more natu-

rally and in a friendly manner. Affective interaction can have

maximal impact when emotion recognition and expression is

Y. Tie and L. Guan are with the Department of Electrical and ComputerEngineering, Ryerson University, Toronto, ON, Canada.

e-mail: [email protected], [email protected] (c) 2012 IEEE. Personal use of this material is permitted.

However, permission to use this material for any other purposes must beobtained from the IEEE by sending an email to [email protected].

available to all parties, human and computers [2]. Most of the

existing systems attempt to recognize the human prototypic

emotions. It is widely accepted from psychological theory

that human emotions can be classified into six archetypal

emotions: surprise, fear, disgust, anger, happiness, and sadness,

which are so-called six-basic emotions that was pioneered

by Ekman and Friesen [3]. According to Ekman, the six-

basic emotions are not culturally determined, but universal

to human culture and thus biological in origin. There are also

several other emotions and many combinations of emotions

that have been studied, but they are unconfirmed as universallydistinguishable.

Facial expression regulates face-to-face interactions, indi-

cates reciprocity, interpersonal attraction or repulsion, and en-

ables intersubjectivity between members of different cultures

[4]. Recent research in the fields of psychology and neurology

has shown that facial expression is a most natural and primary

cue for communicating the quality and nature of emotions, and

that it correlates well with the body and voice [5]. Each of the

six basic emotions corresponds to a unique facial expression.

To the objectives of an emotion recognition system, facial

expression analysis is considered to be the major indicator

of a human affective state.

In the past 20 years there has been much research inrecognizing emotion through facial expressions. However,

challenges still remain. Traditionally, the majority approaches

for solving human facial expression recognition problems

attempt to perform the task on two dimensional data, either 2D

images or 2D video sequences. Unfortunately, such approaches

have difficulty handling pose variations, lighting illumination

and subtle facial behavior. The performance of 2D based

algorithms remains unsatisfactory, and often proves unreliable

under adverse conditions.

Using 3D visual feature to recognize and understand fa-

cial expressions has been demonstrated to be a more robust

approach for human emotion recognition [6]. However, the

general 3D emotion recognition approaches are mainly basedon static analysis. A growing body of psychological research

supports that the timing of expressions is a critical parameter

in recognizing emotions and the detailed spatial dynamic

deformation of the expression is important in expression recog-

nition. Therefore the dynamic analysis for the state transitions

of 3D faces could be a crucial clue to the investigation of

human emotional states.

Another weakness of the existing 3D based approaches is

the complexity and intensive computation cost to meet the

challenge of accuracy. The temporal and detailed spatial infor-

mation in the 3D visual cues, both at local and global scales,


2/14

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].


2

may cause more difficulties and complexities in describing

human facial movement. Moreover, automatic detection and

the segments based on the facial components with respect to

emotion recognition has not been reported so far. Most of the

existing works require manual initialization.

Fig. 1. Overall System Diagram.

In light of these problems, this paper presents an automatic

emotion recognition method from video sequences based on

a deformable 3D facial expression model. We use the elas-

tic body spline (EBS) based approach for human emotion

classification with the active deformation feature extractiondepending on the 3D generic model. This model is driven by

the key fiducial points and thus makes it possible to generate

the intrinsic geometries of the emotional space. The block

diagram of this method is shown in Fig. 1. The rest of the paper

is organized as follows. Section II gives an overview on state-

of-the-art for human emotion recognition. We then present the

proposed 3D facial modeling and feature extraction from video

sequences using EBS techniques in Section III. Discriminative

Isomap (D-Isomap) based classification is discussed in Section

IV. The experimental results are presented in Section V.

Section VI gives our conclusions.

II. RELATED WORKS

The most commonly used vision-based coding system is

the facial action coding system (FACS) proposed by Ekman

and Friesen [7] for the manual labeling of facial behavior.

To recognize emotions from facial clues, FACS enables facial

expression analysis through standardized coding of changes

in facial motion in terms of atomic facial actions called

Action Units (AUs). The changes in the facial expression are

described with FACS in terms of AUs. FACS decomposes the

facial muscular actions into 44 basic actions and describes

the facial expressions as combinations of the AUs. Many

researchers are inspirited by this work and try to analyze facial

expressions in image and video processing. Most methods usethe distribution of facial features as inputs of a classification

system, and the outcome is one of the facial expression classes.

Lyons et al. [8] used a set of multi-scale, multi-orientation

Gabor filters to transform the images first. The Gabor coef-

ficients sampled on the grid were combined into one single

vector. They tested their system and achieved 75% expression

classification accuracy by using Linear Discriminant Analysis

(LDA). Silva and Hui [9] determined the eye and lip position

using low-pass filtering and edge detection methods. They

achieved an average emotion recognition rate of 60% using

a neural network (NN). Cohen et al. [10] introduced the

temporal information from video sequences for recognizing

human facial expression. They proposed a multi-level hidden

Markov model (HMM) classifier for dynamic classification, in

which the temporal information was also taken into account.

Guo and Dyer [11] introduced a linear programming based

method for facial expression recognition with a small number

of training images for each expression. A pairwise framework

for feature selection was presented and three methods were

compared in the experimental part. Pantic and Patras [12]

presented a method to handle a large range of human facial

behavior by recognizing facial muscle actions that produce

expressions. The algorithm performed both automatic seg-

mentation into facial expressions and recognition of temporal

segments of 27 AUs. Anderson and McOwan [13] presented an

automated multistage system for real-time recognition of facial

expression. The system used facial motion to characterize

monochrome frontal views of facial expressions. It was able to

operate effectively in cluttered and dynamic scenes, recogniz-

ing the six emotions universally associated with unique facial

expressions. Gunes and Piccardi [14] proposed an automatic

method for temporal segment detection and affect recognitionfrom facial and body displays. Wang and Guan [15] con-

structed a bimodal system for emotion recognition. They used

a facial detection scheme based on a Hue Saturation Value

(HSV) color model to detect the face from the background

and Gabor wavelet features to represent the facial expressions.

Presently, state-of-the-art 3D facial modeling by physically

based paradigm has been recognized as a key research area

of emotion recognition for next-generation human-computer

interaction (HCI) [16]. Song et al. [17] presented a generic fa-

cial expression analogy technique to transfer facial expressions

between arbitrary 3D facial models, and between 2D facial

images. Geometry encoding for triangle meshes, vertex-tent-

coordinates were proposed to formulate expression transfer in2D and 3D cases as a solution to a simple system of linear

equations. In [18], a 3D features based method for human

emotion recognition was proposed. 3D geometric information

plus colour/density information of the facial expressions were

extracted by 3D Gabor library to construct visual feature

vectors. The improved kernel canonical correlation analysis

(IKCCA) algorithm was applied for final decision, and the

overall recognition rate was about 85%. A static 3D facial

expression recognition method was proposed in [19]. The

primitive 3D facial expression features were extracted from

3D models based on the principal curvature calculation on

3D mesh models. Classification into one of the six-basic

emotions was done based on the statistical analysis of thesefeatures, and the best performance was obtained using LDA.

Although several methods can achieve a very high recogni-

tion rate, most of the existing 3D face expression recognition

works are based on static data. Soyel and Demirel [20], [21]

used six distance measures from 3D distributions of facial

feature points to form the feature vectors. The probabilistic

NN architecture was applied to classify the facial expressions.

They obtained an average recognition rate of 87.8%. Unfortu-

nately, the authors did not specify how to identify this set

of feature points. Tang and Huang [22], [23] used similar

distance features based on the change of face shape between


3/14



3

the emotional expressions. Normalized Euclidean distances

between the facial feature points were used for emotion

classification. An automatic feature selection method was also

proposed based on maximizing the average relative entropy

of marginalized class-conditional feature distributions. Using

a regularized multi-class AdaBoost classification algorithm,

they achieved a 95.1% average recognition rate. However the

facial feature points were predefined on the cropped 3D face

mesh model, and were not generated automatically. Such an

approach is, therefore, difficult to be used in real world ap-

plications. Thus far, few efforts have been reported exploiting

3D facial expression recognition in dynamic or deformable

feature analysis. Sun and Yin [24] extracted sophisticated

features of geometric labeling and used 2D HMMs to model

the spatial and temporal relationships between the features

for recognizing expressions from 3D facial model sequences.

However this method requires manual detection and annotation

of certain facial landmarks.

III. METHODOLOGY

In this section we present a fully automatic method for

emotion recognition that exploits the EBS features between

neutral and expressional faces based on a 3D deformable mesh

(DM) model. The system developed consists of several steps.

The facial region is first detected automatically in the input

frames using the local normalization based method [25]. We

then locate 26 fiducial points over the facial region using scale-

space extrema and scale invariant feature examination. The

fiducial points are tracked continuously by multiple particle

filters throughout the video sequences. EBS is used to extract

the deformation features and the D-Isomap algorithm is then

applied for the final decision.

A. Preprocessing

Automatic face detection is considered to be the first es-

sential requirement for our emotion recognition system. Since

the faces are non-rigid and have a high degree of variability

in location, color and pose, several facial features that are

uncommon to other pattern detection issues make facial de-

tection more complex. Occlusion and lighting distortions and

illumination conditions can also change the overall appearance

of a face. We detect facial regions in the input video sequence

consisting of feature selection and classification based on a

local normalization technique [25]. Compare to Viola and

Johns algorithm [26], the proposed method is adaptive tothe normalized input image and designed to complete the

segmentation in a single iteration. With the local normalization

based method, the proposed emotion recognition system can

be more robust under different illumination conditions.

Fiducial points are a set of facial salient points, usually

located on the corners, tips or mid points of facial components.

Automatically detecting fiducial points can extract the promi-

nent characteristics of facial expressions with the distances

between points and the relative sizes of the facial components

and form a feature vector. Additionally, choosing the feature

points should represent the most important characteristics on

the face and be extracted easily. The Active Appearance Mod-

els (AAM) and Active Shape Models (ASM) are two popular

feature localization methods with statistical face models to

prevent locating inappropriate feature points. The AAM [27],

[28] fits a generative model to the region of interest. The best

match of the model simultaneously calculates feature point

locations. The ASM algorithm learns a statistical model of

shape from manually labeled images and the PCA models

of patches around individual feature points. The best local

match of each feature is found with constraints on the relative

configuration of feature points. They are commonly used to

track faces in video. In general, the point to point accuracy is

around 85% if the bias of the automatic labeling result to the

manual labeling result is less than 20% of the true inter-ocular

distance [29]. However, it is not sufficient in the case of facial

expression analysis.

We choose 26 fiducial points [30] on the facial region ac-

cording to the anthropometricmeasurement with the maximum

movement of the facial components during expressions. To

follow the subtle changes in the facial feature appearance, we

define a SUCCESS case if the bias of a detected point to thetrue facial point is less than 10% of inter-ocular distance in the

test image. The proposed method constructs a set of fiducial

point detectors with scale invariant feature. Candidate points

are selected over the facial region by the local scale-space ex-

trema detection. The scale invariant feature for each candidate

point is extracted to form the feature vectors for the detection.

We use multiple Differential Evolution Markov Chain (DE-

MC) particle filters [31] to track the fiducial points depending

on the locations of the current appearance of the spatially

sampled features. The kernel correlation based on HSV color

histograms is used to estimate the observation likelihood and

measure the correctness of particles. We define the observation

likelihood of the color measurement distribution using the cor-relation coefficient. Starting with mode-seeking procedure, the

posterior modes are subsequently detected through the kernel

correlation analysis. It provides a consistent way to resolve

the ambiguities that arise in associating multiple objects with

measurements of the similarity criterion between the target

points and the candidate points. The proposed method achieves

an overall accuracy of 91% for the 26 fiducial points [31].

B. 3D EBS Facial Modeling

The EBS is an image morphing technique derived from

the Navier equation that describes the deformation of ho-

mogeneous elastic tissues. It was developed for matching3D MRIs of the breast used in the evaluation of breast

cancer [32], [33]. Davis et al. [32] designed the EBS to

matching 3D magnetic resonance images (MRIs) used in the

evaluation of breast cancer. The coordinate transformations are

evaluated with different types of deformations and different

numbers of corresponding coordinate locations. The EBS is

based on a mechanical model of an elastic body, which

can approximate the properties of body tissues. The spline

maps can be expressed as the linear combination of an affine

transformation and a Navier interpolation spline. It allows each

landmark to be mapped to the corresponding landmark in


4/14



4

the other image and provides interpolation of this mapping

at intermediate locations. Hassanien and Nakajima [34] used

Navier EBS to generate warp functions for facial animation

with the interpolating scattered data points. Kuo et al. [35]

proposed an iterative EBS algorithm to obtain the elastic

property of a facial model for facial expression generation.

However, the most feature points in these works were manually

localized, and only 2D examples were considered for facial

image analysis.

Fig. 2. Proposed 3D mesh model with 26 fiducial points (black) and 28characteristic points (red).

The proposed EBS method automatically generate facial

expressions using a 3D physically based DM model according

to a deformable feature perspective executed with the control

points within an acceptable time for emotion recognition. The

mesh wireframe generic facial model consists of characteristic

feature points and deformable polygons with EBS structure.

We can deform the wireframe model to best fit a human face

with any expressions. The 3D affine transformation realizes the

facial expressions by imitating the facial muscular actions. Itformulates the deforming rules according to the FACS coding

system using the 26 fiducial points as control points. Fig. 2

shows the proposed model based on this standardized coding

system. In practical applications, not all feature points in the

model can be easily detected from the input sequences, so

we use 54 characteristic feature points for facial expression

parameterization. Characteristic feature points include: a) the

26 control points based on the fiducial points, and b) 28

dependent points which are determined by the control points.

We also assume that the physical property of the EBS structure

is the same within the facial region. The EBS deformation

analysis is presented in following section.

Merits of this approach are: a) a physically based DMmodel of the human face with fiducial points for driving facial

deformation according to muscle movement parameterization.

The face can be modeled as an elastic body that is deformed

under a tension force field. Muscles are interpreted as forces

deforming the polygonal mesh of the face. The factors affect-

ing the deformation are tension of the muscle, elasticity of

the skin and zone of influence. Higher-level parameterizations

are easier to use for emotional expressions and can be de-

fined in terms of low-level parameters. b) We extend a DM

facial model by a set of well-designed polygons with EBS

structure which can be efficiently modified to establish the

facial expression model. A 3D face is decomposed into area

or volume elements, each endowed with physical parameters

embedded in an EBS model according to the surface curva-

ture. The deformable element relationships are computed by

integrating the piecewise components over the entire face. c)

The control points are predefined by the landmarked fiducial

points. The number of control points is small and they can be

identified robustly and automatically. Once the control points

are adjusted, the emotional facial model can be established

using the transform function of EBS and extended to obtain

expression parameters for final recognition.

Using EBS transforms we can interpolate the positions of

characteristic feature points such that the 3D facial model of

non-neutral expressive expression can be generated from the

input video frame. Based on the arrangement of facial muscle

fibers, our EBS model calculates elastic characteristics for

each emotional face by modeling the facial muscle fiber as an

elastic body. The affine elastic body coordinate transformation

is fitted to the displacements of the facial expression with

the continuity condition. The spline obtained by this method

is mathematically identical to the computed coefficients ofthe original displacements from the control points directly.

Moreover, the resulting spline is added to the initial mesh of

the elastic body transformation to give the overall coordinate

transformation. Simulation results show that the facial model

generated by our method demonstrates good performance

under the availability of control point positions.

C. EBS parameterizations

EBS is applied for generating different facial expressions

with a generic facial model from a neutral face. By varying

the position of control points, EBS mathematically describes

the equilibrium displacement of the facial expressions sub-

jected to muscular forces using a Navier partial differential

equation (PDE). The deformable facial model equations can

be expressed in 3D vector form with the interpolation spline

relating the set of corresponding control points. The PDE of

an elastic body is based on notions of stress and strain. When

a body is subject to an external force this induces internal

forces within the body which cause it to deform. The integral

of the surface forces and body forces must be zero [36]. Let xdenote a set of feature points in the 3D facial model of neutral

face, yi be the corresponding control points with expressions,we then have the Navier equilibrium PDE as:

2l (x ) + ( + )[

l (x )] +

f(x ) = 0 (1)

wherel (x ) is the displacement of all characteristic feature

points within the facial model from the original position

(neutral face), and are the Lam coefficients which describethe physical properties of the face. is also referred to as theshear modulus. 2 and denote the Laplacian and gradient

operation, respectively, and l (x ) is the divergence of

l (x ).

f (x ) is the muscular force field being applied on the face.

To find an appropriate physical property for an expressional

model, muscular forces are assumed to distribute on the ho-

mogeneous isotropic elastic body of the facial model to obtain


5/14



5

smooth deformation. So a polynomial radially symmetric force

is considered that:

f (x ) = w d(x ) (2)

where w = [ w1 w2 w3 ]T is the strength of the force

field, and d(x ) = |x21

+ x22

+ x23

|1/2

. The PDEs solutions of

(1) can be computed as:

l (x ) = E(x )w (3)

and

E(x ) = [d(x )2I 3x x T]d(x ) (4)

where = (11+5)/(+) is the Poissons ratio, I is a 33 identity matrix, and x x T is an outer product. It is obtainedusing the Galerkin vector method [36] to transform the three

coupled PDEs into three independencies. The solution can be

verified by substituting (3) into (1). The EBS displacement

LEBS(x ) is a linear combination of the PDEs solution in

(3) that:

LEBS(x ) =

Ni=1

E(x yi )wi + Ax + B (5)

where Ax + B is the affine portion of the EBS, A =[ a1 a2 a3 ]T is a 3 3 matrix. The coefficients of thespline are determined from the control points yi and thedisplacement of feature points. The spline relaxes to an affine

transformation as the distance from the point approaches

infinity.

The summation in (5) can be expressed in the matrix-vector

form as:

EEBS(x ) = HLEBS (6)

where H is a (3N + 12) (3N + 12) transfer function as

described by Kuo [35], and EEBS is a (3N + 12) 1 vectorwith all the EBS coefficients as:

EEBS = [w1T

w2T ...wNT a1T a2T a3T bT ]

T

(7)

In our system, the 26 control points and the displacements of

the control point sets are obtained from the fiducial detection

and tracking steps. We solve (6) from the requirements that

the spline displacements equal the control point displacements

with a constant all over the facial region. The flatnessconstraints which are expressed in terms of second or higher

order (e.g. xi2, xj

2 or xixj ) are set to zero enforcing the

conservation of linear and angular momenta for an equilibrium

solution. These constraints cause the force field to be balancedso that the EBS facial model is stationary. The value of the

spline for the 28 dependent points are computed from (5) with

the spline coefficient EEBS, the spline basis function H andthe control point locations.

The muscular force fieldf (x ), given by (2), can be calcu-

lated from the solutions of EBS according to the displacement

of control points such that:

f (x ) =

f1

f2

f3

T=

Ni=1

f(x yi )

wi (8)

With different we obtained variantf(x ). By the princi-

ple of superposition for an elastic body, the external forces

must be minimized according to the roughness measurement

constraints [35]. This ensures that the forces are optimally

smooth and sufficient to deform the elastic material so that

the EBS equals the given displacements at the control point

locations. By varying the values of in (4), we can calculateeach corresponding muscular force field respectively. To find

the minimum muscular force field |f (x )|min , we obtainthe appropriate physical property and the associate EBScoefficients EEBS. We then construct the deformable visualfeature v for classification with and EEBS. The algorithmfor deformation feature extraction is summarized as follows.

1) Initialize the feature point positions x in the 3D facialmodel for neutral face according to the detection results

from the 26 fiducial points.

2) Set for facial region 0.01 .3) Update the corresponding control point positions yi in

the expressional facial model subject to the tracking

results .

4) Calculate the displacements of the control point sets

l

in the facial region.

5) Solve EBS in (6) to obtain the associate spline coeffi-

cient EEBS.6) Compute the position of nonsignificant points in the

facial region based on the EBSs solution in the previous

step.

7) Calculate the muscular force fieldf(x ) in (2) from

the solution of EBS.

8) Sweep from 0.02, 0.03, ... , to 0.5 and repeat steps 5),6) and 7) to obtain the new muscular force fields.

9) Find the minimum muscular force field |f (x )|min, fix

and the EBS coefficients EEBS.

10) Construct the deformable visual feature v for classifica-tion with EEBS and

.

IV. D-ISOMAP BASED CLASSIFIER

Once the deformable facial features have been obtained

with the EBS, we use an isomap based method for emotion

classification. Isomap was first proposed by Tenenbaum [37],

and is one of the most popular manifold learning techniques

for promising nonlinear dimensionality reduction. It attempts

to learn complex embedding manifolds using local geometric

metrics within a single global coordinate system. The Isomap

algorithm uses geodesic distances between points instead of

simply taking Euclidean distances, thus encoding the manifoldstructure of the input space into distances. The geodesic

distances are computed by constructing a sparse graph in

which each node is connected only to its closest neighbors.

The geodesic distance between each pair of nodes is taken to

be the length of the shortest path in the graph that connects

them. These approximated geodesic distances are then used

as inputs to classical multidimensional scaling (MDS). Yang

proposed a face recognition method based on Extended Isomap

(EI) [38]. In his work, the EI method was utilized by a Fisher

Linear Discriminant (FLD) algorithm. The main difference

between EI and the original Isomap is that after a geodesic


6/14



6

distance is obtained, the EI algorithm uses FLD to achieve

the low dimensional embedding while the original Isomap

algorithm using MDS. X. Geng [39] proposed an improved

version of Isomap to guide the procedure of nonlinear di-

mensionality reduction. The neighborhood graph of the input

data is constructed according to a certain kind of dissimilarity

between data points, which is specially designed to integrate

the class information.

The Isomap algorithm generally has three steps: construct

a neighborhood graph, compute shortest paths, and construct

d-dimensional embedding. Classical MDS is applied to the ma-

trix of graph distances to obtain a low-dimensional embedding

of the data. However, since the original prototype Isomap does

not discriminate data acquired from different classes, when

dealing with multi-class data, several isolated sub-graphs will

result in undesirable embedding. On the other hand, the EI

[38] used the Euclidean distance to approximate the distance

between two nearest points in two classes. When the number

of classes becomes large, the classes may construct their

own spatially intrinsic structure. Then the EI and improved

version cant recover the classes intrinsic structures of thehigh-dimensional data. In order to cope with such problems,

in this paper, we propose a D-Isomap based method for

emotion classification. The discriminative information of facial

features [40] are considered so that they can properly represent

the discriminative structures of the emotional space on the

manifold. The proposed D-Isomap provides a simple way

to obtain the low dimensional embedding and discovers the

discriminative structure on the manifold. It has the capability

of discovering nonlinear degrees of freedom and finding the

globally optimal solutions guaranteed to converge for each

manifold [41].

There are two general approaches to build the final classifier

using dynamic information from video sequences. One is todetermine the dependencies based on the joint probability

distribution among the score level decisions. The other is based

on the distribution of dynamic features, in which case the

features can be discrete or continuous. Le et al. [42] proposed

a 3D dynamic expression recognition method using spatio-

temporal shape features. The HMMs algorithm was adopted

for the final classification. Sandbach et al. [43] also proposed

to recognize 3D facial expression using HMM dependent on

the motion-based features.

In this work, the final classifier is constructed based on the

dynamic feature level fusion. We change the facial expression

model following the trajectory of the 54 characteristic feature

points frame by frame. It explicitly describes the relationshipbetween the motions of the facial feature points and the ex-

pression changes. The EBS model sequence v(t) is effectivelyrepresented by a sequence of observation from the input video,

where t is the time variable. Before the raw data samples in thedatasets could be used for training/testing of classification, it

is necessary to normalize the sequences such that they were in

the format required by the system. The frame rate is reduced

to 10 fps and the sequences last 3 seconds in total from a

neutral face to the apex of one expression. Since the original

displacement ofv(t) in each frame depends on each individual,we use the length (distance between the Top point of the head

and the Tip point of the chin) and width (distance between

the Left point of the head and the Right point of the head) of

the neutral face for scale normalization. We then normalize

the feature matrix to regulate the variances from the EBS

coefficients and constant lambda using L2 method. The EBSmodel sequence takes into account the temporal dynamics of

the feature vectors, and the labeled graph matching is then

applied to determine the category of the sample video.

The EBS feature v for each emotional facial model can beseen as one point in a high dimensional space. As we have 54

characteristic feature points in the 3D facial model, each EBS

feature, v, has 175 dimensions. Given the variations of facialconfigurations during emotional expressions, these points can

be embedded into a lower dimensional space. We define the

facial EBS feature set V as the input data:

V = {vt} RTM (9)

where t = 1,...,T is the input sample number, M = 175is the dimensionality of the original data. Let U denote the

embedding space ofV into a low dimensional manifold with

m dimensions such that:

U = {ut} RTm (10)

which preserves the manifolds estimated intrinsic geometry.

The D-Isomap provides a simple way to obtain the low dimen-

sional embedding and discovers the discriminative structure

on the manifold as well [40], [41]. We compute Euclidean

distance D between any pairwise points (vt, vt) from theinput space V for the training data with a discriminative

weight factor such that:

D(vt, vt) =

vt vt 2 if Z(vt) = Z(vt)

vt vt 2 if Z(vt) = Z(vt)

(11)where Z(vt) denotes the class label which the input data vtbelongs to. For pairwise points with the same class labels,

the Euclidean distance is shortened by weight factor . Thecompacting and expanding parameters are empirically calcu-

lated for the discriminative matrix. It can solve the impeding

problems in [44] when the dimensions of scatter become very

high in the real data sets.

A neighborhood graph G is constructed according to thediscriminative matrix. If one point is one of the closest points

or lies within a fixed radius of any other point, it is defined as

a neighbor of that point. The pairs are connected with paths

between points, which are acquired by adding up a sequence

of edges equal to the distance between neighbor points . Thedistances between all point pairs are computed based on a

chosen distance metric. We then calculate a distance matrix

between all pairwise points by computing the shortest paths in

the neighborhood graph. The geodesic distance matrix between

all points is set to be:

DG = min (DG,DTG) (12)

The embedding matrix, Dm, in low dimensional space can

be calculated by converting the distance matrix to inner

products with a translation mapping [45]. Compute the largest

eigenvalue and the top m eigenvectors ofDG, we obtain the


7/14



7

eigenvector matrix E Rnm and the eigenvalue matrixM Rmm. The embedding matrix in low dimensional spacecan be calculated that:

Dm = M1/2ET (13)

We then use a Nearest Class Center (NCC) algorithm [46] to

determine the emotion classes. The NCC algorithm is a centre

based method for data classification in a high dimensional

space. Many classification methods may be concerned for

the final decision such as nearest neighbors, k-mean or EM

algorithms. In the nearest neighbors based classification, the

representational capacity and the error rate depends on how the

dataset to be chosen to account for possible variations and how

many are available. The k-mean method adjusts the center of

a class based on the distance of its neighboring patterns. The

EM algorithm is a special kind of quasi-Newton algorithm

with a searching direction having a positive projection on

the gradient of the log likelihood. In each EM iteration, the

estimation maximizes a likelihood function which is further

refined in each iteration by a maximization step. When the

EM iteration converges, it should ideally obtain maximumlikelihood estimation of the data distribution

A commonality among these methods is that they define a

distance between the dataset and an individual class, then the

classification is determined by consisting of isolated points in

the feature space. However, since the emotional features in our

work are complex and not interpretable, a formal centre for

each emotion class may be difficult to determine or misplaced.

In many cases, multiple clusters are available within one video

sequence. Such property can be utilized to improve the final

decision but has been ignored by other methods. For this

reason, we need to find a more efficient way to generalize

the representational capacity with sufficient large number of

feature points stored to account for as many variations aspossible. Unlike other alternations, NCC considers the centers

for the clusters k with known label from the training dataand generalizes the class center for each emotion group. The

derived cluster centers have more variations than the original

input features and thus expands the capacity of the available

data sets. The classification for the test data is based on the

nearest distance to each class center.

The NCC algorithm is applied for the classification of

the input video based on the number of clusters k and theembedding matrix Dm. We assume that the clusters can be

classified in classes a priori through any viable means and are

available within each video sequence. So the distance matrix

makes use of such information about classes contained in theclusters of each class. A subspace is constructed out from the

entire feature space based on the prior knowledge and the

within-class clusters are generalized to represent the variants

of that emotion class. Thus the generalization ability of the

classifier is increased.

Let ck be a set of k cluster centers for the feature pointsbelong to a class. The k clusters determine the output classlabel of the input data. Each cluster approximates the sample

population in the feature space for the samples that belong

to it. The statistics of every cluster are used to estimate the

probability for the dataset. The probability distribution can be

calculated from the training data at this level. The centers of

these clusters provide essential information for discriminative

subspace, since these clusters are formed according to class

labels of emotions. We can simply enforce the mapping to be

orthogonal, i.e., we can impose the condition

U UT = I (14)

for the feature points on the projected set. In our case, a

total of k centers of the clusters give (k 1) discriminativefeatures which span (k 1) dimensional discriminative space.The cluster centers for a test data can be calculated using the

objective function:

E(ck) = c

k ck (15)

A dense matrix h = eeT that e = [1,..., 1]T is imposed tothe distance matrix DG to calculate cluster centers from the

training data. Since DG is symmetric, we put the uniform

weight 1/N to every single pair for the full graph. Let pdenote the sample number in one cluster, l = 7 the emotionalspace for labeling, and Ut represents the tth-element of the

embedded manifold matrix for a test data from (10), theobjective function becomes:

{Ck}l =1

p

pt=1

(DmUt 1

2HDGH) (16)

where H is the centering matrix that H = I 1NeeT. The

labeled class center {Ck}l for the emotional space of a testvideo can be calculated from (16). Each data sample along

with its k cluster lies on a locally manifold. Since D-Isomapseeks to preserve the intrinsic geometric properties of the local

neighborhoods, the input data is reconstructed by a linear

combination of its nearest centers with the labeled graph

matching.For each category of facial expression, we calculate an

average class center coordinates Cl from the training samples.Compute the class centers cl for the test data using (16), wecan obtain its class label C using the Euclidean distance tothe nearest class center coordinates Cl.

C = arg mincl

(cl, Cl,Dm,Ut, ) (17)

V. EXPERIMENT AND RESULTS

To evaluate the performance of our proposed method, two

facial expression video datasets are used for the experiment:

RML Emotion database and Mind Reading DVD database.

RML Emotion database [15] was originally recorded forlanguage and context independent emotional recognition with

the six fundamental emotional states: happiness, sadness,

anger, disgust, fear and surprise. It includes eight subjects

in a nearly frontal view (2 Italian, 2 Chinese, 2 Pakistani,

1 Persian, and 1 Canadian) and 520 video sequences in

total. The RML Emotion database was originally developed

for language and context independent emotional recognition.

Each video pictures a single emotional expression and ends

at the apex of that expression while the first frame of every

video sequence shows a neutral face. Video sequences from

neutral to target display are digitized into 320 340 pixel


8/14



8

arrays with 24-bit color values. The Mind Reading DVD [47]

is an interactive computer-based resource for face emotional

expressions, developed by Cohen and his psychologist team.

It consists of 2472 faces, 2472 voices and 2472 stories. Each

video pictures the frontal face with a single facial expression

of one actor (30 actors in total) of varying age ranges and

ethnic origins. All the videos are recorded at 30 frames per

second, last between 5 to 8 seconds, and are as a resolution

of320 240.

A. Facial region detection

The Facial region is detected in the input video sequence

using the face detection method with local normalization [25].

The normalized results of the original sequences show that the

histograms of all input images are widely spread to cover the

entire gray scale by local normalization; and the distribution

of pixels is not too far from uniform. As a result, dark images,

bright images, and low contrast images are much enhanced to

have an appearance of high contrast. The overall performance

of the system is considerably improved by incorporating local

normalization.

B. Fiducial points detection and tracking

The fiducial points are then detected [30] and tracked [31]

automatically in the facial region. As the location of each

fiducial point is at the center of a 16 16 pixel neighborhoodwindow, and the feature vector for point detectors are extracted

from this region, we consider detected points displaced within

five pixels from the corresponding ground truth facial points

as successful detections. 180 videos of 6 subjects from RML

Emotion database and 240 videos of 20 subjects from Mind

Reading DVD database are selected for experiment, which

constitute a total of 420 sequences of 26 subjects. We ran-domly divide all the 420 sequences into training and testing

subsets containing 210 sequences each.

The overall system performance of recall 92.45% and pre-cision 90.93% are achieved simultaneously in terms of falsealarm rates. We also implement the AAM method mentioned

in [27] for the 26 fiducial point detection and tracking, as

shown in Fig. 3. The proposed method has a better perfor-

mance on both efficiency and accuracy.

C. EBS based emotional facial modeling

In this section, we verify the performance of the EBS based

method for emotional facial modeling on the aforementioneddatabases. The positions of 26 fiducial points are obtained

from the detecting and tracking step and then used for calcu-

lating the positions of the 28 dependent points. These positions

are 2D data in the video sequences and cant be applied with

the 3D EBS analysis directly. All the fiducial points need to be

aligned to our 3D model first. We use a flexible generic facial

modeling (FGFM) algorithm [48] for fitting each face image

to the 3D mesh model. The geometric values used in FGPFM

are obtained from the BU-3DFE database [49]. There are 2500

3D facial expression models for 100 subjects in this database.

We use the 3D facial expression model with the associated

Fig. 3. Detection and tracking Result

(a) anger faces.

(b) disgust faces.

(c) fear faces.

Fig. 4. Emotional EBS model construction.

frontal-view texture image as ground truth data to train the

3D model. Initially, we define a face-centered coordinate

system used for FGPFM. All the 3-D coordinates, curvature

parameters for every vertex generation function, the weights in

the interrelationship functions and the statistical model ratios

are recorded in an FGPFM. The clustering process is used to

construct the accurate generic facial models from the training

3D data. All the selected typical training examples are used

to acquire the geometric values for each FGPFM. The optimal

geometric values of FGPFM result in full coincidence between


9/14



9

superimpositions of the transformed FGPFM and those facial

contours of training images. Geometric values of FGPFM are

established using the profile matching technique for silhouettes

of the training images and the FGPFM with the known view

directions. The reconstruction procedure can be regarded as

a block function of FGPFM, and the input parameters are

3-D face-centered coordinates of control points. When the

control points are accurately modified, the desired 3-D facial

model is determined based on the topological and geometric

descriptions of FGPFM.

To remove the individual differences in the facial expres-

sions, each face shape from the video sequences is normalized

to the same scale. The 26 control points on the 3D facial model

are initially estimated by the fiducial points using the back

projection technique with the set of predefined unified depth

values. The original dependent points are also predefined in

the model. Classified FGFM ratio features are selected with

a minimal Euclidean distance between the estimated and the

codebook-like ratios database. The depth values of control

points and curvature parameters are obtained for reconstructing

the EBS facial model from the selected ratio features classifier.Fig. 4 shows some representative sample results for emo-

tional model construction. Our objective here is to find the

positions of dependent points after emotional facial defor-

mation under the availability of the fiducial point position.

The basic six emotions are analyzed in this experiment. The

best-fit mesh model of a given face is estimated from the

first input frame with neutral emotion. Based on the known

tracking information, the positions of all characteristic feature

points are calculated and the EBS model is reconstructed for

any particular expression. From experimental results we can

see that our method provides good construction following the

variations of the control points.

(a) (b)

(c) (d)

Fig. 5. EBS facial model constructions with different Poissons ratio (a) amale anger face (b) a female sadness face (c) a female anger face (d) a malehappiness face.

We provide more experimental results in Fig. 5 to verify

the consistency of the proposed method. Fig. 5 presents the

results of the emotional facial model for different people.

The is assumed to be constant for the whole facial regionand determined under the condition of minimum muscular

force field generation. Fig. 5(a-d) shows the results when

is obtained experimentally. Subjectively, the proposed method

provides a good facial model under different people and

expressions.

D. D-Isomap for final decision

In this section, 280 video sequences of eight subjects from

the RML Emotion database and 420 video sequences of 12

subjects from the Mind Reading DVD database are selectedfor D-Isomap based classifier evaluation, which constitute a

total of 700 sequences of 20 subjects with six emotions and

neutral faces.

The facial EBS features are extracted to construct a 175

dimensional vector sequence that is too large to manipulate

directly. We use D-Isomap algorithm for dimensionality re-

duction, as discussed in section 4. Since each feature vector

can be seen as one point in a 175 dimensional space, the

D-Isomap is utilized to find the embedding manifold in a

low-dimensional space to represent the original data. These

representations should cover most of the variances of the

observation based on the continuous variations of facial config-

urations. The low-dimensional space structures are extracted tofacilitate the manifolds estimated intrinsic geometry by the D-

Isomaps capability of nonlinear analysis and the convergence

of globally optimal solutions.

The geodesic distance graph from (12) is used for D-Isomap

based embedding. Fig. 6 shows examples of distance matrices

with discriminative weight factors for seven emotionalexpressions of randomly selected subjects. The distance graph

reflects the intrinsic similarity of the original expression data

and consequently is considered for determining true embed-

ding. From Fig. 6 we can see, by applying the weight factor,

the points from the same cluster can be projected closer in

low dimensional space, thus the distance is compacted. On

the other hand, the distance between different clusters could

be expanded by increasing .

(a) = 0.1. (b) = 0.25.

(c) = 0.5. (d) = 0.75.

Fig. 6. Distance matrix graph with different weight factor, higher values areshown in red, lower values in blue.

Increasing the dimension of the embedding space, we can

calculate the residual variance for the original data. The true


10/14



10

dimension of data can be found by considering the decreasing

trend in the residual value. The embedding results using

Isomap and proposed D-Isomap with different k are presentedin Fig. 7, which shows the results when k is set to be 7,12, and 20, respectively. From the results we see that our

proposed method achieves an average of 10% improvementwhen compared with the original Isomap. The best perfor-

mance can be obtained when k is 12 and the dimension ofembedded space is reduced to 20, which covers more than 95

% variances of the observation from the input data. Therefore,these 20 dimensional components are used here to represent

facial expressions in the input videos.

(a) (b)

(c) (d)

(e) (f)

Fig. 7. Dimensionality reduction using Isomap and D-Isomap, (a-e) showthe results using Isomap with cluster k are 7, 12, and 20 respectively, (b-f)show the results using D-Isomap.

We also provide expressional configurations to show appar-

ent emotional variation in Fig. 8. For each video sequencefrom the database, we constructed 10 sub-clips of the samples

with different frames from neutral to the apex, which can

improve the representational capacity with sufficient large

number of feature points stored to account for as many

variations in the original data as possible. To show apparent

emotional variations, we provided the expressional configu-

rations based on different numbers of samples. In Fig. 8,

(a) shows the result using 700 samples with one sample for

each video, (b) using 10 samples for each video and 7000

samples in total. From the results we can see that the EBS

model sequences are embedded to a discriminative structure

on the low dimensional feature space. By applying the NCC

algorithm to the embedding results from the D-Isomap using

(17), we can determine the emotion class for a test video. We

label the emotion class centers on the embedded feature space,

shown in Fig. 8.

(a)

(b)

Fig. 8. Labeled class centers in a 2D space based on the embedding results(a) shows the results using 700 samples (b) shows the results using 7000samples.

To evaluate the performance of our proposed method, we

divide these 700 sequences into five subsets and 140 sequences

for each. Every time, one of the five subsets is used as a

testing set, the other four subsets are used as a training set.

The evaluation procedure is repeated until all the subsets have

been used as a testing set. A test video sequence is treated

as a unit and labeled with a single expression category. Therecognition accuracy is calculated as the ratio of the number of

correctly classified videos to the total number of videos in the

data set. By using the proposed classifier, we achieve an overall

accuracy of 88.2%. We list the confusion matrix for emotion

recognition with numbers representing percentage correct in

Table I. From the results we can see that features representing

different expressions exhibit great diversity since the distances

between different emotions are relatively high. On the other

hand, the same expressions collected from different subjects

are very similar due to the short distances within the same

class.


11/14


12/14



12

standard deviation and average rate of emotion recognition

between three Isomap methods. Table III indicates that the

proposed algorithm achieves better performance than OI and

EI. D-Isomap outperforms the other methods because it can

compact the data points from the same cluster on a high-

dimension manifold to make them closer in the low-dimension

space, and make the data points from the different clusters

farther as well. This ability could be beneficial in preserving

the homogeneous characteristics for emotion classification.

To demonstrate the discriminative embedding performance

of the proposed D-Isomap, we conducted some experiments

with state-of-art manifold learning methods i.e. localized

LDA (LLDA), the discriminative version of LLE (DLLE)

and Laplacian Eigenmap (LE). LLDA [50] is based on the

local estimates of the model parameters that approximate

the non-linearity of the manifold structure with piece-wise

linear discriminating subspaces. The local neighborhood size

k = 30 and dimensionality of subspace d = 32 are selectedto compute local metrics. DLLE [51] preserves the local

geometric properties within each class according to LLE

criterion, and the separability between different classes is en-forced by maximizing margins between point pairs on different

classes. The balance term h = 1, nearest Neighbors k1 = 1and smallest distances k2 = 100 are used for classificationwith the closest centroit. LE [52] makes use of incremental

Laplacian Eigenmap reducing the dimension and extracting the

feature to data points Drawing on the correspondence between

the graph Laplacian, the Laplacian Beltrami operator on the

manifold, and the connections to the heat equation, a geo-

metrically motivated algorithm is utilized for representing the

high dimensional data that has locality preserving properties

and a natural connection to clustering. The experiments are

conducted with the compression dimensions of 50.

TABLE IVRECOGNITION RATE OF DIFFERENT MANIFOLD LEARNING

METHODS.

Method Dimensions Recognition Rate

LLDA 32 80.5%DLLE 40 85.3%

LE 25 84.7%D-Isomap 20 88.2%

In all the experiments, the final classification after dimen-

sionality reduction is determined by the nearest neighbor

criterion. Table IV shows the experimental results of different

algorithms. The results demonstrate the greater effectiveness

of D-Isomap for both feature reduction and final recognition.

It considers the label information and local manifold structure.

When dealing with the multiple classes and the data set

distribution is complex, the proposed D-Isomap takes the

advantage of weight factor to separate data with different

labels farther and cluster data with the same label closer. Thus

the proposed algorithm can gain better recognition rate.

VI. CONCLUSIONS

In this paper we present an automatic emotion recognition

method from video sequences using the 3D active deformable

information. From experimental results we find that the signif-

icant features to distinguish one individual emotion from the

other emotions are different. Some of the features selected in a

global scenario are redundant, while some of the other features

might contribute to the classification of specific emotion.

Another observation is that there is not single feature which is

significant for all the classes. This actually reveals the nature

of human emotion; there are no sharp boundaries between

the emotional states. Any single emotion may share similar

patterns to other emotions, but not all. The human perception

of emotion is based on the integration of different patterns.

In the emotion recognition field, current techniques for the

detection and tracking of facial expressions are sensitive to

head pose, clutter, and variations in lighting conditions. Few

approaches to automatic facial expression analysis are based

on deformable or dynamic 3D facial models. The proposed

system attempts to solve such problems by using a generic 3D

mesh model with D-Isomap classification. The facial region

and fiducial points are detected and tracked in the input video

frames automatically. The generic facial mesh model is then

used for EBS feature extraction. D-Isomap based classificationis applied for final decision. The merits of this work are

summarized as follows.

Facial expressions are detected and tracked automatically

in the video sequences, which can alleviate a common

problem in conventional detection and tracking methods.

Namely, inconsistent performance due to sensitivity to

variation in illuminations such as local shadowing, noise

and occlusion.

We model the face as an elastic body and exhibit

different elastic characteristics dependent on different

facial expressions. Based on the continuity condition, the

elastic property of each facial expression is found, and a

complete wireframe facial model can be generated underthe availability of some limited feature point positions.

An adaptive partition of polygons is embedded in EBS

according to the surface curvature through the character-

istic feature points. The subtle structural information can

be expressed without giving complicated facial features.

The generic 3D facial model is established so that the

good parameters of EBS can be used for emotion recogni-

tion, e.g. the appropriate physical characteristics for face

deformations, control points, etc.

We propose the use of D-Isomap for emotion recognition.

It can compact the data points from the same emotion

class on a high-dimension manifold to make them closer

in the low-dimension space, and makes the data pointsfrom the different clusters farther as well. It results in a

high recognition rate when compared with other Isomap

methods

Experimental results and comparison with several other

algorithms demonstrated the effectiveness of the proposed

method.

REFERENCES

[1] A.C. Rafael and D. Sidney, Affect Detection: An InterdisciplinaryReview of Models, Methods, and Their Applications, IEEE Transactionson Affective Computing, Vol.1 (1), pp. 18-34, June 2010.


13/14



13

[2] N. Sebe, H. Aghajan, T. Huang, N.M. Thalmann, C. Shan, Special Issueon Multimodal Affective Interaction, IEEE Transactions on Multimedia,Vol.12 (6), pp. 477-480, 2010.

[3] P. Ekman, T. Dalgleish, and M.E. Power, Basic emotions, Handbook ofCognition and Emotion, Wiley, Chichester, U.K., 1999.

[4] C. Darwin, The Expression of Emotions in Man and Animals, JohnMurray, 1872, reprinted by University of Chicago Press, 1965.

[5] J.F. Cohn, Advances in Behavioral Science Using Automated FacialImage Analysis and Synthesis, Signal Processing Magazine, IEEE,Vol.27 (6), pp. 128-133, 2010.

[6] K.I. Chang, K.W. Bowyer, and P.J. Flynn, Multiple Nose RegionMatching for 3D Face Recognition under Varying Facial Expression,

IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.28(10), pp. 1695-1700, October 2006.

[7] P. Ekman, W.V. Friesen, and J.C. Hager, The Facial Action CodingSystem: A Technique for the Measurement of Facial Movement, SanFrancisco, Consulting Psychologist, 2002.

[8] M.J. Lyons, J. Budynek, A. Plante, and S. Akamatsu, Classifying facialattributes using a 2-D Gabor wavelet representation and discriminantanalysis, Proceedings of the 4th International Conference on AutomaticFace and Gesture Recognition, pp. 202-207, March 2000.

[9] L.D. Silva and S.C. Hui, Real-time facial feature extraction and emotionrecognition, Proceedings of 4th International Conference on Informa-tion, Communications and Signal Processing, Vol.3, pp. 1310-1314,Singapore, December 2003.

[10] I. Cohen, N. Sebe, Y. Sun, M. S. Lew, and T.S. Huang, Evaluation ofexpression recognition techniques, Proceedings of International Confer-

ence on Image and Video Retrieval, pp. 184-195, IL, USA July 2003.[11] G. Guo and C.R. Dyer, Learning from examples in the small sample

case: face expression recognition, IEEE Transactions on Systems, Man,and Cybernetics, Part B, Vol.35 (3), pp. 477-488, June 2005.

[12] M. Pantic and I. Patras, Dynamics of facial expression: recognitionof facial actions and their temporal segments from face profile imagesequences, IEEE Transactions on Systems, Man, and Cybernetics, Part

B, Vol.36 (2), pp. 433-449, April 2006.[13] K. Anderson and P.W. McOwan, A real-time automated system for the

recognition of human facial expressions, IEEE Transactions on Systems,Man, and Cybernetics, Part B, Vol.36 (1), pp. 96-105, February 2006.

[14] H. Gunes and M. Piccardi, Automatic Temporal Segment Detection andAffect Recognition from Face and Body Display, IEEE Transactions onSystems, Man, and Cybernetics, Part B, Vol.39 (1), pp. 64-84, February2009.

[15] Y. Wang and L. Guan, Recognizing Human Emotional State fromAudiovisual Signals, IEEE Transactions on Multimedia, Vol.10 (5), pp.

659-668, August 2008.[16] Z. Zeng, M. Pantic, G.I. Roisman, and T.S. Huang, A Survey of Affect

Recognition Methods: Audio, Visual, and Spontaneous Expressions,IEEE Transactions on Pattern Analysis and Machine Intelligent, Vol.31(1), pp. 39-58, January 2009.

[17] M. Song, Z. Dong, C. Theobalt, H.Q. Wang, Z.C. Liu, and H.P. Seidel,A General Framework for Efficient 2D and 3D Facial ExpressionAnalogy, IEEE Transactions on Multimedia, Vol.9 (7), pp. 1384-1395,November 2007.

[18] T. Yun and L. Guan, Human Emotion Recognition Using Real 3DVisual Features from Gabor Library, IEEE International Workshop on

Multimedia Signal Processing, pp. 481-486, Saint Malo, October 2010.[19] J. Wang, L. Yin, X. Wei, and Y. Sun, 3D facial expression recognition

based on primitive surface feature distribution, IEEE InternationalConference on Computer Vision and Pattern Recognition, pp. 1399-1406,New York, June 2006.

[20] H. Soyel and H. Demirel, Facial expression recognition using 3D

facial feature distances, International Conference on Image Analysis andRecognition, Vol.4633, pp. 831-838, Montreal, August 2007.

[21] H. Soyel and H. Demirel, Optimal feature selection for 3D facial ex-pression recognition using coarse-to-fine classification, Turkish Journalof Electrical Engineering and Computer Sciences, Vol.18 (6), pp. 1031-1040, 2010.

[22] H. Tang and T. S. Huang, 3D facial expression recognition based onautomatically selected features, IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshops, pp. 1-8, Anchorage,June 2008.

[23] H. Tang and T. Huang, 3D facial expression recognition based onproperties of line segments connecting facial feature points, IEEE

International Conference on Automatic Face and Gesture Recognition,pp. 1-6, Amsterdam, The Netherlands, 2008.

[24] Y. Sun and L. Yin, Facial expression recognition based on 3D dynamicrange model sequences, Computer Vision - ECCV 08, pp. 58-71, 2008.

[25] T. Yun and L. Guan, Automatic face detection in video sequences usinglocal normalization and optimal adaptive correlation techniques, Pattern

Recognition, Vol.42 (9), pp. 1859-1868, September 2009.

[26] P. Viola and M. Jones, Robust Real Time Object Detection, Pro-ceedings 2nd International Workshop on Statistical and ComputationalTheories of Vision, Vancouver, Canada, July 2001.

[27] J. Xiao, S. Baker, I. Matthews, T. Kanade,Real-time combined 2d+3dactive appearance models, Computer Vision and Pattern RecognitionConference, Vol.2, pp. 535-542, July 2004.

[28] R.Gross, I.Matthews, S.Baker, Constructing and Fitting Active Appear-

ance Models With Occlusion, IEEE Workshop on Face Processing inVideo, pp. 72, 2004.

[29] D. Vukadinovic, M. Pantic, Fully Automatic Facial Feature PointDetection Using Gabor Feature Based Boosted Classifiers, IEEE Inter-national Conference on Systems, Man and Cybernetics Waikoloa, Vol. 2,pp.1692 - 1698, October 2005.

[30] T. Yun and L. Guan, Automatic fiducial points detection for facialexpressions using scale invariant feature, IEEE International Workshopon Multimedia Signal Processing, pp. 1-6, Rio de Janero, Brazil, October2009.

[31] T. Yun and L. Guan, Fiducial Point Tracking for Facial ExpressionUsing Multiple Particle Filters with Kernel Correlation Analysis, IEEE

International Conference on Image Processing, pp. 373-376, Hongkong,September 2010.

[32] M.H. Davis, A. Khotanzad, D.P Flamig, and S.E Harms, A physics-based coordinate transformation for 3-D image matching, IEEE Trans-actions on Medical Imaging, Vol.16 (3), pp. 317 -328, June 1997.

[33] M.H. Davis, A. Khotanzad, D.P. Flamig, S.E. Harms, Elastic bodysplines: a physics based approach to coordinate transformation in medicalimage matching, IEEE Symposium on Computer-Based Medical Systems,pp. 81 - 88, 1995.

[34] A.Hassanien, M.Nakajima, Image morphing of facial images transfor-mation based on Navier elastic body splines, Computer Animation, pp.119 - 125, 1998.

[35] C.J. Kuo, J. Hung, M. Tsai, and P. Shih, Elastic Body Spline Techniquefor Feature Point Generation and Face Modeling, IEEE Transactions on

Image Processing, Vol.14 (12), pp. 2159-2166, December 2005.

[36] P. C. Chou and N. J. Pagano,Elasticity: Tensor, Dyadic, and EngineeringApproaches, Dover, New York, 1992.

[37] J.B. Tenenbaum, V. Silva, and J.C. Langford, A global geometricframework for nonlinear dimensional reduction, Science, Vol.290, pp.2319-2323, December 2000.

[38] M.H. Yang, Face recognition using extended isomap, InternationalConference on Image Processing, Vol.2, pp.117-120, New York, Septem-ber 2002.

[39] X. Geng, D.C. Zhan, and Z.H. Zhou, Supervised Nonlinear Dimension-ality Reduction for Visualization and Classification, IEEE Transactionson Systems, Man and Cybernetics, Part B, Vol.35 (6), pp. 10981107,December 2005.

[40] Y. Wu, K.L. Chan, and L. Wang, Face recognition based on discrimina-tive manifold learning, International Conference on Pattern Recognition,Vol.4 pp. 171-174, Cambridge, UK, September 2004.

[41] D. Zhao and L. Yang, Incremental Isometric Embedding of High-Dimensional Data Using Connected Neighborhood Graphs, IEEE Trans-actions on Pattern Analysis and Machine Intelligence, Vol.31, pp. 86-98,January 2009.

[42] V. Le, H. Tang, and T.S. Huang, Expression recognition from 3Ddynamic faces using robust spatio-temporal shape features, AutomaticFace Gesture Recognition and Workshops, pp. 414-421, 2011.

[43] G. Sandbach, S. Zafeiriou, M. Pantic, and D.Rueckert, A dynamic

approach to the recognition of 3D facial expressions and their temporalmodels, Automatic Face Gesture Recognition and Workshops, pp. 406-413, 2011.

[44] T. Friedrich, Nonlinear Dimensionality Reduction with Locally LinearEmbedding and Isomap, University of Sheffield, 2002.

[45] E. Kokiopoulou and Y. Saad, Orthogonal Neighborhood PreservingProjections: A Projection- Based Dimensionality Reduction Technique,

IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.29,pp. 2143-2156, December 2007.

[46] J. Handl and J. Knowles, An Evolutionary Approach to MultiobjectiveClustering, IEEE Transactions on Evolutionary Computation, Vol.11 (1),pp. 56-76, February 2007.

[47] S.B. Cohen, Mind Reading: The Interactive Guide to Emotions, Lon-don, Jessica Kingsley, 2004.

[48] S.Y. Ho and H.L. Huang, Facial Modeling from an UncalibratedFace Image Using Flexible Generic Parameterized Facial Models, IEEE


14/14


14

Transactions on Systems, Man and Cybernetics, Part B, Vol.31 (8),October 2005.

[49] L. Yin, X. Wei, Y. Sun, J. Wang, M.J. Rosato,3D Facial ExpressionDatabase for Facial Behavior Research, Automatic Face and Gesture

Recognition, pp. 211 - 216, 2006.[50] L. Zhu, F. Yun, Y. Junsong, T.S. Huang, and W. Ying, Query Driven

Localized Linear Discriminant Models for Head Pose Estimation, IEEEInternational Conference on Multimedia and Expo,pp. 1810 - 1813, July2007.

[51] X. Li, S. Lin, S. Yan, and D. Xu, Discriminant Locally Linear Em-

bedding With High-Order Tensor Data, IEEE Transactions on Systems,Man, and Cybernetics, Part B: Cybernetics,Vol. 38 (2), pp. 342 - 352,April 2008.

[52] W. Luo, Face recognition based on Laplacian Eigenmaps, Interna-tional Conference on Computer Science and Service System, pp 416 -419, June 2011.

Yun Tie (S07) received his B.Sc. Degree from Nanjing University of Scienceand Technology, China, M.A.Sc Degree of Computer Science from KwangjuInstitute of Science and Technology (KJIST), Korea, and Ph.D Degree fromRyerson University, Canada. He is currently a Post Doctoral Fellow in RyersonMultimedia Lab at Ryerson University, Toronto, Canada. His research interestsinclude image/video processing, pattern recognition, 3D data modeling andintelligent classification and their applications.

Ling Guan (S88-M90-SM96-F08) received his B.Sc. degree in ElectronicEngineering from Tianjin University, China, M.A.Sc degree in SystemsDesign Engineering from University of Waterloo, Canada and Ph.D. degree inElectrical Engineering from the University of British Columbia, Canada. Heis currently a professor and a Tier I Canada Research Chair in the Departmentof Electrical and Computer Engineering at Ryerson University, Toronto,Canada. He also held visiting positions at British Telecom (1994), TokyoInstitute of Technology (1999), Princeton University (2000), Hong KongPolytechnic University (2008) and Microsoft Research Asia (2002, 2009).

He has published extensively in multimedia processing and communications,human-centered computing, pattern analysis and machine intelligence, andadaptive image and signal processing. He is a recipient of the 2005 IEEETrans. on Circuits and Systems for Video Technology Best Paper Award andan IEEE Circuits and Systems Society Distinguished Lecturer.

Documents

A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition