A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

Embed Size (px)

Citation preview

  • 7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

    1/14Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    1

    A Deformable 3D Facial Expression Model for

    Dynamic Human Emotional State RecognitionYun Tie, Member, IEEE, Ling Guan, Fellow, IEEE,

    AbstractAutomatic emotion recognition from facial expres-sion is one of the most intensively researched topics in affectivecomputing and human-computer interaction (HCI). However, itis well known that due to the lack of 3D feature and dynamicanalysis the functional aspect of affective computing is insufficientfor natural interaction. In this paper we present an automaticemotion recognition approach from video sequences based ona fiducial point controlled 3D facial model. The facial regionis first detected with local normalization in the input frames.The 26 fiducial points are then located on the facial region andtracked through the video sequences by multiple particle filters.Depending on the displacement of the fiducial points, they may

    be used as landmarked control points to synthesize the inputemotional expressions on a generic mesh model. As a physics-based transformation, Elastic Body Spline (EBS) technology isintroduced to the facial mesh to generate a smooth warp thatreflects the control point correspondences. This also extracts thedeformation feature from the realistic emotional expressions aswell. Discriminative Isomap (D-Isomap) based classification isused to embed the deformation feature into a low dimensionalmanifold that spans in an expression space with one neutral andsix emotion class centers. The final decision is made by computingthe Nearest Class Center (NCC) of the feature space.

    Index TermsVideo analysis, Elastic Body Spline, DifferentialEvolution Markov Chain, Discriminative Isomap, Nearest ClassCenter,

    I. INTRODUCTION

    W ITH the rapid development of Human-Machine Inter-action (HMI), affective computing is currently gainingpopularity in research and flourishing in the industry domain.

    It aims to equip computing devices with effortless and natural

    communication. The ability to recognize human affective state

    will empower the intelligent computer to interpret, under-

    stand, and respond to human emotions, moods, and possibly,

    intentions. This is similar to the way that humans rely on

    their senses to assess each others affective state [1]. Many

    potential applications such as intelligent automobile systems,

    game and entertainment industries, interactive video, indexing

    and retrieval of image or video databases can benefit from thisability.

    Emotion recognition is the first and one of the most im-

    portant issues in the affective computing field. It incorporates

    computers with the ability to interact with humans more natu-

    rally and in a friendly manner. Affective interaction can have

    maximal impact when emotion recognition and expression is

    Y. Tie and L. Guan are with the Department of Electrical and ComputerEngineering, Ryerson University, Toronto, ON, Canada.

    e-mail: [email protected], [email protected] (c) 2012 IEEE. Personal use of this material is permitted.

    However, permission to use this material for any other purposes must beobtained from the IEEE by sending an email to [email protected].

    available to all parties, human and computers [2]. Most of the

    existing systems attempt to recognize the human prototypic

    emotions. It is widely accepted from psychological theory

    that human emotions can be classified into six archetypal

    emotions: surprise, fear, disgust, anger, happiness, and sadness,

    which are so-called six-basic emotions that was pioneered

    by Ekman and Friesen [3]. According to Ekman, the six-

    basic emotions are not culturally determined, but universal

    to human culture and thus biological in origin. There are also

    several other emotions and many combinations of emotions

    that have been studied, but they are unconfirmed as universallydistinguishable.

    Facial expression regulates face-to-face interactions, indi-

    cates reciprocity, interpersonal attraction or repulsion, and en-

    ables intersubjectivity between members of different cultures

    [4]. Recent research in the fields of psychology and neurology

    has shown that facial expression is a most natural and primary

    cue for communicating the quality and nature of emotions, and

    that it correlates well with the body and voice [5]. Each of the

    six basic emotions corresponds to a unique facial expression.

    To the objectives of an emotion recognition system, facial

    expression analysis is considered to be the major indicator

    of a human affective state.

    In the past 20 years there has been much research inrecognizing emotion through facial expressions. However,

    challenges still remain. Traditionally, the majority approaches

    for solving human facial expression recognition problems

    attempt to perform the task on two dimensional data, either 2D

    images or 2D video sequences. Unfortunately, such approaches

    have difficulty handling pose variations, lighting illumination

    and subtle facial behavior. The performance of 2D based

    algorithms remains unsatisfactory, and often proves unreliable

    under adverse conditions.

    Using 3D visual feature to recognize and understand fa-

    cial expressions has been demonstrated to be a more robust

    approach for human emotion recognition [6]. However, the

    general 3D emotion recognition approaches are mainly basedon static analysis. A growing body of psychological research

    supports that the timing of expressions is a critical parameter

    in recognizing emotions and the detailed spatial dynamic

    deformation of the expression is important in expression recog-

    nition. Therefore the dynamic analysis for the state transitions

    of 3D faces could be a crucial clue to the investigation of

    human emotional states.

    Another weakness of the existing 3D based approaches is

    the complexity and intensive computation cost to meet the

    challenge of accuracy. The temporal and detailed spatial infor-

    mation in the 3D visual cues, both at local and global scales,

  • 7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

    2/14

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    2

    may cause more difficulties and complexities in describing

    human facial movement. Moreover, automatic detection and

    the segments based on the facial components with respect to

    emotion recognition has not been reported so far. Most of the

    existing works require manual initialization.

    Fig. 1. Overall System Diagram.

    In light of these problems, this paper presents an automatic

    emotion recognition method from video sequences based on

    a deformable 3D facial expression model. We use the elas-

    tic body spline (EBS) based approach for human emotion

    classification with the active deformation feature extractiondepending on the 3D generic model. This model is driven by

    the key fiducial points and thus makes it possible to generate

    the intrinsic geometries of the emotional space. The block

    diagram of this method is shown in Fig. 1. The rest of the paper

    is organized as follows. Section II gives an overview on state-

    of-the-art for human emotion recognition. We then present the

    proposed 3D facial modeling and feature extraction from video

    sequences using EBS techniques in Section III. Discriminative

    Isomap (D-Isomap) based classification is discussed in Section

    IV. The experimental results are presented in Section V.

    Section VI gives our conclusions.

    II. RELATED WORKS

    The most commonly used vision-based coding system is

    the facial action coding system (FACS) proposed by Ekman

    and Friesen [7] for the manual labeling of facial behavior.

    To recognize emotions from facial clues, FACS enables facial

    expression analysis through standardized coding of changes

    in facial motion in terms of atomic facial actions called

    Action Units (AUs). The changes in the facial expression are

    described with FACS in terms of AUs. FACS decomposes the

    facial muscular actions into 44 basic actions and describes

    the facial expressions as combinations of the AUs. Many

    researchers are inspirited by this work and try to analyze facial

    expressions in image and video processing. Most methods usethe distribution of facial features as inputs of a classification

    system, and the outcome is one of the facial expression classes.

    Lyons et al. [8] used a set of multi-scale, multi-orientation

    Gabor filters to transform the images first. The Gabor coef-

    ficients sampled on the grid were combined into one single

    vector. They tested their system and achieved 75% expression

    classification accuracy by using Linear Discriminant Analysis

    (LDA). Silva and Hui [9] determined the eye and lip position

    using low-pass filtering and edge detection methods. They

    achieved an average emotion recognition rate of 60% using

    a neural network (NN). Cohen et al. [10] introduced the

    temporal information from video sequences for recognizing

    human facial expression. They proposed a multi-level hidden

    Markov model (HMM) classifier for dynamic classification, in

    which the temporal information was also taken into account.

    Guo and Dyer [11] introduced a linear programming based

    method for facial expression recognition with a small number

    of training images for each expression. A pairwise framework

    for feature selection was presented and three methods were

    compared in the experimental part. Pantic and Patras [12]

    presented a method to handle a large range of human facial

    behavior by recognizing facial muscle actions that produce

    expressions. The algorithm performed both automatic seg-

    mentation into facial expressions and recognition of temporal

    segments of 27 AUs. Anderson and McOwan [13] presented an

    automated multistage system for real-time recognition of facial

    expression. The system used facial motion to characterize

    monochrome frontal views of facial expressions. It was able to

    operate effectively in cluttered and dynamic scenes, recogniz-

    ing the six emotions universally associated with unique facial

    expressions. Gunes and Piccardi [14] proposed an automatic

    method for temporal segment detection and affect recognitionfrom facial and body displays. Wang and Guan [15] con-

    structed a bimodal system for emotion recognition. They used

    a facial detection scheme based on a Hue Saturation Value

    (HSV) color model to detect the face from the background

    and Gabor wavelet features to represent the facial expressions.

    Presently, state-of-the-art 3D facial modeling by physically

    based paradigm has been recognized as a key research area

    of emotion recognition for next-generation human-computer

    interaction (HCI) [16]. Song et al. [17] presented a generic fa-

    cial expression analogy technique to transfer facial expressions

    between arbitrary 3D facial models, and between 2D facial

    images. Geometry encoding for triangle meshes, vertex-tent-

    coordinates were proposed to formulate expression transfer in2D and 3D cases as a solution to a simple system of linear

    equations. In [18], a 3D features based method for human

    emotion recognition was proposed. 3D geometric information

    plus colour/density information of the facial expressions were

    extracted by 3D Gabor library to construct visual feature

    vectors. The improved kernel canonical correlation analysis

    (IKCCA) algorithm was applied for final decision, and the

    overall recognition rate was about 85%. A static 3D facial

    expression recognition method was proposed in [19]. The

    primitive 3D facial expression features were extracted from

    3D models based on the principal curvature calculation on

    3D mesh models. Classification into one of the six-basic

    emotions was done based on the statistical analysis of thesefeatures, and the best performance was obtained using LDA.

    Although several methods can achieve a very high recogni-

    tion rate, most of the existing 3D face expression recognition

    works are based on static data. Soyel and Demirel [20], [21]

    used six distance measures from 3D distributions of facial

    feature points to form the feature vectors. The probabilistic

    NN architecture was applied to classify the facial expressions.

    They obtained an average recognition rate of 87.8%. Unfortu-

    nately, the authors did not specify how to identify this set

    of feature points. Tang and Huang [22], [23] used similar

    distance features based on the change of face shape between

  • 7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

    3/14

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    3

    the emotional expressions. Normalized Euclidean distances

    between the facial feature points were used for emotion

    classification. An automatic feature selection method was also

    proposed based on maximizing the average relative entropy

    of marginalized class-conditional feature distributions. Using

    a regularized multi-class AdaBoost classification algorithm,

    they achieved a 95.1% average recognition rate. However the

    facial feature points were predefined on the cropped 3D face

    mesh model, and were not generated automatically. Such an

    approach is, therefore, difficult to be used in real world ap-

    plications. Thus far, few efforts have been reported exploiting

    3D facial expression recognition in dynamic or deformable

    feature analysis. Sun and Yin [24] extracted sophisticated

    features of geometric labeling and used 2D HMMs to model

    the spatial and temporal relationships between the features

    for recognizing expressions from 3D facial model sequences.

    However this method requires manual detection and annotation

    of certain facial landmarks.

    III. METHODOLOGY

    In this section we present a fully automatic method for

    emotion recognition that exploits the EBS features between

    neutral and expressional faces based on a 3D deformable mesh

    (DM) model. The system developed consists of several steps.

    The facial region is first detected automatically in the input

    frames using the local normalization based method [25]. We

    then locate 26 fiducial points over the facial region using scale-

    space extrema and scale invariant feature examination. The

    fiducial points are tracked continuously by multiple particle

    filters throughout the video sequences. EBS is used to extract

    the deformation features and the D-Isomap algorithm is then

    applied for the final decision.

    A. Preprocessing

    Automatic face detection is considered to be the first es-

    sential requirement for our emotion recognition system. Since

    the faces are non-rigid and have a high degree of variability

    in location, color and pose, several facial features that are

    uncommon to other pattern detection issues make facial de-

    tection more complex. Occlusion and lighting distortions and

    illumination conditions can also change the overall appearance

    of a face. We detect facial regions in the input video sequence

    consisting of feature selection and classification based on a

    local normalization technique [25]. Compare to Viola and

    Johns algorithm [26], the proposed method is adaptive tothe normalized input image and designed to complete the

    segmentation in a single iteration. With the local normalization

    based method, the proposed emotion recognition system can

    be more robust under different illumination conditions.

    Fiducial points are a set of facial salient points, usually

    located on the corners, tips or mid points of facial components.

    Automatically detecting fiducial points can extract the promi-

    nent characteristics of facial expressions with the distances

    between points and the relative sizes of the facial components

    and form a feature vector. Additionally, choosing the feature

    points should represent the most important characteristics on

    the face and be extracted easily. The Active Appearance Mod-

    els (AAM) and Active Shape Models (ASM) are two popular

    feature localization methods with statistical face models to

    prevent locating inappropriate feature points. The AAM [27],

    [28] fits a generative model to the region of interest. The best

    match of the model simultaneously calculates feature point

    locations. The ASM algorithm learns a statistical model of

    shape from manually labeled images and the PCA models

    of patches around individual feature points. The best local

    match of each feature is found with constraints on the relative

    configuration of feature points. They are commonly used to

    track faces in video. In general, the point to point accuracy is

    around 85% if the bias of the automatic labeling result to the

    manual labeling result is less than 20% of the true inter-ocular

    distance [29]. However, it is not sufficient in the case of facial

    expression analysis.

    We choose 26 fiducial points [30] on the facial region ac-

    cording to the anthropometricmeasurement with the maximum

    movement of the facial components during expressions. To

    follow the subtle changes in the facial feature appearance, we

    define a SUCCESS case if the bias of a detected point to thetrue facial point is less than 10% of inter-ocular distance in the

    test image. The proposed method constructs a set of fiducial

    point detectors with scale invariant feature. Candidate points

    are selected over the facial region by the local scale-space ex-

    trema detection. The scale invariant feature for each candidate

    point is extracted to form the feature vectors for the detection.

    We use multiple Differential Evolution Markov Chain (DE-

    MC) particle filters [31] to track the fiducial points depending

    on the locations of the current appearance of the spatially

    sampled features. The kernel correlation based on HSV color

    histograms is used to estimate the observation likelihood and

    measure the correctness of particles. We define the observation

    likelihood of the color measurement distribution using the cor-relation coefficient. Starting with mode-seeking procedure, the

    posterior modes are subsequently detected through the kernel

    correlation analysis. It provides a consistent way to resolve

    the ambiguities that arise in associating multiple objects with

    measurements of the similarity criterion between the target

    points and the candidate points. The proposed method achieves

    an overall accuracy of 91% for the 26 fiducial points [31].

    B. 3D EBS Facial Modeling

    The EBS is an image morphing technique derived from

    the Navier equation that describes the deformation of ho-

    mogeneous elastic tissues. It was developed for matching3D MRIs of the breast used in the evaluation of breast

    cancer [32], [33]. Davis et al. [32] designed the EBS to

    matching 3D magnetic resonance images (MRIs) used in the

    evaluation of breast cancer. The coordinate transformations are

    evaluated with different types of deformations and different

    numbers of corresponding coordinate locations. The EBS is

    based on a mechanical model of an elastic body, which

    can approximate the properties of body tissues. The spline

    maps can be expressed as the linear combination of an affine

    transformation and a Navier interpolation spline. It allows each

    landmark to be mapped to the corresponding landmark in

  • 7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

    4/14

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    4

    the other image and provides interpolation of this mapping

    at intermediate locations. Hassanien and Nakajima [34] used

    Navier EBS to generate warp functions for facial animation

    with the interpolating scattered data points. Kuo et al. [35]

    proposed an iterative EBS algorithm to obtain the elastic

    property of a facial model for facial expression generation.

    However, the most feature points in these works were manually

    localized, and only 2D examples were considered for facial

    image analysis.

    Fig. 2. Proposed 3D mesh model with 26 fiducial points (black) and 28characteristic points (red).

    The proposed EBS method automatically generate facial

    expressions using a 3D physically based DM model according

    to a deformable feature perspective executed with the control

    points within an acceptable time for emotion recognition. The

    mesh wireframe generic facial model consists of characteristic

    feature points and deformable polygons with EBS structure.

    We can deform the wireframe model to best fit a human face

    with any expressions. The 3D affine transformation realizes the

    facial expressions by imitating the facial muscular actions. Itformulates the deforming rules according to the FACS coding

    system using the 26 fiducial points as control points. Fig. 2

    shows the proposed model based on this standardized coding

    system. In practical applications, not all feature points in the

    model can be easily detected from the input sequences, so

    we use 54 characteristic feature points for facial expression

    parameterization. Characteristic feature points include: a) the

    26 control points based on the fiducial points, and b) 28

    dependent points which are determined by the control points.

    We also assume that the physical property of the EBS structure

    is the same within the facial region. The EBS deformation

    analysis is presented in following section.

    Merits of this approach are: a) a physically based DMmodel of the human face with fiducial points for driving facial

    deformation according to muscle movement parameterization.

    The face can be modeled as an elastic body that is deformed

    under a tension force field. Muscles are interpreted as forces

    deforming the polygonal mesh of the face. The factors affect-

    ing the deformation are tension of the muscle, elasticity of

    the skin and zone of influence. Higher-level parameterizations

    are easier to use for emotional expressions and can be de-

    fined in terms of low-level parameters. b) We extend a DM

    facial model by a set of well-designed polygons with EBS

    structure which can be efficiently modified to establish the

    facial expression model. A 3D face is decomposed into area

    or volume elements, each endowed with physical parameters

    embedded in an EBS model according to the surface curva-

    ture. The deformable element relationships are computed by

    integrating the piecewise components over the entire face. c)

    The control points are predefined by the landmarked fiducial

    points. The number of control points is small and they can be

    identified robustly and automatically. Once the control points

    are adjusted, the emotional facial model can be established

    using the transform function of EBS and extended to obtain

    expression parameters for final recognition.

    Using EBS transforms we can interpolate the positions of

    characteristic feature points such that the 3D facial model of

    non-neutral expressive expression can be generated from the

    input video frame. Based on the arrangement of facial muscle

    fibers, our EBS model calculates elastic characteristics for

    each emotional face by modeling the facial muscle fiber as an

    elastic body. The affine elastic body coordinate transformation

    is fitted to the displacements of the facial expression with

    the continuity condition. The spline obtained by this method

    is mathematically identical to the computed coefficients ofthe original displacements from the control points directly.

    Moreover, the resulting spline is added to the initial mesh of

    the elastic body transformation to give the overall coordinate

    transformation. Simulation results show that the facial model

    generated by our method demonstrates good performance

    under the availability of control point positions.

    C. EBS parameterizations

    EBS is applied for generating different facial expressions

    with a generic facial model from a neutral face. By varying

    the position of control points, EBS mathematically describes

    the equilibrium displacement of the facial expressions sub-

    jected to muscular forces using a Navier partial differential

    equation (PDE). The deformable facial model equations can

    be expressed in 3D vector form with the interpolation spline

    relating the set of corresponding control points. The PDE of

    an elastic body is based on notions of stress and strain. When

    a body is subject to an external force this induces internal

    forces within the body which cause it to deform. The integral

    of the surface forces and body forces must be zero [36]. Let xdenote a set of feature points in the 3D facial model of neutral

    face, yi be the corresponding control points with expressions,we then have the Navier equilibrium PDE as:

    2l (x ) + ( + )[

    l (x )] +

    f(x ) = 0 (1)

    wherel (x ) is the displacement of all characteristic feature

    points within the facial model from the original position

    (neutral face), and are the Lam coefficients which describethe physical properties of the face. is also referred to as theshear modulus. 2 and denote the Laplacian and gradient

    operation, respectively, and l (x ) is the divergence of

    l (x ).

    f (x ) is the muscular force field being applied on the face.

    To find an appropriate physical property for an expressional

    model, muscular forces are assumed to distribute on the ho-

    mogeneous isotropic elastic body of the facial model to obtain

  • 7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

    5/14

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    5

    smooth deformation. So a polynomial radially symmetric force

    is considered that:

    f (x ) = w d(x ) (2)

    where w = [ w1 w2 w3 ]T is the strength of the force

    field, and d(x ) = |x21

    + x22

    + x23

    |1/2

    . The PDEs solutions of

    (1) can be computed as:

    l (x ) = E(x )w (3)

    and

    E(x ) = [d(x )2I 3x x T]d(x ) (4)

    where = (11+5)/(+) is the Poissons ratio, I is a 33 identity matrix, and x x T is an outer product. It is obtainedusing the Galerkin vector method [36] to transform the three

    coupled PDEs into three independencies. The solution can be

    verified by substituting (3) into (1). The EBS displacement

    LEBS(x ) is a linear combination of the PDEs solution in

    (3) that:

    LEBS(x ) =

    Ni=1

    E(x yi )wi + Ax + B (5)

    where Ax + B is the affine portion of the EBS, A =[ a1 a2 a3 ]T is a 3 3 matrix. The coefficients of thespline are determined from the control points yi and thedisplacement of feature points. The spline relaxes to an affine

    transformation as the distance from the point approaches

    infinity.

    The summation in (5) can be expressed in the matrix-vector

    form as:

    EEBS(x ) = HLEBS (6)

    where H is a (3N + 12) (3N + 12) transfer function as

    described by Kuo [35], and EEBS is a (3N + 12) 1 vectorwith all the EBS coefficients as:

    EEBS = [w1T

    w2T ...wNT a1T a2T a3T bT ]

    T

    (7)

    In our system, the 26 control points and the displacements of

    the control point sets are obtained from the fiducial detection

    and tracking steps. We solve (6) from the requirements that

    the spline displacements equal the control point displacements

    with a constant all over the facial region. The flatnessconstraints which are expressed in terms of second or higher

    order (e.g. xi2, xj

    2 or xixj ) are set to zero enforcing the

    conservation of linear and angular momenta for an equilibrium

    solution. These constraints cause the force field to be balancedso that the EBS facial model is stationary. The value of the

    spline for the 28 dependent points are computed from (5) with

    the spline coefficient EEBS, the spline basis function H andthe control point locations.

    The muscular force fieldf (x ), given by (2), can be calcu-

    lated from the solutions of EBS according to the displacement

    of control points such that:

    f (x ) =

    f1

    f2

    f3

    T=

    Ni=1

    f(x yi )

    wi (8)

    With different we obtained variantf(x ). By the princi-

    ple of superposition for an elastic body, the external forces

    must be minimized according to the roughness measurement

    constraints [35]. This ensures that the forces are optimally

    smooth and sufficient to deform the elastic material so that

    the EBS equals the given displacements at the control point

    locations. By varying the values of in (4), we can calculateeach corresponding muscular force field respectively. To find

    the minimum muscular force field |f (x )|min , we obtainthe appropriate physical property and the associate EBScoefficients EEBS. We then construct the deformable visualfeature v for classification with and EEBS. The algorithmfor deformation feature extraction is summarized as follows.

    1) Initialize the feature point positions x in the 3D facialmodel for neutral face according to the detection results

    from the 26 fiducial points.

    2) Set for facial region 0.01 .3) Update the corresponding control point positions yi in

    the expressional facial model subject to the tracking

    results .

    4) Calculate the displacements of the control point sets

    l

    in the facial region.

    5) Solve EBS in (6) to obtain the associate spline coeffi-

    cient EEBS.6) Compute the position of nonsignificant points in the

    facial region based on the EBSs solution in the previous

    step.

    7) Calculate the muscular force fieldf(x ) in (2) from

    the solution of EBS.

    8) Sweep from 0.02, 0.03, ... , to 0.5 and repeat steps 5),6) and 7) to obtain the new muscular force fields.

    9) Find the minimum muscular force field |f (x )|min, fix

    and the EBS coefficients EEBS.

    10) Construct the deformable visual feature v for classifica-tion with EEBS and

    .

    IV. D-ISOMAP BASED CLASSIFIER

    Once the deformable facial features have been obtained

    with the EBS, we use an isomap based method for emotion

    classification. Isomap was first proposed by Tenenbaum [37],

    and is one of the most popular manifold learning techniques

    for promising nonlinear dimensionality reduction. It attempts

    to learn complex embedding manifolds using local geometric

    metrics within a single global coordinate system. The Isomap

    algorithm uses geodesic distances between points instead of

    simply taking Euclidean distances, thus encoding the manifoldstructure of the input space into distances. The geodesic

    distances are computed by constructing a sparse graph in

    which each node is connected only to its closest neighbors.

    The geodesic distance between each pair of nodes is taken to

    be the length of the shortest path in the graph that connects

    them. These approximated geodesic distances are then used

    as inputs to classical multidimensional scaling (MDS). Yang

    proposed a face recognition method based on Extended Isomap

    (EI) [38]. In his work, the EI method was utilized by a Fisher

    Linear Discriminant (FLD) algorithm. The main difference

    between EI and the original Isomap is that after a geodesic

  • 7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

    6/14

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    6

    distance is obtained, the EI algorithm uses FLD to achieve

    the low dimensional embedding while the original Isomap

    algorithm using MDS. X. Geng [39] proposed an improved

    version of Isomap to guide the procedure of nonlinear di-

    mensionality reduction. The neighborhood graph of the input

    data is constructed according to a certain kind of dissimilarity

    between data points, which is specially designed to integrate

    the class information.

    The Isomap algorithm generally has three steps: construct

    a neighborhood graph, compute shortest paths, and construct

    d-dimensional embedding. Classical MDS is applied to the ma-

    trix of graph distances to obtain a low-dimensional embedding

    of the data. However, since the original prototype Isomap does

    not discriminate data acquired from different classes, when

    dealing with multi-class data, several isolated sub-graphs will

    result in undesirable embedding. On the other hand, the EI

    [38] used the Euclidean distance to approximate the distance

    between two nearest points in two classes. When the number

    of classes becomes large, the classes may construct their

    own spatially intrinsic structure. Then the EI and improved

    version cant recover the classes intrinsic structures of thehigh-dimensional data. In order to cope with such problems,

    in this paper, we propose a D-Isomap based method for

    emotion classification. The discriminative information of facial

    features [40] are considered so that they can properly represent

    the discriminative structures of the emotional space on the

    manifold. The proposed D-Isomap provides a simple way

    to obtain the low dimensional embedding and discovers the

    discriminative structure on the manifold. It has the capability

    of discovering nonlinear degrees of freedom and finding the

    globally optimal solutions guaranteed to converge for each

    manifold [41].

    There are two general approaches to build the final classifier

    using dynamic information from video sequences. One is todetermine the dependencies based on the joint probability

    distribution among the score level decisions. The other is based

    on the distribution of dynamic features, in which case the

    features can be discrete or continuous. Le et al. [42] proposed

    a 3D dynamic expression recognition method using spatio-

    temporal shape features. The HMMs algorithm was adopted

    for the final classification. Sandbach et al. [43] also proposed

    to recognize 3D facial expression using HMM dependent on

    the motion-based features.

    In this work, the final classifier is constructed based on the

    dynamic feature level fusion. We change the facial expression

    model following the trajectory of the 54 characteristic feature

    points frame by frame. It explicitly describes the relationshipbetween the motions of the facial feature points and the ex-

    pression changes. The EBS model sequence v(t) is effectivelyrepresented by a sequence of observation from the input video,

    where t is the time variable. Before the raw data samples in thedatasets could be used for training/testing of classification, it

    is necessary to normalize the sequences such that they were in

    the format required by the system. The frame rate is reduced

    to 10 fps and the sequences last 3 seconds in total from a

    neutral face to the apex of one expression. Since the original

    displacement ofv(t) in each frame depends on each individual,we use the length (distance between the Top point of the head

    and the Tip point of the chin) and width (distance between

    the Left point of the head and the Right point of the head) of

    the neutral face for scale normalization. We then normalize

    the feature matrix to regulate the variances from the EBS

    coefficients and constant lambda using L2 method. The EBSmodel sequence takes into account the temporal dynamics of

    the feature vectors, and the labeled graph matching is then

    applied to determine the category of the sample video.

    The EBS feature v for each emotional facial model can beseen as one point in a high dimensional space. As we have 54

    characteristic feature points in the 3D facial model, each EBS

    feature, v, has 175 dimensions. Given the variations of facialconfigurations during emotional expressions, these points can

    be embedded into a lower dimensional space. We define the

    facial EBS feature set V as the input data:

    V = {vt} RTM (9)

    where t = 1,...,T is the input sample number, M = 175is the dimensionality of the original data. Let U denote the

    embedding space ofV into a low dimensional manifold with

    m dimensions such that:

    U = {ut} RTm (10)

    which preserves the manifolds estimated intrinsic geometry.

    The D-Isomap provides a simple way to obtain the low dimen-

    sional embedding and discovers the discriminative structure

    on the manifold as well [40], [41]. We compute Euclidean

    distance D between any pairwise points (vt, vt) from theinput space V for the training data with a discriminative

    weight factor such that:

    D(vt, vt) =

    vt vt 2 if Z(vt) = Z(vt)

    vt vt 2 if Z(vt) = Z(vt)

    (11)where Z(vt) denotes the class label which the input data vtbelongs to. For pairwise points with the same class labels,

    the Euclidean distance is shortened by weight factor . Thecompacting and expanding parameters are empirically calcu-

    lated for the discriminative matrix. It can solve the impeding

    problems in [44] when the dimensions of scatter become very

    high in the real data sets.

    A neighborhood graph G is constructed according to thediscriminative matrix. If one point is one of the closest points

    or lies within a fixed radius of any other point, it is defined as

    a neighbor of that point. The pairs are connected with paths

    between points, which are acquired by adding up a sequence

    of edges equal to the distance between neighbor points . Thedistances between all point pairs are computed based on a

    chosen distance metric. We then calculate a distance matrix

    between all pairwise points by computing the shortest paths in

    the neighborhood graph. The geodesic distance matrix between

    all points is set to be:

    DG = min (DG,DTG) (12)

    The embedding matrix, Dm, in low dimensional space can

    be calculated by converting the distance matrix to inner

    products with a translation mapping [45]. Compute the largest

    eigenvalue and the top m eigenvectors ofDG, we obtain the

  • 7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

    7/14

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    7

    eigenvector matrix E Rnm and the eigenvalue matrixM Rmm. The embedding matrix in low dimensional spacecan be calculated that:

    Dm = M1/2ET (13)

    We then use a Nearest Class Center (NCC) algorithm [46] to

    determine the emotion classes. The NCC algorithm is a centre

    based method for data classification in a high dimensional

    space. Many classification methods may be concerned for

    the final decision such as nearest neighbors, k-mean or EM

    algorithms. In the nearest neighbors based classification, the

    representational capacity and the error rate depends on how the

    dataset to be chosen to account for possible variations and how

    many are available. The k-mean method adjusts the center of

    a class based on the distance of its neighboring patterns. The

    EM algorithm is a special kind of quasi-Newton algorithm

    with a searching direction having a positive projection on

    the gradient of the log likelihood. In each EM iteration, the

    estimation maximizes a likelihood function which is further

    refined in each iteration by a maximization step. When the

    EM iteration converges, it should ideally obtain maximumlikelihood estimation of the data distribution

    A commonality among these methods is that they define a

    distance between the dataset and an individual class, then the

    classification is determined by consisting of isolated points in

    the feature space. However, since the emotional features in our

    work are complex and not interpretable, a formal centre for

    each emotion class may be difficult to determine or misplaced.

    In many cases, multiple clusters are available within one video

    sequence. Such property can be utilized to improve the final

    decision but has been ignored by other methods. For this

    reason, we need to find a more efficient way to generalize

    the representational capacity with sufficient large number of

    feature points stored to account for as many variations aspossible. Unlike other alternations, NCC considers the centers

    for the clusters k with known label from the training dataand generalizes the class center for each emotion group. The

    derived cluster centers have more variations than the original

    input features and thus expands the capacity of the available

    data sets. The classification for the test data is based on the

    nearest distance to each class center.

    The NCC algorithm is applied for the classification of

    the input video based on the number of clusters k and theembedding matrix Dm. We assume that the clusters can be

    classified in classes a priori through any viable means and are

    available within each video sequence. So the distance matrix

    makes use of such information about classes contained in theclusters of each class. A subspace is constructed out from the

    entire feature space based on the prior knowledge and the

    within-class clusters are generalized to represent the variants

    of that emotion class. Thus the generalization ability of the

    classifier is increased.

    Let ck be a set of k cluster centers for the feature pointsbelong to a class. The k clusters determine the output classlabel of the input data. Each cluster approximates the sample

    population in the feature space for the samples that belong

    to it. The statistics of every cluster are used to estimate the

    probability for the dataset. The probability distribution can be

    calculated from the training data at this level. The centers of

    these clusters provide essential information for discriminative

    subspace, since these clusters are formed according to class

    labels of emotions. We can simply enforce the mapping to be

    orthogonal, i.e., we can impose the condition

    U UT = I (14)

    for the feature points on the projected set. In our case, a

    total of k centers of the clusters give (k 1) discriminativefeatures which span (k 1) dimensional discriminative space.The cluster centers for a test data can be calculated using the

    objective function:

    E(ck) = c

    k ck (15)

    A dense matrix h = eeT that e = [1,..., 1]T is imposed tothe distance matrix DG to calculate cluster centers from the

    training data. Since DG is symmetric, we put the uniform

    weight 1/N to every single pair for the full graph. Let pdenote the sample number in one cluster, l = 7 the emotionalspace for labeling, and Ut represents the tth-element of the

    embedded manifold matrix for a test data from (10), theobjective function becomes:

    {Ck}l =1

    p

    pt=1

    (DmUt 1

    2HDGH) (16)

    where H is the centering matrix that H = I 1NeeT. The

    labeled class center {Ck}l for the emotional space of a testvideo can be calculated from (16). Each data sample along

    with its k cluster lies on a locally manifold. Since D-Isomapseeks to preserve the intrinsic geometric properties of the local

    neighborhoods, the input data is reconstructed by a linear

    combination of its nearest centers with the labeled graph

    matching.For each category of facial expression, we calculate an

    average class center coordinates Cl from the training samples.Compute the class centers cl for the test data using (16), wecan obtain its class label C using the Euclidean distance tothe nearest class center coordinates Cl.

    C = arg mincl

    (cl, Cl,Dm,Ut, ) (17)

    V. EXPERIMENT AND RESULTS

    To evaluate the performance of our proposed method, two

    facial expression video datasets are used for the experiment:

    RML Emotion database and Mind Reading DVD database.

    RML Emotion database [15] was originally recorded forlanguage and context independent emotional recognition with

    the six fundamental emotional states: happiness, sadness,

    anger, disgust, fear and surprise. It includes eight subjects

    in a nearly frontal view (2 Italian, 2 Chinese, 2 Pakistani,

    1 Persian, and 1 Canadian) and 520 video sequences in

    total. The RML Emotion database was originally developed

    for language and context independent emotional recognition.

    Each video pictures a single emotional expression and ends

    at the apex of that expression while the first frame of every

    video sequence shows a neutral face. Video sequences from

    neutral to target display are digitized into 320 340 pixel

  • 7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

    8/14

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    8

    arrays with 24-bit color values. The Mind Reading DVD [47]

    is an interactive computer-based resource for face emotional

    expressions, developed by Cohen and his psychologist team.

    It consists of 2472 faces, 2472 voices and 2472 stories. Each

    video pictures the frontal face with a single facial expression

    of one actor (30 actors in total) of varying age ranges and

    ethnic origins. All the videos are recorded at 30 frames per

    second, last between 5 to 8 seconds, and are as a resolution

    of320 240.

    A. Facial region detection

    The Facial region is detected in the input video sequence

    using the face detection method with local normalization [25].

    The normalized results of the original sequences show that the

    histograms of all input images are widely spread to cover the

    entire gray scale by local normalization; and the distribution

    of pixels is not too far from uniform. As a result, dark images,

    bright images, and low contrast images are much enhanced to

    have an appearance of high contrast. The overall performance

    of the system is considerably improved by incorporating local

    normalization.

    B. Fiducial points detection and tracking

    The fiducial points are then detected [30] and tracked [31]

    automatically in the facial region. As the location of each

    fiducial point is at the center of a 16 16 pixel neighborhoodwindow, and the feature vector for point detectors are extracted

    from this region, we consider detected points displaced within

    five pixels from the corresponding ground truth facial points

    as successful detections. 180 videos of 6 subjects from RML

    Emotion database and 240 videos of 20 subjects from Mind

    Reading DVD database are selected for experiment, which

    constitute a total of 420 sequences of 26 subjects. We ran-domly divide all the 420 sequences into training and testing

    subsets containing 210 sequences each.

    The overall system performance of recall 92.45% and pre-cision 90.93% are achieved simultaneously in terms of falsealarm rates. We also implement the AAM method mentioned

    in [27] for the 26 fiducial point detection and tracking, as

    shown in Fig. 3. The proposed method has a better perfor-

    mance on both efficiency and accuracy.

    C. EBS based emotional facial modeling

    In this section, we verify the performance of the EBS based

    method for emotional facial modeling on the aforementioneddatabases. The positions of 26 fiducial points are obtained

    from the detecting and tracking step and then used for calcu-

    lating the positions of the 28 dependent points. These positions

    are 2D data in the video sequences and cant be applied with

    the 3D EBS analysis directly. All the fiducial points need to be

    aligned to our 3D model first. We use a flexible generic facial

    modeling (FGFM) algorithm [48] for fitting each face image

    to the 3D mesh model. The geometric values used in FGPFM

    are obtained from the BU-3DFE database [49]. There are 2500

    3D facial expression models for 100 subjects in this database.

    We use the 3D facial expression model with the associated

    Fig. 3. Detection and tracking Result

    (a) anger faces.

    (b) disgust faces.

    (c) fear faces.

    Fig. 4. Emotional EBS model construction.

    frontal-view texture image as ground truth data to train the

    3D model. Initially, we define a face-centered coordinate

    system used for FGPFM. All the 3-D coordinates, curvature

    parameters for every vertex generation function, the weights in

    the interrelationship functions and the statistical model ratios

    are recorded in an FGPFM. The clustering process is used to

    construct the accurate generic facial models from the training

    3D data. All the selected typical training examples are used

    to acquire the geometric values for each FGPFM. The optimal

    geometric values of FGPFM result in full coincidence between

  • 7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

    9/14

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    9

    superimpositions of the transformed FGPFM and those facial

    contours of training images. Geometric values of FGPFM are

    established using the profile matching technique for silhouettes

    of the training images and the FGPFM with the known view

    directions. The reconstruction procedure can be regarded as

    a block function of FGPFM, and the input parameters are

    3-D face-centered coordinates of control points. When the

    control points are accurately modified, the desired 3-D facial

    model is determined based on the topological and geometric

    descriptions of FGPFM.

    To remove the individual differences in the facial expres-

    sions, each face shape from the video sequences is normalized

    to the same scale. The 26 control points on the 3D facial model

    are initially estimated by the fiducial points using the back

    projection technique with the set of predefined unified depth

    values. The original dependent points are also predefined in

    the model. Classified FGFM ratio features are selected with

    a minimal Euclidean distance between the estimated and the

    codebook-like ratios database. The depth values of control

    points and curvature parameters are obtained for reconstructing

    the EBS facial model from the selected ratio features classifier.Fig. 4 shows some representative sample results for emo-

    tional model construction. Our objective here is to find the

    positions of dependent points after emotional facial defor-

    mation under the availability of the fiducial point position.

    The basic six emotions are analyzed in this experiment. The

    best-fit mesh model of a given face is estimated from the

    first input frame with neutral emotion. Based on the known

    tracking information, the positions of all characteristic feature

    points are calculated and the EBS model is reconstructed for

    any particular expression. From experimental results we can

    see that our method provides good construction following the

    variations of the control points.

    (a) (b)

    (c) (d)

    Fig. 5. EBS facial model constructions with different Poissons ratio (a) amale anger face (b) a female sadness face (c) a female anger face (d) a malehappiness face.

    We provide more experimental results in Fig. 5 to verify

    the consistency of the proposed method. Fig. 5 presents the

    results of the emotional facial model for different people.

    The is assumed to be constant for the whole facial regionand determined under the condition of minimum muscular

    force field generation. Fig. 5(a-d) shows the results when

    is obtained experimentally. Subjectively, the proposed method

    provides a good facial model under different people and

    expressions.

    D. D-Isomap for final decision

    In this section, 280 video sequences of eight subjects from

    the RML Emotion database and 420 video sequences of 12

    subjects from the Mind Reading DVD database are selectedfor D-Isomap based classifier evaluation, which constitute a

    total of 700 sequences of 20 subjects with six emotions and

    neutral faces.

    The facial EBS features are extracted to construct a 175

    dimensional vector sequence that is too large to manipulate

    directly. We use D-Isomap algorithm for dimensionality re-

    duction, as discussed in section 4. Since each feature vector

    can be seen as one point in a 175 dimensional space, the

    D-Isomap is utilized to find the embedding manifold in a

    low-dimensional space to represent the original data. These

    representations should cover most of the variances of the

    observation based on the continuous variations of facial config-

    urations. The low-dimensional space structures are extracted tofacilitate the manifolds estimated intrinsic geometry by the D-

    Isomaps capability of nonlinear analysis and the convergence

    of globally optimal solutions.

    The geodesic distance graph from (12) is used for D-Isomap

    based embedding. Fig. 6 shows examples of distance matrices

    with discriminative weight factors for seven emotionalexpressions of randomly selected subjects. The distance graph

    reflects the intrinsic similarity of the original expression data

    and consequently is considered for determining true embed-

    ding. From Fig. 6 we can see, by applying the weight factor,

    the points from the same cluster can be projected closer in

    low dimensional space, thus the distance is compacted. On

    the other hand, the distance between different clusters could

    be expanded by increasing .

    (a) = 0.1. (b) = 0.25.

    (c) = 0.5. (d) = 0.75.

    Fig. 6. Distance matrix graph with different weight factor, higher values areshown in red, lower values in blue.

    Increasing the dimension of the embedding space, we can

    calculate the residual variance for the original data. The true

  • 7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

    10/14

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    10

    dimension of data can be found by considering the decreasing

    trend in the residual value. The embedding results using

    Isomap and proposed D-Isomap with different k are presentedin Fig. 7, which shows the results when k is set to be 7,12, and 20, respectively. From the results we see that our

    proposed method achieves an average of 10% improvementwhen compared with the original Isomap. The best perfor-

    mance can be obtained when k is 12 and the dimension ofembedded space is reduced to 20, which covers more than 95

    % variances of the observation from the input data. Therefore,these 20 dimensional components are used here to represent

    facial expressions in the input videos.

    (a) (b)

    (c) (d)

    (e) (f)

    Fig. 7. Dimensionality reduction using Isomap and D-Isomap, (a-e) showthe results using Isomap with cluster k are 7, 12, and 20 respectively, (b-f)show the results using D-Isomap.

    We also provide expressional configurations to show appar-

    ent emotional variation in Fig. 8. For each video sequencefrom the database, we constructed 10 sub-clips of the samples

    with different frames from neutral to the apex, which can

    improve the representational capacity with sufficient large

    number of feature points stored to account for as many

    variations in the original data as possible. To show apparent

    emotional variations, we provided the expressional configu-

    rations based on different numbers of samples. In Fig. 8,

    (a) shows the result using 700 samples with one sample for

    each video, (b) using 10 samples for each video and 7000

    samples in total. From the results we can see that the EBS

    model sequences are embedded to a discriminative structure

    on the low dimensional feature space. By applying the NCC

    algorithm to the embedding results from the D-Isomap using

    (17), we can determine the emotion class for a test video. We

    label the emotion class centers on the embedded feature space,

    shown in Fig. 8.

    (a)

    (b)

    Fig. 8. Labeled class centers in a 2D space based on the embedding results(a) shows the results using 700 samples (b) shows the results using 7000samples.

    To evaluate the performance of our proposed method, we

    divide these 700 sequences into five subsets and 140 sequences

    for each. Every time, one of the five subsets is used as a

    testing set, the other four subsets are used as a training set.

    The evaluation procedure is repeated until all the subsets have

    been used as a testing set. A test video sequence is treated

    as a unit and labeled with a single expression category. Therecognition accuracy is calculated as the ratio of the number of

    correctly classified videos to the total number of videos in the

    data set. By using the proposed classifier, we achieve an overall

    accuracy of 88.2%. We list the confusion matrix for emotion

    recognition with numbers representing percentage correct in

    Table I. From the results we can see that features representing

    different expressions exhibit great diversity since the distances

    between different emotions are relatively high. On the other

    hand, the same expressions collected from different subjects

    are very similar due to the short distances within the same

    class.

  • 7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

    11/14

  • 7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

    12/14

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    12

    standard deviation and average rate of emotion recognition

    between three Isomap methods. Table III indicates that the

    proposed algorithm achieves better performance than OI and

    EI. D-Isomap outperforms the other methods because it can

    compact the data points from the same cluster on a high-

    dimension manifold to make them closer in the low-dimension

    space, and make the data points from the different clusters

    farther as well. This ability could be beneficial in preserving

    the homogeneous characteristics for emotion classification.

    To demonstrate the discriminative embedding performance

    of the proposed D-Isomap, we conducted some experiments

    with state-of-art manifold learning methods i.e. localized

    LDA (LLDA), the discriminative version of LLE (DLLE)

    and Laplacian Eigenmap (LE). LLDA [50] is based on the

    local estimates of the model parameters that approximate

    the non-linearity of the manifold structure with piece-wise

    linear discriminating subspaces. The local neighborhood size

    k = 30 and dimensionality of subspace d = 32 are selectedto compute local metrics. DLLE [51] preserves the local

    geometric properties within each class according to LLE

    criterion, and the separability between different classes is en-forced by maximizing margins between point pairs on different

    classes. The balance term h = 1, nearest Neighbors k1 = 1and smallest distances k2 = 100 are used for classificationwith the closest centroit. LE [52] makes use of incremental

    Laplacian Eigenmap reducing the dimension and extracting the

    feature to data points Drawing on the correspondence between

    the graph Laplacian, the Laplacian Beltrami operator on the

    manifold, and the connections to the heat equation, a geo-

    metrically motivated algorithm is utilized for representing the

    high dimensional data that has locality preserving properties

    and a natural connection to clustering. The experiments are

    conducted with the compression dimensions of 50.

    TABLE IVRECOGNITION RATE OF DIFFERENT MANIFOLD LEARNING

    METHODS.

    Method Dimensions Recognition Rate

    LLDA 32 80.5%DLLE 40 85.3%

    LE 25 84.7%D-Isomap 20 88.2%

    In all the experiments, the final classification after dimen-

    sionality reduction is determined by the nearest neighbor

    criterion. Table IV shows the experimental results of different

    algorithms. The results demonstrate the greater effectiveness

    of D-Isomap for both feature reduction and final recognition.

    It considers the label information and local manifold structure.

    When dealing with the multiple classes and the data set

    distribution is complex, the proposed D-Isomap takes the

    advantage of weight factor to separate data with different

    labels farther and cluster data with the same label closer. Thus

    the proposed algorithm can gain better recognition rate.

    VI. CONCLUSIONS

    In this paper we present an automatic emotion recognition

    method from video sequences using the 3D active deformable

    information. From experimental results we find that the signif-

    icant features to distinguish one individual emotion from the

    other emotions are different. Some of the features selected in a

    global scenario are redundant, while some of the other features

    might contribute to the classification of specific emotion.

    Another observation is that there is not single feature which is

    significant for all the classes. This actually reveals the nature

    of human emotion; there are no sharp boundaries between

    the emotional states. Any single emotion may share similar

    patterns to other emotions, but not all. The human perception

    of emotion is based on the integration of different patterns.

    In the emotion recognition field, current techniques for the

    detection and tracking of facial expressions are sensitive to

    head pose, clutter, and variations in lighting conditions. Few

    approaches to automatic facial expression analysis are based

    on deformable or dynamic 3D facial models. The proposed

    system attempts to solve such problems by using a generic 3D

    mesh model with D-Isomap classification. The facial region

    and fiducial points are detected and tracked in the input video

    frames automatically. The generic facial mesh model is then

    used for EBS feature extraction. D-Isomap based classificationis applied for final decision. The merits of this work are

    summarized as follows.

    Facial expressions are detected and tracked automatically

    in the video sequences, which can alleviate a common

    problem in conventional detection and tracking methods.

    Namely, inconsistent performance due to sensitivity to

    variation in illuminations such as local shadowing, noise

    and occlusion.

    We model the face as an elastic body and exhibit

    different elastic characteristics dependent on different

    facial expressions. Based on the continuity condition, the

    elastic property of each facial expression is found, and a

    complete wireframe facial model can be generated underthe availability of some limited feature point positions.

    An adaptive partition of polygons is embedded in EBS

    according to the surface curvature through the character-

    istic feature points. The subtle structural information can

    be expressed without giving complicated facial features.

    The generic 3D facial model is established so that the

    good parameters of EBS can be used for emotion recogni-

    tion, e.g. the appropriate physical characteristics for face

    deformations, control points, etc.

    We propose the use of D-Isomap for emotion recognition.

    It can compact the data points from the same emotion

    class on a high-dimension manifold to make them closer

    in the low-dimension space, and makes the data pointsfrom the different clusters farther as well. It results in a

    high recognition rate when compared with other Isomap

    methods

    Experimental results and comparison with several other

    algorithms demonstrated the effectiveness of the proposed

    method.

    REFERENCES

    [1] A.C. Rafael and D. Sidney, Affect Detection: An InterdisciplinaryReview of Models, Methods, and Their Applications, IEEE Transactionson Affective Computing, Vol.1 (1), pp. 18-34, June 2010.

  • 7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

    13/14

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    13

    [2] N. Sebe, H. Aghajan, T. Huang, N.M. Thalmann, C. Shan, Special Issueon Multimodal Affective Interaction, IEEE Transactions on Multimedia,Vol.12 (6), pp. 477-480, 2010.

    [3] P. Ekman, T. Dalgleish, and M.E. Power, Basic emotions, Handbook ofCognition and Emotion, Wiley, Chichester, U.K., 1999.

    [4] C. Darwin, The Expression of Emotions in Man and Animals, JohnMurray, 1872, reprinted by University of Chicago Press, 1965.

    [5] J.F. Cohn, Advances in Behavioral Science Using Automated FacialImage Analysis and Synthesis, Signal Processing Magazine, IEEE,Vol.27 (6), pp. 128-133, 2010.

    [6] K.I. Chang, K.W. Bowyer, and P.J. Flynn, Multiple Nose RegionMatching for 3D Face Recognition under Varying Facial Expression,

    IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.28(10), pp. 1695-1700, October 2006.

    [7] P. Ekman, W.V. Friesen, and J.C. Hager, The Facial Action CodingSystem: A Technique for the Measurement of Facial Movement, SanFrancisco, Consulting Psychologist, 2002.

    [8] M.J. Lyons, J. Budynek, A. Plante, and S. Akamatsu, Classifying facialattributes using a 2-D Gabor wavelet representation and discriminantanalysis, Proceedings of the 4th International Conference on AutomaticFace and Gesture Recognition, pp. 202-207, March 2000.

    [9] L.D. Silva and S.C. Hui, Real-time facial feature extraction and emotionrecognition, Proceedings of 4th International Conference on Informa-tion, Communications and Signal Processing, Vol.3, pp. 1310-1314,Singapore, December 2003.

    [10] I. Cohen, N. Sebe, Y. Sun, M. S. Lew, and T.S. Huang, Evaluation ofexpression recognition techniques, Proceedings of International Confer-

    ence on Image and Video Retrieval, pp. 184-195, IL, USA July 2003.[11] G. Guo and C.R. Dyer, Learning from examples in the small sample

    case: face expression recognition, IEEE Transactions on Systems, Man,and Cybernetics, Part B, Vol.35 (3), pp. 477-488, June 2005.

    [12] M. Pantic and I. Patras, Dynamics of facial expression: recognitionof facial actions and their temporal segments from face profile imagesequences, IEEE Transactions on Systems, Man, and Cybernetics, Part

    B, Vol.36 (2), pp. 433-449, April 2006.[13] K. Anderson and P.W. McOwan, A real-time automated system for the

    recognition of human facial expressions, IEEE Transactions on Systems,Man, and Cybernetics, Part B, Vol.36 (1), pp. 96-105, February 2006.

    [14] H. Gunes and M. Piccardi, Automatic Temporal Segment Detection andAffect Recognition from Face and Body Display, IEEE Transactions onSystems, Man, and Cybernetics, Part B, Vol.39 (1), pp. 64-84, February2009.

    [15] Y. Wang and L. Guan, Recognizing Human Emotional State fromAudiovisual Signals, IEEE Transactions on Multimedia, Vol.10 (5), pp.

    659-668, August 2008.[16] Z. Zeng, M. Pantic, G.I. Roisman, and T.S. Huang, A Survey of Affect

    Recognition Methods: Audio, Visual, and Spontaneous Expressions,IEEE Transactions on Pattern Analysis and Machine Intelligent, Vol.31(1), pp. 39-58, January 2009.

    [17] M. Song, Z. Dong, C. Theobalt, H.Q. Wang, Z.C. Liu, and H.P. Seidel,A General Framework for Efficient 2D and 3D Facial ExpressionAnalogy, IEEE Transactions on Multimedia, Vol.9 (7), pp. 1384-1395,November 2007.

    [18] T. Yun and L. Guan, Human Emotion Recognition Using Real 3DVisual Features from Gabor Library, IEEE International Workshop on

    Multimedia Signal Processing, pp. 481-486, Saint Malo, October 2010.[19] J. Wang, L. Yin, X. Wei, and Y. Sun, 3D facial expression recognition

    based on primitive surface feature distribution, IEEE InternationalConference on Computer Vision and Pattern Recognition, pp. 1399-1406,New York, June 2006.

    [20] H. Soyel and H. Demirel, Facial expression recognition using 3D

    facial feature distances, International Conference on Image Analysis andRecognition, Vol.4633, pp. 831-838, Montreal, August 2007.

    [21] H. Soyel and H. Demirel, Optimal feature selection for 3D facial ex-pression recognition using coarse-to-fine classification, Turkish Journalof Electrical Engineering and Computer Sciences, Vol.18 (6), pp. 1031-1040, 2010.

    [22] H. Tang and T. S. Huang, 3D facial expression recognition based onautomatically selected features, IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshops, pp. 1-8, Anchorage,June 2008.

    [23] H. Tang and T. Huang, 3D facial expression recognition based onproperties of line segments connecting facial feature points, IEEE

    International Conference on Automatic Face and Gesture Recognition,pp. 1-6, Amsterdam, The Netherlands, 2008.

    [24] Y. Sun and L. Yin, Facial expression recognition based on 3D dynamicrange model sequences, Computer Vision - ECCV 08, pp. 58-71, 2008.

    [25] T. Yun and L. Guan, Automatic face detection in video sequences usinglocal normalization and optimal adaptive correlation techniques, Pattern

    Recognition, Vol.42 (9), pp. 1859-1868, September 2009.

    [26] P. Viola and M. Jones, Robust Real Time Object Detection, Pro-ceedings 2nd International Workshop on Statistical and ComputationalTheories of Vision, Vancouver, Canada, July 2001.

    [27] J. Xiao, S. Baker, I. Matthews, T. Kanade,Real-time combined 2d+3dactive appearance models, Computer Vision and Pattern RecognitionConference, Vol.2, pp. 535-542, July 2004.

    [28] R.Gross, I.Matthews, S.Baker, Constructing and Fitting Active Appear-

    ance Models With Occlusion, IEEE Workshop on Face Processing inVideo, pp. 72, 2004.

    [29] D. Vukadinovic, M. Pantic, Fully Automatic Facial Feature PointDetection Using Gabor Feature Based Boosted Classifiers, IEEE Inter-national Conference on Systems, Man and Cybernetics Waikoloa, Vol. 2,pp.1692 - 1698, October 2005.

    [30] T. Yun and L. Guan, Automatic fiducial points detection for facialexpressions using scale invariant feature, IEEE International Workshopon Multimedia Signal Processing, pp. 1-6, Rio de Janero, Brazil, October2009.

    [31] T. Yun and L. Guan, Fiducial Point Tracking for Facial ExpressionUsing Multiple Particle Filters with Kernel Correlation Analysis, IEEE

    International Conference on Image Processing, pp. 373-376, Hongkong,September 2010.

    [32] M.H. Davis, A. Khotanzad, D.P Flamig, and S.E Harms, A physics-based coordinate transformation for 3-D image matching, IEEE Trans-actions on Medical Imaging, Vol.16 (3), pp. 317 -328, June 1997.

    [33] M.H. Davis, A. Khotanzad, D.P. Flamig, S.E. Harms, Elastic bodysplines: a physics based approach to coordinate transformation in medicalimage matching, IEEE Symposium on Computer-Based Medical Systems,pp. 81 - 88, 1995.

    [34] A.Hassanien, M.Nakajima, Image morphing of facial images transfor-mation based on Navier elastic body splines, Computer Animation, pp.119 - 125, 1998.

    [35] C.J. Kuo, J. Hung, M. Tsai, and P. Shih, Elastic Body Spline Techniquefor Feature Point Generation and Face Modeling, IEEE Transactions on

    Image Processing, Vol.14 (12), pp. 2159-2166, December 2005.

    [36] P. C. Chou and N. J. Pagano,Elasticity: Tensor, Dyadic, and EngineeringApproaches, Dover, New York, 1992.

    [37] J.B. Tenenbaum, V. Silva, and J.C. Langford, A global geometricframework for nonlinear dimensional reduction, Science, Vol.290, pp.2319-2323, December 2000.

    [38] M.H. Yang, Face recognition using extended isomap, InternationalConference on Image Processing, Vol.2, pp.117-120, New York, Septem-ber 2002.

    [39] X. Geng, D.C. Zhan, and Z.H. Zhou, Supervised Nonlinear Dimension-ality Reduction for Visualization and Classification, IEEE Transactionson Systems, Man and Cybernetics, Part B, Vol.35 (6), pp. 10981107,December 2005.

    [40] Y. Wu, K.L. Chan, and L. Wang, Face recognition based on discrimina-tive manifold learning, International Conference on Pattern Recognition,Vol.4 pp. 171-174, Cambridge, UK, September 2004.

    [41] D. Zhao and L. Yang, Incremental Isometric Embedding of High-Dimensional Data Using Connected Neighborhood Graphs, IEEE Trans-actions on Pattern Analysis and Machine Intelligence, Vol.31, pp. 86-98,January 2009.

    [42] V. Le, H. Tang, and T.S. Huang, Expression recognition from 3Ddynamic faces using robust spatio-temporal shape features, AutomaticFace Gesture Recognition and Workshops, pp. 414-421, 2011.

    [43] G. Sandbach, S. Zafeiriou, M. Pantic, and D.Rueckert, A dynamic

    approach to the recognition of 3D facial expressions and their temporalmodels, Automatic Face Gesture Recognition and Workshops, pp. 406-413, 2011.

    [44] T. Friedrich, Nonlinear Dimensionality Reduction with Locally LinearEmbedding and Isomap, University of Sheffield, 2002.

    [45] E. Kokiopoulou and Y. Saad, Orthogonal Neighborhood PreservingProjections: A Projection- Based Dimensionality Reduction Technique,

    IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.29,pp. 2143-2156, December 2007.

    [46] J. Handl and J. Knowles, An Evolutionary Approach to MultiobjectiveClustering, IEEE Transactions on Evolutionary Computation, Vol.11 (1),pp. 56-76, February 2007.

    [47] S.B. Cohen, Mind Reading: The Interactive Guide to Emotions, Lon-don, Jessica Kingsley, 2004.

    [48] S.Y. Ho and H.L. Huang, Facial Modeling from an UncalibratedFace Image Using Flexible Generic Parameterized Facial Models, IEEE

  • 7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition

    14/14

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    14

    Transactions on Systems, Man and Cybernetics, Part B, Vol.31 (8),October 2005.

    [49] L. Yin, X. Wei, Y. Sun, J. Wang, M.J. Rosato,3D Facial ExpressionDatabase for Facial Behavior Research, Automatic Face and Gesture

    Recognition, pp. 211 - 216, 2006.[50] L. Zhu, F. Yun, Y. Junsong, T.S. Huang, and W. Ying, Query Driven

    Localized Linear Discriminant Models for Head Pose Estimation, IEEEInternational Conference on Multimedia and Expo,pp. 1810 - 1813, July2007.

    [51] X. Li, S. Lin, S. Yan, and D. Xu, Discriminant Locally Linear Em-

    bedding With High-Order Tensor Data, IEEE Transactions on Systems,Man, and Cybernetics, Part B: Cybernetics,Vol. 38 (2), pp. 342 - 352,April 2008.

    [52] W. Luo, Face recognition based on Laplacian Eigenmaps, Interna-tional Conference on Computer Science and Service System, pp 416 -419, June 2011.

    Yun Tie (S07) received his B.Sc. Degree from Nanjing University of Scienceand Technology, China, M.A.Sc Degree of Computer Science from KwangjuInstitute of Science and Technology (KJIST), Korea, and Ph.D Degree fromRyerson University, Canada. He is currently a Post Doctoral Fellow in RyersonMultimedia Lab at Ryerson University, Toronto, Canada. His research interestsinclude image/video processing, pattern recognition, 3D data modeling andintelligent classification and their applications.

    Ling Guan (S88-M90-SM96-F08) received his B.Sc. degree in ElectronicEngineering from Tianjin University, China, M.A.Sc degree in SystemsDesign Engineering from University of Waterloo, Canada and Ph.D. degree inElectrical Engineering from the University of British Columbia, Canada. Heis currently a professor and a Tier I Canada Research Chair in the Departmentof Electrical and Computer Engineering at Ryerson University, Toronto,Canada. He also held visiting positions at British Telecom (1994), TokyoInstitute of Technology (1999), Princeton University (2000), Hong KongPolytechnic University (2008) and Microsoft Research Asia (2002, 2009).

    He has published extensively in multimedia processing and communications,human-centered computing, pattern analysis and machine intelligence, andadaptive image and signal processing. He is a recipient of the 2005 IEEETrans. on Circuits and Systems for Video Technology Best Paper Award andan IEEE Circuits and Systems Society Distinguished Lecturer.