IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 7, …cvrr.ucsd.edu/publications/2013/TawariTrivediFace...For example, audio-visual speech recognition (AVSR), also known as automatic

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 7, NOVEMBER 2013 1543

Face Expression Recognitionby Cross Modal Data Association

Ashish Tawari and Mohan Manubhai Trivedi, Fellow, IEEE

Abstract—We present a novel facial expression recognitionframework using audio-visual information analysis. We proposeto model the cross-modality data correlation while allowingthem to be treated as asynchronous streams. We also show thatour framework can improve the recognition performance whilesignificantly reducing the computational cost by avoiding redun-dant or insignificant frame processing by incorporating auditoryinformation. In particular, we design a single good image repre-sentation of image sequence by weighted sums of registered faceimages where the weights are derived using auditory features.We use a still image based technique for the expression recogni-tion task. Our framework, however, can be generalized to workwith dynamic features as well. We performed experiments usingeNTERFACE’05 audio-visual emotional database containing sixarchetypal emotion classes: Happy, Sad, Surprise, Fear, Angerand Disgust. We present one-to-one binary classification as wellas multi-class classification performances evaluated using bothsubject dependent and independent strategies. Furthermore,we compare multi-class classification accuracies with those ofpreviously published literature which use the same database. Ouranalyses show promising results.

Index Terms—Facial expression recognition, audio-visual ex-pression recognition, key frames selection, multi-modal expressionrecognition, emotion recognition, affective computing, affectanalysis.

I. INTRODUCTION AND MOTIVATION

A FFECTIVE state plays a fundamental role in humaninteractions, influencing cognition, perception and even

rational decision making. This fact has inspired the researchfield of “affective computing” which aims at enabling com-puters to recognize, interpret and simulate affects [1]. Suchsystems can contribute to human computer communication andto applications such as learning environment, entertainment,customer service, computer games, security/surveillance, edu-cational software as well as in safety critical application such asdriver monitoring [2], [3]. To make human-computer interac-tion (HCI) more natural and friendly, it would be beneficial togive computers the ability to recognize affects the same way ahuman does. Since speech and vision are the primary senses forhuman expression and perception, significant research efforthas been focused on developing intelligent systems with audioand video interfaces [4].

Manuscript received March 05, 2012; revised September 17, 2012; acceptedNovember 22, 2012. Date of publication June 06, 2013; date of current versionOctober 11, 2013. The associate editor coordinating the review of this manu-script and approving it for publication was Prof. K. Selcuk Candan.The authors are with the Computer Vision and Robotics Research Labora-

tory, University of California, San Diego, La Jolla, CA, 92093 USA (e-mail:[email protected]; [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TMM.2013.2266635

Multimodal systems, specifically with audio and visualmodalities, have shown several interesting interactions be-tween the two modalities. For example, audio-visual speechrecognition (AVSR), also known as automatic lipreading, orspeechreading [5] aims at improving automatic speech recog-nition by exploring the visual modality of the speaker’s mouthregion. Not surprisingly, it has outperformed audio alone ASRsystem particularly in noisy conditions. Similarly, the wellknown perceptual phenomenon, McGurk effect [6], whichdemonstrates an interaction between hearing and vision inspeech perception. Furthermore, Munhall et al. [7] suggeststhat rhythmic head movements are correlated with the pitchand amplitude of speaker’s voice and that visual informationcan improve speech intelligibility by 100% over that possibleusing auditory information only.In the field of affect recognition, there have been number

of efforts to exploit audio-visual information as well and ourframework can utilize these methods. However, above exam-ples, where visual modality improves audio alone system, aremotivated us to ask the fundamental question of how does audiomodality influence visual perception, in particular, for the taskof facial expression recognition. It is evident that speech gen-eration influences facial expression. Also, for expression recog-nition the coupling between these two modalities is not so tightunlike the case in audio-visual speech recognition task.Towards this end, we present a novel facial expression recog-

nition framework using bimodal information. Our frameworkexplicitly models the cross-modality data correlation while al-lowing them to be treated as asynchronous streams. To recog-nize the key emotion of an image sequence, the proposed frame-work seeks to summarize the emotion using one single imagederived from hundreds of frames contained in the video. Wealso show that the framework can improve the recognition per-formance while significantly reducing the computational costby avoiding redundant or insignificant frame processing usingauditory information.

II. RELATED STUDIES

Our long term goal is to study the cross-modal influence ofthe audio-visual data streams on each other for the affect recog-nition task. In this study, however, our focus is on face expres-sion recognition. Hence we first discuss some of the represen-tative works for facial expression recognition and then moveour discussion on existing audio-visual affect recognition ap-proaches to highlight the challenges lies in the integration ofthe two modalities. For an overview of audio only, visual onlyand audio-visual affect recognition, readers are encouraged tostudy a recent survey by Zeng et al. [9].

1520-9210 © 2013 IEEE

1544 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 7, NOVEMBER 2013

Fig. 1. An example of spontaneous conversation between driver and passenger during a driving task. Film strip shows samples of five images equally spaced inthe utterance. First half of the utterance contains the speech and later half the road noise. Notice, however, that facial features are more expressive after speechcontent while head dynamics is concomitant with the speech [8].

Because of the importance of face in emotion expression andperception, most of the vision-based affect recognition studiesfocus on facial expression analysis. A large amount of existingfacial expression recognizers employ various pattern recogni-tion approaches and are based on 2D spatiotemporal facial fea-tures: geometric features or appearance based features. Geo-metric-based approaches track the facial geometry informationover time and classify expressions based on the deformation offacial feature [10]. Chang et al. [11] defined a set of points asthe facial contour feature, and an Active Shape Model (ASM) islearned in a low dimensional space. Lucey et al. [12] employedActive AppearanceModel (AAM)-derived representation whileValtar, Patras, and Pantic [13] tracked 20 fiducial facial pointson raw video using a particle filter.On the other hand, appearance-based approaches empha-

size on describing the appearance of facial features and theirdynamics. Zhao and Pietikaninen [14] employed the dynamicLocal Binary Pattern (LBP) which is able to extract informationalong the time axis. Bartlett et al. [15] used a bank of Gaborwavelet filter to decompose the facial texture. More recently,Wu et al. [16] utilized Gabor Motion Energy Filters which isalso able to capture the spatial-temporal information. Yangand Bhanu [17] created a single good image representationfrom a visual sequence by first registering the face image to anreference image using dense SIFT flow algorithm and extractappearance feature using Local Phase Quantization (LPQ). Themethod has provided the best overall emotion recognition per-formance till date for the GEMEP-FERA benchmark [18]. Thiscan be derived as one of the special cases in our framework.It is important to mention that precise registration of framesis an important step otherwise single representation of imagesequence using all the frames could suffer from large deviationof head pose.Cohen et al. [19] performed expression classification in video

sequence using temporal and static modeling by Naive-Bayesbased (‘static’) and HMM based (‘dynamic’) classifiers respec-tively. Static classifiers outperformed dynamic ones. It is argued

that dynamic classifiers are more complex, therefore they re-quire more training samples and many more parameters to learncompared with the static approach. Author suggests that dy-namic classifiers are more suited for person-dependent systemsdue to their higher sensitivity not only to changes in appear-ance of expressions among different individuals, but also to thedifferences in temporal patterns. Static classifiers are easier totrain and implement, but when used on a continuous video se-quence, they can be unreliable especially for frames that are notat the peak of an expression. This brings an important aspectof how to obtain a better and robust representation of an ex-pression from video sequences. Can multimodality help in thisregard? We seek to answer these questions.As far as automatic facial affect recognition is concerned,

most of the existing efforts studied the expressions of the sixbasic emotions (Happy, Sad, Surprise, Fear, Anger and Dis-gust) due to their universal properties and the availability ofthe relevant training and test material (e.g., [20]). These emo-tions are often deliberate and exaggerated displays [21]. The de-liberate behavior, however, differs in visual appearance, audioprofile, and timing from spontaneously occurring behavior [22],[23]. This has led the research field to new trends: analysis ofspontaneous affective behavior and development of multimodalanalysis. Multimodal analysis helps to improve the performancein challenging naturalistic setting during spontaneous behavior.Combining complementary information from the two streamscan help improve the recognition performance. However, thetwo modalities are not tightly coupled in spontaneous natural-istic behavior as depicted in Fig. 1 [8]. Moreover, speech gener-ation affects the facial expression dynamics. In following para-graphs, we present some of the works which address these twoissues. In particular, how they derive various visual representa-tions for visual channel as well as how they model asynchronyin the two streams.One of the challenging tasks of the visual tracking systems

is to deal with changes in the shape of the mouth caused due tospeech. In order to deal with this situation, Datcu et al. [24] pro-

TAWARI AND TRIVEDI: FACE EXPRESSION RECOGNITION BY CROSS MODAL DATA ASSOCIATION 1545

Fig. 2. Overview of the proposed expression recognition system. Cross-Relevance feedback block provides the importance of the other modality at current timeinterval based on the analysis of its own modality. Frame-Relevance measure block can potentially utilize both Cross-Relevance feedback and its own modalityanalysis to finally assign the importance to the current frame. The solid blocks and connections shows the active components in the current implementation.

posed a data fusion technique where they rely only on the visualdata in the silent phase of the video sequence and the fused au-diovisual data during non-silent segments. The visual modalityduring non-silent segments only focused on the upper half ofthe facial region to eliminate the effects caused by changes inthe shape of the mouth. However, the results shows that full facebased model performs superior than partial face. Hence an alter-native strategy is require to filter out the influence of phonemes.Wang et al. [25] proposed a relatively inexpensive computa-

tional method for visual based emotion recognition which se-lected a single key frame from each audio-visual sequence torepresent the emotion present in the entire sequence. The cri-terion for selecting the key frames from the audio-visual se-quences was based on the heuristic that peak emotions are dis-played at the maximum audio intensities. The visual features areextracted from these key frames using Gabor wavelets. Acousticfeatures is then combine with derived visual feature at featurelevel data fusion scheme for the classification task. However,choosing one single frame from visual sequence is very restric-tive and the same is clear from the performance of their visualalone system.An important audio visual fusion scheme which aim at

making use of the correlation between audio and visual datastreams and relaxing the requirement of synchronization ofthese streams, is that of model-level fusion. Zeng et al. [26]presented a Multistream Fused HMM to build an optimalconnection among multiple streams from audio and visualchannels according to the maximum entropy and the maximummutual information criterion. Author, however, consideredtightly coupled HMMs. Song et al. [27] proposed an approachfor multimodal emotion recognition which was specifically fo-cused on temporal analysis of three sets of features: ‘audio onlyfeatures’, ‘visual only features’ (upper half of facial region) and

‘visual speech features’ (lower half of facial region) using atriple HMM, i.e., one HMM for each of the information modes.This model was proposed to deal with state asynchrony of theaudio-visual features while maintaining the original correlationof these features over time. On the other extreme is the modelthat allows complete asynchrony between the streams. Thisis, however, infeasible due to the exponential increase in thenumber of state combinations possible due to the asynchrony.Our contribution in this paper is two folds: first, we explic-

itly model the correlation between the two streams while al-lowing them to be treated as asynchronous streams; second, weassign importance to a particular frame and there by avoidingextreme treatment (all the frames or just a single frame). Moreimportantly this is accomplished by incorporating cross-modalmodels developed at the first step. The idea is that the analysisof the sequential changes can be beneficial for the facial expres-sion recognition, however, the onset and the offset of the fa-cial dynamics are hard to detect using visual alone modalities.Hence most of the efforts often tries to classify every frame andtake a majority voting in the end to come up with single expres-sion class. If the near apex frame or a set of more representativeframes can be picked up based on multimodal data, to representan entire segment, we can restrict noisy/redundant sequentialfacial feature deformations to negatively influence the recog-nition performance, and hence describe emotions in a reliablemanner. Initial findings based on the mentioned proposition wasreported in [28]. Here, we provide further in depth analysis bystatistically substantiating claims and compare multi-class clas-sification performances with existing literature.

III. AUDIO-VISUAL DATA ASSOCIATION APPROACH

Fig. 2 sketches an overview of the proposed recognitionsystem. The salient feature of our framework is the introduction


of cross-modal relevance feedback and the frame relevancemeasure blocks. In our present work, we have limited ourdiscussion to facial expression recognition using visual fea-tures alone. Hence classification module only utilizes visualfeatures. An audio-visual classification framework, however,can easily be devised. Important point to note is that unlike anystandard fusion schemes (early- model-level- or late-fusion),the proposed method attempts to improve signal representationat the first place hence by reducing error propagation which ingeneral is harder to deal at later stages. Another simplificationmade in this work is to utilize only cross-modal feedback. Thisis to highlight importance of cross-modal information feedbackin this context.A detailed approach to summarize the visual expression

information into a single image representation is presentedin following sections. We show that just by incorporatingcross-modal information a significant reduction in computationcost can be achieved. On the other hand by avoiding spuriousframes for further processing and there by reducing unwantedinfluence (degradation in recognition performance), classifica-tion accuracy can be improved as discussed in Section IV.B.

A. Face Tracking and Alignment

In recent times, model-based techniques have been exten-sively used in nonrigid deformable object fitting. We use theConstraint Local Model (CLM) [29] for face tracking in theimage sequences. CLM utilizes parameterized shape model tocapture plausible deformation of landmark locations. It predictsthe locations of the landmarks using a group of landmark detec-tors. In [29], the response map of these detectors is representednon-parametrically and the landmarks’ locations are optimizedvia subspace constrained meanshifts while enforcing their jointmotion via shape model. It fits well to various poses. We used aperson independent model which was trained on the Multi-PIEdatabase [29]. The fitting process on an image provides arow vector for each sequence and frame containing

detected landmark positions

The detected landmark is normalized by appropriate scaling, ro-tation and translation to make center of eyes 200 pixel apart andline joining the two centers horizontal. We denote the normal-ized shape vector as . Furthermore, a reference shape iscalculated using (1).

(1)

where is total number of frames in sequence and isthe total number of image sequences. Given this reference shape

, image is aligned using affine transform to obtainthe aligned image . For alignment, we only consideredthe points which are relatively stable to track corresponding tothe eyebrows, eyes, nose and mouth regions. Fig. 3 shows thereference shape obtained for the database and the points usedfor image alignment. An example of automatically tracked faceand the aligned face is illustrated in Fig. 4.

Fig. 3. Reference shape derived from the database showing 66 landmark posi-tions along with the ones in red which are used during image alignment process.

Fig. 4. (a) An example of tracked face and the landmarks, and (b) aligned faceimage obtained using reference shape during image alignment step.

B. Visual Sequence Analysis: A Bimodal Approach

Our goal in audio-visual sequence analysis is to provide a seg-ment level classification. A video sequence, however, consistsof hundreds of frames and the question is how to intelligentlyutilize all or a subset of frames to obtain a single image repre-sentation. For this, we propose to derive a weighted mean image

for the sequence which hopefully is representative ofemotional content of the segment. As shown in (2), aligned faceimage is weighted by the relevance measure de-rived from audio signal analysis.

(2)

where .We propose two rule based approaches to assign value of rel-

evance measure. The first approach utilizes all the frames inthe active video sequence, hence discarding any prosodic in-formation available in audio stream. This is accomplished byassigning uniform weights to all the frames. For fair com-parison with the second approach, the active video sequencedoes not include preceding and trailing silence period otherwiseit introduces irrelevant frames where subject may not even belooking to the camera in the utilized database. We call the resul-tant image as the ‘mean image’.The second approach uses prosodic information related to

pitch and energy contour to choose only certain frames for thecalculation of image . In particular, we use four sub-seg-ments of the given video segment: two corresponding to startand end of the speech segment and two corresponding to max-imum intensity and maximum pitch value. Each sub-segment is


Fig. 5. Signal processing involved in calculation of single image representation of the image sequence. The bottom curves are intensity (dotted green-associatedwith the left axis) and pitch contour (solid blue associated with right axis) along with voiced region depicted by the red cross. The middle plot is the speech signalshowing the segments chosen by the two schemes weighted mean in green and mean in red box. Finally the image sequence is shown next. All the plots and imagesequence have the same time axis. (a) Weighted mean image derived for the expression class of happy and (b) mean image derived for the expression class ofhappy.

200 ms long, centered around mentioned events. All the selectframes are assigned the same weight to derive the single imagerepresentation. We call the resultant image as ‘weighted meanimage’. Fig. 5 shows signal processing involved in a typical ex-ample of the two approaches and their weighted mean andmeanimage output.

C. Appearance Feature Extraction

Originally proposed for texture analysis, the Local BinaryPattern (LBP) family of descriptors (LBP [30], LBP-TOP [14],LPQ [31] and LPQ-TOP [32]), in recent years, have been exten-sively used for static and temporal facial expression analysis,and face recognition. We use the blur insensitive LPQ (LocalPhase Quantization) appearance descriptor proposed by Ojan-sivu et al. [31] as the feature for facial expression analysis. LPQis based on computing the short-term Fourier transform on localimage window. At each pixel the local Fourier coefficients arecomputed for four frequency points: and ,where is sufficiently small number. We use in ourexperiment. Then phase information is recovered using binaryscalar quantization of the signs of the real and imaginary part ofeach coefficient. The resultant eight bit binary coefficients arethen represented as integers using binary coding. Finally, a his-togram of these integer values from all image positions is com-posed and used as a 256-dimensional feature vector. We alsouse de-correlation process to eliminate the dependency of theneighboring pixels before quantization.In our experiment, we resize the aligned face images to

200 200 and further divided into non-overlapping tiles of10 10 to extract local pattern. Thus the LPQ feature vector isof dimension .

D. Auditory Feature Extraction

In our prior work [33], [34] we have used prosodic and spec-tral features to model emotional states. We used subset of thesefeatures for cross-modal relevance calculation in the proposedframework. In particular, the pitch and intensity (energy) con-tours are used to derive weights for the nth frame in visualstream as described in Section III.B.For pitch contour calculation, we used the auto-correlation

algorithm similar to [35]. The input speech signal is divided intooverlapping frames with shift intervals (difference between thestarting point of consecutive frames) of 10 ms. Each frame isof 60 ms long to be able to span 3 periods of minimum pitchvalue (in our case 50 Hz). Pitch candidate over each frame iscalculated and a dynamic programming technique is used to getthe final pitch contour. Log-energy coefficients are calculatedusing 30 ms frames with shift interval of 10 ms. Fig. 5 showsthe interpolated pitch contour and voiced segment as well as theintensity contour.

IV. EXPERIMENTAL ANALYSIS

A. Audio-Visual Dataset

In our experiments, we used the audio-visual affective data-base eNTERFACE’05 [36]. It contains the six archetypal emo-tions: happiness (ha), sadness (sa), surprise (su), anger (an), dis-gust (di) and fear (fe). 42 subjects were asked to react to sixdifferent situations. The subjects were given five different an-swers to react to these situations. However, they were not givenany instruction on how to express their emotions. Two humanexperts judged whether the reaction expressed the emotion inan unambiguous manner. If not, it was discarded. The database


Fig. 6. Binary-class classification accuracy for all the possible 15 different combinations over six basic emotions: happy (ha), Sad (sa), Surprise (Su), Fear (Fe),Anger (An) and Disgust (Di). (a) Subject dependent analysis, average accuracy of the ten experiments corresponding to each 10-fold cross-validation procedureand statistical significance test (one-way ANOVA). A value of less than 0.05 indicates that the two populations under test have significantly different means.Each point shows it’s mean and one standard deviation bar plot. Signs and corresponds to and or ‘notsignificant’ respectively. (b) Subject independent analysis using Leave-One-Subject-Out cross validation.


is collected in English language. Among the 42 subjects, 81%were men and remaining 19% were women. 31% of the total setwore glasses, while 17% of the subjects had a beard. The data-base is captured in a controlled recording environment.

B. Results and Discussion

In this section, we present results for two classification tasks:the first one involves binary-class classification experiments andthe second involves multi-class classification experiments. Thelatter helps us to conduct comparative study with other pub-lications available in the literature. While, the purpose of bi-nary classification task is to bring forth the importance of bi-modal data association in facial expression recognition usingvisual sequence data. Also, binary classification analysis helpsus gain better insight on, specifically, the impact of our proposedframework and generally, the inherent confusion between twoclasses. It is also worthy to note that many multi-class classifi-cation strategies inherently involve multiple binary classifica-tion (in our case too) and their performance is often ignoredfrom discussion. Hence, we present the same in the followingparagraphs.We perform binary classification using Support Vector Ma-

chine (SVM) with linear kernel and default parameters avail-able in Matlab implementation. We have 15 binary classifica-tion tasks corresponding to every possible pair of six expres-sion classes. We present both, subject dependent and indepen-dent analyses.For subject dependent analysis, we utilize 10 fold cross vali-

dation strategy. That is the database is randomly divided into 10folds in stratified manner so that they contain approximately thesame proportions of labels as the original database. The systemis trained on 9 folds and tested on the left out fold. This is re-peated 10 times each time leaving out a different fold. In theend, we obtain a classification accuracy. We repeated the aboveprocedure 10 times generating 10 accuracy figures for each ofbinary classification task.Mean accuracy is reported in Fig. 6(a).For subject independent analysis, we employ Leave-One-Sub-ject-Out (LOSO) cross validation strategy. That is the system istrained using the data associated with all the subjects but oneand tested on the left out subject. This is repeated until everysubject is kept as test subject. Fig. 6(b) shows average accuracyover all the subjects.Firstly, it is clear that the use of single image representa-

tion can provide high recognition accuracy. In 10 fold crossvalidation, the best result is obtained for the Happy/Anger bi-nary classification with accuracy over 95%. Certain classes are,however, more confusing in visual domain like the Sad/Fearor Sad/Surprise with recognition accuracy around 78%. As ex-pected, subject independent results show lower performance butfollow similar trend. Later in this section, we compare resultsfor multi-class classification with others’ and show the supe-riority of the proposed framework. It is important to point outthough that we have not used any tuning of SVM parameters norhave we used any feature selection technique which often im-proves the performance greatly. Our focus here is to comparethe usefulness of auditory cross-modal feedback for frame se-lection which is also evident from the results.

Furthermore, to determine the statistical significance of theresults, we performed an analysis of variance (ANOVA) toeliminate the effects of random fold selection on the results.The data under test are the binary classification accuracies overten, 10-fold cross validation experiments for the two differentschemes as described in the Section III.B. The analysis allowsjudgment on whether the results were significantly differentbetween mean image representation and weighted mean imagerepresentation. Fig. 6(a) shows the significance test scores forthe 15 binary classifiers.Using this analysis, we can conclude, in a statistically signifi-

cant way , that, exploiting audio association, in mostcases, improves the classification accuracy. The best improve-ment of over 10% is obtained for binary classification task ofSad/Fear when the auditory signal is utilized for the frame se-lection step. Except in two cases Surprise/Disgust and Anger/Disgust, trend is opposite with accuracy lowered by %. Acloser look on the results suggests that emotion classes Fear andHappy have shown the most improvements. On the other hand,emotion classes Disgust and Sad may have not been benefited.This can be due to our rule based weight assignment for theseparticular emotion classes. Particularly, for Sad class having lowarousal profile, regions corresponding to high intensity and pitchmay not provide representative frames. This encourages us tolearn such bimodal association automatically from audio-visualdata which we will pursue in our future efforts.Also, notice that audio assisted approach utilizes maximum

of ms worth of visual data corresponding tothe four segments as described in Section III.B while using allthe frames on an average requires 2.5 sec worth of visual frameprocessing. Hence using cross-modal information improves thevisual computation cost by factor of .Finally, we perform multi-class classification and compare

the results with that of previously published ones. Both theproposed schemes show similar performances. We have alreadydiscussed the benefits of Weighted-mean image earlier; hence,we present the results using the same. For multi-class classi-fication, we again used SVMs as classifiers. A linear kernel,pairwise multi-class discrimination and Sequential MinimalOptimization learning [37] are used. In literature, often, averageunweighted accuracy is reported since it is a better metric thanweighted accuracy, specially in case of unbalanced dataset.Hence, we report and compare unweighted accuracy. However,since the database in use has approximately equal number ofinstances per class, unweighted and weighted accuracy areapproximately same. For fair comparison, we have chosen thestudies which have used the same database. We evaluated bothsubject independent and dependent methodologies.For subject independent evaluation, we use Leave-One-Sub-

ject-Out (LOSO) strategy to ensure strict speaker independence.Table I shows the confusion matrix for six class classificationtask. Unweighted average accuracy of over 47% is obtained.Paleari et al. [38] reported the best average accuracy of32% for visual alone system using SVM and Neural Networks(NN) classifiers. In fact, the best audio-visual system usingthe Bayesian fusion showed 43% average accuracy. A direct


TABLE IMULTI-CLASS CLASSIFICATION ACCURACY (IN %): CONFUSIONMATRIX FOR LEAVE-ONE-SUBJECT-OUT CROSS VALIDATION

comparison, however, is difficult since their evaluation criterionis based on 60% training and 40% testing. We chose LOSOstrategy for the ease of reproducibility and can serve as baselinefor future reference.For subject dependent analysis, Mansoorizadeh and Charkari

[39] used ten fold cross validation using single subject data andreported recognition performance averaged over all the subjects.The average accuracy of 37% is achieved using visual alongsystem while best bimodal system has 70% accuracy. Authorsused SMV as classifiers. Similar procedure using our proposedframework achieves over 64% of average accuracy. However,given the small size of the database (5 emotion instances persubject per class), we suggest to perform a 10 fold randomizedcross validation using all subjects data. Table II shows the con-fusion matrix for six class classification task. The improved av-erage accuracy (from 47%) of over 62% suggests the subjectdependency on randomly chosen training and testing instances.Our visual alone system is significantly better than that of [39];while lower than that of bimodal system, as expected.In another study, by Gajšek et al. [40] speaker dependent in-

formation is decoupled from emotion instance at feature ex-traction stage. Their video based subsystem showed averageaccuracy of 54.6%. Authors used SVM as classifier. It is notclear whether 5 fold cross validation used in evaluation pro-cedure has different subjects in training and testing subsets.While there exist other studies that used the same database,they are either not clear on their evaluation criteria (subject de-pendent/independent) or many of them used just a part of thedatabase or yet others do not report their single modal sub-system performance. We utilized the whole database and pre-sented both subject dependent and independent performances.Based on the comparative study, our proposed framework hasshown promising results.

TABLE IIMULTI-CLASS CLASSIFICATION ACCURACY (IN %): CONFUSION

TABLE FOR RANDOMIZED 10 FOLD CROSS VALIDATION

V. CONCLUDING REMARKS

Automatic analysis of human affective behavior have beenextensively studied in past several decades. Facial expressionrecognition systems, in particular, have matured to a levelwhere automatic detection of small number of expressions inposed and controlled displays can be done with reasonablyhigh accuracy. Detecting these expressions in less constrainedsettings during spontaneous behavior, however, is still a chal-lenging problem [41]. In recent years, increasing numberof efforts have been made to collect spontaneous behaviordata in multiple modalities [42]. The research shift towardsthis direction suggests to utilize the multimodal data analysisapproaches.In this work, we presented a novel approach which explic-

itly models the cross-modal data association. We then investi-gated two different rule based data association approach. Ourresults showed that use of audio data could improve the recog-nition performance in terms of computation cost (since in gen-eral visual processing is costlier than audio processing) as wellas recognition accuracy. Unlike various data fusion strategies,our approach attempts to better represent signal at feature ex-traction level by weighting frames by its importance based oncross-relevance feedback for the task at hand, in this case facialexpression recognition. We reported one-to-one binary classifi-cation results as well as that of multi-class classification.Best improvement in recognition accuracy for the binary

classification task was over 10% and was statistically signifi-cant . However, for some expression recognition taskrecognition accuracy was lowered by % which suggests thatthe rules used might not be suitable for the particular emotionclass (in our experiment disgust class did not show much ofimprovement). Comparative study of multi-class classifica-tion results show significant improvement over previously


published approaches for visual subsystem for both subjectdependent and independent analyses.In our future efforts, we will explore data driven approach to

learn cross-modal relevance measure. The ability of our frame-work to incorporate these models in early stages of signal pro-cessing has a great potential for robust recognition performance.Wewill also incorporate audiomodality in classificationmodulefor the design of fully automatic audio-visual affect recognitionsystem.

ACKNOWLEDGMENT

We thank the sponsorships of National Science Foundation,U.C. Discovery Program and Industry Partners for supportingthe research. We would also like to thank anonymous re-viewers and editors for their constructive feedback and helpfulsuggestions.

REFERENCES

[1] R. W. Picard, Affective Computing. Cambridge, MA, USA: MITPress, 1997.

[2] E. Murphy-Chutorian and M. M. Trivedi, “Head pose estimation andaugmented reality tracking: An integrated system and evaluation formonitoring driver awareness,” IEEE Trans. Intell. Transp. Syst., vol.11, no. 2, pp. 300–311, 2010.

[3] A. Doshi and M. M. Trivedi, “On the roles of eye gaze and head dy-namics in predicting driver’s intent to change lanes,” IEEE Trans. In-tell. Transp. Syst., vol. 10, no. 3, pp. 453–462, 2009.

[4] S. Shivappa, M. M. Trivedi, and B. Rao, “Audio-visual informationfusion in human computer interfaces and intelligent environments: Asurvey,” Proc. IEEE, vol. 98, no. 10, pp. 1692–1715, Oct. 2010.

[5] C. Neti, G. Potamianos, J. Luettin, I.Matthews, H. Glotin, D. Vergyri, J.Sison, A. Mashari, and J. Zhou, “Audio-visual speech recognition,” inFinal Workshop 2000 Report., Baltimore, MD, USA, 2000, The JohnsHopkins Univ., Center for Language and Speech Processing.

[6] J. Macdonald and H. McGurk, “Visual influences on speech percep-tion processes,” Attention, Percept., Psychophys., vol. 24, pp. 253–257,1978.

[7] K. Munhall, J. A. Jones, D. E. Callan, T. Kuratate, and E. Vatikiotis-Bateson, “Visual prosody and speech intelligibility,” Psychol. Sci., vol.15, no. 2, pp. 133–137, 2004.

[8] A. Tawari and M. M. Trivedi, “Speech emotion analysis: Exploring therole of context,” IEEE Trans. Multimedia, vol. 12, no. 6, pp. 502–509,Oct. 2010.

[9] Z. Zeng, M. Pantic, G. Roisman, and T. Huang, “A survey of affectrecognition methods: Audio, visual, and spontaneous expressions,”PAMI, vol. 31, no. 1, pp. 39–58, Jan. 2009.

[10] J. C. McCall and M. M. Trivedi, “Pose invariant affect analysis usingthin-plate splines,” in Proc. Int. Conf. Pattern Recognition, 2004, vol.3, pp. 958–964.

[11] C. Hu, Y. Chang, R. Feris, and M. Turk, “Manifold based analysis offacial expression,” in Proc. Conf. Computer Vision and Pattern Recog-nition Workshop, Jun. 2004, p. 81.

[12] S. Lucey, A. B. Ashraf, and J. F. Cohn, “Investigating spontaneousfacial action recognition through aam representations of the face,” inFace Recognition, Delac, 2007, pp. 275–286.

[13] M. Valstar, I. Patras, and M. Pantic, “Facial action unit detection usingprobabilistic actively learned support vector machines on tracked facialpoint data,” in Proc. Computer Vision and Pattern Recognition Work-shop, 2005, vol. 0, p. 76.

[14] G. Zhao and M. Pietikäinen, “Dynamic texture recognition using localbinary patterns with an application to facial expressions,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 29, pp. 915–928, 2007.

[15] M. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J.Movellan, “Recognizing facial expression: Machine learning and ap-plication to spontaneous behavior,” in Proc. IEEE Computer SocietyConf. Computer Vision and Pattern Recognition, Jun. 2005, vol. 2, pp.568–573.

[16] T. Wu, M. Bartlett, and J. Movellan, “Facial expression recognitionusing Gabor motion energy filters,” in Proc. IEEE Computer SocietyConf. Computer Vision and Pattern Recognition Workshops (CVPRW),Jun. 2010, pp. 42–47.

[17] S. Yang and B. Bhanu, “Facial expression recognition using emotionavatar image,” inProc. IEEE Int. Conf. Automatic Face Gesture Recog-nition Workshops, Mar. 2011, pp. 866–871.

[18] M. Valstar, B. Jiang, M. Mehu, M. Pantic, and K. Scherer, “The first fa-cial expression recognition and analysis challenge,” in Proc. IEEE Int.Conf. Automatic Face and Gesture Recognition, Workshop on FacialExpression Recognition and Analysis Challenge, 2011.

[19] I. Cohen, N. Sebe, A. Garg, L. S. Chen, and T. S. Huang, “Facial ex-pression recognition from video sequences: Temporal and static mod-eling,” Comput. Vision Image Understand., vol. 91, pp. 160–187, Jul.2003.

[20] T. Kanade, J. Cohn, and Y. Tian, “Comprehensive database for facialexpression analysis,” in Proc. 4th IEEE Int. Conf. Automatic Face andGesture Recognition, 2000, pp. 46–53.

[21] Y. Tian, T. Kanade, and J. Cohn, “Facial expression analysis,” inHand-book of Face Recognition, S. Li and A. Jain, Eds. New York, NY,USA: Springer, 2005, pp. 247–276.

[22] J. F. Cohn andK. L. Schmidt, “The timing of facial motion in posed andspontaneous smiles,” J. Wavelets, Multi-Resolut., Inf. Process., vol. 2,pp. 1–12, 2004.

[23] N. Sebe, M. Lew, I. Cohen, Y. Sun, T. Gevers, and T. Huang, “Au-thentic facial expression analysis,” in Proc. Int. Conf. Automatic Faceand Gesture Recognition, May 2004, pp. 517–522.

[24] D. Datcu and L. Rothkrantz, “Semantic audio-visual data fusion forautomatic emotion recognition,” in Proc. Euromedia’2008 Porto, J.Tavares and R. N. Jorge, Eds., Ghent, Belgium, Apr. 2008, pp. 58–65,Eurosis.

[25] Y. Wang and L. Guan, “Recognizing human emotional state fromaudiovisual signals,” IEEE Trans. Multimedia, vol. 10, no. 5, pp.936–946, Aug. 2008.

[26] Z. Zeng, J. Tu, B. Pianfetti, M. Liu, T. Zhang, Z. Zhang, T. Huang, andS. Levinson, “Audio-visual affect recognition through multi-streamfused hmm for hci,” in Proc. IEEE Computer Society Conf. ComputerVision and Pattern Recognition, Jun. 2005, vol. 2, pp. 967–972.

[27] M. Song, C. Chen, and M. You, “Audio-visual based emotion recog-nition using tripled hidden Markov model,” in Proc. IEEE Int. Conf.Acoustics, Speech, and Signal Processing (ICASSP), May 2004, vol.5, pp. 877–880.

[28] A. Tawari and M. M. Trivedi, “Audio-visual data association for faceexpression analysis,” in Proc. Int. Conf. Pattern Recognition, 2012.

[29] J. Saragih, S. Lucey, and J. Cohn, “Face alignment through subspaceconstrained mean-shifts,” in Proc. Int. Conf. Computer Vision, 2009,pp. 1034–1041.

[30] T. Ojala, M. Pietikäinen, and T. Mäenpää, “Multiresolution gray-scaleand rotation invariant texture classification with local binary patterns,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 971–987,2002.

[31] V. Ojansivu and J. Heikkilä, “Blur insensitive texture classificationusing local phase quantization,” in Image and Signal Processing, A.Elmoataz, O. Lezoray, F. Nouboud, and D. Mammass, Eds., 2008, vol.5099, pp. 236–243.

[32] B. Jiang, M. Valstar, and M. Pantic, “Action unit detection using sparseappearance descriptors in space-time video volumes,” in Proc. IEEEInt. Conf. Automatic Face Gesture Recognition Workshops, Mar. 2011,pp. 314–321.

[33] A. Tawari and M. M. Trivedi, “Speech emotion analysis in noisy realworld environment,” in Proc. Int. Conf. Pattern Recognition, 2010.

[34] A. Tawari andM. Trivedi, “Speech based emotion classification frame-work for driver assistance system,” in Proc. Intelligent Vehicles Symp.(IV), 2010, pp. 174–178.

[35] B. Paul, “Accurate short-term analysis of the fundamental frequencyand the harmonics-to-noise ratio of a sampled sound,” in Inst. Phonet.Sci. 17, 1993, pp. 97–110.

[36] O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The enterface’05 audio-visual emotion database,” in Proc. 22nd Int. Conf. Data EngineeringWorkshops, 2006, p. 8, IEEE Computer Society.

[37] I. H. Witten and E. Frank, Data Mining: Practical Machine LearningTools and Techniques, Second Edition (Morgan Kaufmann Series inData Management Systems), ser. Morgan Kaufmann Series In DataManagement Systems, 2Nd ed. SanMateo, CA, USA:Morgan Kauf-mann, Jun. 2005.


[38] M. Paleari, R. Benmokhtar, and B. Huet, “Evidence theory-basedmultimodal emotion recognition,” in Advances in Multimedia Mod-eling, ser. Lecture Notes in Computer Science, B. Huet, A. Smeaton,K. Mayer-Patel, and Y. Avrithis, Eds. Berlin/Heidelberg, Germany:Springer, 2009, vol. 5371, pp. 435–446.

[39] M. Mansoorizadeh and N. Moghaddam Charkari, “Multimodal infor-mation fusion application to human emotion recognition from face andspeech,” Multimedia Tools Applicat., vol. 49, pp. 277–297, 2010.

[40] R. Gajšek, V. Štruc, and F. Mihelič, “Multimodal emotion recogni-tion based on the decoupling of emotion and speaker information,” inText, Speech and Dialogue, ser. Lecture Notes in Computer Science,P. Sojka, A. Horák, I. Kopecek, and K. Pala, Eds. Berlin/Heidelberg,Germany: Springer, 2010, vol. 6231, pp. 275–282.

[41] A. Tawari and M. M. Trivedi, “Audio visual cues in driver affect char-acterization: Issues and challenges in developing robust approaches,”in Proc. Int. Joint Conf. Neural Networks, 2011, pp. 2997–3002.

[42] A. Tawari, C. Tran, A. Doshi, T. Zander, and M. Trivedi, “Distributedmultisensory signals acquisition and analysis in dyadic interactions,” inProc. 2012 ACM Annu. Conf. Extended Abstracts on Human Factorsin Computing Systems Extended Abstracts, 2012, pp. 2261–2266, ser.CHI EA.

Ashish Tawari received his B.Tech. degree inelectrical engineering at the Indian Institute of Tech-nology, Bombay, India, in 2006, and the M.S. degreefrom University of California, San Diego (UCSD),in 2010. He is currently a Ph.D. candidate in the De-partment of Electrical and Computer Engineering atUCSD. He served as a DSP Engineer at Qualcomm(India) during 2006–2008 and interned at QualcommInc., San Diego, CA, with Multimedia R&D Speechteam during summer 2010. His research interestslie in the areas of multimodal signal processing,

machine learning, speech and audio processing and computer vision. Mr.Tawari is a recipient of UCSD Powell Fellowship 2008–2011. His thesisproposal, advised by Mohan Trivedi, received honorable mention at IEEEIntelligent Vehicles Symposium 2010, Ph.D. Forum.

Mohan Manubhai Trivedi (F’08) is a Professorof electrical and computer engineering and theFounding Director of the Computer Vision andRobotics Research Laboratory and the Laboratoryfor Intelligent and Safe Automobiles (LISA) at theUniversity of California, San Diego. He and histeam are currently pursuing research in machineand human perception, multimodal interfaces andinteractivity, machine learning, intelligent vehicles,driver assistance and active safety systems. He isserving on the Board of Governors of the IEEE

Intelligent Transportation Systems Society and on the Editorial Board of theIEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS. He servesas a consultant to industry and government agencies in the U.S. and abroad,including various government agencies, major auto manufactures and researchinitiatives in Asia and Europe. He is a co-author of a number of papers winning“Best Papers” awards. Two of his students were awarded Best DissertationAwards by the IEEE ITS Society (Dr. Shinko Cheng 2008 and Dr. BrendanMorris 2010) and his advisee Dr. Anup Doshi’s dissertation was selected as theUCSD entry and judged among the five finalists in the 2011 dissertation compe-tition of the Western (USA and Canada) Association of Graduate Schools. Hehas received the Distinguished Alumnus Award from the Utah State University,Pioneer Award (Technical Activities) and Meritorious Service Award fromthe IEEE Computer Society. He has given over 65 Keynote/Plenary talks atmajor conferences. He is a Fellow of the IEEE (“for contributions to IntelligentTransportation Systems field”), Fellow of the IAPR (“for contributions to visionsystems for situational awareness and human-centered vehicle safety”), andFellow of the SPIE (“for distinguished contributions to the field of opticalengineering”).

Documents

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 7, …cvrr.ucsd.edu/publications/2013/TawariTrivediFace...For example, audio-visual speech recognition (AVSR), also known as automatic