Facial Expression Recognition in the Wild: The Inﬂuence of ... · facial expression recognition in the wild is a more challenging task. Consequently, performance of facial expression

Facial Expression Recognition in the Wild:The Influence of Temporal

Information

Steve Nowee10183914

Bachelor thesisCredits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of AmsterdamFaculty of ScienceScience Park 904

1098 XH Amsterdam

Supervisors

Prof. dr. T. Gevers

Informatics InstituteFaculty of ScienceUniversity of AmsterdamScience Park 9041098 XH Amsterdam

Dr. R. Valenti

Informatics InstituteFaculty of Science

University of AmsterdamScience Park 904

1098 XH Amsterdam

Friday 27th June, 2014

Acknowledgements

I would like to thank Prof. dr. Theo Gevers and Dr. Roberto Valenti for agreeingto supervise my thesis and helping me come up with an interesting research topic.

Also, I would like to thank Sightcorp for the use of their facial expressionrecognition software CrowdSight.

Abstract

In this thesis, the influence of temporal information in the task of fa-cial expression recognition in the wild is studied. To investigate this influ-ence, several experiments have been conducted using three different methodsof classification. More specifically, static classification using Naive Bayesand dynamic classification using Conditional Random Fields and Latent Dy-namic Conditional Random Fields. These classifiers have been applied totwo types of features, extracted from the Acted Facial Expression in Wilddata set [Dhall et al., 2012]. These features were static Action Unit (AU)intensity values and spatio-temporal Local Binary Patterns on Three Orthog-onal Planes (LBP-TOP). The highest achieved accuracy was 38.01%, 4%higher than the baseline, using the LDCRF classification on the LBP-TOPfeatures. Comparing the performance of this dynamic classifier with that ofthe static Naive Bayes classification yields an improvement in accuracy of30%. Furthermore, by comparing the performance of the LDCRF on theAU intensity features with the performance of the LDCRF on the LBP-TOPfeatures, it was found that the use of the spatio-temporal LBP-TOP featuresresulted in an improvement in accuracy of 65%. Thus, showing a positive in-fluence of temporal information on the task of facial expression recognitionin the wild.

i

Contents

1 Introduction 1

2 Related Work 2

3 Methodology 43.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1.1 Action Unit Intensity . . . . . . . . . . . . . . . . . . . . 43.1.2 Local Binary Patterns on Three Orthogonal Planes . . . . 5

3.2 Methods of Classification . . . . . . . . . . . . . . . . . . . . . . 83.2.1 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . 83.2.2 Conditional Random Field . . . . . . . . . . . . . . . . . 93.2.3 Latent Dynamic Conditional Random Field . . . . . . . . 11

4 Data Set 12

5 Experimental Results 135.1 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.2 Conditional Random Field . . . . . . . . . . . . . . . . . . . . . 155.3 Latent Dynamic Conditional Random Field . . . . . . . . . . . . 175.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Conclusion 206.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

References 23

Appendices 23

A Results: Aligned face image AU intensity features with CRF 23

B Results:Aligned face image AU intensity features with LDCRF. 24

ii

1 Introduction

In recent years, automatic facial expression recognition has been an increas-ingly popular field of study. A accurate classification of how a person feels canhelp make the interaction between a user and a system, human-computer interac-tion (HCI), more natural and pleasant [Picard et al., 2001]. Such a system wouldhave a better understanding of the needs of a user. This is also called affective HCI,interpreting and acting on the affect of the user. The affect is also reffered to as theexperience of feeling or as emotion.

The form of human affect that can be employed in HCI is the emotional stateor mood of a person. This emotional state can be determined through differentmeans, for example through the use of linguistic or acoustic data. In this thesis,the focus is on the use of facial expressions for determining the emotional state ofa person. Facial expressions are a major part of human non-verbal communica-tion, which are often used to show one’s emotional state. Research on automaticfacial expression recognition has been conducted and has resulted in a high accu-racy of recognition [Cohen et al., 2003, Jain et al., 2011]. However, most of thisresearch occurred using unrealistic data sets that consist of unnatural and lab con-trolled facial expressions, instead of using facial expressions as they would occurin reality, also referred to facial expressions in the wild. Facial expressions in thewild are less controlled than those in the unrealistic data sets and because of this,facial expression recognition in the wild is a more challenging task. Consequently,performance of facial expression recognition in the wild is overall lower than thatof facial expression recognition using the lab controlled data [Gehrig and Ekenel,2013]. However, in order to create applications and software that can be utilizedin situations in a complex world, a facial expression recognition system based onsuch unrealistic data will not suffice. Thus, the performance of facial expressionrecognition in the wild should be increased as to create applications that will func-tion appropriately in the real world.

It has been discovered that dynamic facial expressions elicit a higher perfor-mance of emotion recognition by humans than static facial expressions [Alves,2013]. Also, neuroimaging and brain lesion studies provide evidence for two dis-tinct mechanisms, one for analysing static facial expressions and one for dynamicfacial expressions. Thus, it seems that the use of temporal analysis of facial expres-sions is employed by humans. Also, in the facial expression recognition researchconducted using the unrealistic data, it was found that the use of temporal infor-mation and temporal classification methods achieved a high performance. Thesemethods, using temporal analysis, have not been widely used in research of facialexpression recognition in the wild, however. In this thesis I propose the use ofsuch temporal analysis on facial expression recognition in the wild, to discover ifit has a positive influence on performance or not. In this work, the performance ofstatic and dynamic classification methods will be compared and analyzed. Thesemethods will be applied to action unit (AU) intensity features and Local BinaryPatterns on Three Orthogonal Planes (LBP-TOP) features, extracted from a data

1

set of facial expressions in the wild.The thesis is structured as follows. Firstly, related research will be discussed in

Section 2. This is followed by the used features and employed methods of classifi-cation, which will be explained in Section 3. In Section 4, the utilized data set willbe clarified. The following section, Section 5, will present the conducted experi-ments and will discuss their results. Lastly, Section 6 will conclude the findings ofthis thesis and discuss possible future work.

2 Related Work

To describe facial expressions in movements of the face or action units (AUs),Ekman and Friesen [1978] developed the Facial Action Coding System (FACS).Action units are fundamental facial movements, caused by muscular contraction orrelaxation within the face. The process of recognizing facial expressions accordingto FACS is performed on static images of facial expressions at their peak. How-ever, working with FACS is a manual form of facial expression recognition and isthus a laborious task.

After the development of FACS by Ekman and Friesen, image and video pro-cessing started to be utilized in the analysis of facial expressions. In this, points onthe face are tracked and extracted from images and videos. Also, the intensity ofdisplayed AUs can be extracted. Using this data, patterns can be sought for a facialexpression. For example, Cohen et al. [2003] proposed two different approachesfor facial expression classification using image and video processing. One of theseapproaches was static classification. Static, in this sense, means on a single frame.In this approach Bayesian networks were employed, such as a Naive-Bayes clas-sifier and a Gaussian Tree-Augmented Naive Bayes classifier (TAN). The otherproposed approach used dynamic classification, which entails that it is not per-formed on a single frame, but also takes earlier information into account. In thismanner, the temporal information of a displayed facial expression is included inthe classification process. The proposed dynamic approach employed a multi-levelHidden Markov Model classifier. Cohen et al. found that the TAN achieved thehighest accuracy. However, it was concluded that if there is not sufficient trainingdata, this approach will become unreliable and the Naive Bayes classifier becomesa more reliable option.

Jain et al. [2011] proposed using other features than AUs. They proposed us-ing facial shape and facial appearance, with static and dynamic approaches of clas-sification. In this, the facial shape was represented by landmark points aroundcontours in the face, e.g. eyebrows, eyes and lips. To extract these shapes, Gen-eralized Procrustes Analysis (GPA) was applied, after which the dimensionalityof the features was reduced by Principal Component Analysis (PCA). The facialappearances were represented by applying the Uniform Local Binary Pattern (U-LPB) method. The dimensionality of the facial appearance features was reducedby applying PCA as well. For the classification of the facial expressions, Jain et

2

al. employed Support Vector Machines (SVM), Conditional Random Fields (CRF)and Latent-Dynamic Conditional Random Fields (LDCRF). Of these methods, theSVM is static, whereas the CRF and LDCRF are dynamic methods of classifica-tion. CRFs are discriminative models, defining a conditional probability over labelsequences given a set of observations, and are similar to HMMs. The ExtendedCohn-Kanade data set (CK+) was utilized in each of the conducted experiments.It was discovered that CRFs are a valid indicator for transitions between facial ex-pressions. However, subtle facial movement classification did not achieve a highperformance when employing CRFs. To increase the performance of recognizingsubtle facial movement, Jain et al. proposed to employ LDCRFs. In LDCRFs, a setof hidden latent variables is included between the observation sequences and theexpression labels. It was discovered, that the use of the shape features resulted inan overall higher performance than in using the appearance features. As expected,the dynamic methods of classification, CRFs and LDCRFs, achieved a higher per-formance than the static SVM. Of the two dynamic methods, LDCRF classificationachieved a higher accuracy of recognizing facial expressions than when employingCRF classification.

The previously described works all utilized data sets that were acquired in a‘lab controlled’ recording environment. This means that the subjects face the cam-era directly and show obvious expressions, either acted or genuine. However, inhuman-human interaction, such conditions are no necessity. More naturally dis-played facial expressions, however, are classified with less accuracy by a computer.A more robust recognition of facial expressions in the real world might be achievedby training on a data set with more realistic conditions and in a more realistic en-vironment. One of such data sets is the Acted Facial Expression in Wild (AFEW)data set [Dhall et al., 2012]. This data set consists of video fragments, extractedfrom movies. In these fragments, the subjects do not always face the camera di-rectly, they might not keep their head still and the expressions might vary in theirlevel of intensity. This makes the conditions for extracting data, such as trackingpoints and action units, more complicated. The baseline for this data set, achievedby applying a non-linear SVM to feature vectors consisting of Local Binary Pat-terns on Three Orthogonal Planes (LBP-TOP) values, lies at 34.4%.

The AFEW data set from the challenge of 2013 has been used by Gehrig andEkenel [2013] to analyse the use of several features and methods of classifica-tion. The features that were used are Local Binary Patterns (LBP) features, Gaborfeatures, Discrete Cosine Transform (DCT) and AU intensities. The classifiers em-ployed using these features were a Nearest Neighbor classifier, a Nearest Meansclassifier and SVMs with linear, polynomial and Radial Basis Function kernels. Itwas discovered that the highest performance was achieved by employing a SVMwith an RBF kernel on the LBP-TOP features, which resulted in a performanceequal to the baseline of the data set of 2013, which was equal to 27.27%. Also,a human evaluation has been performed and Gehrig and Ekenel discovered thathuman classification achieves only 52.63% accuracy with a data set such as theAFEW.

3

3 Methodology

The standard pipeline of automatic facial expression recognition consists of:detecting faces in visual data, representing these faces as a set of features andusing these representations to compute some dependencies between a facial ex-pression and its features. Within such a pipeline, different features and classifierscan be employed. The following subsections are allotted to clarify the features andclassifiers that have been used in this thesis.

3.1 Features

In facial expression recognition using image processing, the complete imagesare usually not used. Instead, the most characteristic information that represents animage, the features, is used. The use of such features scales down the size of useddata, because only a relatively small amount of values is used per image. The fea-tures that have been used in this thesis are Action Unit Intensity and Local BinaryPatterns on Three Orthogonal Planes. Both of these features will be explained inmore detail in the following subsections.

3.1.1 Action Unit Intensity

As mentioned in Section 2, action units (AUs) are fundamental movements ofthe face that are caused by contractions and relaxations of facial muscles. Theactivity or intensity of these AUs can represent certain facial expressions, as de-scribed by the FACS [Ekman and Friesen, 1978]. Some examples of AUs can beseen in the left image of figure 1 and an example of how a facial expression can bedecomposed into AUs with a certain intensity is shown in the right image of thatfigure.

Figure 1: Left: Examples of action units. Right: Example of facial expressiondecomposition in AUs.

For this thesis, facial expression recognition software called CrowdSight1 has1Software by: Sightcorp B.V. www.sightcorp.com

4

www.sightcorp.com

been used. This software detects faces in an image or a video frame and extractsintensities from each of these detected faces, for nine AUs which can be found intable 1. The intensity values extracted from the AFEW data set, range from -315to 210. Also, information about the head poses of the subjects can be retrieved.This information consists of pitch, roll and yaw values in radians, which cover thethree degrees of freedom of the head. During the process of extracting the AUintensities, only faces with a yaw value between -15 degrees and 15 degrees wereused to ensure a measure of reliability of the extracted AU intensities. Because,if this precaution would not be taken, parts of the to be analyzed faces would berotated out of view and no AU intensity values could be extracted from these partsof the faces. Furthermore, AU intensities have only been extracted from videos thatconsist of ten or more frames, because videos with less than ten frames are likelyto be uninformative.

AU Name Facial Muscle

1 Inner Brow Raiser Frontalis, pars medialis2 Outer Brow Raiser Frontalis, pars lateralis4 Brow Lowerer Corrugator supercilii, Depressor supercilii9 Nose Wrinkler Levator labii superioris alaquae nasi12 Lip Corner Puller Zygomaticus major15 Lip Corner Depressor Depressor anguli oris20 Lip Stretcher Risorius w/ platysma23 Lip Tightener Orbicularis oris28 Lip Suck Orbicularis oris

Table 1: The nine AUs extracted by CrowdSight.

AU intensities have been extracted from both the video fragments of the AFEWdata set and the aligned faces per frame, per video fragment of the AFEW data set(see Section 4). It was expected that applying CrowdSight to the aligned face im-ages would yield more accurate AU intensity values and thus a more accurate clas-sification of facial expressions, because the aligned face images are already givenand no face detection is necessary. In this process of face detection on the videofragments, wrongfully detected faces might occur. The results of both extractionmethods will be discussed in Section 5.

3.1.2 Local Binary Patterns on Three Orthogonal Planes

In order to explain the Local Binary Patterns on Three Orthogonal Planes(LBP-TOP) feature, the standard Local Binary Patterns (LBP) feature has to beexplained as well. LBP features are used to describe spatial information of animage. In extracting LBP features the whole image is used and encoded into a his-togram, instead of using only a small set of keypoints of an image. The procedure

5

that achieves the encoding of the images functions as follows. For each pixel inan image or video frame, the surrounding eight pixels will be thresholded. If thepixel value of a surrounding pixel is higher than the threshold, it will be equivalentto a one and if its value is lower it will be equivalent to a zero. These eight binaryvalues form a binary pattern, by concatenating the values clockwise. The processof thresholding a pixel’s surrounding pixels can be seen in figure 2.

Figure 2: Example of forming a binary pattern for one pixel (LBP).

The result will be a binary pattern for each pixel in an image. This set of binarypatterns can be turned into a histogram by summing all similar occurrences of bi-nary patterns. There are two often used methods of computing this histogram. Oneoption is to sum all similar occurrences of the binary patterns over the whole im-age. More often, however, the image is divided into sub-images and the binarypatterns of each sub-image are summed to a histogram. Subsequently, all of the re-sulting histograms are concatenated into one histogram. This method is visualizedin figure 3.

Figure 3: Example of computing an LBP histogram, by concatenating histogramsof sub-images.

The LBP-TOP feature is an extension of the LBP feature. As noted before, de-scribing images using LBP features uses spatial information. In other words, eachframe is analyzed in the spatial (X,Y)-space. The LBP-TOP features, however,also incorporate temporal information with the spatial information. Instead of an-alyzing single frames, a set of frames (a video) is analyzed in the spatio-temporal(X,Y,T)-space. In this space, X and Y still denote spatial coordinates and T now de-notes the temporal coordinates, or frame index. There are three orthogonal planesin this (X,Y,T)-space, which are the XY plane, the XT plane and the YT plane.The XY plane represents the standard spatial information, whereas the XT and theYT planes respectively represent changes in the row values and changes in the col-umn values in time. From each of these three planes, LBP features are extracted

6

and turned into a histogram, as explained in the previous paragraph. Then, lastly,the three resulting histograms are concatenated to form a single histogram. Thisprocess, of extracting LBP-TOP features, is shown in figure 4.

Figure 4: Example of the process of extracting LBP-TOP features.

The LBP-TOP features used in the experiments of this thesis were extractedby dividing the frames of each of the three orthogonal planes into sixteen non-overlapping blocks (4×4). From each of these individual blocks, the binary pat-terns were created and the histograms were computed. The computation of thesehistograms occurred using uniformity of the binary patterns. A binary pattern isuniform if it consists of at most two contiguous regions. The eight surroundingpixels, one binary value each, resulted in 256 possible binary patterns. Withinthese possibilities only 58 patterns are uniform, each of which is given an indi-vidual label. All non-uniform binary patterns are labelled using one unique label.This results in 59 labels and thus 59 bins per histogram. Per orthogonal plane, thehistograms of the sixteen blocks were concatenated, resulting in three histograms.Lastly, these three histograms, one for each orthogonal plane, were concatenated,forming one feature vector per video. Each of these feature vectors consisted of2832 values (16× 59× 3).

To investigate the influence of temporal information, different methods of clas-sification have been applied to the features that have been described in the previoussubsections. This thesis will continue by describing said methods of classification.

7

3.2 Methods of Classification

In machine learning, classification can be defined as the procedure of catego-rizing observations based on their characteristics, or features. The three methodsof classification that have been used for this thesis originate from a probabilisticmodelling of data. These methods will be discussed in the following subsections.

3.2.1 Naive Bayes

The Naive Bayes classifier is a generative probabilistic method of classifying,using Bayes’ theorem. When using a Naive Bayes classifier, a strong independenceis assumed between the sequences of features. Because of this assumed indepen-dence, any actual relation the sequences of features may have is neglected. For thatreason, the Naive Bayes classifier is called naive. However, this same indepen-dence makes the Naive Bayes classifier a simple and fast method of classification.

As noted above, Naive Bayes classification is based on probability. Morespecifically, the conditional probability of a class given a set of features. Thisprobability can be written as:

p(C|F1, . . . , Fn) (1)

in which C denotes the class or label and F1, . . . , Fn denote the observed features.This can be rewritten using Bayes’ theorem, resulting in:

p(C|F1, . . . , Fn) =p(C)p(F1, . . . , Fn|C)

p(F1, . . . , Fn). (2)

Since all feature values Fi are known the denominator is a constant, thus only thenumerator is of importance. This numerator is equivalent to the joint probability ofC and all features Fi, which can be rewritten using the chain rule:

p(C,F1, . . . , Fn) = p(C)p(F1, . . . , Fn|C)= p(C)p(F1|C)p(F2, . . . , Fn|C,F1)

= p(C)p(F1|C)p(F2|C,F1) . . . p(Fn|C,F1, F2, . . . , Fn−1)

(3)

However, since the Naive Bayes classifier assumes independence between its fea-tures, each of the conditional probabilities of a feature will only depend on theclass C, for example p(Fi|C,Fj , Fk, Fl) = p(Fi|C). For that reason, the joint

8

probability can be simplified to:

p(C|F1, . . . , Fn) =1

Zp(C,F1, . . . , Fn)

=1

Zp(C)p(F1|C)p(F2|C)(Fn|C)

=1

Zp(C)

n∏i=1

p(Fi|C)

(4)

in which Z denotes the probability of a sequence of features p(F1, . . . , Fn), whichwas a constant since these features are known.

Despite the simplicity and ‘naivety’ of the Naive Bayes classifier, it generallyachieves a high accuracy in a wide variety of classification tasks. It has been con-cluded in the past, that Naive Bayes classification competed with state-of-the-artdecision tree classifiers of that time [Langley et al., 1992] and are still compet-ing with state-of-the-art now. Thus, one of the main reasons for using Naive Bayesclassification for recognizing facial expressions, is its simplicity and relatively highperformance. The other reason for using Naive Bayes classification in this thesis,is because of its independence assumptions. The assumed independence betweenthe features make Naive Bayes classification a static classification. Frames will beanalyzed apart from each other and there will be no dependency between the datain separate frames. By comparing the results from temporal classification methodswith the results from Naive Bayes classification, the effect of the temporal classifi-cation can be examined.

3.2.2 Conditional Random Field

A conditional random field (CRF) is an undirected graphical model that is con-ditioned on observation sequencesX and is often used for segmenting and labelingstructured data. For example, CRFs have been used to segment and label docu-ments of text, either handwritten or machine printed, with higher precision thanthe use of Neural Networks and Naive Bayes classification [Shetty et al., 2007].Also, it has been found that CRFs perform with higher accuracy than HiddenMarkov Models (HMMs) in activity recognition on temporal data [van Kasterenet al., 2008]. An HMM is a statistical Markov model with unobserved (hidden)states and can be defined as a dynamic Bayesian network.

An example of the structure of a CRF can be seen in figure 5. In this figure,one can see that the nodes of the CRF graph can be divided into the observationsX = {x1, . . . , xn} and the labels Y = {y1, . . . , yn}. On these two sets, X and Y ,the conditional distribution p(Y |X) is modelled.

9

Figure 5: Example of a CRF, in which xi denote the observation sequences and yidenote the label sequences.

Theoretically, a CRF can be motivated from a sequence of Naive Bayes classi-fiers. The result of putting Naive Bayes classifiers in sequence, is an HMM. Like aNaive Bayes classifier, an HMM is also generative. The counterpart of a generativeclassifier is a discriminative classifier. Where a generative classifier is based on amodel of joint distribution p(y, x), a discriminative classifier is based on a modelof conditional distribution p(y|x). The discriminative counterpart of the genera-tive HMM is the CRF. The modelling of the conditional distribution of a CRF isachieved with the following equation [Sutton and McCallum, 2006]:

p(Y |X) =1

Z(X)exp

{K∑k=1

λkfk(yt, yt−1, Xt)

}, (5)

in which Z(X) denotes a normalization function:

Z(X) =∑Y

exp

{K∑k=1

λkfk(yt, yt−1, Xt)

}. (6)

To train a CRF, its parameters or weights, θ = {k}, have to be estimated. Toestimate these parameters, training data D = {X(i), Y (i)N

i=1} is used. In this dataeach X(i) = {x(i)1 , x

(i)2 , . . . , x

(i)T } is an input sequence of observations and each

Y (i) = {y(i)1 , y(i)2 , . . . , y

(i)T } is an output sequence of labels. The estimation is

performed by maximizing a conditional log likelihood:

l(θ) =

N∑i=1

logp(Y (i)|X(i)). (7)

The method used to optimize l in this thesis, is called Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) [Bertsekas, 1999, Byrd et al., 1994]. NormalBFGS optimization computes an approximation of the Hessian, because the use ofthe full Hessian is not practical due to its quadratic number of parameters. TheHessian is a matrix of second-order partial derivatives of a function and describesthe curvature of that function, making it useful for optimization. Even though thenormal BFGS uses an approximation of this Hessian, it still requires quadratic

10

size. The L-BFGS method, instead of storing the full and dense approximation ofthe Hessian, only stores several vectors representing the approximations implic-itly. Storing vectors instead of full matrices, the memory requirement of L-BFGSbecomes linear, which makes L-BFGS a good method when a large amount of fea-tures is used. For a full introduction to CRFs, please see [Sutton and McCallum,2006].

In tasks such as activity recognition or facial expression recognition, often, dy-namic methods such as CRFs are used. Facial expressions in videos, for example,cannot be fully explained by features from each separate static frame of that video.Instead, some interdependency between the features in the frames should be takeninto account. A method such as Naive Bayes, discussed in Section 3.2.1, doesnot take such interdependency into account. CRFs, however, do take this interde-pendency into account. This fact makes CRFs a good method of classification forfacial expressions.

3.2.3 Latent Dynamic Conditional Random Field

As Jain et al. [2011] stated, CRFs are a reliable method to model transitions be-tween facial expressions. However, the distinction between similar facial expres-sions is made by subtle changes. In the detection of these subtle changes, CRFsare less reliable. For that reason, Latent Dynamic Conditional Random Fields (LD-CRFs) were proposed. The structure of an LDCRF is similar to that of a CRF. Only,an LDCRF has an added layer of hidden states, or latent states. This structure canbe seen in figure 6. In this figure, as with CRFs, the X = {x1, x2, . . . , xn} denoteobservation sequences and the Y = {y1, y2, . . . , yn} denote the labels. The hiddenstates are denoted by H = {h1, h2, . . . , hn}

Figure 6: Example of an LDCRF, in which xi denote the observation sequences, yidenote the label sequences and hi denote the latent states.

The conditional probability of labels, Y , given observations, X , for an LDCRFis similar to that of a CRF. However, the hidden states H = {h1, h2, . . . , hn} have

11

to be incorporated. This results in the following conditional probability:

p(Y |X) =∑H

p(Y |H)p(H|X). (8)

The hidden states of an LDCRF are subject to a restriction: The sets of hiddenstates connected to different labels should be disjoint. In other words, if a hiddenstate is connected to label yi, it cannot also be connected to yj , where i 6= j. Thisrestriction can also be represented by the following relationship:

p(Y |H) =

{1 ∀Hm ∈ HY m

0 otherwise(9)

Using this restriction, the conditional probability equation can be simplified:

p(Y |X) =∑

H:∀Hm∈HY m

p(H|X). (10)

This p(H|X) can be written in the same form as equation (5) in Section 3.2.2,resulting in:

p(H|X) =1

Z(X)exp

{K∑k=1

λkfk(ht, ht−1, Xt)

}, (11)

with Z(X) written as:

Z(X) =∑H

exp

{K∑k=1

λkfk(ht, ht−1, Xt)

}. (12)

As with the training of a CRF model, a parameter estimation must be performed totrain an LDCRF model. To perform this parameter estimation, again the optimiza-tion algorithm L-BFGS, explained in 3.2.2, was used.

The three discussed methods of classification have been applied to the afore-mentioned features. These features have been extracted from the AFEW data set,which will be clarified in the following section.

4 Data Set

For all of the experiments in this thesis, the AFEW data set [Dhall et al., 2012]has been used. As mentioned before, this data set consists of video fragments ofseven different facial expressions, extracted from 54 movies. These seven facialexpressions are: Anger, Disgust, Fear, Happiness, Neutral, Sadness and Surprise.Also, the AFEW data set is used in a facial expression recognition in the wild

12

challenge: EmotiW2. These video fragments were obtained in .avi format and havebeen converted to .mp4 format, in order for the CrowdSight software to be able toprocess the video fragments. Since the video fragments are extracted from movies,the facial expressions shown are in a close to real-world environment. The actorsin the video fragments move around, move their heads and show facial expressionswith a varying level of intensity. Also, the fragments in the data set have a widerange of illumination conditions, making the environment more natural than labcontrolled environments. The data set was divided into a train set and a test set.The number of fragments per facial expression and per train or test set can be seenin table 2.

Number of frames per expression

Anger Disgust Fear Happiness Neutral Sadness Surprise Total

Train set 92 66 67 108 104 84 57 578Test set 64 40 46 63 63 61 46 383

Table 2: Number of video fragments per facial expression, per train or test set, inthe AFEW data set.

Apart from the video fragments, sorted by facial expression, images of thealigned faces for each frame, landmark points for each face in each frame andLBP-TOP features were included for each video fragment.

5 Experimental Results

Using the features described in Section 3.1, several experiments have beenconducted to investigate the influence of temporal analysis in facial expressionrecognition. To show this influence, the static Naive Bayes classification will becompared with the dynamic CRFs and LDCRFs. Also, all three of these methodswill be compared to the baseline of the AFEW data set, which is 34.4%.

In these experiments, the AU intensity data has been used in two different man-ners. The first approach was simply using the data as it is: AU intensity values perframe. This leads to a frame-by-frame classification of facial expressions. How-ever, one can also argue that if the majority of frames in a video is classified cor-rectly, the whole video is classified correctly. Both the frame-by-frame and themajority of frames per video methods of evaluating have been used. For the sec-ond manner of using the AU intensity data, the median per AU per video wascalculated. This resulted in nine median values per video, one for each AU. Theclassification using these median values resulted in one label per video, instead ofone for each frame. Lastly, the LBP-TOP feature vectors of the AFEW data sethave been used without any further processing.

2http://cs.anu.edu.au/few/emotiw2014.html

13

http://cs.anu.edu.au/few/emotiw2014.html

In the following subsections the results of the three methods of classificationfound in Section 3.2 are presented. This is followed by a paragraph that discussesthese results in more detail.

5.1 Naive Bayes

The experiments using the Naive Bayes classifier have been conducted usingbuilt-in Matlab functions of the NaiveBayes class3, both to train and to test themodels. Also, the duration of the training and testing procedures has been com-puted. The results of the Naive Bayes classifier and the duration of training andtesting are presented in table 3. The results are given for the AU intensity features,extracted from both the video fragments and from the aligned face images, and forthe LBP-TOP features.

Time (s)

Training Testing Accuracy (%)

Video AU Int. Median 0.00833 0.00629 25.17Video AU Int. Frame-by-Frame 0.0167 0.0195 24.69

Image AU Int. Median 0.00849 0.00502 19.11Image AU Int. Frame-by-Frame 0.0322 0.0354 19.53

LBP-TOP 0.338 0.220 29.38

Table 3: Naive Bayes results on video and image extracted AU intensity featuresand LBP-TOP features.

There is an obvious difference in performance between the AU intensities ex-tracted from the video fragments and those extracted from the aligned face images.The results of the AU intensity features extracted from the aligned face images arelower, even though it was expected that more reliable AU intensity values would beextracted using these images. Possible reasons for this difference will be discussedin Section 5.4.

Neither the performance of the AU intensity features nor the performance of theLBP-TOP features surpass the baseline of 34.4%, achieved with a Support VectorMachine on the LBP-TOP features. The resulting confusion matrices for the AUintensity median features, extracted from the video fragments, and the LBP-TOPfeatures can be seen in tables 4 and 5 respectively.

3http://www.mathworks.nl/help/stats/naivebayes-class.html

14

http://www.mathworks.nl/help/stats/naivebayes-class.html

An Di Fe Ha Ne Sa SuAn 20 10 3.3 23.3 30 6.7 6.7

Di 26.7 0 6.7 6.7 33.3 26.7 0Fe 40 0 10 0 30 10 10Ha 13.6 0 4.5 54.5 9.1 13.6 4.5

Ne 3.6 14.3 7.1 10.7 46.4 14.3 3.6

Sa 10.5 0 5.3 31.6 31.6 15.8 5.3

Su 0 13.3 0 26.7 46.7 13.3 0

Table 4: Confusion matrix in percentages for Naive Bayes classification on medianAU intensity features.

An Di Fe Ha Ne Sa SuAn 44.1 8.5 15.3 8.5 6.8 13.5 3.4

Di 30.8 10.3 5.1 10.3 17.9 15.4 10.3

Fe 43.2 4.5 15.9 6.8 25 4.5 0Ha 22.2 3.2 9.5 42.9 14.3 6.3 1.6

Ne 13.1 1.6 8.2 8.2 42.6 21.3 4.9

Sa 20.3 6.8 11.9 20.3 16.9 23.7 0Su 17.4 4.3 17.4 10.9 30.4 8.7 10.9

Table 5: Confusion matrix in percentages for Naive Bayes classification on LBP-TOP features.

When looking at the confusion matrix in table 4, it is clear that for some ex-pressions, for example Sadness, a large amount of videos is misclassified as beingeither Happiness or Neutral. This same applies for Surprise and Disgust. In ta-ble 5, the misclassifications are more spread out over all expressions. However,most expressions are misclassified as being Anger. Possible explanations for thesemisclassifications are given in Section 5.4.

5.2 Conditional Random Field

The experiments using the CRFs have been conducted using Matlab toolboxHCRF2.0b. As mentioned before, the optimization algorithm used in this proce-dure was L-BFGS. The number of used iterations has been set at 50, 100, 300 and500 to examine its influence on accuracy and running time of the procedure. Aswith the Naive Bayes classification, the running time of the training and testingprocesses and the accuracy of the CRF experiments are listed in table 6.

15

Time (s)


50 iterations AU Int. Median 0.035 0.0009 23.02AU Int. Frame-by-Frame 4.575 0.05 14.49AU Int. Frame Majority 15.83LBP-TOP 20.56 0.102 37.47




Table 6: CRF results on video extracted AU intensity features and LBP-TOP fea-tures. Note that the ‘AU Int. Frame Majority’ is part of the ‘AU Int. Frame-by-Frame’ model and thus they have the same training and testing time.

As one may have noticed, the results from the AU intensity data extracted fromthe aligned face images has not been incorporated in table 6, because the results onthis data were overall lower than the results on the video fragment extracted AUintensity data. To see the results for the aligned face image extracted AU intensityfeatures, view Appendix A.

The highest results using the CRF were achieved by applying it to the LBP-TOP features. For each variation of the number of iterations, the CRF applied tothe LBP-TOP features surpasses the data set’s baseline. The best result on the LBP-TOP features was achieved with 50 iterations. However, overall, the results werebest when using 300 iterations. Although, the difference in running time between50 and 300 iterations is not insignificant. The training of the CRF on the LBP-TOPfeatures takes more than five times as long when using 300 iterations, instead of 50iterations.

In table 7, the confusion matrix for the performance of the CRF with 300 it-erations, applied to the LBP-TOP features, is shown. In this confusion matrix,the expressions Anger and Happiness are classified with a relatively high accu-racy. However, again, the misclassified expressions are for a large part classified asAnger and Neutral.

16

An Di Fe Ha Ne Sa SuAn 59.3 13.6 5.1 6.8 10.2 5.1 0Di 25.6 25.6 10.3 12.8 10.3 7.7 7.7

Fe 25 9.1 20.5 9.1 15.9 11.4 9.1

Ha 6.3 9.5 11.1 60.3 4.8 6.3 1.6

Ne 6.6 9.8 9.8 14.8 39.3 18.0 1.6

Sa 11.9 11.9 13.6 11.9 23.7 20.3 6.8

Su 13.0 13.0 17.4 8.7 23.9 4.3 19.6

Table 7: Confusion matrix in percentages for Conditional Random Field classifi-cation with 300 iterations, on LBP-TOP features.

5.3 Latent Dynamic Conditional Random Field

The experiments using LDCRFs were conducted using the Matlab toolboxHCRF2.0b, just as the experiments with the CRFs. However, instead of varyingthe number of iterations, these experiments were conducted with a variable num-ber of hidden states. The number of iterations was set at 300, since this yielded theoverall highest results in the experiments with the CRFs. The number of hiddenstates was set at two, three, four and five. Again, the duration of the training andtesting procedures was computed, together with the accuracy. These results areshown in table 8. As with the CRF results, the results of the AU intensity featuresextracted from the aligned face images are not included in this table, due to theoverall lower performance on this data. See Appendix B for these results.

17

Time (s)


2 Hidden states AU Int. Median 0.069 0.0009 23.02AU Int. Frame-by-Frame 189.99 0.159 16.80AU Int. Frame Majority 16.55LBP-TOP 202.93 0.182 37.74




Table 8: LDCRF results on video fragment extracted AU intensity features andLBP-TOP features. Note that the ‘AU Int. Frame Majority’ is part of the ‘AU Int.Frame-by-Frame’ model and thus they have the same training and testing time.

The overall highest performance is achieved by the LDCRF with three hiddenstates. Also, using the LDCRF with three hidden states on the LBP-TOP featuresresults in the highest performance of all tested methods and features, which isunderlined in table 8. The corresponding confusion matrix of the LDCRF withthree hidden states on the LBP-TOP features can be seen in table 9.

An Di Fe Ha Ne Sa SuAn 61.0 11.9 5.1 8.5 8.5 5.1 0Di 23.1 25.6 10.3 15.4 10.3 7.7 7.7

Fe 29.5 4.5 22.7 11.4 11.4 9.1 11.4

Ha 6.3 7.9 11.1 65.1 4.8 3.2 1.6

Ne 6.6 9.8 8.2 11.5 42.6 19.7 1.6

Sa 11.9 11.9 11.9 13.6 20.3 20.3 10.2

Su 15.2 15.2 13.0 13.0 26.1 4.3 13.0

Table 9: Confusion matrix in percentages for Latent Dynamic Conditional Ran-dom Field classification with 300 iterations and thee hidden states, on LBP-TOPfeatures.

18

The classifications of Anger and Happiness have a relatively high accuracy,just as with the CRF classification. When comparing the confusion matrix fromthe CRF, in table 7, with this confusion matrix, it can be seen that almost eachexpression has either the same performance or the performance of the LDCRFsurpasses that of the CRF. Only the classification of Surprise achieves a loweraccuracy and its misclassifications are spread out equally over the remaining sixexpressions.

5.4 Discussion

For none of the classification methods did the accuracy of the results using theAU intensity features surpass the baseline of 34.4%. Only using the CRFs andLDCRFs on the LBP-TOP features resulted in an improvement of accuracy overthe baseline. However, the highest accuracy of 38.01% is still not high enoughto be called reliable. A viable reason for this fact might be that in normal humaninteraction, we rarely show one single facial expression at a time. Some facial ex-pressions may be shown as a combination, such as Anger and Disgust, and othermay be shown in sequence, for example Surprise and Fear. This makes the prob-lem of facial expression recognition in the wild not only a classification task, butalso a disambiguation task between facial expressions. This increases the difficultyof the task and thus decreases its performance. This problem was also noted byGehrig and Ekenel [2013]. In the AFEW data set, some of the video fragmentsshow several expressions, while being labelled as only one of those expressions.The approach of Gehrig et al. was to simply remove the video fragments wheremore than one facial expression was shown, to form a revised subset. This some-what increased their achieved performance.

Another problem that was observable from the confusion matrices, was that alot of misclassifications were classified as Neutral. The most probable cause forthese misclassifications is that in some of the fragments there is a large amount offrames that show a neutral facial expression more than showing the facial expres-sion of its actual class. This qualifies the fragment as being of that certain class,but it has a high probability of being classified as Neutral, because of the amountof frames in which the face was actually neutral.

Also, the results on the AU intensity features that were extracted from thealigned face images were overall lower than the results from the AU intensity fea-tures extracted from the video fragments. It was expected that the results fromthe aligned face image data would yield better results, because these AU intensityvalues would be extracted more reliably from already found facial images. How-ever, examine the images in figure 7. These are examples of some of the alignedface images of the AFEW data set. It seems that the face detection, used to createthe aligned face images, was not equally accurate in detecting the faces in all thefragments. For some of the videos, only images such as in figure 7 were given.

19

Figure 7: Examples of bad aligned face images from the AFEW data set.

The main comparison in this thesis was between static and dynamic, or tempo-ral, classification. The Naive Bayes classification being the static method of clas-sification and the CRFs and LDCRFs being the temporal methods of classification.However, there is also a distinction to be made in the features that have been used.The LBP-TOP features are spatio-temporal features, computed by using temporalinformation combined with spatial information, whereas the AU intensity featuresare static values extracted from static images. This fact should make the LBP-TOPfeatures more descriptive for the task of facial expression recognition, than the AUintensity features. This effect can be seen in the results of the Naive Bayes classi-fication. However, accordingly, that should make the classification of the CRF andLDCRF on the AU intensity features better than the Naive Bayes classification onthese features. This, however, is not the case. By combining the temporal classi-fication methods, CRF and LDCRF, with the spatio-temporal features, LBP-TOP,however, the results surpass all other results.

Some important comparisons can be made between temporal and static featuresand temporal and static classifiers. One of these comparisons is between the LBP-TOP features and the AU intensity features, where the performance of LDCRF onthe spatio-temporal LBP-TOP features is 65% higher than the performance of thatsame classifier on the static AU intensities. The other comparison is between theLDCRF classifier and the Naive Bayes classifier, where the performance of theLDCRF classifier on the LBP-TOP features is almost 30% higher than that of theNaive Bayes classifier on the LBP-TOP features. This shows a significant increasein performance of facial expression recognition in the wild, when using temporalinformation.

6 Conclusion

Results from several experiments have been presented, in order to investigatethe influence of temporal analysis in the task of facial expression recognition inthe wild. These experiments can be compared on the basis of being conductedwith static or temporal classifiers and using static or spatio-temporal features. Thehighest performance was achieved by the temporal LDCRF on the spatio-temporalLBP-TOP features, which was 38.01%. This outperformed the baseline of the

20

AFEW data set with 4%.By comparing the static and temporal classifiers it was found that an increase

in performance of 30% was achieved. The comparison between the static AU in-tensity features and the spatio-temporal LBP-TOP features yielded that an increasein performance of 65% was achieved. This shows that the use of temporal classifi-cation methods and the use of temporal information in the features has a significantpositive influence on the performance of facial expression recognition in the wild.

6.1 Future Work

Even though an accuracy above the baseline was achieved, there is still enoughroom for improvement. This section describes some possibilities for future workthat might improve the accuracy achieved in this thesis.

An improvement could be achieved by removing the fragments that showedmore than one facial expression, or that showed a certain facial expression only ina small amount of the frames with a neutral expression in the rest of the frames.This is similar to the data set revision proposed in [Gehrig and Ekenel, 2013]. Inthat manner, the classifiers will have a less difficult task, because there is no needfor disambiguation between facial expressions in one fragment.

Another option that might improve the accuracy, is by removing all frames thatare classified as the Neutral class, since a large amount of frames in the fragmentswill show a neutral face. This would remove the problem that causes many of themisclassifications to be classified as Neutral.

Also, fusion with other feature modalities such as acoustic or linguistic datacan be attempted. In the EmotiW challenge of 2013, the winner achieved a per-formance of 41% using deep neural networks on a combination of visual features,audio features, recognition of activity and Bag of Mouth features [Kahou et al.,2013].

Lastly, the LBP-TOP features can be extracted from only a number of framesat a time, instead of the full fragment at once. This could help in making a dis-tinguishment between several facial expressions in one fragment, because thesedifferent facial expressions will be part of different sliding windows of frames,instead of being all computed into one big set of features.

References

Nelson T. Alves. Recognition of static and dynamic facial expressions: a studyreview. Estudos de Psicologia (Natal), 18(1):125–130, 2013.

Dimitri P. Bertsekas. Nonlinear programming. Athena Scientific, 1999.

Richard H. Byrd, Jorge Nocedal, and Robert B. Schnabel. Representations ofquasi-newton matrices and their use in limited memory methods. Mathemati-cal Programming, 63(1-3):129–156, 1994.

21

Ira Cohen, Nicu Sebe, Ashutosh Garg, Lawrence S. Chen, and Thomas S. Huang.Facial expression recognition from video sequences: temporal and static model-ing. Computer Vision and Image Understanding, 91(1):160–187, 2003.

Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. Collecting large,richly annotated facial-expression databases from movies. IEEE MultiMedia, 19(3):0034, 2012.

Paul Ekman and Wallace V. Friesen. Facial Action Coding System: Investigator’sGuide. Consulting Psychologists Press, 1978.

Tobias Gehrig and Hazım K. Ekenel. Why is facial expression analysis in the wildchallenging? In Proceedings of the 2013 on Emotion recognition in the wildchallenge and workshop, pages 9–16. ACM, 2013.

Suyog Jain, Changbo Hu, and Jake K. Aggarwal. Facial expression recognitionwith temporal modeling of shapes. In Computer Vision Workshops (ICCV Work-shops), 2011 IEEE International Conference on, pages 1642–1649. IEEE, 2011.

Samira Ebrahimi Kahou, Christopher Pal, Xavier Bouthillier, Pierre Froumenty,Caglar Gulcehre, Roland Memisevic, Pascal Vincent, Aaron Courville, YoshuaBengio, Raul Chandias Ferrari, Mehdi Mirza, Sebastien Jean, Pierre-Luc Car-rier, Yann Dauphin, Nicolas Boulanger-Lewandowski, Abhishek Aggarwal,Jeremie Zumer, Pascal Lamblin, Jean-Philippe Raymond, Guillaume Desjardins,Razvan Pascanu, David Warde-Farley, Atousa Torabi, Arjun Sharma, EmmanuelBengio, Myriam Cote, Kishore Reddy Konda, and Zhenzhou Wu. Combiningmodality specific deep neural networks for emotion recognition in video. InProceedings of the 15th ACM on International Conference on Multimodal Inter-action, ICMI ’13, pages 543–550, New York, NY, USA, 2013.

Pat Langley, Wayne Iba, and Kevin Thompson. An analysis of bayesian classifiers.In AAAI, volume 90, pages 223–228. Citeseer, 1992.

Rosalind W. Picard, Elias Vyzas, and Jennifer Healey. Toward machine emotionalintelligence: Analysis of affective physiological state. Pattern Analysis and Ma-chine Intelligence, IEEE Transactions on, 23(10):1175–1191, 2001.

Shravya Shetty, Harish Srinivasan, Matthew Beal, and Sargur Srihari. Segmen-tation and labeling of documents using conditional random fields. In Elec-tronic Imaging 2007, pages 65000U–65000U. International Society for Opticsand Photonics, 2007.

Charles Sutton and Andrew McCallum. An introduction to conditional randomfields for relational learning, volume 2. Introduction to statistical relationallearning. MIT Press, 2006.

22

Tim L.M. van Kasteren, Athanasios K. Noulas, and Ben J.A. Krose. Conditionalrandom fields versus hidden markov models for activity recognition in temporalsensor data. 2008.

AppendicesA Results: Aligned face image AU intensity features with

CRF

Time (s)


50 iterations AU Int. Median 0.0746 0.00228 20.70AU Int. Frame-by-Frame 12.17 0.112 16.64AU Int. Frame Majority 12.74




Table 10: CRF results on aligned face image extracted AU intensity features. Notethat the ‘AU Int. Frame Majority’ is part of the ‘AU Int. Frame-by-Frame’ modeland thus they have the same training and testing time.

23

B Results:Aligned face image AU intensity features withLDCRF.

Time (s)


2 Hidden states AU Int. Median 0.185 0.00256 20.38AU Int. Frame-by-Frame 416.05 0.4357 14.29AU Int. Frame Majority 9.55




Table 11: LDCRF results on aligned face image extracted AU intensity features.Note that the ‘AU Int. Frame Majority’ is part of the ‘AU Int. Frame-by-Frame’model and thus they have the same training and testing time.

24

Documents

Facial Expression Recognition in the Wild: The Inﬂuence of ... · facial expression recognition in the wild is a more challenging task. Consequently, performance of facial expression