Micro and macro facial expression recognition using

HAL Id: hal-02428528https://hal.archives-ouvertes.fr/hal-02428528

Submitted on 6 Jan 2020

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Micro and macro facial expression recognition usingadvanced Local Motion Patterns

Benjamin Allaert, Ioan Marius Bilasco, Chaabane Djeraba

To cite this version:Benjamin Allaert, Ioan Marius Bilasco, Chaabane Djeraba. Micro and macro facial expression recog-nition using advanced Local Motion Patterns. IEEE Transactions on Affective Computing, Instituteof Electrical and Electronics Engineers, 2019, �10.1109/TAFFC.2019.2949559�. �hal-02428528�

https://hal.archives-ouvertes.fr/hal-02428528

https://hal.archives-ouvertes.fr

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Micro and macro facial expression recognitionusing advanced Local Motion Patterns

Benjamin Allaert∗, Ioan Marius Bilasco∗, and Chaabane Djeraba∗∗Centre de Recherche en Informatique Signal et Automatique de Lille, Univ. Lille,

CNRS, Centrale Lille, UMR 9189 - CRIStAL -, F-59000 Lille, France

Abstract—In this paper, we develop a new method that recognizes facial expressions, on the basis of an innovative Local MotionPatterns (LMP) feature. The LMP feature analyzes locally the motion distribution in order to separate consistent mouvement patternsfrom noise. Indeed, facial motion extracted from the face is generally noisy and without specific processing, it can hardly cope withexpression recognition requirements especially for micro-expressions. Direction and magnitude statistical profiles are jointly analyzedin order to filter out noise. This work presents three main contributions. The first one is the analysis of the face skin temporal elasticityand face deformations during expression. The second one is a unified approach for both macro and micro expression recognitionleading the way to supporting a wide range of expression intensities. The third one is the step forward towards in-the-wild expressionrecognition, dealing with challenges such as various intensity and various expression activation patterns, illumination variations andsmall head pose variations. Our method outperforms state-of-the-art methods for micro expression recognition and positions itselfamong top-ranked state-of-the-art methods for macro expression recognition.

Index Terms—Macro expression, Micro expression, Optical flow, Facial expression, Local motion patterns.

F

1 INTRODUCTION

Facial expression recognition has attracted great interestover the past decade in wide application areas such ashuman machine interaction, behavior analysis, video com-munication, e-learning, well-being, e-health and marketing.For example, during visio-conferences between several par-ticipants, facial expression analysis strengthens dialogueand social interaction between all participants (i.e keep theviewers attention). In e-health applications, facial expres-sions recognition helps to better understand patient mindsand pain, without any intrusive sensors.

Facial expressions are fundamentally covering both mi-cro and macro expressions [1]. It is a very important issue,because by essence, both micro and macro expressions aswell as intermediate expressions are present during hu-man interactions. Addressing expression recognition prob-lem with a unified approach, regardless of the expressionintensity, is one important requirement related to in-the-wild expression recognition. In this work, we focus on themicro and macro expression recognition as they representextreme intensities. Proposing a unified approach coping atonce with very large intensity variations leads up the way tothe coverage of the full range of facial expression intensities.

The difference between macro and micro expressiondepends essentially of the duration and the intensity ofexpression, as illustrated in Figure 1.

Macro expressions are voluntary facial expressions, andcover large face area. The underlying facial movementsand texture deformations can be clearly discriminated fromnoise. The typical duration of macro expression is between0.5 and 4 s [1]. On the opposite, micro expressions are

Manuscript received may 30, 2018; revised 2018. Corresponding author: B.Allaert (email: [email protected]).

Fig. 1. Difference of motion intensity between micro and macro expres-sion - happiness (line 1), disgust (line 2), from CASME II [2] and MMI[3], respectively micro and macro expression datasets.

involuntary facial expressions. Often, they convey hiddenemotions that determine true human feelings and state-of-mind. Micro expressions tend to be subtle manifestationsof a concealed emotion under a masked expression. Microexpressions are characterized by rapid facial movementsand cover restricted facial area. The typical duration ofmicro expressions is between 65 ms and 500 ms [4]. In termsof facial muscles movements and texture changes, microexpressions are characterized by low intensities [5].

We propose an innovative motion descriptor called LocalMotion Patterns (LMP), with three main contributions. First,it takes into account mechanical facial skin deformationproperties (local coherency and local propagation). Second,it empowers the construction of a unified method for microexpressions (disgust, happiness, repression, surprise) andmacro expressions (anger, disgust, fear, happiness, sadness,surprise) recognition. When extracting motion informationfrom the face, the unified method deals with inconsistenciesand noise caused by face characteristics (skin smoothness,skin reflect and elasticity). Generally, related works on facialexpression recognition have been proposed to deal sepa-rately with macro and micro expressions. The advantage


of having a unified method characterizing both macro andmicro expressions consists in its ability to cope with alarge panel of facial expression intensities. Hence, a unifiedmethod narrows the gap towards in-the-wild settings withregard to intensity variations. Third, on the basis of localfacial motion intensity and propagation, the method is thestep forward and potentially suitable for in-the-wild expres-sion recognition showing robustness to : motion noise, illu-mination changes (near infrared and natural illumination),small head pose variation and various activation pattern.

Our face expression recognition method is validatedon representative datasets of facial expression recognitioncommunity for both micro (CASME II, SMIC) and macroexpression (CK+, Oulu-CASIA, MMI, AFEW) recognition.Our method is better than the state-of-the-art methods formicro expression recognition, and is competitive with state-of-the-art macro expression recognition methods.

In section 2, we discuss works related to expressionrecognition. We introduce facial expression features, andrecent expression recognition approaches. In section 3, wepresent our Local Motion Patterns (LMP) feature that con-siders local facial motion coherency (see ”Feature extrac-tion” part in Figure 2). In this section, we show how LMPdeals with facial skin smoothness, reflection and elasticity. Insection 4, we explore several strategies of encoding the facialmotion for macro and micro expression recognition (see”Expression recognition” part in Figure 2). Experimentalresults, presented in section 5, outline the generalizationcapacity of our method for micro and macro expressionrecognition. Conclusions, summing up the main contribu-tions, and perspectives are discussed in section 6.

Fig. 2. Overview of our expression recognition method.

2 RELATED WORK

This section presents the most significant macro and microfacial expression recognition approaches that have beenproposed in the literature. Facial expression recognitionapproaches are based on features and face segmentationmodels. The facial segmentation model defines the regionsof the faces from where information is extracted. The infor-mation is composed of features encoding changes in termsof texture and motion. We start the section by discussingfeatures of macro and micro expression recognition, fol-lowed by face segmentation models. Finally, we focus onthe combination of features and face segmentation modelsfor macro and micro expression recognition.

2.1 Macro expression recognition

Important motions induced by face skin muscles character-ize macro expressions. Furthermore, with regard to facialdeformation, several types of techniques based on appear-ance and/or geometry are used to encode the changes.

Features, such as LBP [6] or HOG [7] obtained good re-sults in the analysis of macro facial deformations. A similarcomment applies to convolutional neural network (CNN)approaches [8]–[10] that learn a spatial feature representa-tion on the apex frames. By relying on the spatial featureonly, LBP, HOG and static CNN approaches do not exploitfacial expression dynamics, which can limit the performancein presence of subtle expressions.

Psychological experiments by Bassili [11] showed thatfacial expressions can be recognized more accurately ina sequence of images. Therefore, a dynamic extension ofLBP called Local Binary Pattern on Three Orthogonal Plans(LBP-TOP) is proposed in [12]. Considering the latest de-velopments in dynamic texture domains, the optical flowregains interest from the community becoming one of themost widely used and recognized solution [13]. Optical flowestimates in a natural way the local dynamics and temporaltexture characteristics. In recent deep learning approaches[14], [15], a recurrent neural network (RNN) was used witha conventional CNN to encode face dynamics and showedbetter performances compared to CNN only.

Most geometric approaches use Active AppearanceModel (AAM) or variations like Active Shape Model (ASM),to track a dense set of facial points [16]. The location of theselandmarks is then used in different ways to help extractingthe shape- or motion-related facial features.

Hybrid approaches combine geometric and appearancefeatures. As suggested in [17], combining them providesadditional information to the recognition process. Jaiswalet al. [18] use a combination of Convolutional and Bi-directional Long Short-Term Memory Neural Networks(CNN-BLSTM), which jointly learn shape, appearance anddynamics. They show that the combination of dynamicCNN features and BLSTM excels at modeling the temporalinformation. Several deep learning methods [15], [16] useda temporal geometric feature in order to reduce the effect ofthe identity on the learned features.

2.2 Micro expression recognition

Expression recognition approaches presented above, de-signed for macro expression, are not adapted to microexpression challenges (very short duration, low motion am-plitude and limited texture changes). Liu et al. [19] applydirectly macro expression approaches to micro expressionsand show that detecting subtle changes by applying tra-ditional macro expression approaches is a difficult task.Indeed, partial and low-intensity facial movements in mi-cro expressions differ from ordinary expressions and it isdifficult to split between true facial motion and noise dueto head movement or motion discontinuities. The sameconclusion has been drawn when using deep learning [20].

According to [21], micro expressions are much moredifficult to detect without temporal information. Thus, re-searchers use spatio-temporal features for micro expression


analysis. Wang et al. [22] propose an extension of LBP-TOP based on the three intersecting lines crossing over thecenter point of the three histograms. They provide morecompact and lightweight representation by minimizing theredundancy in LBP-TOP. Huang et al. [23] propose a newspatio-temporal LBP on an improved integral projectioncombining the benefit of texture and shape.

Although most micro expression recognition studieshave considered LBP-TOP, several authors investigate alter-native methods. Li et al. [21] employ temporal interpolationand motion magnification to counteract the low intensityof micro expressions. They show that longer interpolatedsequences do not lead to improved performances, becausethe movement tends to be diluted. Interpolating micro ex-pression segments using only 10 frames seems sufficient.Recently, Liu et al. [19] design a feature for micro expressionrecognition based on a robust optical flow method, andextract Main Directional Mean Optical-flow (MDMO). Theyshowed that the magnitude is more discriminant than thedirection when working with micro expressions. Further-more, several deep-learning methods have been proposedto deal with micro expressions [14], [20], [24], but for nowthey all present lower performances than handcrafted ap-proaches.

In this context, systems based on dynamic texturesprovide better performances. They allow detecting sub-tle changes occurring on the face while texture-based orgeometry-based approaches fail in this case.

2.3 Face segmentation models

The face segmentation model, based on geometric informa-tion, defines the most appropriate layout for extracting facefeatures in order to recognize expression. Assuming thatface regions are well aligned; histogram-like features areoften computed from equal-sized face grids [25]. However,apparent misalignment can be observed, primarily causedby face deformations induced by the expression itself. Inmost cases the geometric features are used to ensure thatfacial regions (eyes, eyebrows, lips corners) are well aligned.

Appearance features extracted from active face regionsimprove the performance of expression recognition. There-fore, some approaches define the regions with respect tospecific facial locations (i.e. eyes, lips corners) using geo-metrical characteristics of the face [26].

Recent studies use facial landmarks to define facial re-gions. They increase robustness to facial deformation analy-sis during expression. Jiang et al. [27] define a mesh overthe whole face with ASM, and extract features from theregions enclosed by the mesh. Sadeghi et al. [28] use afixed geometric model for geometric normalization of facialimages. The face image is divided into small sub-regionsand then LBP histograms are calculated in each one foraccurately describing the texture.

Face segmentation models based on salient patches,blocks, meshes or weighted masks have been exploredovertime in combination with various features. Despite theuse of similar features in macro and micro expression recog-nition, it is still difficult to find a unified facial segmentationmodel for analyzing macro and micro expressions together.

2.4 SynthesisMicro expressions are quite different from macro expres-sions in terms of facial motion amplitudes and texturechanges, which make them more difficult to characterize.Results from significant state-of-the-art approaches are illus-trated in Table 1 and show the striking difference betweenmacro and micro expression recognition performances.

Table 1 illustrates the established trends: appearance(static approaches), geometry and motion (dynamic textureand temporal approaches) in both macro and micro expres-sion recognition fields. The main focus of the table resides inthe difference in terms of performances between micro andmacro expression recognition when the same underlyingfeatures and face segmentation models are used. Macro andmicro expression recognition approaches are not directlycomparable due to the fact that the underlying data isvery different. However, we present them together in orderto show that methods working well in one situation donot provide equivalent performances in the other. In orderto allow an intra-category ranking, all macro expressionapproaches, cited in Table 1, use SVM as a final classifier and10 fold cross-validation protocol. All cited micro expressionapproaches use SVM as a final classifier and leave-one-subject-out (LOSO) cross validation protocol.

TABLE 1State-of-the-art methods for macro and micro expressions (* data

augmentation).

Based on Macro expression (CK+) Micro expression (CASME II)

App.

LBP [29] 90.05% LBP [21] 55.87%Block-based Block-basedPHOG [7] 95.30% HIGO [21] 67.21%

Salient region Block-based magnifiedCNN [8] 96.76% * CNN [20] 47.30% *Whole face Whole face

Geom.Gabor Jet [30] 95.17% / /Facial points

DTGN [16] 92.35% * / /Facial points

Motion

LBP-TOP [12] 96.26% DiSTLBP-IIP [23] 64.78%Block-based Block-basedOptical flow [31] 93.17% MDMO [19] 67.37%Facial meshes Facial meshes

CNN + LSTM [14] 98.62% * CNN + LSTM [24] 60.98% *Whole face Whole face

As shown in Table 1, well-known static methods like LBPhave limited potential for micro expression recognition. Thedifference would be attributable to the fact that it cannotdiscriminate very low intensity motions [21]. LBP-TOP hasshown promising performance for facial expression recogni-tion. Therefore, many researchers have actively focused onLBP-TOP for micro expression recognition.

Geometric approaches deliver good results for macroexpressions, but fail in detecting subtle motions in presenceof micro expressions. Subtle motions require measuring skinsurface changes. Algorithms tracking landmarks do notdeliver the necessary accuracy for micro expressions.

Dynamic texture approaches are best suited to low fa-cial motion amplitudes [23]. Specifically, methods based onoptical flow appear to be promising for micro expressionanalysis [19]. Moreover, the optical flow approach proposedin [31] obtains competitive results in both macro and microexpression analysis. However, the optical flow approachesare often criticized for being heavily impacted by the pres-ence of motion discontinuities and illumination changes.Recent optical flow algorithms (i.e. [32]) evolved to better


deal with noise. The majority of these algorithms is based oncomplex filtering and smooth motion propagation to reducethe discontinuity of local motion, improving the quality ofoptical flow. Still, in presence of high and low intensityof motion, the smoothing effect tends to induce false mo-tions. Another technique consists in artificially amplifyingthe motion. This technique is being used increasingly andsuccessfully in micro expression recognition [21]. The maindisadvantage is the requirement of high intensity facialdeformation. Such deformations alter significantly the facialmorphology, especially in the presence of macro expression.

Concerning deep learning approaches, we underline im-portant contrasts. On one hand, deep learning approachesprovide good results for macro expression recognition (see* lines in Table 1). Deep learning approaches are based onauto-encoded features optimized for specific datasets. Forexample, Breuer and Kimmel [14] employ Ekman’s facialaction coding system (FACS) in order to boost the perfor-mances of their approach. On the other hand, deep learningresults are clearly lower than handcrafted approaches inmicro expressions recognition (Table 1).

Transposing efficiently features and face segmentationmodels from macro expression recognition to micro expres-sion recognition is not yet achieved with regard to thecurrent state-of-the-art. The selected representative worksemploy the same underlying feature in micro and macroexpression recognition, however they need to change thefacial segmentation model and the overall approach in orderto maximize performances in both situations. Table 1 showsthat it is still difficult to find a common methodology to ana-lyze both macro and micro expressions accurately. However,for both, dynamic approaches seems promising.

Starting from these observations, we propose an innova-tive motion descriptor called Local Motion Patterns (LMP)that preserves the real facial motion and filters the motiondiscontinuity for both low and high intensities. Inspired byrecent advances in the use of motion-based approaches formacro and micro expression recognition, we explore the useof magnitude and direction constraints in order to extractthe relevant motion on the face. Considering the smoothingof motion in recent optical flow approaches, simple opticalflow combined with magnitude constraint is appropriate forreducing the noise induced by illumination changes andsmall head movements. In the next section, we propose tofilter optical flow information based on consistent local mo-tion propagation to keep only the pertinent motion. Then,in section 4, we explore the construction of a unified facialsegmentation model that generates discriminating featuresused to recognize effectively six macro expressions (anger,disgust, fear, happiness, sadness, surprise) and four microexpressions (disgust, happiness, repression, surprise).

3 LOCAL MOTION PATTERNS

The facial characteristics (skin smoothness, skin reflect andelasticity) induce inconsistencies when extracting motion in-formation from the face. In our method, instead of explicitlycomputing the global motion field, the motion is computedin specific facial areas, defined in relation with the facialaction coding system in order to keep only the pertinentmotion on the face. The pertinent motion is extracted from

regions where the movement intensity reflects natural facialmovements. We consider natural facial movement to beuniform and locally continuous over neighboring regions.

We propose a new feature named Local Motion Patterns(LMP) that retrieves the coherent motion around epicen-ter ε(x,y) when considering natural motion propagationto neighboring regions. Each region, called Local MotionRegion (LMR), is characterized by a histogram of opticalflow HLMRx,y of B bins. There are two types of LMRinvolved: Central Motion Region (CMR), and NeighboringMotion Region (NMR).

Fig. 3. Overview of local motion patterns (LMP) extraction.

LMP construction is illustrated in Figure 3. Eight NMRare generated around the CMR. All these regions are at dis-tance ∆ from the CMR. The bigger is the distance betweentwo regions, the lower is the coherence in the overlappingarea. λ is the size of the area under investigation aroundthe epicenter. Finally, β characterizes the number of directpropagations from the epicenter that are carried out in orderto retrieve all the coherent motions.

3.1 Local coherency of central motion regionIn order to measure the consistency of the motion in termsof intensity and directions of LMP, we analyze the directiondistribution in its CMR for several layers of magnitude.The motion on the face spreads progressively due to skinelasticity. We assume a regular progression of magnitude inspecific directions.

We propose a method to compute the main directionin specific regions by analyzing jointly different layers ofmagnitude. This technique brings out main directions thatare difficult to observe and reduces the motion noise.

The direction distribution of LMR is divided into qhistograms, one per magnitude layer. The high intensitymotion is more easily detected than low intensity motion.Each layer of magnitude is defined as following:MHLMRx,y (n,m) = {(bini,magi) ∈ HLMRx,y | magi ∈ [n,m]}.

(1)where n and m represent the magnitudes ranges andi = 1, 2, ..., B is the index of bin. Each MHLMRx,y isnormalized, and directions representing less than 10%, arefiltered out (set to zero). Then, magnitude layers are seg-mented into three parts P1 ∈ [0%, 33%], P2 ∈]33%, 66%]and P3 ∈]66%, 100%], represented by three cumulative his-tograms MLLMRx,y (m1,m2) that are computed as follows:MLLMRx,y (m1,m2) = {(bini, card({(n,m) | ∃(bini,magi)

∈MHLMRx,y (n,m) | magi ∈ [m1,m2]}))}.(2)


The directional and magnified histogram DMHLMRx,y isobtained by applying different weights to each part ω1, ω2

and ω3 of the corresponding bins, as follows:

DMHLMRx,y = MLLMRx,y (m1,m2) ∗ ω1

+MLLMRx,y (m2,m3) ∗ ω2

+MLLMRx,y (m3,m4) ∗ ω3.

(3)

in order to reinforce the local consistency of magnitudewithin each direction, we are applying 10-scale factors be-tween layers (ω1 = 1, ω2 = 10 and ω3 = 100). We assumethat the higher is the result, the higher is the pertinence ofmotion.

The motion filtering process is illustrated in Figure4. Figure 4-A represents the histogram magnitude layersMHLMRx,y . Parameter n is varying between 0 and 10, by 0.2magnitude steps. The parameter m is fixed to 10 in order tokeep overlapping of magnitudes. The successive magnitudelayers clearly distinguish the main direction. Next, the threemagnitude layers MLLMRx,y are represented in Figure 4-B, where each MLLMRx,y corresponds to a row, and thenumber in each cell represents the number of magnitudeoccurrences for each bin. Finally, directional and magnifiedhistogram DMHLMRx,y is illustrated in Figure 4-C. Thevalues associated with the black circles in Figure 4-C arecomputed by applying 10-scale factors to successive layersin 4-B and summing the results.

Fig. 4. The process of consistent local motion characterization in localmotion region. (A) Magnitude histograms for different ranges, (B) Cumu-lative overlapping histograms, (C) Filtering motion.

Before analyzing the neighborhood and confirming thecoherency of LMR, at least one main direction in the distri-bution shall be obtained after applying a fixed threshold αto the directional and magnified histogram (DMHLMRx,y ).The threshold value reinforces the co-occurences of variousintensities within the same direction bin. If no direction isfound, LMR and LMP are locally incoherent. This meansthat the local intensity of motion does not exhibit the ex-pected progressive behavior in any direction.

Afterwards, it must be ensured that the main orientationdirections into DMHLMRx,y are consistent. In fact, the localdistribution in LMR can be consistent in terms of intensity,but it is possible to have a large number of bins withhigh values. This step ensures that the local motions spreadcoherently in the local neighborhood.

In order to ensure consistent distribution in terms oforientation, the density of k main directions is analyzed.Each main selected direction must satisfy several criteria.First criterion ensures that the main direction covers alimited number of bins (1 to s), where s is the thresholdfor the number of bin spans accepted. Indeed, if we analyze

a small region in a face, a coherent facial motion is rarelyspreading over more than 60◦and the variance of movementis progressive. Otherwise, if one main direction is spreadingover 60◦, LMR stops analyzing the neighboring regions.Indeed, main directions spreading over 60◦undermine theaccurate identification of consistent motion by causing thepropagation of false and misleading information. This crite-rion is defined by the following two equations. The first onecharacterizes the extent of main directions and the secondfilters out orientations spreading over s consecutive bins:

C(DMHLMRx,y ) = {E = [a..b] | ∀i ∈ [a..b] | DMHLMRx,y (i) > α

∧ @j ∈ {a− 1, b+ 1} | DMHLMRx,y (j) > α}.(4)

C′(DMHLMRx,y , s) = {E ∈ C(DMHLMRx,y ) | card(E) < s}. (5)

where [a..b] represents the limits that the standard deviationof directions must meet and α is the threshold value of theintensity. Then, for each selected direction, we keep only thedirections spreading over at most s consecutive bins.

In order to reinforce the fact that there is a gradualchange in orientation, it is important that each main motiongenerates smooth transitions in terms of directions betweenneighbors. A maximum tolerance of Φ is supported asdefined in the following:

C′′

(DMHLMRx,y ) = {E = [a..b] ∈ C′(DMHLMRx,y , s)

| ∀i,j ∈ E, ‖i− j‖ ≤ 1

| ‖DMHLMRx,y (i)−DMHLMRx,y (j)‖ < Φ}.(6)

Finally, the filtered directional and magnified histogramFDMHLMRx,y corresponds to k main directions inDMHLMRx,y . FDMHLMRx,y is constructed as follows:

FDMHLMRx,y = {(bi,mi) ∈ DMHLMRx,y | ∃E = [a..b]

∈ C′′

(DMHLMRx,y ) ∧ bi ∈ E}.(7)

Despite CMR is considered coherent, LMP validation andcomputation have not yet been completed. Indeed, if weconsider that natural facial movement is uniform duringfacial expressions, then the local facial motion should spreadover at least one neighboring region.

3.2 Neighborhood propagationWhen LMP is locally coherent in CMR, the approach ver-ifies the motion expansion on neighboring motion regions(NMR). In some cases, physical rules (e.g. skin elasticity)ensure that local motion spreads to neighboring regionsuntil motion exhaustion. Motion is subject to changes thatmay affect direction and magnitude in any location. How-ever, intensity of moving facial region tends to remain con-stant during facial expression. Therefore, a pertinent motionobserved and computed in CMR appears, eventually withlower or upper intensity, in at least one neighboring region.

Before analyzing the motion propagation, the local co-herency of each NMR is analyzed with the same methoddiscussed above for CMR. As for CMR, it must be ensuredthat the local distribution is consistent in terms of intensityand orientation. As an outcome of the process each locallyconsistent NMRi is characterized by FDMHLMRxi,yi

.However, it is important to check that the local distribu-tion is similar to some extent with the previous adjacent


neighbor. Bhattacharyya coefficient is used to measure theoverlap between two neighboring LMR as follows:

C′′′

(FDMHLMRx1,y1, FDMH

′LMRx2,y2

) =

B∑i=1

√FDMHLMRx1,y1

(i)FDMH′LMRx2,y2

(i).(8)

where FDMHLMRx1,y1and FDMH

′

LMRx2,y2are the local

distributions and B is the number of bins. LMR is consid-ered consistent with his neighbor, if the coefficient is lowerthan the fixed threshold ρ.

The motion propagation into LMP after one iteration isgiven in Figure 5. If the local motion is inconsistent, NMRare represented in gray. If NMR are not coherent, three situ-ations can be distinguished: a) the motion in NMR is locallyinconsistent in terms of intensity; b) the motion in NMRis locally inconsistent in terms of orientation, and c) thedistribution similarity between two regions is inconsistent.

Fig. 5. LMP distribution, computed from the propagation in neighbor-hood of central motion region.

As long as at least one newly created NMR is inter-regioncoherent with its neighbor, recursively, for each subsequentNMR, the motion analysis is reconducted. The recursiveprocess ends when the number of propagations β is reached.

Finally, each distribution (FDMHLMRx,y ) correspond-ing to NMR that have direct or indirect connections tooriginal CMR is cumulated into the LMP distribution. If themotion propagation between all NMR is inconsistent, themotion propagation is no more explored. This means thatthere are no more pertinent motions to collect into LMP. Thefinal motion distribution of LMP is computed as follows:

MDLMPx,y = {n∑

i=0

FDMHLMRx,y | FDMHLMRx,y ∈ LMPx,y}.

(9)where n is the number of consistent regions.

MDLMPx,y is defined as a histogram over B bins, whichcontains, for each bin the sum of main direction intensitiescollected from coherent NMR and CMR. Then we are ableto extract the coherent motions from a specific location onthe face.

In summary, the proposed LMP feature collects pertinentmotions and filters out the noise based on three criteria:convergence of motion intensity in the same direction, localcoherency of direction distribution and coherent motionpropagation. Each criterion can be configurable indepen-dently of the others, which makes it fully adaptable to

many uses and contexts such as action recognition, facialexpression recognition, tracking and other. To prove theeffectiveness of our LMP, we analyze in the next section, theuse of LMP for micro and macro facial expression analysis.

4 EXPRESSION RECOGNITION

The choice of the facial segmentation model impacts greatlythe performances. Various epicenters can be considered forcoherent motion extraction. So, we study the impact ofepicenters on the perceived motion while applying LMP.We show that the intensity of expression (macro or micro)plays a key role in locating LMP epicenter and, in themeantime, it impacts the way the consistent motion on theface is encoded. Then, we explore the integration of thecoherent optical flow into facial model formulation, anddiscuss several strategies for considering discriminant localregions of the face.

4.1 Impact of LMP locationFor macro expressions, motion propagation covers largefacial area. If one CMR (Central Motion Region) is randomlyplaced in an area around the motion epicenter, then the mo-tion consistency is most of the time observed. However, formicro expressions, the motion propagation covers restrictedfacial area. Motions are less intense, so motion propagationis discontinued. Figure 6 shows local motion distributionextracted in various points around left lip corner (blue, redand green dots). The original flow field and the local motiondistribution extracted from a happiness sequence aroundthe different locations are shown in the first three columns.The fourth column shows the distribution overlap.

Fig. 6. Consistent motions from happiness sequence computed fromdifferent locations in the same region.

For macro expressions (first line), the location of eachLMR is different, still the distributions present large over-laps (column 4). For micro expressions (second line), thedistributions corresponding to the three columns are differ-ent. The experimentation can be reproduced in other facialregions with similar outcomes with regard to micro versusmacro expressions. It is hence important to determine bestdiscriminant facial regions for encoding coherent motion inthe context of generic expression recognition process.

4.2 Best discriminant facial regionMacro and micro expression motions are very different interms of intensity and propagation. It is therefore importantto detect pertinent motions that generate features able todiscriminate effectively some of the most common macro


expressions (happiness, sadness, fear, disgust, surprise andanger) and micro expressions (happiness, disgust, surprise,repression). In order to identify optimal LMP epicenters lo-cations, we have considered samples for CK+ and CASME2datasets.

To identify the locations within the face where motionoften occurs, we first align frames based on eyes location,and we compute the optical flow of each frame of thesequence. This step eliminates in-plane head rotation andaddresses individual differences in face shape. Then, eachframe is segmented in 20x30 blocks. LMP is extracted fromeach block, with LMP epicenter situated at the center of eachblock. Then, the consistent motion vector is computed ineach LMP. Next, each relevant optical flow extracted fromeach frame is merged into a single binary motion mask.The consistent motion mask as well as motion informationare extracted from video sequences of the same expressionclass. Finally, each consistent motion mask is normalizedand merged into a heat map of motion for the underlyingexpression. The six consistent motion masks for the basicmacro expressions are illustrated in the first line of Figure 7.

The extracted mask indicates that pertinent motions arelocated below the eyes, in the forehead, around the nose andmouth, as illustrated in Figure 7. Some motions are locatedin the same place during elicitation for several expressions,but they are distinguishable by their intensity, directionand density. For example, anger and sadness motions aresimilar as they appear around the mouth and the eyebrows.However, when a person is angry, motion is convergent (e.gmouth upwards and eyebrows downwards), and motion isdivergent when a person is sad.

The same strategy for finding the best discriminant re-gions was used on CASME II dataset for micro expressions(happiness, disgust, surprise and repression). As illustratedin the second line of Figure 7, the pertinent motions arelocated near the eyebrows and the lips corner. Comparedwith macro expression motion maps, propagation distancesare highly reduced for micro expressions.

Fig. 7. Pertinent motions elicitating 6 macro expressions from CK+dataset (top) and 4 micro expressions from CASME II dataset (bottom).

At this stage, the main facial regions of motion are accu-rately identified. We now construct a vector that encodes therelationships between facial region of motion and expres-sions. We use the facial landmarks to define regions that in-crease deformation robustness during expression. Similarlyto Jiang et al. [27], the landmarks are used to define a meshover the whole face, and a feature vector can be extractedfrom the regions enclosed by the mesh. Landmarks andgeometrical features of the face are used to compute the setof points that defines a mesh over the whole face (forehead,cheek). Finally, the best discriminant landmarks are selected

corresponding to active face regions, and specific points arecomputed to set out the mesh boundaries.

The partitioning into facial regions of interest (ROIs)is illustrated in Figure 8. The partitioning is based on thefacial motion observed in the consistency maps constructedfrom both macro and micro expressions. The locations ofthese ROIs are uniquely determined by landmarks pointsfor both micro and macro expressions. For example, thelocation of feature point PQ is the average of two landmarks,P10 and P55. The distance between eyebrows and foreheadfeature points (PA,PB ,...,PF ) corresponds to the size of thenose DistanceP27,P33

/4. This allows maintaining the samedistance for optimal adaptation to the size of the face. Notethat, in order to deal precisely with the lip corners motion,regions 19 and 22 overlap regions 18 and 23, respectively.

Fig. 8. Facial partition in interest regions based on facial muscles.

Fig. 9. Building the feature vector from the facial motion mask.

4.3 Facial motion descriptorThe facial motion mask is defined by the 25 ROIs presentedabove. In each frame ft, we consider the filtered distributionmotion inside each ROI Rk

t , where t is the frame index andk = 1, 2, ..., 25 is the ROI index. Inside each Rk

t , LMP isapplied and MDLMPx,y is computed as defined in equation9. Rk

t motion distributions are summed into ηk, whichcorresponds to local facial motion in region k for the entiresequence.

ηk =

time∑t=1

Rkt . (10)

Finally, histograms ηk are concatenated into one-row vectorGMD = (η1, η2, ..., η25), which is considered as the featurevector for the macro and micro expression. The featurevector size is equal to the number of ROI multiplied by thenumber of bins. An example is illustrated in Figure 9, whereall motion distributions MDLMPx,y corresponding to R1

t ,R2

t ... R25i with t ∈ [1, time] are summed up in η1, η2 ... η25

respectively. η1, η2 ... η25 are then concatenated, and definethe global motion distribution GMD.

4.4 Facial expression recognition frameworkThe framework, presented in Figure 2, is suitable to microand macro expressions. First a preprocessing step is con-sidered in order to extract landmarks and compute the 25


ROIs. Then Farnebck algorithm [33] is used to compute fastdense optical flow. It ensures that motion is not affected bysmoothing and the computation time is low. LMP featuresare extracted from each ROI. Next, relevant motion in eachfacial region is cumulated over time. Each facial region isrepresented by a histogram based on the orientation andthe intensity of motion. The concatenation of the histogramsextracted from the various regions defines the feature vectorused for classifying each video sequence. In order to eval-uate the benefit brought by mixing motion and geometricinformation, for some of the experiments introduced in thenext section, we enrich the feature vector with the shapecharacteristics of each ROI.

5 EVALUATION

We highlight the performances obtained by our methodon widely used datasets for micro expression recognition,namely CASME II [2] and SMIC [34], and widely useddatasets for macro expression recognition, namely CK+ [35],Oulu-CASIA [36], MMI [3] and AFEW [37]. Experimentson these datasets cover aspects of in-the-wild recognitionsuch as: head movement, illumination, visible and infraredcontexts.

After introducing the datasets, we compare our perfor-mances with some major methods in the literature. We useLIBSVM [38] with RBF kernel and 10 fold cross-validationprotocol for macro expressions and leave-one-subject-out(LOSO) for micro expressions 1.

5.1 Datasets

CASME II (micro expression dataset) contains 247 sponta-neous micro expressions from 26 subjects, categorized intofive classes: happiness (33 samples), disgust (60 samples),surprise (25 samples), repression (27 samples) and others(102 samples). The micro expressions are recorded at 200fps in well-controlled laboratory environment.

SMIC (micro expression dataset) is divided into threesets : (i) HS dataset is recorded by high-speed camera at100 fps and includes 164 sequences from 16 subjects, (ii)VIS dataset is recorded by standard color camera at 25 fps;and (iii) NIR dataset is recorded by near infrared camera at25 fps. The high-speed (HS) camera was used to captureand record the whole data, while VIS and NIR cameraswere only used for recording the last eight subjects (77sequences). The three datasets include micro expressionsequences from onset to offset. Each sequence is labeledwith one of the following emotion classes: positive, surpriseand negative.

CK+ (macro expression dataset) contains 593 acted fa-cial expression sequences from 123 participants, with sevenbasic expressions (anger, contempt, disgust, fear, happiness,sadness, and surprise). In this dataset, the expression se-quences start at neutral state and finishes at apex state.Expression recognition is completed in excellent conditions,because the deformations induced by ambient noise, facialalignment and intra-face occlusions are not significant with

1. Detailed informations about the data used for the experimentsand the code for extracting LMP features are available here :https://gitlab.univ-lille.fr/marius.bilasco/lmp for review

regard to the deformations directly related to the expression.However, the temporal activation pattern is variable andspreads from 4 frames to 66 frames with a mean sequencelength of 17.8± 7.42 frames.

Oulu-CASIA (macro expression dataset) includes 480sequences of 80 subjects taken under three different lightingconditions: strong, weak and dark illuminations. They arelabeled with one of the six basic emotion labels (happiness,sadness, anger, disgust, surprise, and fear). Each sequencebegins with neutral facial expression and ends with apex.Expressions are simultaneously captured in visible lightand near infrared. Varying lighting conditions influence therecognition process.

MMI (macro expression dataset) includes 213 sequencesfrom 30 subjects. The subjects were instructed to performsix expressions (happiness, sadness, anger, disgust, surprise,and fear). Subjects are free of their head movements and ex-pressions show similarities with in-the-wild settings. Com-pared with CK+ and Oulu-CASIA, due to more importanthead pose variations of subjects, MMI is more challenging.

AFEW (macro expression dataset) contains sequences ex-tracted from movies and is divided into three sets: Train (773samples), Validation (383 samples) and Test (653 samples).In this experiment, we used the VReco sub-challenge datawhich consists in classifying a sample audio-video clip intoone of the seven categories: anger, disgust, fear, happiness,neutral, sadness and surprise. Compared to other selectedmacro expression dataset, AFEW is the most challengingone, as it presents close to real world situations.

5.2 Micro expression

In this section, we show the experiment results on CASME IIand SMIC micro expression datasets, followed by discussionand analysis of the results.

Experiments on CASME II Table 2 shows a comparisonof our results with regard to the major state-of-the-art microexpression methods. In our method, the optical flow iscalculated from two consecutive frames without any magni-fication nor temporal interpolation. For these experiments,we select only the activation part (e.g. onset to apex) fromeach sequence.

TABLE 2Performances on CASME II dataset using LOSO (* data augmentation)

.

Method Interpolat. Magnifi. Acc(%)Baseline [2] 7 7 63.41%LBP-SIP [22] 7 7 67.21%Deep feat. (CNN) [20] 7 7 47.30%STLBP-IIP [23] 7 7 62.75%DiSTLBP-IPP [23] 7 7 64.78%HIGO [21] 3 3 67.21%CNN + LSTM [24] 7 7 60.98% *CNN + AUs + LSTM [14] 7 7 59.47% *LMP 7 7 70.20%LMP 7 3 68.43%

In view of the results obtained in Table 2, our methodoutperforms the other state-of-the-art methods, includinghandcrafted and deep learning methods (see ∗ lines), inall cases. Looking closely, some authors summarize videosin fewer frames [21]. Indeed, the time lapse between twoframes in CASME II is very small as the dataset is recordedwith a high-speed camera (at 200 fps). The short time


lapse combined with the low expression intensity makesthe distinction between the noise and the true facial motionvery difficult. In [21] a magnification process, which consistsof interpolating the frequency, in order to intensify thefacial motion is used. These techniques perform well inpresence of low intensity motion, but produce severe facialdeformations in presence of high intensity motions or headpose variations. Although magnification shows interest inpresence of descriptors such as LBP, this technique tends toreduce the performance of optical flow-based approaches.This is mainly due to the fact that the acquisition noise (lowlighting change) is also intensified and does not facilitatethe measurement of the optical flow. Even-though deeplearning methods [14], [24] employ data augmentation, theirperformances are lower than those of handcrafted methods.The performances obtained on the CASME II dataset showthe ability of our method to deal with micro expressionsrecognition in situations where no illumination changesappear. In the next paragraph, we evaluate our methodon micro expressions in presence of various illuminationsettings.

Experiments on SMIC Table 3 compares the perfor-mances of the proposed method with those of major state-of-the-art methods on SMIC dataset under three differentacquisition conditions: sequences recorded by high-speedcamera at 100 fps (HS), sequences recorded by normal colorcamera at 25 fps (VIS) and sequences recorded by a nearinfrared camera both at 25 fps (NIR).

TABLE 3Performances on SMIC dataset using LOSO (* data augmentation) .

Method Magnifi. SMIC-HS SMIC-VIS SMIC-NIRLBP-TOP [34] 7 48.78% 52.11% 38.03%Deep feat. (CNN) [20] 7 53.60% * 56.30% * N/AFacial Dynamics Map [39] 7 54.88% 59.15% 57.75%HIGO [21] 7 65.24% 76.06% 59.15%HIGO [21] 3 68.29% 81.69% 67.61%LMP 7 67.68% 86.11% 80.56%LMP 3 67.42% 83.12% 78.45%

Our method outperforms the state-of-the-art methods,including handcrafted and deep learning methods, in allcases when no magnification is applied. We obtain compa-rable performances for the SMIC-HS subset when magnifi-cation is applied. Indeed, Li et al. [21] show that artificiallyamplifying the motion tends to improve the results for microexpression recognition. However, as our aim is to offer aunified micro and macro expression recognition solution,interpolating the video frequency cannot be appropriatelygeneralized on macro expressions. The results obtained onSMIC dataset show good performances for micro expres-sions recognition with regard to near infrared and naturalillumination settings. Our method based on optical flowseems to fit much better near infrared condition comparedto other dynamic methods.

5.3 Macro expression

We study the performance of our method to recognizemacro expressions on CK+, Oulu-CASIA and MMI datasetsdealing respectively with variations in temporal activationsequences, illumination variations and small head move-ments. We are also considering AFEW dataset containing in-

the-wild data in order to study the behaviour of our methodin such settings without using any complex pre-processing.

Experiments on CK+ Table 4 compares the performanceof the proposed method with major state-of-the-art methodson CK+ dataset. We use the most representative subset ofCK+ dataset that contains 327 sequences and 7 expressionsto evaluate the performances of our method.

TABLE 4Performances on CK+ dataset using 10-fold cross validation protocol

on 327 sequences (* data augmentation) .

Method Acc(%)Dis-ExpLet [40] 95.10%RBM-based model [41] 95.66%PHRNN-MCSNN [15] 98.50% *DTAGN (joint) [16] 97.25% *LMP 96.94%LMP + Geom. feat. 97.25%

Compared to handcrafted approaches [40], [41], ourmethod based only on optical flow obtains competitiveresults (96.94%). Despite the noise contained in the originaloptical flows, the variation in sequence length and expres-sion activation patterns, the joint analysis of magnitudesand orientations keeps only the pertinent motion.

Inspired by improvements obtained by hybrid ap-proaches, we combine motion features with geometric fea-tures by exploiting the shape of facial ROIs for the apexframe. Combination of geometric and LMP features im-proves slightly the results (97.25%). Results of recent deeplearning approaches [15], [16] obtained on CK+ are compa-rable with the best results that we obtained using a hand-crafted approach. Handcrafted approaches consider onlythe initial data and hence are more sustainable as limitedquantity of data are required for training. The performancesachieved using only the initial data are well positionedwith regard to the augmented settings. This proves thediscriminant power of the LMP features.

The facial segmentation model plays an important rolein characterizing globally the local facial movement. Thesegmentation model can be subject to landmark detectionerrors. In order to quantify the effect of landmark detectionand epicenter computation errors, we conduct a series ofexperiments where landmarks are randomly affected bysmall to large errors. Three landmarks noise levels wereapplied. The landmarks location were randomly shifted by± 0.5%, ± 5% and ± 10% in relation to the size of the face.The results obtained are respectively 96.02%, 95.71% and94.18%. Although performance tends to decrease as noisebecomes more and more important, performance remainsrelatively stable.

Experiments on Oulu-CASIA Table 5 compares the per-formance of our method with major state-of-the-art methodson Oulu-CASIA dataset under normal illumination and nearinfrared settings. The majority of approaches, evaluated onOulu-CASIA dataset, takes into account only the data undernormal illumination conditions (VL). Performances on nearinfrared (NI) sequences are reported in [36].

Under various illumination settings, our methodachieves better results than handcrafted approaches [12],[40], [42] and is competitive with regard to recent deeplearning approaches [9], [15], [16]. The performances ob-tained using LMP in the near infrared domain outper-


TABLE 5Performances on Oulu-CASIA dataset using 10-fold cross validation

protocol on 480 sequences (* data augmentation) .

Method VL-Acc(%) NI-Acc(%)LBP-TOP [12] 68.13% -AdaLBP [36] 73.54% 72.09%LBP-TOP + Gabor [42] 74.37% -Dis-ExpLet [40] 79.00% -DTAGN (joint) [16] 81.46% * -PHRNN-MSCNN [15] 86.25% * -FN2EN [9] 87.71% * -LMP 75.13% 81.88%LMP + Geom.feat. 84.58% 81.49%

form those of [36] (81.88%). According to the results, thecombination of motion and geometric features clearly im-proves the performances (84.58%) in the VL setting andour method obtains competitive performances. Under NIsettings, LMP features perform the best due to robustnessto poor landmarks detection which impacts negatively thesolution combining LMP and geometry features.

Experiments on MMI Table 6 compares the performanceof recent state-of-the-art methods on MMI dataset. We haveselected only the activation sequence (e.g. neutral to apex)for 205 sequences. The combination of motion and geomet-ric features improves the performances (78.26%) and outstands other handcrafted approaches [12], [40], [42], [43].Compared to deep learning approaches our approach per-forms better than [10], [16] and obtains competitive resultswith [12], [40], [43]. Compared to deep learning approachesour approach performs better than [10], [16] and obtainscompetitive results with [15].

TABLE 6Performances on MMI dataset using 10-fold cross-validation protocol (*

data augmentation) .

Method Acc(%)LBP-TOP [12] 59.51%LBP-TOP + Gabor [42] 71.92%CSPL [43] 73.53%Dis-ExpLet [40] 77.60%DTAGN (joint) [16] 70.24% *PHRNN-MSCNN [15] 81.18% *LMP 74.40%LMP + Geom. feat. 78.26%

Experiments on AFEW Table 7 compares the perfor-mances of recent state-of-the-art methods on VReco sub-challenge of the AFEW dataset. To deal with head posevariations, we use the same affine registration proposed inthe baseline [44]. In view of the performance obtained byall the approaches, it can be seen that the existing solutionsare not very robust on data acquired in-the-wild. AlthoughLMPs do not give the best performance compared to deeplearning approaches, they are better than LBP-TOP used inthe baseline. In this context, deep learning based approachestend to give better performances because they have theability to fit better to data specificities (head pose variations,illumination changes, important head movements.)

TABLE 7Performances on AFEW dataset (* data augmentation) .

Method Acc(%)LBP-TOP [44] 41.07%LSTM [45] 58.81% *CNN-RNN [46] 59.02% *LMP 49.16%

In the next section, we synthesize the results and wehighlight the capacity of the proposed facial expressionrecognition framework and underlying LMP feature to dealin a unified manner with the various challenges brought bymicro and macro expressions.

5.4 Micro and macro expression evaluation synthesis

Table 8 summarizes the most relevant results with rep-resentative state-of-the-art methods on micro and macroexpressions using unaltered versions of the datasets andthe same evaluation protocols (10-fold cross validation formacro-expression and LOSO for micro-expression).

Results show that the proposed method has the singu-larity of dealing in a unified manner with both micro andmacro expressions challenges. The method outperforms mi-cro expression state-of-the-art methods. Overall, on average,our method performs 4.94% better than the best handcraftedapproaches for each dataset and 14.18% better than learning-based approaches. Furthermore, we obtain very competitiveresults for macro expression recognition, whether it is undervarying illumination condition or in presence of small headpose variations. Our method performs on average 5.17%better than the best handcrafted approaches for each dataset.Learning-based approaches using data augmentation per-form on average 2.19% better than our approach whenused in situation where no face normalization is required,as it is the case for CK+, Oulu-Casia and MMI datasets.For the specific case of AFEW, where important head posevariations occurs, the performances of our method are rela-tively low with regard to learning methods, but still highwith regard to handcrafted methods. We did not addedany complex pre-processing stages required by datasets asAFEW in order to keep a uniform framework for all cases.

Although recent deep learning methods achieve betterperformances for macro expression recognition, it is impor-tant to emphasize the relevance of a unified method that cancharacterize efficiently both micro and macro expressions.Proposing an unified approach capable to deal with verylarge intensity variations leads up the way to the coverageof the full range of facial expression intensities.

The parameters used to assess LMP performances oneach dataset are given in Table 9. LMP settings vary slightlydepending on the dataset, underlining the generalization ca-pacity of the unified approach to deal with macro and microexpression specificities. Most of the time, the variations aredue to the acquisition conditions (distance to the camera,resolution, frame rates).

Results obtained for micro and macro expression provethe efficiency and the robustness of our contribution, whichstands as a good candidate for challenging contexts (e.g.variations in head movements, illumination, activation pat-terns and intensities).

6 CONCLUSION

The main contributions of our paper are articulated aroundthree axes. The first one is an innovative Local MotionPatterns (LMP) feature that measures temporal physicalphenomena related to skin elasticity of facial expression.The second one is a unified recognition approach of both


TABLE 8Performance synthesis on all datasets (* data augmentation) .

Micro expression Macro expression

Method CASME II SMIC CK+ CASIA MMI AFEWHS VIS NIR 7 classes VL NI

Han

dcra

fted

LBP-TOP [12] - - - - - % 68.13% - 59.51 % -LBP-TOP + Gabor [42] - - - - - 74.37% - 71.92% -AdaLBP [36] - - - - - 73.54% 72.09% - -Dis-ExpLet [40] - - - - 95.10% 79.00% - 77.60% -HIGO + magnification [21] 67.21% 68.29% 81.69% 67.61% - - - - -* LBP-TOP [44] - - - - - - - - 41.07%LMP 70.20% 67.68% 86.11% 80.56% 97.25% 84.58% 81.46% 78.26% 49.16%

Dee

pLe

arni

ng

* CNN + LSTM [24] 60.98% - - - - - - - -* Deep feat. (CNN) [20] 47.30% 53.60% 56.30% - - - - - -* CNN + AUs + LSTM [14] 59.47% - - - - - - - -* PHRNN-MSCNN [15] - - - - 98.50% 86.25% - 81.18% -* FN2EN [9] - - - - - 87.71% - - -* CNN - RNN [46] - - - - - - - - 59.02%

TABLE 9Parameter settings used for assessing the best results.

Datasets λ ∆ ρ E M V β bin

Mic

ro

CASME II 4 0.5 0.75 100 4 5 6 9

SMIC-HS 3 0.5 0.75 100 3 5 6 9

SMIC-VIS 5 0.5 0.75 100 4 5 3 9

SMIC-NIR 4 0.5 0.75 100 3 5 3 12

Mac

ro

CK+ 3 0.5 1 100 4 5 3 12

CASIA-VL 4 0.5 1 100 5 5 3 6

CASIA-NI 5 0.5 0.75 100 5 5 6 9

MMI 3 0.5 1 100 4 5 6 12

AFEW 3 0.5 1 100 4 5 3 12

macro and micro expressions. The spatio-temporal features,extracted from videos, encode motion propagation into localmotion regions situated near expression epicenters. As mo-tion is inherent to any facial expressions our method is nat-urally suitable to deal with all expressions that cause facialskin deformation. The third one is related to the exponentialpotentiality and suitability of our method to meet in-the-wild requirements. We obtain good performances in variousillumination (near infrared and natural) conditions for bothmicro and macro expression recognition.

The method outperforms micro expression state-of-the-art methods on CASME II (70.20%) and SMIC-VIS (86.11%).Furthermore, we obtain competitive results for macro ex-pression recognition (97.25% for CK+, 84.58% for Oulu-CASIA and 78.26% for MMI). However, on data acquiredunder natural conditions, as in AFEW dataset, further ef-forts are still needed (49.19%). The important global headmotions overcome the local motion characterizing facial ex-pressions. Specific pre-processing steps as those illustratedin [47] are required in order to address challenges broughtby large head movements.

Although our contribution unifies the micro and macroexpression domains, other challenges such as dynamicbackground, occlusion, non-frontal poses, important headmovements are still to be addressed. For example, let usconsider the challenge of expression recognition in presenceof important head movements. Although dynamic textureapproaches perform well when analyzing facial expressionin near frontal view, recognition of dynamic textures in pres-ence of head movements remains a challenging problem.Indeed, dynamic textures must be well segmented in spaceand time. However, we believe that the registration basedon facial components or shape are not adapted to dynamicapproaches. Such registrations cause facial deformations

and induce noisy motion [47]. We believe that suitablerelationship between motion representation and registrationis the key for expression recognition in presence of headmovements.

REFERENCES

[1] P. Ekman and E. L. Rosenberg, What the face reveals: Basic and appliedstudies of spontaneous expression using the Facial Action Coding System(FACS). Oxford University Press, USA, 1997.

[2] W.-J. Yan, X. Li, S.-J. Wang, G. Zhao, Y.-J. Liu, Y.-H. Chen, andX. Fu, “Casme ii: An improved spontaneous micro-expressiondatabase and the baseline evaluation,” PloS one, vol. 9, no. 1, 2014.

[3] M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-baseddatabase for facial expression analysis,” in ICME, 2005, pp. 5–pp.

[4] W.-J. Yan, Q. Wu, J. Liang, Y.-H. Chen, and X. Fu, “How fast arethe leaked facial expressions: The duration of micro-expressions,”Journal of Nonverbal Behavior, vol. 37, no. 4, pp. 217–230, 2013.

[5] W.-J. Yan, S.-J. Wang, Y.-J. Liu, Q. Wu, and X. Fu, “For micro-expression recognition: Database and suggestions,” Neurocomput-ing, vol. 136, pp. 82–87, 2014.

[6] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binarypatterns,” PAMI, vol. 24, no. 7, pp. 971–987, 2002.

[7] R. A. Khan, A. Meyer, H. Konik, and S. Bouakaz, “Human visioninspired framework for facial expressions recognition,” in ICIP,2012, pp. 2593–2596.

[8] A. T. Lopes, E. de Aguiar, A. F. De Souza, and T. Oliveira-Santos, “Facial expression recognition with convolutional neuralnetworks: Coping with few data and the training sample order,”Pattern Recognition, vol. 61, pp. 610–628, 2017.

[9] H. Ding, S. K. Zhou, and R. Chellappa, “Facenet2expnet: Regular-izing a deep face recognition net for expression recognition,” inFG. IEEE, 2017, pp. 118–126.

[10] A. Mollahosseini, D. Chan, and M. H. Mahoor, “Going deeperin facial expression recognition using deep neural networks,” inWACV. IEEE, 2016, pp. 1–10.

[11] J. N. Bassili, “Emotion recognition: the role of facial movementand the relative importance of upper and lower areas of the face.”Journal of personality and social psychology, vol. 37, no. 11, pp. 2049–58, 1979.

[12] G. Zhao and M. Pietikainen, “Dynamic texture recognition usinglocal binary patterns with an application to facial expressions,”PAMI, vol. 29, no. 6, pp. 915–928, 2007.

[13] D. Fortun, P. Bouthemy, and C. Kervrann, “Optical flow modelingand computation: a survey,” Computer Vision and Image Understand-ing, vol. 134, pp. 1–21, 2015.

[14] R. Breuer and R. Kimmel, “A deep learning perspective on theorigin of facial expressions,” in CVPR Honolulu - June 21-26, 2017.

[15] K. Zhang, Y. Huang, Y. Du, and L. Wang, “Facial expression recog-nition based on deep evolutional spatial-temporal networks,”Transactions on Image Processing, vol. 26, no. 9, pp. 4193–4203, 2017.

[16] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim, “Joint fine-tuning indeep neural networks for facial expression recognition,” in ICCV,2015, pp. 2983–2991.


[17] I. Kotsia, S. Zafeiriou, and I. Pitas, “Texture and shape informationfusion for facial expression and facial action unit recognition,”Pattern Recognition, vol. 41, no. 3, pp. 833–851, 2008.

[18] S. Jaiswal and M. Valstar, “Deep learning the dynamic appearanceand shape of facial action units,” in WACV, 2016, pp. 1–8.

[19] Y.-J. Liu, J.-K. Zhang, W.-J. Yan, S.-J. Wang, G. Zhao, and X. Fu, “Amain directional mean optical flow feature for spontaneous micro-expression recognition,” Transactions on Affective Computing, vol. 7,no. 4, pp. 299–310, 2016.

[20] D. Patel, X. Hong, and G. Zhao, “Selective deep features for micro-expression recognition,” in ICPR. IEEE, 2016, pp. 2258–2263.

[21] X. Li, X. Hong, A. Moilanen, X. Huang, T. Pfister, G. Zhao, andM. Pietikainen, “Reading hidden emotions: spontaneous micro-expression spotting and recognition,” in CVPR, 2015, pp. 217–230.

[22] Y. Wang, J. See, R. C.-W. Phan, and Y.-H. Oh, “Lbp with sixintersection points: Reducing redundant information in lbp-topfor micro-expression recognition,” in ACCV, 2014, pp. 525–537.

[23] X. Huang, S. Wang, X. Liu, G. Zhao, X. Feng, and M. Pietikainen,“Spontaneous facial micro-expression recognition using discrim-inative spatiotemporal local binary pattern with an improvedintegral projection,” CVPR, 2016.

[24] D. H. Kim, W. Baddar, J. Jang, and Y. M. Ro, “Multi-objectivebased spatio-temporal feature representation learning robust toexpression intensity variations for facial expression recognition,”Transactions on Affective Computing - issue 99, 2017.

[25] X. Fan and T. Tjahjadi, “A dynamic framework based on localzernike moment and motion history image for facial expressionrecognition,” Pattern Recognition, vol. 64, pp. 399–406, 2017.

[26] S. Happy and A. Routray, “Automatic facial expression recogni-tion using features of salient facial patches,” Affective Computing,vol. 6, no. 1, pp. 1–12, 2015.

[27] B. Jiang, B. Martinez, M. F. Valstar, and M. Pantic, “Decision levelfusion of domain specific regions for facial action recognition,” inICPR, 2014, pp. 1776–1781.

[28] H. Sadeghi, A.-A. Raie, and M.-R. Mohammadi, “Facial expres-sion recognition using geometric normalization and appearancerepresentation,” in MVIP. IEEE, 2013, pp. 159–163.

[29] C. Shan, S. Gong, and P. W. McOwan, “Facial expression recog-nition based on local binary patterns: A comprehensive study,”Image and Vision Computing, vol. 27, no. 6, pp. 803–816, 2009.

[30] D. Ghimire and J. Lee, “Geometric feature-based facial expressionrecognition in image sequences using multi-class adaboost andsupport vector machines,” Sensors, vol. 13, no. 6, pp. 7714–7734,2013.

[31] B. Allaert, I. M. Bilasco, and C. Djeraba, “Consistent optical flowmaps for full and micro facial expression recognition.” in VISAPP,2017, pp. 235–242.

[32] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid,“Epicflow: Edge-preserving interpolation of correspondences foroptical flow,” in CVPR, 2015, pp. 1164–1172.

[33] G. Farneback, “Two-frame motion estimation based on polynomialexpansion,” in SCIA. Springer, 2003, pp. 363–370.

[34] X. Li, T. Pfister, X. Huang, G. Zhao, and M. Pietikainen, “Aspontaneous micro-expression database: Inducement, collectionand baseline,” in FG. IEEE, 2013.

[35] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, andI. Matthews, “The extended cohn-kanade dataset (ck+): A com-plete dataset for action unit and emotion-specified expression,” inCVPRW. IEEE, 2010, pp. 94–101.

[36] G. Zhao, X. Huang, M. Taini, S. Z. Li, and M. Pietikainen, “Fa-cial expression recognition from near-infrared videos,” Image andVision Computing, vol. 29, no. 9, pp. 607–619, 2011.

[37] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Collecting large,richly annotated facial-expression databases from movies,” IEEEMultiMedia, vol. 19, no. 3, pp. 34–41, July 2012.

[38] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vectormachines,” ACM TIST, vol. 2, no. 3, p. 27, 2011.

[39] F. Xu, J. Zhang, and J. Z. Wang, “Microexpression identificationand categorization using a facial dynamics map,” Transactions onAffective Computing, vol. 8, no. 2, pp. 254–267, 2017.

[40] M. Liu, S. Shan, R. Wang, and X. Chen, “Learning expression-lets via universal manifold model for dynamic facial expressionrecognition,” Transactions on Image Processing, vol. 25, no. 12, pp.5920–5932, 2016.

[41] S. Elaiwat, M. Bennamoun, and F. Boussaid, “A spatio-temporalrbm-based model for facial expression recognition,” Pattern Recog-nition, vol. 49, pp. 152–161, 2016.

[42] L. Zhao, Z. Wang, and G. Zhang, “Facial expression recognitionfrom video sequences based on spatial-temporal motion localbinary pattern and gabor multiorientation fusion histogram,”Mathematical Problems in Engineering, 2017.

[43] L. Zhong, Q. Liu, P. Yang, B. Liu, J. Huang, and D. N. Metaxas,“Learning active facial patches for expression analysis,” in CVPR.IEEE, 2012, pp. 2562–2569.

[44] A. Dhall, R. Goecke, S. Ghosh, J. Joshi, J. Hoey, and T. Gedeon,“From individual to group-level emotion recognition: Emotiw5.0,” in Proceedings of the 19th ACM international conference onmultimodal interaction. ACM, 2017, pp. 524–528.

[45] V. Vielzeuf, S. Pateux, and F. Jurie, “Temporal multimodal fusionfor video emotion classification in the wild,” in Proceedings of the19th ACM International Conference on Multimodal Interaction. ACM,2017, pp. 569–576.

[46] Y. Fan, X. Lu, D. Li, and Y. Liu, “Video-based emotion recognitionusing cnn-rnn and c3d hybrid networks,” in Proceedings of the 18thACM International Conference on Multimodal Interaction. ACM,2016, pp. 445–450.

[47] B. Allaert, J. Mennesson, I. M. Bilasco, and C. Djeraba, “Impact ofthe face registration techniques on facial expressions recognition,”Signal Processing: Image Communication, vol. 61, pp. 44–53, 2018.

Benjamin Allaert received his MS degree onImage,Vision and Interaction and his Ph.D. onanalysis of facial expressions in video flows inComputer Science from the University of Lille,France. He is currently a research engineerat the Computer Science Laboratory in Lille(CRIStAL). His research interests include com-puter vision and affective computing, and currentfocus of interest is the automatic analysis ofhuman behavior.

Ioan Marius Bilasco is an Assistant Professorat the University of Lille, France, since 2009. Hereceived his MS degree on multimedia adapta-tion and his Ph.D. on semantic adaptation of 3Ddata in Computer Science from the UniversityJoseph Fourier in Grenoble. In 2008, he inte-grated the Computer Science Laboratory in Lille(CRIStAL, formerly LIFL) as an expert in meta-data modeling activities. Since, he extended hisresearch to facial expressions and human be-havior analysis.

Chaabane Djeraba obtained a MS and Ph.D.degrees in Computer Science, from respectivelythe Pierre Mendes France University of Grenoble(France) and the Claude Bernard University ofLyon (France). He then became an Assistant andAssociate Professor in Computer Science at thePolytechnic School of Nantes University, France.Since 2003, he has been a full Professor at theUniversity of Lille. His current research interestscover the extraction of human behavior relatedinformation from videos, as well as multimedia

indexing and mining.

Documents

Micro and macro facial expression recognition using