Universitat Autònoma de Barcelona · Universitat Autònoma de Barcelona ... a wise combination of deformable models captures rigid and non-rigid movements of diﬀerent kinematics;

Universitat

Autònoma

de Barcelona

Human Emotion Understanding On Facial

Image Sequences

A dissertation submitted by Javier Orozcoat Universitat Autònoma de Barcelona to ful-fil the degree of PhD in Computer Sci-ence.

Barcelona, July 2009

Director: Jordi Gonzàlez i SabatéUniversitat Autònoma de BarcelonaComputer Vision Centre

Co-director: F. Xavier RocaUniversitat Autònoma de BarcelonaComputer Vision Centre

Centre de Visióper Computador

This document was typeset by the author using LATEX2ε.

The research described in this book was carried out at the Computer VisionCentre, Universitat Autònoma de Barcelona.

Copyright c© 2009 by Javier Orozco. Permission is granted to make anddistribute verbatim copies of this thesis provided that the copyright notice,this permission notice, and the good use right are preserved on all copies.Permission is granted to copy and distribute modified versions of this documentunder the conditions for verbatim copying, provided that this copyright noticeis included exactly as in the original, and that the entire resulting derivedwork is distributed under the terms of a permission notice identical to this one.Permission is granted to copy and distribute translations of this document intoanother language, under the above conditions for modified versions.

Good Use Right: it is strictly prohibited to use, to investigate or to develop,in a direct or indirect way, any of the scientific contributions of the authorcontained in this work by any army or armed group in the world, for militarypurposes and for any other use which is against human rights or the environ-ment, unless a written consent of all the persons in the world is obtained.

ISBN-13: 978-84-936529-3-7ISBN-10:

Printed by Ediciones Gráficas Rey, S.L.

"The present letter is a very long one, simply because I had no leisure

to make it shorter".

–Blaise Pascal

Acknowledgements

Estoy altamente en deuda con el Dr. Jordi Gonzàlez ("Poal"), Dr. Xavier Roca("Xavi") y el Profesor Juan José Villanueva por la valiosa oportunida que mehan brindado. Poal quien me ha provisto con la ambisión por la produccióncientífica y las guías adecuadas que fueron necesarias para desarrollarme comodoctor. Xavi ha la mano fuerte en las decisiones dificiles, cuan prestó atencióna mis problemas, brindo consejo ante tales situaciones y oídos sordos cuandolos momentos adevrsos requirieon de su prudencia.

Me considero afortunado por haber trabajado con tan valiosas personascuyo conocimiento y experiencia en el area de Visión por Computador hanbeneficiado a mi trabajo. Ello me ha hayudado a crecer hacia un mejor desar-rollo profesional y entendimiento de mi area de especialización. Ellos motivaronmi deseo ambicioso por la producción científica. Estuve siempre motivado yapoyado para continuar desempeñando un buen trabajo, apuntado siempre ala cima y a disfrutar la carrera por el mejore premio, el conocimiento.

Finalmente, estaré siempre agradecido con mi familia, quienes me apoyaronpara aboandonar mi país y perseguir este anhelado sueño. Agradezco enorme-mente a Dios por mantenerme en los caminos que conducen al mejor porvenir,a la consecusión de una felicidad inconmensurable para mi y "mi hijo".

i

Abstract

Psychological evidence has emphasized the importance of affective behaviour under-standing due to its high impact in nowadays interaction humans and computers. Alltype of affective and behavioural patterns such as gestures, emotions and mentalstates are highly displayed through the face, head and body. Therefore, this thesis isfocused to analyse affective behaviours on head and face. To this end, head and facialmovements are encoded by using appearance based tracking methods. Specifically,a wise combination of deformable models captures rigid and non-rigid movements ofdifferent kinematics; 3D head pose, eyebrows, mouth, eyelids and irises are taken intoaccount as basis for extracting features from databases of video sequences. This ap-proach combines the strengths of adaptive appearance models, optimization methodsand backtracking techniques.

For about thirty years, computer sciences have addressed the investigation onhuman emotions to the automatic recognition of six prototypic emotions suggestedby Darwin and systematized by Paul Ekman in the seventies. The Facial ActionCoding System (FACS) which uses discrete movements of the face (called Actionunits or AUs) to code the six facial emotions named anger, disgust, fear, happy-Joy,sadness and surprise. However, human emotions are much complex patterns thathave not received the same attention from computer scientists.

Simon Baron-Cohen proposed a new taxonomy of emotions and mental stateswithout a system coding of the facial actions. These 426 affective behaviours aremore challenging for the understanding of human emotions. Beyond of classicallyclassifying the six basic facial expressions, more subtle gestures, facial actions andspontaneous emotions are considered here. By assessing confidence on the recognitionresults, exploring spatial and temporal relationships of the features, some methods arecombined and enhanced for developing new taxonomy of expressions and emotions.

The objective of this dissertation is to develop a computer vision system, includ-ing both facial feature extraction, expression recognition and emotion understandingby building a bottom-up reasoning process. Building a detailed taxonomy of humanaffective behaviours is an interesting challenge for head-face-based image analysismethods. In this paper, we exploit the strengths of Canonical Correlation Analysis(CCA) to enhance an on-line head-face tracker. A relationship between head pose andlocal facial movements is studied according to their cognitive interpretation on affec-tive expressions and emotions. Active Shape Models are synthesized for AAMs basedon CCA-regression. Head pose and facial actions are fused into a maximally corre-lated space in order to assess expressiveness, confidence and classification in a CBR

iii

iv

system. The CBR solutions are also correlated to the cognitive features, which allowavoiding exhaustive search when recognizing new head-face features. Subsequently,Support Vector Machines (SVMs) and Bayesian Networks are applied for learning thespatial relationships of facial expressions. Similarly, the temporal evolution of facialexpressions, emotion and mental states are analysed based on Factorized DynamicBayesian Networks (FaDBN).

As results, the bottom-up system recognizes six facial expressions, six basic emo-tions and six mental states, plus enhancing this categorization with confidence assess-ment at each level, intensity of expressions and a complete taxonomy.

Keywords: Appearance-Based Tracking;Facial Expression Recognition;EmotionAnalysis;Case-Based Reasoning;Support Vector Machines; Bayesian Networks;HiddenMarkov Models;Emotional Cognitive Mapping.

Topics: Image Processing;Computer Vision; Machine Vision Applications; Pat-tern Recognition; Emotions.

Resum

L’evidència psicològica ha emfasitzat la importància d’anàlisi de mirada en la interac-ció humano-ordenator i interpretació d’emocións. Per aquest objectiu, els algoritmesd’anàlisi d’imatge actuals ten en consideració l’moviment i detecció de d’parpella id’iris, utilitzant informació de color i detectors de contorn: Tanmateix, el movimentd’ull és ràpid i per això es difícil d’utilitzar per obtenir un precís i robust seguimentde las miradas. En canvi, el nostre mètode proposa descriure moviments de parpellai iris com variables contínues basats en l’aparença. Aquesta enfocament combina lesfortalesas de models d’aparença adaptables, mètodes d’optimització i tècniques deregressió inversa. Així, en el mètode proposat, les textures s’aprenen connectat deprop d’imatges frontals i canvis d’il·luminació, oclusions i moviments ràpids són acon-seguits. El mètode aconsegueix rendiment en temps real combinant dos rastrejadorsbasats en l’aparença a un algoritme de regresiò inversa per a estimació de parpelles i unaltre per a l’estimació d’iris. Aquestes contribucions representen un avenç significatiucap a una descripció de moviment de mirada fiable per a HCI i anàlisi d’expressió, onla força de metodologies complementàries són combinats evitar utilitzar imatges dequalitat altes, informació de color, entrenament de textura, escenes de càmeres i unsaltres processos que consumeixen temps.

Les ambients humanes proporcionen dóna l’entrada sobre de l’estat emocionald’un individu i juga un paper important en humà a comunicació no verbal humana.Això és possible però limitat completament expressions facials. La comunicació noverbal a través d’expressions és|està de gran importància en dia normal a dia interaccióentre humans com felicita comunicació verbal. Les expressions facials formen unafont potent d’informació en la comunicació no verbal com estat emocional d’individui poden proporcionar una indicació d’intencin o ajuden a regular comportaments.D’altra banda la interacció entre ordinadors i humans es limita a parla, textual ien alguna mesura visual. Per utilitzar la informació proporcionada per expressionsfacials i millorar la comunicació entre humans i ordinadors, és crític desenvolupar unsistema de visió informàtic que és fiable, corregeix i eficaçment fa un pont sobre elbuit de comunicació entre ordinadors i humans.

El focus principal en enfocaments de visió informàtiques que tracten amb l’anàlisid’expressió facial és sobre emocions que classifiquen a un conjunt molt petit. Elprimer treball fet utilitzant aquesta enfocament és un de Darwin. Un treball mésrecent basat en la mateixa aproximació|enfocament és aquell per Ekman. Segons elseu treball les emocions es poden classificar a sis emocions bàsiques; ràbia, fàstic, por,Alegria feliç, tristor i sorpresa. D’altra banda un sistema manual era desenvolupat

v

vi

per psicòlegs anomenats el Sistema de Codificació d’Acció Facial (FACS) que utilitzamoviments discrets de l’esfera (anomenada unitats d’Acció o AUs) per codificar ex-pressions facials. Un AU únic o una combinació d’AUs poden representar diversesemocions humanes. Encara que els humans poden produir moltes expressions facialsdiferents amb complexitat variable, intensities i amb significat desitjat diferent, compassos inicials el focus hauria de estar desenvolupant un sistema de visió informàticque pot com a mínim distingir entre el susdit sis emocions bàsiques.

Hi ha unes quantes contribucions de l’anàlisi d’imatge i reconeixement de formescap a la comprensió d’emoció humana. A una mà, l’expressió facial més popular queclassifiers tractin amb celles i llavis mentre eviten les parpelles. D’altra banda, aquestsmètodes tracten la classificació d’expressions facials forçades en comptes d’expressionsespontànies i subtils. Segons psicòlegs, la moció|moviment d’ull és pertinent per al’anàlisi de confiança i engany així com per dichotomizing expressions a prop facials. Adiferència d’enfocaments prèvies, incloem les parpelles construint un rastrejador basaten l’aparença. Posteriorment, una aproximació|enfocament de Raonament Basadaen Cas s’aplica entrenant la base de dades de base de cas amb les sis expressionsfacials proposades per Paul Ekman. Més enllà de de la classificació i la proximitatal grup|grup de sectors més proper que indiqui confiança, proporcionem un valor deconfiança de classificació i l’expressiveness de l’expressió facial analitzada. Per això,el sistema proposat produeix proporcions de classificació eficaces comparables a lamillor expressió facial prèvia classifiers. El seguiment basat en aparença i RaonamentBasat en Cas (CBR) combinació proporciona fidels solucions avaluant la confiança dela classificació celles-parpelles-boca.

En la quotidianitat, tanmateix, aquestes sis expressions bàsiques ocorren relati-vament infreqüentment, i l’emoció o propòsit és més sovint comunicada per canvissubtils a un o dos trets discrets, com estrènyer-se dels llavis que poden comunicarràbia. Els humans són capaços de milers que produeixen d’expressions que varienen la complexitat, intensitat, i significat. L’objectiu d’aquesta dissertació|tesi és de-senvolupar un sistema de visió informàtic, incloent-hi els dos extracció de tret facial,comprensió d’emoció d’identificació d’expressió construint un procés de raonament debaix a dalt.

Construint una taxonomia detallada de comportaments afectius humans és undesafiament interessant per a mètodes d’anàlisi d’imatge encapçala-esfera-basava. Enaquest traball, explotem les forces de Canònica Anàlisi de Correlació (CCA) per re-alçar un rastrejador d’esfera de cap connectat. Una relació entre posa de cap i movi-ments facials locals s’estudia segons la seva interpretació cognitiva sobre expressionsafectives i emocions. Els Models de Forma Actius se sintetitzen per a AAMs basatsen regressió de CCA. La posa de cap i accions facials es fonen a un espai relacionatde manera màxima per avaluar expressiveness, confiança i classificació en un sistemade CBR. Les solucions de CBR també es relacionen amb els trets cognitius, la qualcosa permet evitar recerca exhaustiva en reconèixer trets d’esfera de cap nous. Lesnostres contribucions se centren a síntesi d’ASM automàtica per a ABT, aprenentatgede trets d’esfera de cap relacionats per a la identificació d’expressió facial per CBR.A més, proposem un mapatge d’expressions cognitives més enllà de sis expressionsbàsiques. Per aplicar multiCCA i regressió, mostrem com aconseguir un coneixementmés profund d’expressió afectiva per construir i taxonomia extensa.

Contents

1 Introduction 11.1 Encoding Head And Facial Movements . . . . . . . . . . . . . . . . . . 11.2 Encoding Gaze Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Reasoning Spatio-Temporal Expression Relationships . . . . . . . . . . 41.4 Developing An Emotion taxonomy . . . . . . . . . . . . . . . . . . . . 61.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5.1 Encoding Head and Facial Movements . . . . . . . . . . . . . . 91.5.2 Emotion Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Background and Literature Review 112.1 Facial Action Coding System . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Prototypic Emotional Expressions . . . . . . . . . . . . . . . . 122.2 Face Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Face Databases for FER . . . . . . . . . . . . . . . . . . . . . . 142.2.1.1 FGnet DB . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1.2 MMI DB . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.1.3 Mind Reading DB . . . . . . . . . . . . . . . . . . . . 15

2.2.2 Face Databases for Emotion and Mental States Recognition(EMR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.1 Eigenface and Template Matching . . . . . . . . . . . . . . . . 182.3.2 Rowley’s Framework . . . . . . . . . . . . . . . . . . . . . . . . 192.3.3 Viola’s Framework . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Face Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.1 Curve Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.2 Active Shape Models . . . . . . . . . . . . . . . . . . . . . . . . 212.4.3 Shape Regularization Models . . . . . . . . . . . . . . . . . . . 22

2.5 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.5.1 Tensorface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.5.2 Potential Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.5.3 Active Appearance Model . . . . . . . . . . . . . . . . . . . . . 262.5.4 Gabor Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.5.5 Face Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.5.6 Eyelid and Iris Tracking . . . . . . . . . . . . . . . . . . . . . . 29

vii

viii CONTENTS

2.6 Facial Expression Recognition (FER) . . . . . . . . . . . . . . . . . . . 312.6.1 Case-Based Reasoning (CBR) . . . . . . . . . . . . . . . . . . . 322.6.2 Support Vector Machines (SVMs) . . . . . . . . . . . . . . . . 322.6.3 Statistical Learning . . . . . . . . . . . . . . . . . . . . . . . . 34

2.6.3.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . 352.7 Emotions and Mental States . . . . . . . . . . . . . . . . . . . . . . . . 37

2.7.1 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . 372.7.2 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . 382.7.3 Dynamic Bayesian Networks . . . . . . . . . . . . . . . . . . . 38

2.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 Head and Face Feature Spaces 413.1 Appearance-Based Tracking . . . . . . . . . . . . . . . . . . . . . . . . 41

3.1.1 Statistical Appearance Modelling . . . . . . . . . . . . . . . . . 423.1.2 Appearance Estimation . . . . . . . . . . . . . . . . . . . . . . 433.1.3 Stability to Outliers . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2 Active Appearance Modelling and Tracking . . . . . . . . . . . . . . . 463.2.1 Face Representation . . . . . . . . . . . . . . . . . . . . . . . . 463.2.2 ABT and AAM for Smooth Motion . . . . . . . . . . . . . . . 483.2.3 Experiments ABT in Real-Time . . . . . . . . . . . . . . . . . 49

3.2.3.1 Ground Truth Comparison . . . . . . . . . . . . . . . 503.3 Improving ABT for Rapid Motion Tracking . . . . . . . . . . . . . . . 53

3.3.1 Head, Eyebrows and Lips Tracker . . . . . . . . . . . . . . . . . 533.3.2 Eyelid Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.3.3 Iris Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.3.4 Head, Eyebrows, Mouth and Gaze Tracking . . . . . . . . . . . 573.3.5 Experiments Hierarchical ABT . . . . . . . . . . . . . . . . . . 59

3.3.5.1 Eyelid Tracking . . . . . . . . . . . . . . . . . . . . . 593.3.5.2 Iris Tracking . . . . . . . . . . . . . . . . . . . . . . . 593.3.5.3 Hierarchical Tracking . . . . . . . . . . . . . . . . . . 613.3.5.4 Gaze Tracking on Web Videos . . . . . . . . . . . . . 633.3.5.5 Lighting Conditions . . . . . . . . . . . . . . . . . . . 643.3.5.6 Occlusions and Real-Time . . . . . . . . . . . . . . . 663.3.5.7 Large Head Movements . . . . . . . . . . . . . . . . . 683.3.5.8 Translucent Textures . . . . . . . . . . . . . . . . . . 693.3.5.9 Low Size and Resolution of Images . . . . . . . . . . . 71

3.4 Face Search and Alignment . . . . . . . . . . . . . . . . . . . . . . . . 723.4.1 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.4.2 Skin Colour Model . . . . . . . . . . . . . . . . . . . . . . . . . 723.4.3 Canonical Correlation Analysis (CCA) . . . . . . . . . . . . . . 723.4.4 Shape Model Synthesis for Face Alignment . . . . . . . . . . . 73

3.4.4.1 ASM synthesis by 2D-CCA . . . . . . . . . . . . . . . 733.4.4.2 Synthetic ASMs for Automatic ABT . . . . . . . . . . 743.4.4.3 Recursive CCA for Optimal Sample Selection . . . . . 74

3.4.5 Experiments ABT Initialization . . . . . . . . . . . . . . . . . . 753.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

CONTENTS ix

4 Spatial Analysis of Facial Actions 794.1 Spatial Knowledge Discovery for Facial Expression Recognition . . . . 80

4.1.1 CBR Representation . . . . . . . . . . . . . . . . . . . . . . . . 804.1.2 CBR Classification . . . . . . . . . . . . . . . . . . . . . . . . . 804.1.3 Confidence Assessment . . . . . . . . . . . . . . . . . . . . . . . 81

4.1.3.1 Confidence Estimators . . . . . . . . . . . . . . . . . . 824.1.4 Classification Confidence . . . . . . . . . . . . . . . . . . . . . 834.1.5 Confidence Classification . . . . . . . . . . . . . . . . . . . . . 854.1.6 CBR Training by Maintenance Polices . . . . . . . . . . . . . . 854.1.7 Experiments CBR . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.1.7.1 Confidence Assessment . . . . . . . . . . . . . . . . . 884.1.7.2 CBR Maintenance . . . . . . . . . . . . . . . . . . . . 904.1.7.3 CBR Gaze Expression Recognition . . . . . . . . . . . 91

4.2 Cognitive Correlated Facial Actions . . . . . . . . . . . . . . . . . . . . 924.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.2.2 Exhaustive Mapping Of Intra-Correlated Expressions . . . . . 93

4.3 Dichotomizing Non-Linear Decision Surfaces . . . . . . . . . . . . . . . 944.3.1 Non-Linear Feature Spaces . . . . . . . . . . . . . . . . . . . . 954.3.2 SVM for Multi-Class FER . . . . . . . . . . . . . . . . . . . . . 964.3.3 Experiments SVM . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.3.3.1 Stress Recognition . . . . . . . . . . . . . . . . . . . . 964.3.3.2 Fusion of SVM and CCA . . . . . . . . . . . . . . . . 974.3.3.3 Expression Recognition . . . . . . . . . . . . . . . . . 100

4.4 Statistical Expression Recognition . . . . . . . . . . . . . . . . . . . . 1014.4.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.4.2 FACS Discretization . . . . . . . . . . . . . . . . . . . . . . . . 1034.4.3 Bayesian Network Classifiers . . . . . . . . . . . . . . . . . . . 1044.4.4 Experiments TAN-BNs . . . . . . . . . . . . . . . . . . . . . . 105

4.4.4.1 Data Preparation . . . . . . . . . . . . . . . . . . . . 1054.4.4.2 Discretization . . . . . . . . . . . . . . . . . . . . . . 1064.4.4.3 FER by Seven TAN-BNs . . . . . . . . . . . . . . . . 107

4.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5 Cognitive Emotion Analysis 1115.1 Reasoning The Cognitive Structure . . . . . . . . . . . . . . . . . . . . 1135.2 Probabilistic Learning of Cognitive Maps . . . . . . . . . . . . . . . . 1155.3 Inferring Expressions, Emotions and Mental States . . . . . . . . . . . 116

5.3.1 Mixture of HMMs . . . . . . . . . . . . . . . . . . . . . . . . . 1165.4 Behaviour Interpretation Based on Confidence . . . . . . . . . . . . . 1185.5 Experiments on Emotions and Mental States . . . . . . . . . . . . . . 119

5.5.1 Building the Spatial Taxonomy . . . . . . . . . . . . . . . . . . 1195.5.2 Probabilistic Recognition of Expressions in a Multinet . . . . . 1195.5.3 Emotion and Mental States Recognition . . . . . . . . . . . . . 1225.5.4 Confident Interpretation of Cognitive Behaviours . . . . . . . . 122

5.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

x CONTENTS

6 Conclusions and Future Research 1276.1 Face Motion Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276.2 Facial Expression Recognition . . . . . . . . . . . . . . . . . . . . . . . 129

6.2.1 CBR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296.2.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . 129

6.3 Expression and Emotion Interpretation . . . . . . . . . . . . . . . . . . 129

A Image Warping 131

B Publications 135B.1 Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135B.2 Conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Bibliography 139

List of Figures

1.1 Micro Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Action Units AUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Six Basic Expressions CMU DB . . . . . . . . . . . . . . . . . . . . . . 132.3 Six Basic Expressions FGnet DB . . . . . . . . . . . . . . . . . . . . . 142.4 Six Basic Expressions MMI DB . . . . . . . . . . . . . . . . . . . . . . 152.5 Six Basic Expressions Mind DB . . . . . . . . . . . . . . . . . . . . . . 162.6 Six Basic Emotions and Sequences . . . . . . . . . . . . . . . . . . . . 172.7 Rowley’s Face Detector . . . . . . . . . . . . . . . . . . . . . . . . . . 192.8 Active Shape Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 232.9 Face Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.10 Face Detection and Alignment . . . . . . . . . . . . . . . . . . . . . . 242.11 Tensor Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.12 Potential Net Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 262.13 Active Appearance Modelling . . . . . . . . . . . . . . . . . . . . . . . 272.14 Gabor Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.15 Thresholding and Eyelid Detection . . . . . . . . . . . . . . . . . . . . 292.16 Eye Edge Detector With Optical Flow . . . . . . . . . . . . . . . . . . 292.17 Blinking Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.18 Iris Extraction With HSI Colour Space . . . . . . . . . . . . . . . . . . 302.19 Separability Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.20 SVM Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.21 Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.22 HMM Represented as DBN . . . . . . . . . . . . . . . . . . . . . . . . 392.23 FaHMM and TsDBN Represented as FaDBN . . . . . . . . . . . . . . 40

3.1 Appearance Sequence Distribution . . . . . . . . . . . . . . . . . . . . 433.2 Generic 3D Shape Model . . . . . . . . . . . . . . . . . . . . . . . . . . 463.3 Input and Appearance Images . . . . . . . . . . . . . . . . . . . . . . . 483.4 FGnet Database for Face Tracking . . . . . . . . . . . . . . . . . . . . 503.5 Correct and Incorrect Tracking . . . . . . . . . . . . . . . . . . . . . . 513.6 Face Tracking vs. Ground Truth . . . . . . . . . . . . . . . . . . . . . 523.7 Ground Truth and Tracking . . . . . . . . . . . . . . . . . . . . . . . . 523.8 Ground Truth Fails on Blinking . . . . . . . . . . . . . . . . . . . . . . 53

xi

xii LIST OF FIGURES

3.9 Comparison of Eye Region Influence . . . . . . . . . . . . . . . . . . . 543.10 Eye Region Motion as Noise for Tracking . . . . . . . . . . . . . . . . 553.11 Eyelid and Iris AAMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.12 Hierarchical Trackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.13 Eyelid Tracking Results . . . . . . . . . . . . . . . . . . . . . . . . . . 603.14 Iris Tracking Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.15 Iris vs. Hierarchical Tracking . . . . . . . . . . . . . . . . . . . . . . . 623.16 Iris Tracker Fails Tracking Eyelids . . . . . . . . . . . . . . . . . . . . 633.17 Hierarchical Tracking on Facial Expressions . . . . . . . . . . . . . . . 643.18 Gaze Tracking on Web Videos . . . . . . . . . . . . . . . . . . . . . . . 653.19 Illumination Changes (Error Plot) . . . . . . . . . . . . . . . . . . . . 663.20 Illumination Changes (FACS Plot) . . . . . . . . . . . . . . . . . . . . 673.21 Tracking Under Occlusions . . . . . . . . . . . . . . . . . . . . . . . . 673.22 Occlusions and Real-Time . . . . . . . . . . . . . . . . . . . . . . . . . 683.23 Tracking in Real-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.24 Out-of-Plane Movements . . . . . . . . . . . . . . . . . . . . . . . . . . 703.25 Gaze Tracking and Sunglasses . . . . . . . . . . . . . . . . . . . . . . . 713.26 Gaze Tracking and Small Resolution . . . . . . . . . . . . . . . . . . . 713.27 Skin Colour Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.28 ABT Initialization by CCA Regression . . . . . . . . . . . . . . . . . . 753.29 ABT Initialization by Synthetic Images of CCA . . . . . . . . . . . . . 76

4.1 CBR and Six Basic Expressions . . . . . . . . . . . . . . . . . . . . . . 814.2 Confidence Estimators Concept . . . . . . . . . . . . . . . . . . . . . . 824.3 Stability of Confidence Estimators . . . . . . . . . . . . . . . . . . . . 844.4 Confidence-Classification Thresholds . . . . . . . . . . . . . . . . . . . 864.5 Confidence Predictor Performance . . . . . . . . . . . . . . . . . . . . 894.6 Feature Space of Subtle Expressions . . . . . . . . . . . . . . . . . . . 954.7 Stress Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.8 PCA Projection of SVM Data . . . . . . . . . . . . . . . . . . . . . . . 984.9 Kernel of PCA,CCA and SVM . . . . . . . . . . . . . . . . . . . . . . 994.10 Virtual SVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004.11 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024.12 GMM for FACS Discretization . . . . . . . . . . . . . . . . . . . . . . 1034.13 TAN-BNs for FER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.14 BIC Criteria for FACS . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.1 Overview of System for Behaviour and Mental States . . . . . . . . . . 1135.2 Additional Cognitive Emotions from Anger . . . . . . . . . . . . . . . 1145.3 Cognitive Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.4 Multinet for Emotions and Mental States . . . . . . . . . . . . . . . . 1175.5 DBN Representation of Mixture of HMMs . . . . . . . . . . . . . . . . 1185.6 Factorising Mixture of HMMs . . . . . . . . . . . . . . . . . . . . . . . 118

A.1 Image Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131A.2 Weak Perspective Projection . . . . . . . . . . . . . . . . . . . . . . . 132

LIST OF FIGURES xiii

A.3 Piece Wise Affine Warping . . . . . . . . . . . . . . . . . . . . . . . . . 134

List of Tables

4.1 Recognition results Without Confidence . . . . . . . . . . . . . . . . . 894.2 Recognition Results With Confidence . . . . . . . . . . . . . . . . . . . 904.3 CBR Recognition After Training . . . . . . . . . . . . . . . . . . . . . 904.4 CBR Recognition Improved . . . . . . . . . . . . . . . . . . . . . . . . 914.5 Confidence-Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 924.6 CBR Recognition, FGnet and MMI . . . . . . . . . . . . . . . . . . . . 924.7 CBR Confidence and CCA . . . . . . . . . . . . . . . . . . . . . . . . . 934.8 CCA for Expression Recognition . . . . . . . . . . . . . . . . . . . . . 944.9 Stress Recognition Results . . . . . . . . . . . . . . . . . . . . . . . . . 994.10 FER With SVM and Virtual SVs . . . . . . . . . . . . . . . . . . . . . 1014.11 Seven TAN-BNs for FER . . . . . . . . . . . . . . . . . . . . . . . . . 1074.12 Full TAN-BN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074.13 Constrained Discretization and Pruning TAN-BNs . . . . . . . . . . . 1084.14 Summary of FER Techniques . . . . . . . . . . . . . . . . . . . . . . . 109

5.1 Confidence for Emotions and Mental States . . . . . . . . . . . . . . . 1205.2 Expression Recognition by NBNs and TAN-BNs . . . . . . . . . . . . 1215.3 Emotion Recognition by FaMHMM . . . . . . . . . . . . . . . . . . . . 1235.4 Emotion and Mental State Interpretation by BCS . . . . . . . . . . . . 124

xv

List of Algorithms

1 Appearance-Based Tracking . . . . . . . . . . . . . . . . . . . . . . . . 492 Eyelid Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 Iris Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574 Learning Confidence and Classification Thresholds . . . . . . . . . . . 87

xvii

Chapter 1

Introduction

Facial expressions provide cues about inner emotional state, and they play an impor-tant role in human to human non-verbal communication. The non-verbal communi-cation through expressions is of great significance in normal day to day interactionamong humans as it compliments verbal communication. Facial expressions form apowerful source of information in non-verbal communication such as emotional state ofan individual and can provide an indication of intent [33] or help in regulating behav-iours. On the other hand, the interaction between computers and humans is limitedto vocal, textual and to some extent visual information. To utilize the informationprovided by facial expressions and to improve the communication between humansand computers, it is critical to develop a computer vision system that is reliable, andefficiently bridges the communication gap between computers and humans.

The earliest model of emotions was proposed by Darwin [33]. A more recentmodel is that proposed by Paul Ekman [46]. According to his work, emotions canbe classified into six basic emotions: anger, disgust, fear, happiness-joy, sadness andsurprise. An alternative approach that was developed by psychologists is called theFacial Action Coding System (FACS) which is a manual system that uses discretemovements of the face (called Action units or AUs) to encode facial expressions. Asingle AU or a combination of AUs can represent various human emotions.

1.1 Encoding Head And Facial Movements

Analysing facial movements in video sequences is essential to vision-based computersystems. Head pose estimation and facial motion description require an optimal com-bination of accuracy and robustness, and are of interest to Human Computer Interac-tion (HCI), surveillance systems and expression analysis. Feature-based approachestry to detect facial features by applying edge detectors, image colour information,texture and shape patterns [120], whereas tracking methods aim to estimate based onstatistical models that use previously registered information from an image sequence[28].

In [111], the authors report a comprehensive study of robust feature tracking con-sisting of two models of image motion; translation for small inter-frame displacements

1

2 CHAPTER 1. INTRODUCTION

and affine image changes for large inter-frame displacements. The 3D head pose isinferred by registering the current texture map with a linear combination of texturewarping templates and orthogonal illumination templates [19]. Stable tracking wasachieved by a regularized and weighted least-squares minimization of the registrationerror.

Tracking simultaneously both the head pose and facial motion represents a dif-ficult task. Faces are non-rigid and their images have a high degree of variability inshape, texture, pose, and imaging conditions. Several contributions in the literatureaddress this combined challenge by using 3D deformable models and specifically Ac-tive Appearance Models (AAM). AAMs capture the variability in their appearance,which is given by a combination of shape and texture [29]. Principal ComponentAnalysis (PCA) is the most popular technique to build AAMs based on a set oftraining examples.

The tracking of non-rigid facial features such as eyebrows and the mouth is pos-sible by applying probabilistic approaches. A two-stage approach is developed for 3Dtracking of head pose and deformations in monocular image sequences [57]. Stablefacial tracking is obtained by learning the possible deformations of 3D faces fromstereo data and setting out an optical flow associated with the tracked features.

The main goal of tracking with AAMs is to create a statistical model to estimatetransition of appearances according to an observation model without spreading theregistration error. However, estimating eyebrows, mouth, eyelids and irises, representsa greater challenge because of their different kinematic properties, the spontaneouseyelid blinking and iris saccade movements. Eyelid and iris movements have not beenaccurately handled with AAMs.

Many approaches for facial feature detection have been extended to gaze motiontracking, i.e. eyelid and irises. There are methods such as deformable templatematching and synthetic texture modelling [38, 115], as well as thresholding, edgedetectors, segmentation, and Hugh transform [125, 76], for the detection of irises andeyelids.

According to Ahlberg [4], 3D head and face tracking involve a feature-basedtracker and active appearance models. This approach could provide accuracy butit is memory expensive, time-consuming and depending on image quality, which iscomputationally expensive for real-time applications.

In [40], the authors face the training problems of tracking by fitting a 3D de-formable face model with two AAMs. They use directed and heuristic searches basedon texture training and pre-computed gradient matrices to deal with fast and spon-taneous movements, scale variation, lighting conditions and significant large headmovements. This approach lacks robustness and it does not guarantee convergence.This face tracker may suffer drifting problems because of lack of convergence.

1.2 Encoding Gaze Motion

Moriyama et al [86], deal with three eyelid states namely open, closed and flutteringby constructing detailed texture templates. They estimate the head pose by usinga cylindrical mesh, combined with image stabilization to compensate for appearance

1.2. ENCODING GAZE MOTION 3

changes. Therefore, the system recovers the head position and then the eyelid andiris position by template matching. However, this approach does not deal with morefacial actions such as eyebrows and mouth, it requires a lot of training, and the eyelidstates are discrete variables.

For iris tracking, Tan and Zhang [60] applied segmentation and colour spacetransformations. They analyse a valley-peak field approach to obtain a binary imagefor the iris region. However, image conditions such as illumination and skin colour,strongly determine the accuracy of the detection results. Therefore, these methodsare difficult to generalize for different image and environment conditions.

The previously mentioned approaches have several drawbacks. Firstly, meth-ods based on imaging conditions, colour information, contours, templates and tex-ture training may not satisfy the real-time requirements for HCI applications. Sec-ondly, these approaches have addressed separately the head pose, eyebrows-mouthand eyelid-iris analysis. Only head, eyebrows and mouth are tracked by using AAMs.Still, gaze analysis has several contributions but only for frontal position faces and us-ing detection methods for tracking, which do not involve strong statistical modelling.

Gaze information is important for the psychological analysis of deceit, truth de-tection and emotion evaluation [63]. Ekman and Frisen [45] have already shown thatthere are perceptible human emotions, which can be detected early by analysing eyelidand iris motion, see Fig. 1.1.

Angry Sad Happy

Figure 1.1: Eye movements are spontaneous facial expressions which provideearly clues of emotional state according to some psychological studies [45].

Human Computer Interaction (HCI) applications require real-time gaze analy-sis. Existing techniques are evaluated according to robustness and accuracy. On onehand, gaze tracking is approached as an eyelid and iris detection problem by applyingedge detectors, Hough transform, optical flow and thresholding techniques [104, 125].These methods are time-consuming and depend on both the image quality and methodof acquisition, for example by IR cameras. On the other hand, restricted detailed tex-tures and templates have been proposed for template matching, skin colour detectionand image energy minimization [103, 86]. Moriyama et al. [86], for example, use threeeyelid states: open, closed and fluttering. They have created detailed templates ofskin textures for the eyelid, the iris and the sclera. Their approach needs trainingand texture matching.

Tan and Zhang [60] applied segmentation and colour space transformations foriris tracking by using a valley-peak field approach to obtain a binary image for theiris region. However, the accuracy of the results is strongly affected by illuminationand skin colour conditions. Therefore, it is difficult to use these methods for differentimage and environment conditions.

By constructing a low-dimensional representation of non-rigid objects, the Appearance-


Based Models (ABM) provide accurate statistical analysis of complex shapes. ABMsare commonly used for face tracking because of their robustness to handle changes inimaging conditions and different skin colours. However, they have not been used foreyelid and iris tracking as of yet.

The proposed gaze tracking method combines two Appearance-Based Trackers(ABT), one for eyelid and another one for iris tracking. The first one excludes thesclera and iris information, while achieving fast and accurate eyelid adaptation forblinking and fluttering motions. The second one is able to track iris movements andrecover the correct adaptation, even in cases of eyelid occlusions and iris saccademovements. Both trackers agree with the best 3D mesh pose that depends on thehead position. The head pose estimation enhances the system capabilities for eyelidand iris tracking in different head positions.

Appearance-textures are modelled as a multivariate normal distribution. TheGaussian parameters are estimated by a recursive filtering technique. This is an on-line learning process of facial textures. Once the expected appearance is calculated,the facial actions are estimated by applying gradient descent methods and backtrack-ing techniques. Thus, the algorithm converges faster to the best adaptation.

This method has several advantages over existing methods. Firstly, occlusions,illumination changes and faster saccade and blinking movements can be handled, andit is as well suitable for real-time applications. Existing methods, which are predomi-nantly created for medical image analysis, require specific image quality, training andcamera settings. Secondly, eyelid and iris movements are represented as continuousvariables according to the Facial Animation Parameters (FAP) of MPEG4, whereasprevious methods can handle only discrete variable states such as open, closed andfluttering.

1.3 Reasoning Spatio-Temporal Expression Relation-ships

Our behaviour, expressions and interactions are based on reasoning, emotions andgoals. Although there is not an evident boundary between reasoning and emotions,it is possible to distinguish between basic and cognitive expressions. Cognitive ex-pressions can be triggered reactions to a stimulus or they can be intentional. Onlya non-intrusive vision system may be able to assess further analyse such informa-tion. Moreover, an on-line behavioural taxonomy is required in order to aid humancomputer interaction tools and/or CCTV architecture.

Cognitive expressions are described by complex patterns involving the analysis ofmultiple information such as head motion, eye-gaze motion, and the common facialactions. The temporal analysis of facial expressions, in sequences of emotions, haveaided the reliable recognition of the six prototypical emotions proposed by Ekman[48]. However, Baron-Cohen have developed the interactive guide of emotions [10],which contains more complex mental states rendering asynchronous head gate motionand facial gestures.

There are several facial expression classifiers, [63], where the authors compare theperformance and accuracy of several Bayesian Networks recognizing six expressions.

1.3. REASONING SPATIO-TEMPORAL EXPRESSION RELATIONSHIPS5

Boosting and SVM are used in [77] to recognize their own prototypes of emotions.Likewise, Neural Networks, Gabor Wavelets, Bayesian Networks, LDA, NeighbourNetworks, Spectral clustering have been used with the same aim [95, 52]. Contrarily,head gestures have been rarely included for expression or emotion analysis. In [49],the authors combine head and local face information to infer subtle expressions byapplying feature tracking and DBN. Convolutional neural networks have been appliedin [51] to head-pose invariant estimation based on multi-scale facial features handlingonly planar head movements.

Canonical Correlation Analysis (CCA), has been applied recently [61], to headpose estimation and different types of kernels have been used for facial expressionrecognition and various other vision problems [130]. Borga [17] adopted CCA to findcorresponding points in stereo images. In the attempt to extend this approach tonon-linear relationships, 2D-CCA has been proposed [69] for correlation analysis onimages. Similarly, in [70] the authors use tensor methods for face shape recovery on4D image databases. The strengths behind CCA are attributed to its capability tofind the correlation between two sample spaces, while coping with noisy signals. Thecanonical vectors remain invariant under scaling or affine transformations.

A system for cognitive mapping of head gesture and facial expressions is proposed.Canonical correlation analysis is transversely applied for Active Shape Model synthesis(ASM), feature space transformation, optimal sample selection and fast multi-classmapping of cognitive expressions. An Appearance-Based Tracker (ABT) is improvedwith automatic initialization by learning with 2D-CCA the correlation between colourimages and ASM for AAM construction. A Case-Based Reasoning (CBR) engine isbuilt to assess confidence on mutual correlated head gesture and facial action features.The complex computation of retrieving data from the CBR database is avoided byperforming multi-class CCA. Correlated regression is used to estimate likely solutionssuch as expressiveness, confidence and expression.

This approach is similar to [130] since CCA is applied for facial expression recog-nition. However, cognitive expression analysis is achieved by a combination of 3Dhead pose and facial action estimation. Although the authors in [49] propose a com-plete structure for subtle expression recognition, they cannot easily extend such anapproach to more classes. This approach overcomes this limitation by building a widetaxonomy of cognitive expressions, and transitions of how to transit from one level toanother and to inherit knowledge by multiple correlation analysis.

Upper facial movements play an important role in the attempt to understandhuman emotions and cognitive behaviour for Human Computer Interaction (HCI).Psychological studies establish the relevance of detecting and interpreting earlier whenthe facial emotion arises [46]. Eyelid and eyebrow expressions provide important non-verbal cues in human communication. In this context, irises, eyelids and eyebrows areknown as gaze expression messengers to signal like, dislike, attentiveness, competence,dominance, credibility, intimacy and threat as well as regulating and initiating socialinteractions. Indeed, what Baron-Cohen et al., [11], describe as the ”language of theeyes” seems to be a significant channel for the communication of emotions and mentalstates. They showed that eyes could convey almost as much affective informationas the whole face, see Figure 1.1. Likewise, Ekman and Friesen [48] applied thesepsychological foundations to computer science by describing specific facial muscle


movements with a set of 44 action units (AUs), the Facial Action Coding System(FACS). Similarly, Paul Ekman [46] proposed six basic facial expressions categorizingthe corresponding internal emotions.

The goal is to detect the facial actions and their intensity for posterior classifi-cation. In this sense, several facial expression classifiers have been reported such as:Neural Networks, Gabor Wavelets, Bayesian Networks, LDA, SVM, Neighbour Net-works, etc. [52]. All of them differ in robustness, accuracy, requirements for trainingand effectiveness. They report an average effectiveness of 86%, but are less effec-tive with unseen subjects and more subtle expressions. Nonetheless, numeric scoreclassifiers do not provide a confidence measure for the classification.

CBR has been used for the facial expression recognition of a single subject [67],and the user-assisted learning process of action units [96], but has failed to provideaccurate solutions and robust processes for knowledge updating. Classification confi-dence in emotion recognition is a challenging task. In the literature, facial expressionrecognition is commonly addressed by classifying key features such as mouth, browsand wrinkles. However, subtle expressions are allocated near to the class boundariesrepresenting a serious limitation for the aforementioned classifiers. Because of this, ex-pression recognition is strengthened by a confidence assessment of the solution insteadof the sole nearness to the belonging class. Toward this end, Case-Based Reasoning(CBR) is used as it is more appropriate for complex and incomplete problem domainsthan eager methods, which replace the training data with abstractions obtained bygeneralization and which, in turn, require an excessive amount of training data.

Typical methods applied to expression recognition, based on key facial features,such as mouth, eyebrows and wrinkles, are Neural Networks, Gabor Wavelets, BayesianNetworks, LDA, Support Vector Machines, Neighbour Networks, etc., [52]. The afore-mentioned eager methods extract as much information as possible from training dataand construct a general approximation of the target function. Instead, lazy classifiers,like Case-Based Reasoning (CBR), have low cost of knowledge acquisition, given thatthe generalization is based on local approximations [3]. Therefore, the efficiency ofCBR depends on the retrieval of representative neighbours and the decision makingprocess.

This novel approach combines Appearance-Based Tracking (ABT), Case-BasedReasoning (CBR) and Confidence Assessment (CA). ABT allows encoding facial ac-tions by providing the temporal deformation of the face [90]. Subtle facial expressionsare recognized by including eyebrows, eyelids and lips as facial features. A set of con-fidence estimators allow for the extension of the solving capabilities to small andlarge clusters. A confidence indicator is proposed beyond the similarity score to thenearest class. Confidence assessment is achieving an average of 93% of correct classi-fication for unseen data. This confidence assessment is suitable for training as well asmaintenance strategies [37], while improving the solving capability of isolated cases.

1.4 Developing An Emotion taxonomy

Mental processes (cognition) are psychological transformations by which a humancan acquire, code and decode information about his/her environment, interaction

1.4. DEVELOPING AN EMOTION TAXONOMY 7

and communication. Cognitive maps represent a belief system of cognition, therebydescribing how people perceive, contextualize and make sense of their everyday inter-action [116].

Human behaviour assessment resembles a guessing process involving random in-terpretations and reasoning based on the observer’s experience, clarity of the evidenceand the empirical and theoretical knowledge of human behaviour understanding. Byusing cognitive maps it is possible to create a taxonomy of head-face gestures andstore their spatio-temporal relationships.

Considering human emotions as a communicative manner of behaviour [33], sixprototypic emotions have been traditionally considered to be represented by facialexpressions that highly reveal the nature of each emotion. However, our cognitioninvolves more complex emotions and intricate mental states which convey intentionsand other cognoscitive processes.

The majority of existing automated facial expression analysis systems either at-tempt to identify basic muscular activity in the human face (encoded as action unitsor AUs) [127, 97, 13] based on the Facial Action Coding System (FACS) [48], or therecognition of the set of basic emotions [63].

More challenging tasks are focused on extending the knowledge about the six pro-totypic emotions, and distinguishing cognitive emotions from cognoscitive behaviours.The former convey affective behaviours arisen from human interaction, communica-tion, spontaneous reactions and so on. The later refer to those behaviours and mentalstates that are the outcome of reasoning and goal achievement like provoking others’reactions or hiding the real instinctive feelings. Therefore, the complexity arises frombuilding a deeper structure that is automatically reading emotions and mental statesfrom the head-face gestures and can distinguish between expressions, emotions andmental states, which are inherently different. Facial expressions are signals embeddedon muscular configuration of the face; they are like isolated snapshots or unstructuredsequences of these. Emotions are states of mind that arise as a result of a specificexperience and drive us to take action, involving asynchronous information factorssuch as cognition, reaction, personality and so on. Cognoscitive behaviours or men-tal states correspond to reasoning processes for expressing emotions and messagesthrough gestures. They are intentional and relative to individual aptitude to expressthe desired gestures either to provoke others’ reactions or not to give feelings away.

Early theories on human behavioural signals expressing internal states, such asaffective states, proposed the existence of six basic emotions (happiness, anger, sad-ness, surprise, disgust and fear) that were universally displayed and recognized fromnon-verbal behavioural signals (especially facial expressions and gestures). However,mental states have other functions rather than social communication; through head-facial expressions humans express affection, intentionality, interaction and cognitivestates.

Although emotions exist in the mind and so are essentially unobservable, theyare frequently revealed through head and facial expressions, posture and gesture, andeye-gaze directions [11], even when the individual is trying their best not to givetheir feelings away. Even then, emotions usually drive our unconscious behaviour. Ingeneral, mental states refer to cognitive, affective and cognoscitive behaviour. Here inparticular, mental states refer to only cognitive and cognoscitive behaviour, as they


regard affective behaviour as emotion.

In order to build a computer system for expression, emotion and mental stateanalysis, it is mandatory to consider the head-facial signals as spatio-temporal de-pendencies. Single facial expressions can be easily recognized in isolation while theirsimilarity reveals the topology of expression space conveying emotions and mentalstates. However, their temporal interaction strengthen the interpretation of behav-ioural patterns rendering emotions and mental states [9].

Based on the aforementioned issues, a system is designed to extract and interpretthe facial muscular activity from video sequences. Thus, spatio-temporal configura-tion of key features and essential kinematic activities are directly encoded by usingAppearance-Based Tracking (ABT) methods [92]. Subsequently, a reasoning systemassesses confidence upon the spatial configuration of the expression classes. Thereby,it is possible to provide a reliable expression classification based on confidence [89]. Asabove mentioned, emotion and mental states require analysis of temporal expressioninteraction. Therefore, a probabilistic approach is adopted providing the predictionfor the given behaviour, emotions and mental states [87].

Case-Based Reasoning (CBR) has the advantages of generalizing classificationrules for small data neighbourhoods by acquiring incremental knowledge based onpreviously solved problems [3]. An enhanced knowledge of the decision surface isobtained by assessing confidence based on topological confidence estimators [22].

Facial expressions are recognized based on independently learning features’ cor-relations by factorizing the output classes of Tree Augmented Bayesian Networks(TAN). Likewise, spatio-temporal behaviours are inferred by Hierarchical FactorizedDynamic Bayesian Networks (FaDBN) which combine the strengths of DBNs andfactorized Hidden Markov Models (FaHMM). DBNs are suitable for user modellingfrom multi-dimensional information, they can handle incomplete data as well as un-certainty. They learn the correlation of the variables as well as they incorporateprior knowledge into learning. By factorizing the observation space, the complexityof learning is reduced. By dividing the states into layers a system is formed that canmodel several processes with independent dynamics and structures, which are looselycoupled.

Cognitive maps have been previously applied for stereo-vision matching [93] andemotional reasoning for autonomous virtual agents [7, 34]. Cognitive Mapping ofEmotions and Mental States (CMEMS) is achieved through the bottom up process.CBR provides a first spatial structure of the observation class based on confidence as-sessment and decision surface knowledge. FaTANs learn the probabilistic correlationsinside the observation space while establishing additional spatial links. FaDBNs learnindependent structures for behaviour inference by modelling the temporal correlationsamong expression classes. A final assessment of confidence combined with a winner-takes-all method based on Maximum Likelihood (ML) provides the interpretation ofa video sequence rendering emotion or mental state.

1.5. CONTRIBUTIONS 9

1.5 Contributions

1.5.1 Encoding Head and Facial Movements

The main contribution of this framework is the full 3D head and facial feature motiontracking by applying gradient methods, backtracking procedures and Appearance-Based Trackers (ABT). The problem of simultaneous estimations of head pose, eye-brows, lips, eyelids and irises in 3D is addressed. Image sequences recorded withmonocular cameras are used, and several challenges such as illumination changes, 3Drotations, translucent textures and occlusions are tackled.

The second contribution is to retain the strengths of previous work by combin-ing deterministic and stochastic methods with AAMs. The proposed method dealswith on-line learning of facial textures while estimating the motion parameters [131].This tracking method is accurate, stable, robust to occlusions and non-dependent ontraining.

Three AAMs model different kinematics for head-face, eyelids and irises. Subse-quently, an Observation Process models the AAMs as a single Multivariate NormalDistribution (MND). This assumption allows for the use of an adaptive observationmodel which recursively estimates the Gaussian parameters over time.

A State Transition Process computes the tracking parameters by adopting anadaptive velocity model. The minimum residual image between expected and esti-mated appearances which is achieved through iterative minimization, provides thebest solution. Although there are several optimization proposals, gradient descentmethods are commonly used, [20], in which the negative of the gradient the searchdirection. This method is slow for functions poorly conditioned and may suffer con-vergence instabilities.

There are several optimization methods. Line-search approaches that achieveminimization by directing the search towards the negative of the gradient [54]. An-other approach is the Barzilai-Borwein method [14], where the authors propose to usea different strategy to interpret the quasi-Newton methods in a simpler manner.

Enhancing the second contribution, it is proposed to tackle the non-linear LeastSquares Minimization (LSM) problem in the state transition process by applyingthe Levenberg-Marquardt Algorithm (LMA) [71, 78]. The LMA is combined withbacktracking procedures to avoid local minima and to find the optimal damping factor[6]. Thus, it is possible to estimate the eyelid and iris movements by using AAMs.

The third contribution is the outcome of the efficient combination of accuracyand robustness to satisfy the real-time requirements of the system and consists of twostages. Firstly, the combination of both the observation and state transition processeswith Huber’s function [62, 25]in order to lessen the influence of outlier pixels and tohandle occlusions. Scale variation, lighting conditions, occlusions and out-of-planemovements are handled as outlier pixels [19]. Secondly, a hierarchical appearancetracker exploits the different strengths of different appearance sizes and differences inperformance in order to achieve the real-time.

This method is suitable for fluid stream of images from monocular cameras, withframe rates higher than 15 frames per second. Otherwise, the method should beadjusted for detection tasks, which is not the purpose of this work.


This method is suitable for fluid stream of images from monocular cameras, withframe rates higher than 15 frames per second. Otherwise, the method should beadjusted for detection tasks, which is not the purpose of this work.

1.5.2 Emotion Analysis

The main contributions to the state of the art in CBR emotion analysis are as follows:First, we present five confidence estimators and a confidence classification assessmentfor CBR. Second, in the context of CBR, an efficient assessment of classification con-fidence is proposed based on confidence predictors. The improvement of the retrievalof similar expressions from the database by learning the neighbourhood dimensionsfor the expected classification confidences. Thus, the database is improved for gener-alization with new subjects by learning thresholds to minimize misclassification withlow confidence, maximize correct classifications with high confidence and re-arrangemisclassification with high confidence.

Here, the study of the topological relationship of facial expressions is proposedwhich are believed to constitute the basic identities of a time series, where eachtime series represents an emotion. A taxonomy of facial expressions is constructedaccording to the basic prototypical facial expressions as well as including basic displaysof cognitive mental states. Subsequently, the dynamic evolution of facial expressionsover time is analysed in order to infer cognitive emotions and mental states.

Chapter 2

Background and Literature Review

Facial expressions are widely recognized by their importance in social interaction andsocial intelligence. The analysis of expressions has been an active research topic since19th century. The first automatic Facial Expression Recognition (FER) system wasintroduced in 1978 by Suwa et al. [114]. This system attempts to analyse facial ex-pressions by tracking the motion of 20 identified spots on an image sequence. Variouscomputer systems have been made to help us understand and use this natural formof human communication.

This chapter summarizes the most relevant contributions to the state of the art to-wards the understanding of affective behaviours based on image processing. The mainissues that must be considered for such a system are: face detection and alignment,face modelling, feature extraction, expression classification and emotion interpreta-tion. Most of the current work in this field is based on methods that implement thesesteps sequentially and independently. Therefore, a breve description of the basicmodel facial expression analysis must be done.

2.1 Facial Action Coding System

The facial action coding system [48] is a categorization widely used in psychology todescribe subtle changes in facial features. This is the most comprehensive methodfor coding facial expressions by psychologists. FACS consists of 44 action units (AU)related to contraction of specific sets of facial muscles, see Fig. 2.1.(a). Some of theAUs are shown in Fig. 2.1.(b). With FACS, observers can manually code discretedeformations of the face (movements of the facial muscle and skin) which are referredto as action units (AUs). Basically, FACS divides the face into upper and lower facialexpressions and subdivides motion AUs. FACS consists of 44 basic AUs, with 14additional AUs for head and eye positions. AUs are the smallest visibly discriminablemuscle actions that individuate or combine to produce characteristic facial actionswhich can be recognized from the image.

Conventional, FACS codes are manually labelled by trained observers. However,recent attempts have been made to do this automatically [94]. The FACS systempresents the advantage of capturing subtle facial expression. However FACS itself is

11

12 CHAPTER 2. BACKGROUND AND LITERATURE REVIEW

Figure 2.1: (a) Muscular model for FACS scoring according to the 44 AUs(b) [55].

purely descriptive and includes no inferential labels. That means in order to get theemotion estimation, the FACS code needs to be converted into the Emotional FacialAction System (EMFACS [55]) or similar systems.

2.1.1 Prototypic Emotional Expressions

Instead of describing the detailed facial features, most FER systems attempt to recog-nize a small set of prototypic emotional expressions. The most widely-used set isperhaps human universal facial expressions of emotion which consists of six basic ex-

2.2. FACE DATABASES 13

Anger Disgust Fear

Happiness Sadness Surprise

Figure 2.2: Six prototypic basic expressions from CMU DB.[66].

pression categories that have been shown to be recognizable across cultures, see Fig.2.2. These expressions, or facial configurations have been recognized in people fromwidely divergent cultural and social backgrounds, and they have been observed evenin the faces of individuals born deaf and blind.

These six basic emotions, i.e., anger, disgust, fear, happiness, sadness and surprise,plus "neutral" which means no facial expression were recorded following the theoret-ical description of the FACS. The CMU database [66] of Kanade et al., consists ofapproximately 500 image sequences from 100 subjects rendering the correspondingFACS [48] describing each facial expression. This database released in 2000 has beenreferenced at least 165 times in works such as [95, 26, 52, 77, 49], see Fig. 2.2.

2.2 Face Databases

Due to the non-rigidity and complex three-dimensional structure of the face, its ap-pearance a face is affected by a large number of factors including identity, face pose,illumination, facial expression, age, occlusion, and facial hair. Additionally, bothfacial expressions and emotions are spontaneous reflect of the inner states of indi-viduals, which are related to affection or intentionality. Therefore, the developmentof algorithms for either expression or emotion analysis in real environments requiredatabases of sufficient naturalness. There various strategies to provoke spontaneousor realistic emotions. For example, surprise is an emotion that may be related to goodor bad feelings, just shock, impression, wondering, etc.


2.2.1 Face Databases for FER

Among all databases for FER studies, three natural and less posed databases arehighlighted here; FGnet [121], MMI [98] and Mind [10] databases are used for bothexpression and emotion recognition.

2.2.1.1 FGnet DB

This database is a collection of image sequences aimed for emotion and facial ex-pressions analysis, for dynamic algorithms using image sequences rather than staticimages. The head movements are not restricted as in other FER DBs, as well asallowing spontaneous and natural expressions, since people were asked not to play arole. Thus, the 6 basic emotions plus neutral are here detailed.

The database contains sequences gathered from 19 different subjects, each per-forming all six desired expressions and an additional neutral sequence three times.Hence, 21 sequences were recorded for each person, to give a total of 399 sequencesin the database. In this work, one sequence per actor has been used in experimentsof FER, giving a total of 49,166 images so far, see Fig. 2.3.

Anger Disgust Fear


Figure 2.3: Six basic emotions from FGnet DB.[121].


2.2.1.2 MMI DB

The MMI database [98] includes 19 different faces of students and research staffmembers of both sexes (44% female), ranging in age from 19 to 62, having eithera European, Asian, or South American ethnic background. Although the databasehas a discrimination by AUs, those sequences labaled as one of the six basic emotionswere chosen. They are in total 11,844 images split in 162 image sequences of 73 imageseach, see Fig. 2.4.

Anger Disgust Fear


Figure 2.4: Six basic emotions from MMI DB.[98].

2.2.1.3 Mind Reading DB

The Mind Reading Database (Mind DB) covers a wider spectrum of human emo-tions rather than the six basic emotions. Baron-Cohen proposed to explore over 400emotions, aware of the special needs of children and adults who have difficulties recog-nising emotional expression in others. This enables the study of emotions and learnthe meanings of facial expressions, as well as tone of voice.

The database completes 412 different emotions organised into 24 related groups.There are six video sequences for each emotion showing close-up performances by avariety of people (old, young, men, women, boys, girls, etc). Among the 24 groups ofemotions, it is possible to find the six basic emotion and the rest are other emotionalstates proposed by Baron-Cohen.

For this work, 35,745 images haven selected, which are distributed into 30 differentactors and 276 image sequences. From this amount images, 17,122 images correspondto the six basic emotions, see Fig. 2.5.


Anger Disgust Fear


Figure 2.5: Six basic emotions from Mind DB.[10].

2.2.2 Face Databases for Emotion and Mental States Recog-nition (EMR)

Facial expressions and emotions are here distinguished whether the pattern recog-nition approach considers static images or dynamic temporal sequence of images.Hence, emotion and mental states analysis requires mainly databases structured assequences, although for the main contribution of this thesis in field, the mapping ofexpression towards emotion inference is taken into account.

Consequently, the aforementioned databases for FER are also here considered fortemporal dynamic patterns, since these databases are provided as image sequences.The FGnet DB [121] contains 399 sequences of six basic emotions, whose lengthdepends on the type of emotion. The MMI DB [98] contains 162 sequences of basicemotions and the sequence’s lengths were standardized and provided like that byauthors. The Mind DB [10] contains about 5,000 sequences for the 412 emotions.However, in this work only the six basic emotions and other six emotions hence calledmental states are extracted from the database for emotion analysis. These selectioncompletes 276 image sequences and their lengths is variable according to the emotion.

The Chapter 5 contains further details on the usage given to each database. Still,it is worth to mention that just a fair comparison of the basic emotions through alldatabases allows highlighting different strengths. The Fig. 2.6 shows a sequence of ahappy expression along the three databases.

It can be seen in Fig. 2.6 how the expressiveness of all DBs is different. Further-


FGnet MMI Mind

(a) (b) (c)

Figure 2.6: The six basic emotions are differently rendered by the three DBs,FGnet, MMI and Mind. Here, a sequence of the "Happy" is compared alongthe three DBs. Expressiveness, duration, transition and variety are the maindifferences among these databases.

more, as explained before, the duration and transition between expression along thesame sequence are also different. Another characteristic is the variety of expressions.The FGnet DB is very spontaneous and always includes neutral expression at thebeginning of each sequence. The length of sequences are not standardized, thereforeallows modelling the influence of duration for emotion patterns. The MMI DB hasother expressiveness, which were more theoretically addressed, i.e. follows the AUsconfiguration attempting to be more subtle and spontaneous. Thereby, the transitionfrom on intensity to another is different despite the duration of sequences are stan-dardized. This database also includes neutral expression at the end of sequences thatallows modelling the offset of the emotion. The Mind DB contains quite challenging


emotions; the expressiveness is different, the sequence’s lengths are not standard, theneutral expression is rarely found within an emotion. Head movements present in allthree DBs, but the Mind DB presents a cognitive representation of such gestures, sothe head movements are not tending to out of plane movements.

Finally, 537 image sequences and about 95,000 images are collected from all threeDBs. Except to the MMI DB that is standardized on time duration, the frame rate(25 fps) allows a sub sampling of fragments of sequences to increase the database.

2.3 Face Detection

In a face analysis system, the first step must be to allocate the face in a given imageor video sequence. To solve the problem, several cues can be used for example, skincolour, motion, shape, and facial appearance. The most successful face detectionalgorithms are based on appearance. The main key of their success the strength ofappearance-based algorithms to avoid confusing textures and shapes by modelling 3Dstructures of faces. However, the appearance variations due to facial expression andhead pose affect the identification of face boundaries highly complex [15].

Here, we will describe three face detection algorithms: Eigenface is one of thesimplest methods, Rowley’s and Viola’s frameworks may be the most successful ones.AdaBoost learning is an important component in Viola’s framework while NeuralNetworks are the basic classification used by Rowley.

2.3.1 Eigenface and Template Matching

Given a collection of training images represented as a column vector, basis vectorsspanning an optimal subspace are determined such that the mean square error betweenthe projection of the training images onto this subspace and the original images isminimized. These are called the set of optimal basis vectors, eigenfaces, becausethey are the eigenvectors of the covariance matrix computed from the vectorized faceimages in the training set.

To build an eigenspace, let us assumes the face images X = x1, ..., xn fittinga multivariate normal distribution from which the training images are identicallyindependently drawn. So the p.d.f. is as follows:

f(x1, ..., xn) =e(−

12(x−µ)T Σ−1(x−µ))

(2π)n/2|Σ1/2| (2.1)

Next, the covariance Σ is decomposed by applying Singular Value Decomposition(SVD), Σ = USUT , wher U is a unitary matrix and S = diag(s21, ..., s

n1 ) is a diagonal

matrix with all elements non-negative. Each column of U is an eigenface. Therefore,a face image x can be expressed as the linear combination:

x = xi +∑

i

ciUi (2.2)

It can be shown that ci/si are standard normal variables embedding a probabilitydensity function of x as follows:

2.3. FACE DETECTION 19

f(x) =∏

i

e−

a2i

2s2i

(2π)1/2(2.3)

This equation can be used for estimations and to define a distance measure, D2 =∑

i a2i /s

2i , understood as the probability to be a face. This idea of using a normalized

Euclidean distance has been extended Sung and Poggio’s [113]. They apply mixtureof Gaussians to model face and non-face images, which means estimating also theprobability of x of being a non-face image. Further, the decision is based on a Bayesianclassiffier.

By assuming the covariance matrix as the identity, Σ = I, the distance is com-pletely degenerated into Euclidean distance, which means the p.d.f is controlled by|x−µ|2, the variation of the image from the average face. This is the simplest detectionalgorithm by thresholding |x− µ| to determine whether x is a face image.

2.3.2 Rowley’s Framework

Rather than detecting upright, frontal faces, Rowley’s system detects faces at anydegree of rotation in the image plane. The system employs multiple networks; arouter network processes determines each window’s orientation in order to performdetection by other networks. This system can distinguish two faces of very differentorientations at adjacent pixel locations in the image. To counter such anomalies, andto reinforce correct detections, some arbitration heuristics are employed.

The router network assumes the input window containing a face, therefore it istrained to estimate the face orientation. Based on the intensity values in a 20x20 pixelwindow of the image, this network outputs the angle of rotation, which is representedby an array of 36 units. Thus, the face’s angle is discretized into 36 scales for each10◦ increment of the face rotation, see Fig. 2.7.

Figure 2.7: Rowley’s face detector system. Input images are sampled intoa pyramid to consider different scale and convolution variations. Next, arouter network determines the face orientation according to a 36 discretescale. Subsequently, a set of networks determine the presence of face. [101].

Once the face orientation has been determined, the face is warped into an uprightposition to further decide whether or not the window contains an upright face. For


detection, a linear function varies across the window according to the intensity valuesin an oval region inside the window. The linear function approximates the overallbrightness of each part of the window, which are later processed by histogram equal-ization expanding the range of intensities in the window. The preprocessed windowis then given to one or more detector networks. Authors reported a detection rate of79.6% on faces rotated in the image plane in combination with an upright face detec-tor. The technique is suitable to be applied to other template-based object detectionschemes.

2.3.3 Viola’s Framework

There are other methods that instead of working with colour information or texturemodelling, they build abstract features based on image representations. Schneider-man and Kanade [107] use AdaBoost as classifier based on wavelet representation ofthe image. This method is computationally expensive because of the wavelet trans-formation. To overcome this problem, Viola and Jones [120] replace wavelets withHaar features, which can be computed very efficiently. Under Viola’s framework,some improvements have been proposed since Haar-likes offer real time performance.Li et al. [72] use rotated Haar features to deal with in-plane rotation. They proposea multi-view face detection system which can also handle out-of-plane rotation usinga detector-pyramid.

Based on AdaBoost architecture, it is possible to achieve a robust classifier com-posed by a set of weak classifiers. In this way, a hard problem can be broken downand interpreted by several inaccurate poor classifier, which can be joined by a votingpolicy to produce the best result. Likewise, Viola’s framework uses threshold classi-fiers dealing with only one feature selected from a pool of Haar wavelet-like features.Haar-likes are computed on an image representation called the integral image [120].An integral image II(x, y) is a representation for the image containing the sum ofpixel values of the image I with height y and width x, which is obtained as follows:

II(x, y) =

x∑

x′=0

y∑

y′=0

I(x′, y′) (2.4)

Feature detection methods based on the integral image representation, can offerreal time since Im(x, y) is computed recursively; II(x, y) = II(x, y−1)+II(x−1, y)+I(x, y) − I(x − 1, y − 1) with II(−1, y) = II(x,−1) = II(−1,−1) = 0. Consequently,it requires only one scan over the image to compute the whole rectangle feature valueat (x, y) with height and width (h,w).

Posterior to the image representation, the classification task is advanced by boost-ing classification. Adaboost is trained to learn effective features from a large Haar-likesset, to construct the weak classifiers and bootstrap towards a strong classification.

In conclusion, the main methods for face detection are Rowley’s and Viola’s. Theformer has the advantage to be more accurate specially when detecting non uprightfaces. However it requires more computation due to the pyramid sampling and layersof different networks. The later method offers real time performance given the fastscanning of the image and the high performance of Adaboost. The problem with this is

2.4. FACE ALIGNMENT 21

method is the dependency with the type of images included in the training. However,Lienhart et al has made important extended the Haar-like features to rotated ones,which can be applied for detecting rotated faces [74].

2.4 Face Alignment

Face alignment attempts to automatically fit a face model by detection key pointsinside of a detected face window. An accurate accurate localization of the face ispursuit by detecting feature points. There are several alorithms in the literatureproviding strong results. One of the most relevant algorithms has been proposed byCootes et al. [30] which is based on Active Shape Models (ASM) and curve fitting.

2.4.1 Curve Fitting

Curve fitting aims to align a shape model to another by applying a transform yieldingto the minimum distance between them. Those transform may be scaling, rotating,and linear translation, which can be all applied by performing an affinity transforma-tion:

T(x, y) =

[

xtranslate

ytranslate

]

+

[

s ∗ Cos θ s ∗ Sin θ−s ∗ Sin θ s ∗ Cos θ

] [

xy

]

, (2.5)

Cootes at al. proposed a method to align two shapes by using a least-squaresminimization. This approach has been extended by considering the similarity tothe physic problem of having two forces. The internal force produces deformationsand the external force is caused by density gradient. Thus, is again reduced to aminimization, in this case, the overall energy. This problem is usually solved usingiterative searching algorithms. A large number of curve fitting methods have beenproposed using a wide range of energy functions and searching schemes.

The most popular curve fitting methods are Levenberg-Marquardt and linearregression [71, 78]. The Levenberg-Marquardt method is a robust algorithm for non-linear regression. It is very reliable in practice, and has the ability to converge quicklyfrom a wider range of initial guesses than other typical methods. Linear regression(linear least-squares) is a curve fitting method that depends on initial guesses anddamping factors for a fast convergence. By computing the residuals, it is possible tomeasure of how well the target curve fits a given set of feature points. The goal ofthe curve fitting computation is to determine the strength envelope which minimizesthe value of the residuals.

2.4.2 Active Shape Models

Active Shape Model (ASM) [6] and Active Appearance Model (AAM) [4] are twoof the most representative face alignment models. In ASM, a Point Distributionmodel captures the shape variants and gradient distributions of a set of landmarkpoints describe the local appearance. The shape parameters are iteratively updatedby locally finding the best nearby match for each landmark point. AAMs model the


appearance globally by PCA on the mean shape coordinates. The shape parametersare locally searched using a linear regression function.

Deformable shape models can only deform in ways that are characteristic of theobject. The ASM is first trained on a set of manually landmarked images. Aftertraining, the ASM is used to search for features on a face by trying to locate eachlandmark independently, then correct the locations if necessary by looking at howthe landmarks are located with respect to each other. First, a profile model foreach landmark describes the characteristics of the image around the landmark, whilegenerating tentative new positions for the landmarks, called the suggested shape.Second, a shape model defines the allowable relative position of the landmarks.

This algorithm combines the results of the weak profile classifiers to build astronger overall classifier. We can say this is a shape constrained feature detector,where the shape model determines the searching area and each profile acts as localweak classifier. Thus, a shape model converts the shape suggested by the profile mod-els to an allowable face shape. A shape model consists of an average face and alloweddistortions of the average face:

S = S + φb (2.6)

where S is the generated shape vector and S is the mean shape model. φ is the matrixof eigenvectors of the covariance matrix Σs of the training shape points. As in Eq.2.2 this combination of mean and eigenvectors determines a set of eigenshapes. Thisallows to obtain a new shape based on the mean and the eigenvectors of the trainingset. Therefore, the aim of the profile model is to take an approximate face shape andproduce a better suggested shape by template matching at the landmarks, see Fig.2.8.(a).

After obtaining a candidate shape, several search profiles are built at each land-mark position by sampling the image in the its neighbourhood of the landmark. Thedistance between a search profile and the mean model is calculated using the Maha-lanobis distance. The closer profile establishes the centre of the profile and the newsuggested position of the landmark. Additional exhaustive search is performed upand down over pyramid at multiple scales, see Fig. 2.8.(b).

2.4.3 Shape Regularization Models

Following the standard shape representation for face alignment, a face consists of acollection of landmarks commonly placed along the boundaries of face components.The geometry information of the shape model is decoupled into canonical shape S, anda rigid transformation T. The rigid transformation maps the shape from a commonreference image to the coordinate plane of the target image I.

Assume that for a given landmark Ln, there are K candidate positions (profileunder previous approach) located on the image. Thus, there might be M = NxKcandidates for estimating a target shape S. In [58], authors introduce a latent variableto assign one candidate position to each landmark. The image likelihood of seeing alandmark Lnk at one particular position is measured by a Bayes rule p(I|Mnk = 1) =p(I|Lnk). Consequently, the shape model estimation is given by a Bayes inferationprocess supposing that S ∝ N(T (S, θ); Σ). Thus, the landmarks profiles are randomly

2.4. FACE ALIGNMENT 23

(a) (b)

Figure 2.8: Milborrow’s approach for Active Shape Models. (a) Each land-mark’s neighbourhood is modelled according to rectangular shape features.All these local profiles compose the ASM which is statistically learnt andexpressed eigenshapes. (b) By performing an exhaustive search over a multi-resolution pyramid, the final ASM is obtained. [84].

withdrawn from a multivariate normal distribution, under the allowance of assumingeach landmark estimation as independent variable.

Finding the most likely landmarks profiles is an issue faced by applying expectationmaximization algorithm for parameters estimation. Thus, given a set of positioncandidates for the landmarks, Lnk and the image likelihood of each candidate, theEM algorithm allows to obtain the optimal deformation and pose parameters b and θ(the Gaussian parameters for the multi-variate Gaussian and the affine transformationparameters, rotation, scale and translation).

The alignment algorithm is initialized by detecting the face based on Rowley’s facedetector [101] since it is pose invariant. The initial deformation parameter b(0) is set tozero. Next, a simple gradient based landmark detector is constructed in order to centreeach landmark profile. Given the multivariate Gaussian distribution for the shapemodel, it is possible to estimate the probability of each pixel containing a landmarkcentre. This landmark detection procedure and the shape inference algorithm areperformed recursively over an image pyramid, similarly to [101, 84] from the coarsestlevel to the finest.

Experimental results can easily obtain from the on-line demo of the authors,see Fig. 2.9.(a). Although the obtained results are more accurate than previousworks, there are still some occlusions yielding to no detections, which is expectedsince the multivariate distribution can cope with outliers under certain threshold.The predictions on the visible face part are stable and small weights are assignedon the occluded points based on shape priors according to the penalization rules.The method can efficiently deal with upright occlusions but rotated faces offeringocclusions are more challenging for the method, see Fig. 2.10.


(a) (b)

Figure 2.9: Gu and Kanade’s approach for Shape Regularized Models. (a) Afinest shape model is obtain for challenging images, including sunglasses androtated faces. (b) Some occlusions are still not covered since the modelling isquite depending on the edge detection [58].

(a) (b)

Figure 2.10: (a) [84] and (b) [58] compared here. Both methods providesimilar results although they are inaccurate. The first has been initialized byViola’s face detector while the second using Rowley’s.

2.5 Feature Extraction

Feature extraction methods aim to convert pixel data into a higher-level represen-tation of shape, motion, color, texture, and spatial configuration of the face or itscomponents. Such representation is later used for recognition, classification and moreelaborated analysis. One of the main goals of feature extraction methods is to reducethe dimensionality of the input space. Obviously, the reduction procedure must retainessential information based on efficient discrimination power. Several facial featureshave been proposed mainly from face recognition approaches. The coefficients ofeigenfaces can be used as features and also an extension of Eigenface [118] proposesTensorfaces as facial features. Active Appearance Model [30] decomposes the facialimage into shape and texture. The shape model encodes the face contour by usingASMs, whereas the facial texture is captured from the input image by warping intoa shape-free facial texture.

2.5. FEATURE EXTRACTION 25

Other approaches focus on local features, which model small face regions. Themost common idea of these methods is to use block-wise representations by splittingthe image in sub-windows for local feature extraction. Wavelet filters have been alsoused, which is the case of Gabor filters as shown in [75]. Facial animation parametershave been also widely used as facial features [41], which are based on Active ShapeModels.

2.5.1 Tensorface

Figure 2.11: Tensor Faces in a partial visualization built with 2,700 facialimages spanning 75 people. Each image has been sampled under 6 viewingand 6 illumination conditions [119]

Based on the Eigenface representation, Tensorfaces emerge as a multilinear ex-tension that represents an image by a multilinear system instead of using a singlelinear equation. This method analyses a face ensemble with respect to its underlyingfactors such as identities, views, and illuminations. The principal components in thismultilinear system are then considered as Tensorfaces which are shown in Fig. 2.11.The Eigen coefficients can be used as features in a recognition and similar uses ofEigenfaces tasks.

Some improvements have been proposed already for the Tensorface representation.[119] proposes a Multilinear Independent Component Analysis where they try to findthe independent directions of variation. Shashua et al. introduce Non-Negative TensorFactorization which is a generalization of Non-negative Matrix Factorization [110].


2.5.2 Potential Net

Figure 2.12: Potential Net representation and nodal deformation whenadapting a facial texture [79].

Potential Nets have been proposed by Matsuno et al. [79] in order to extractfacial features. As shown in Fig. 2.12, These are 2D-dimensional meshes of whichnodes are connected to their four neighbours with springs, while the most exteriornodes are fixed to the frame of the Net. Similar to curve fitting, Potential Netconsiders two forces: each node in the mesh is driven by the external forces which comefrom the image gradient and the elastic forces of springs propagate local deformationthroughout the Net. Eventually, equilibrium is reached, and the nodal displacementsrepresent the overall pattern of facial image.

2.5.3 Active Appearance Model

Cootes et al. [29] proposed the Active Appearance Model (AAM) as an extension ofASMs. AAMs merge the the shape and texture model into a single representation ofappearance. In this way, shape and texture are statically encoded by mapping wholeobject i.e. the whole face. An optimization technique which pre-calculates residualsduring training is used when matching image texture to model texture.

An AAM contains a statistical linear model of the shape and the grey-level ap-pearance of the object of interest, which can generalise to almost any valid example.Matching to an image involves finding the model parameters, which minimise the dif-ference between the image and a synthesised model, projected into the image. Findingan AAM is an optimal search problem, in which the difference between a new imageand one synthesised by the appearance model is minimised, see Fig. 2.13.

The texture model provides accuracy and time consumption by using colour infor-mation. Therefore, shape and texture should be jointed to warp an input image withrespect to the model while decreasing the complexity. Furthermore, facial featuretracking is a not appropriate technique for real time [42]. Instead, AAM are affec-tive by combining both shape and texture, which allows comparing the appearance


Figure 2.13: Active Appearance Modelling process.

changes with respect to local and global movements.While ASMs use the image texture in a locality of the landmarks, AAMs use the

texture across the entire object. Therefore AAMs use the whole information acrossthe face surface. Given this extra information for AAMs, they need a smaller numberof landmarks, see Fig. 2.13.

2.5.4 Gabor Wavelets

Gabor wavelet as facial features contain rich information about the local structureof the face. Instead of encoding intensity pixel values, Gabor functions are Gaus-sians modulated by complex sinusoids. They are similar to those of the human visualsystem, and they have been found to be particularly appropriate for texture repre-sentation and discrimination. Theese features are directly extracted from gray-levelimages, and have been successfully applied to texture segmentation, handwritten nu-merals recognition and fingerprint recognition. In the spatial domain, a 2D Gaborfilter is a Gaussian kernel function modulated by a sinusoidal plane wave:

G(x, y) = e

h−x2+y2

2σ2

ie[jω(xSin θ+yCos θ)] (2.7)

where σ is the standard deviation of the circle Gaussian along x and y, and ω denotesthe spatial frequency. Consequently, a Gabor filter output is:

Φ(x, y) = G(x, y) ⊗ I(x, y) (2.8)

where ⊗ is a two-dimensional convolution operation, and I(x, y) is the input image.Several author have applied and extended the Gabor feature representation [122] andused for face tracking by reporting accurate results [128]. Some samples of imagerepresentation by Gabor filters can be seen in Fig. 2.14.


Figure 2.14: Gabor wavelets features. By applying edge detection techniquesas pre-process to the Gabor filtering, it is possible to obtain a cleaner Gaborfeatured image [122].

2.5.5 Face Tracking

Face tracking can be performed by applying feature-based or appearance-based ap-proaches as discussed before. Moreover, tracking approaches can be divided intotwo groups: deterministic and stochastic. Probabilistic video analysis has recentlygained significant attention in the computer vision community using stochastic sam-pling techniques. Visual tracking based on probabilistic analysis is formulated froma Bayesian perspective as a problem of estimating some degree of belief in the stateof an object at the current time given a sequence of observations. Particle filteringmethods, also known as Sequential Monte Carlo (SMC) methods were independentlyused and proposed by several research groups [80]. These algorithms provide flexibletracking frameworks as they are neither limited to linear systems nor require the noiseto be Gaussian and proved to be more robust to disturbing disorder as the randomlysampled particles allow maintaining several competing hypotheses of the hidden state.

In order to recover the parameters of the model, observation likelihood should beavailable to quantify the consistency of the predicted parameters with the observationmade at the same instant of time. A statistical texture model like the AAM can beused to apply eigenfaces as a texture model [4], although in this framework thereis not a priori statistical texture model. The tracking and the learning process areperformed in parallel. So that the appearance is built on-line from previous shape-free textures, and allow recovering the geometric model. Using this approach thesequences of textures can build an eigenfaces system associated with the tracked faceby means of batch or incremental Principal Component Analysis (PCA) [39].

Other technique introduces an appearance-adaptive tracker able of estimatingthe 3D head pose and the facial actions in real-time including both deterministicand stochastic approaches. Thus, this technique has an online adaptive observationmodel of the face texture together with an adaptive transition motion model based ona registration technique between the appearance model and the incoming observation.Here, the concept of Online Appearance Models (OAM) is extended to the case oftracking 3D non-rigid face motion.

Classical 3D vision techniques provide tools for computing the 3D pose and thefacial animations from images [12]. However, such trackers very often suffer fromerrors accumulation since facial features do not have enough stable local appearancedue to many factors. Within deterministic approaches, appearance-based techniques


have the advantage that they are easy to implement and are generally more robustthan feature-based methods. The problem of appearance changes is tackled adoptingstatistical facial textures like the AAM that have been proposed as a powerful tool foranalyzing facial images [30], since these methods depend of the imaging conditions.

Tracking long video sequences with OAM demonstrated effectiveness and accuracywhile obtaining good results handling illumination changes, significant head pose andfacial expression variations as well as occlusions. The adaptive observation modeland the object appearance are learned during the tracking. Thus, unlike trackingapproaches using statistical texture modelling OAMs are expected to offer a lot offlexibility, here it is extended to track 3D non-rigid face motion [16].

2.5.6 Eyelid and Iris Tracking

Figure 2.15: Thresholding and eyelid detection by using simple templatematching [125].

(a) (b)

Figure 2.16: (a) Eyelid edge detection. (b) Normal flow in eye regions tofollow iris motion [104].

The eye states provide important information to recognize facial expression andhuman-computer interface systems [59]. For example, when a man is smiling, his eyesare nearly closed and so the eye states are related with basic emotions. Therefore, atechnique, which can extract the eye states from input images, is very important. Theeye states can be obtained from the eye features such as the inner corner and outercorner of eyes, iris, eyelid, eye position for example. There are methods like deformabletemplates to detect eye features minimizing energy of deformable templates to detecteye states [2, 38]. Kanade and Cohn [127] adopted action units to recognized facial


expression, but they detected only two eye states by iris because are visible wheneye is opened or closed. Bernogger et al. [103] presented an approach to synthesizethe eye movement by using the extracted eye features to compute the deformation ofthe eyes of the 3D model. However, facial movements as eyebrows and lips, eyelidsand irises have a faster motion due to the small area and range variation. In fact,they have specifics movements such eye blinking and eye saccade respectively. Atpresent, neither eyelid nor iris motion have been tracked jointly head and face. Manyresearchers have contributed with these issues by applying typical methods basedon detection of edges, eyelid corner detectors, iris circle detection by using HoughTransform [125, 104], see Fig. 2.15, and image colour information, see Fig. 2.16. Somemethods work with only eyelid motion or iris motion, but there are few works abouteyelids, iris, brows, lips and head tracking while only applying a head stabilizationprocess. They estimate discrete eye states as open and closed [85, 86], see Fig. 2.17.

Figure 2.17: Head pose estimation for blinking, flutter and non-blinkingdetection [85].

Figure 2.18: Iris extraction through HSI colour space [59].

Colour segmentation is used to obtain skin-segmented face image characterized bya multivariate normal distribution in a normalized colour space [59]. After detectingthe face region, two coarse regions of interest in the upper left and upper right halfof face image can be defined to detect the eyes. Therefore, if the eye inner cornersare localized and the length and width of the eye is known, the eye can be localizedand tracked in image sequence. Detection of eye states, eye colour information areused in RGB or HSI space colour, see Fig. 2.18. Because the eye has not remarkablefeatures in Red, Green and Blue colours, HSI colour space is selected.

Another approach is based on eyelid detection for further the eye states estimation

2.6. FACIAL EXPRESSION RECOGNITION (FER) 31

[125]. The advantage rises from deformable templates as parametrized active model:it interacts with the image in both the geometric aspect and the intensity aspect.To estimate with accuracy the template parameters, some works are done on thevalley-peak field. First, the binary image of eyes is got from the grey-level image ofeyes. The conditions of the image such as illumination, colour, and skin, determinethe threshold to transform the input image on a black and white image, but thethreshold depends on image quality and environment factors.

2.6 Facial Expression Recognition (FER)

The recognition of facial expressions has been widely studied by applying variousnumerical classifiers, statistical learning methods and artificial intelligence techniques.Two main surveys summarize the beginning [52] and the most recent advances [129]in facial gesture and expression recognition.

In [52] authors provide an extended analysis of facial expression recognition sys-tem. They clearly distinguish between holistic and local approaches for face mod-elling. In the same direction, the different methods for representing human faces aredetailed and compared. A further step in this survey is the comparison of a varietyof classification methods such as Neural Networks, Bayesian Networks, Linear Dis-criminant Analysis, Nearest Neighbours, etc. At this point, authors highlighted theimportance of further investigation of temporal relationships of facial expressions andthe complexity of emotional patterns.

Zeng et al. [129] compare the most advances in facial expressions and multi-modalstudies for human emotion understanding. This survey reviews the most importantdatabases in the literature for everyone of these approaches. In this way, they reflectthe necessity of incorporating contextual information to the human emotion analysisproblem.

Human Computer Interaction (HCI) has received many contributions towardsa friendly understanding of multimedia users. Thus, Pantic and Rothkrantz [97]proposed Case-Based Reasoning for profiling user’s emotions. Likewise, Khanum andZubair [68] also studied the FER problem by implementing CBR system combinedwith fuzzy logic. CBR imitates the way how human beings solve problems by oncomparison to previous solved situations [3]. A CBR system is composed by fourmain steps named retrieve, reuse, revise and retain. These two approaches openedinteresting alternatives to recognize facial expressions. However, CBR has many otherstrengths for knowledge discovery and adaptability towards improving efficiency.

Facial expressions and emotions are normally modelled by multi-dimensional fea-tures. Additionally the numerous prototypes of behaviours increase the complexityof expression recognition, emotion understanding and behaviour interpretation. Thisleads to complex spaces which are not linearly separable or huge to be mapped whileestablishing topological relationships. Support Vector Machines (SVMs) [83, 13, 24,73] is a powerful method for dealing with multi-dimensional spaces, transformationsand efficiently determining the decision boundary. SVMs are based on strong theo-retical background making them suitable of generalization and rapid computation.

Statistical learning methods such as Bayesian Networks have widely used for the


expression recognition problem [26, 49]. Cohen et al. [26] explored several structuresof BN by varying the degree of correlation in the observation space. Additionally,both person dependent and independent experimentation was advanced showing thegeneralization power of learning six basic expressions. [49] implemented a novel systemfor mental behaviour recognition that combines facial tracking and Dynamic BayesianNetworks (DBNs). Although DBNs are suitable for modelling temporal patterns, thiswork has dealt mainly with images while reinforcing decision as much as the samedisplay remains constant.

2.6.1 Case-Based Reasoning (CBR)

The recycling of the human memory of previous experiences has been studied beforeby Schank [106]. In the theory of functional organization of human memory of expe-riences, for a certain event to remind one spontaneously of another, both events mustbe represented within the same memory structure, which organizes the experiencedevents according to their thematic similarities.

A CBR system is requires a initial example space of facial actions and descrip-tions of solutions such as expression class, intensity of expression, quality, etc. Thereasoning structure is according to four processes as follows:

1. Retrieve: Given a testing data, the database is explored in order to retrievethe most similar data to the testing sample. Therefore, similarity measures andretrieving techniques have to be defined here. Nearest Neighbours is the mostused technique to sort the database based on a similarity measure.

2. Reuse: Once a neighbourhood is retrieved from the database, additional com-parisons and tests allow to reassure which are the most suitable data from theneighbourhood providing likely solutions. Next, a decision is taken and appliedto the testing data.

3. Revise: Some knowledge rules are required in order to self-assess the accuracyof the final solution. These rules are learn in a training process as part of thegeneralization that constitutes the knowledge discovery.

4. Retain: Similarly to previous step, other rules learnt in training, are requiredto update the system by retaining solved data and/or removing unnecessarydata from the database. This step aims to increase the classification efficiencyand adapt the system to the testing data.

Maintenance polices attempts to contribute mainly to revise and retain steps,although some contribution can be done for retrieve and reuse. Quality, confidence,similarity and updating rules are the main focus for maintenance research.

2.6.2 Support Vector Machines (SVMs)

Motivated by the non-linearity distribution of the feature space, a space transforma-tion arises as alternative to solve the problem. Support Vector Machines (SVMs)is a powerful, state-of-the-art classifier with strong theoretical foundations based on


the Vapnik-Chervonenkis (VC) theory. SVMs attempts to linearly separate featurevectors while maximizing the margin to the decision surface, thereby it is also knownas Optimal Margin Classifier [18]. The set of vectors is optimally separated by ahyperplane if it is separated without an error and the distance between the closestvector and the hyperplane is maximal.

SVMs transforms the feature space by applying a variety of kernel functions (e.g.,linear, polynomial, and radial basis function (RBF)) as the possible sets of approxi-mating functions. It optimizes the dual quadratic programming problem while usingstructural risk minimization as the inductive principle, as opposed to classical statis-tical algorithms that maximize the absolute value of an error or its square. Differenttypes of SVM classifiers are used according to the type of input patterns. A linearmaximal margin classifier is used for linearly separable classes, a linear soft marginclassifier is used for linearly non-separable classes, and a non-linear classifier is usedfor overlapping classes.

(a) Optimal hyperplane. (b) Kernel transformation

Figure 2.19: SVMs construct strong classifiers for generalization, since theoptimal hyperplane is maximally equidistant to both classes. In cases whereclasses are not linearly separable, a kernel function is applied to transform thefeature space in order to find the linear hyperplane.

SVM Classifier: The classification problem can be restricted to recognize be-tween two facial expressions without loss of generality. This binary discriminationis possible by inducing a function from available examples, whilst expecting to workwell on unseen faces, i.e. it generalises well. Consider the example in Fig. 2.19.(a)where there are many possible linear classifiers that can separate the data, but thereis only one that maximises the margin between it and the nearest data point of eachclass. This linear classifier is termed the optimal separating hyperplane.

Let us consider the feature space D = {(~γ0, λ0), ..., (~γl, λl)}, where ~γi ∈ Rn are

facial actions and λi ∈ [−1, 1] are labels denoting facial expressions of two classes.SVMs constructs a classifier such that the optimal hyperplane separating both classesis:

~w · ~γ + b = 0 (2.9)


where the weighting vector ~w ∈ Rn is given by a linear combination of the training

samples. The optimal hyperplane satisfies the following constraints:

arg minw,b0

1

||~w|| (2.10)

s.t. λi(~w · ~γi + b0) ≥ 1 ∀i

This optimization problem is a quadratic programming problem that can be solvedby solving the dual problem through Lagrangian multipliers:

arg maxα

l∑

i=1

αi −1

2

l∑

i=1

l∑

j=1

αiαjλiλj~γi · ~γj (2.11)

s.t.l

∑

i=0

αiλi = 0

~w =l

∑

i=1

αiλi~γi

where αi are the Lagrangian multipliers. The optimal solution contains many αi = 0while those αi 6= 0 are called Support Vectors (SV), which determine the decisionboundary. Finally, for a new testing facial action vector γn extracted from a facedisplaying expressions, the classification into one of the two expression classes is doneaccording to the following rule:

~w · ~γn + b =

|SV |∑

j=1

αjλj(~γj · ~γn) + b (2.12)

where |SV | indicates that only the SVs are needed for such classification, see Fig.2.20. The binary decision is taken upon the computation (in Eq. 2.12) is positive ornegative. Since all these computations are written in terms of inner products, theycan be generalized to a non-linear cases by employing Kernel techniques.

2.6.3 Statistical Learning

Ten years ago there were already good results on FER problems by Bayesian Networks(BNs). Ball and Breese [56] proposed a model of emotions and personality for a com-putational agent. The architecture uses dynamic models of emotions and personalityencoded as Bayesian networks to diagnose the emotions and personality of the user.In particular, they described the structure of dynamic Bayesian Networks (DBNs)that form the basis for the interpretation and generation, and address assessment andcalibration of static and dynamic components.

Cohen et al. [26] revised several statistical and numerical classifiers for FERchallenges. Based on FACS, authors compared the efficiency of techniques such as


Figure 2.20: SVs are the data points over the bound, dash line. The largerthe margin, the smaller the bound. The smaller the number of SV, the smallerthe bound. Thus, only SVs determine the decision hyperplane.

Neural Networks, Naïve Bayes Classifiers (NBC), Tree Augmented Networks (TAN)and more complex BNs. Additionally, they assessed the capabilities of generalizationof such techniques based on classic databases of posed facial expressions.

Even recent publications adopt BNs and graphical models for solving FER prob-lems [53]. Authors recognize facial expressions by applying convex optimization toincorporate constraints on parameters with training data to perform Bayesian networkparameter estimation. For complete data, a global optimum solution to maximumlikelihood estimation is obtained in polynomial time, while for incomplete data, amodified expectation-maximization method is proposed.

2.6.3.1 Bayesian Networks

A Bayesian network is a graph structure encoding joint probability distributions forits variables in a very compact way [99]. BNs are strongly efficient for inferenceby relying on a factorization of these distributions in arbitrary ways to make arbi-trary conditional independence assumptions. Given the structure of the network, itsparameters can be learnt through the estimation of probability measures of the Condi-tional Probability Distributions (CPDs). Maximum likelihood estimation techniquesare usually applied to learn such parameters from from a limited amount of data.

Naïve Bayes (NB) classifiers make the assumption that all features are condition-ally independent given the class label. Although this assumption is normally unrealin practice, NB have been successfully used for practical classification applications.The main reason behind of these results is the small number of parameters needed tobe learnt. However, there strong relationships among the features describing a class,which more efficiently allow recognizing the class. Thus, Bayesian Network Classifiers(BNCs) are generalized by considering the correlation among observation variablesadditionally to the dependency to the class variable. Tree-Augmented Naive Bayesclassifiers (TAN) is an especial case of BNCs, where observation variables are allowed


to be correlated at most with another observation variable.A BN for a set of variablesX = {x1, ..., xn} contains network structure S encoding

conditional independence assertions about X a set of local probability distributions.The network structure is a Directed Acyclic Graph (DAG) and the nodes are in one-to-one correspondence with the variables X. Lack of an arc denotes a conditionalindependence. For example in Fig. 2.21 an arc between Y and Z determines thedependency of Z from Y . Contrarily, X and Y are independent.

Figure 2.21: A directed acyclic graph (DAG) consistent with the conditionalindependence relations in P(W, X, Y, Z).

The independence relations are as follows:

P (W,X, Y, Z) = P (W )P (X)P (Y |W )P (Z|X,Y ) (2.13)

Notice that Y is conditioned on W . Moreover, W and Z are independent sincethey are not directly connected.

In comparison to other classification methods, SVMs enclose several advantages.For example, with respect to a decision tree, SVMs can handle linear relationshipsbetween attributes, but typical decision tree split only on one attribute at a time.It is possible to handle continuous values with SVMs while decision trees are betterhandling nested if-then-else type of rules. SVMs require smaller databases and canalso deal with discrete values.

SVMs are faster than K-nearest Neighbours classifiers, creates a smoother separat-ing hyperplane with less sensitivity to noise. Comparing SVMs and Na’ive BayesianNetworks, NBN are faster and easier to implement, hence often treated as a base-line, but the assumption of independence is their main handicap. A linear SVM canbe seen as a Neural Network with no hidden layers. However, the backpropagationalgorithm for training multi-layer neural network does not find the maximal marginseparating hyperplane. Neural network tend to overfit more than SVM as well assuffering of having many local minima, while SVM has only one global minima.

One possible alternative for increasing the classification efficiency is the combi-nation of classifiers. It is possible to combine a SVM with decision trees by splittingbased on a linear classifier trained by SVM instead of splitting according to an at-tribute. Another combination is a SVM with K-nearest neighbours by using K-NNto classify test instances inside the margin. After training a NBN conditional proba-bilities P (xi|c) are obtained. If the instances of X are modified by multiplying eachattribute xi by P (xi|c) and subsequently training a SVM with these attributes.

2.7. EMOTIONS AND MENTAL STATES 37

SVMs are maximum margin linear classifiers that are regularly used for instancesin high dimensional spaces. If the training instances are not linearly separable, theuse of soft margin, kernels, Transformation of non-numeric attributes, normalizationof attributes and tune parameters when training yields to better results.

2.7 Emotions and Mental States

Our behaviours, expressions and interactions are based on reasoning, emotions andgoals. Although there is not evident boundary between reasoning and emotions, it ispossible to distinguish those purely affective emotions from some conscious and inten-tioned behaviours [46]. Similarly, the spontaneous emotions are differentiable fromthe posed ones. Those behaviours involving feelings and affection are can be consid-ered cognitive while behaviours intentionality and achievements reveal reasoning andinferation of the individual, named cognoscitive behaviours [1, 105].

Emotion and reasoning can be mixed while humans behave driven by reactions,stimulus or intentionality. Human emotions can reveal empirical affection, cognitiveand cognoscitive behaviours. The study of emotions has been faced as pattern recog-nition of static displays through classification methods, e.g. [51, 13, 130, 67]. Deeperstudies consider the temporal evolution of expressions by applying methods for time-series modelling, e.g. [63, 97, 49]. Most of these approaches for emotion analysisadopted Bayesian Networks and Hidden Markov Models.

Cognitive behaviours imply dynamic interactions of subject’s personality and theenvironment, which involves a temporal process [124, 8, 44]. Cognoscitive behavioursare addressed according to purposeful interactions, goals and reasoning for expressingemotions.

2.7.1 Graphical Models

Graphical Models (GMs) are probabilist networks which encode the conditional in-dependence structure between random variables. One of the main use of GMs is theanalysis of temporal series as state machines. Dynamic Bayesian Networks (DBN)are one of the most general of these models which are directed graphical models ofstochastic processes. However, an specific case of DBNs is Hidden Markov Models(HMM) that are intended to model Markov Processes, i.e. a mathematical modelfor the random evolution of a memoryless system, for which the likelihood of a givenfuture state, depends only on its present state, and not on past states.

DBNs generalise HMMs by representing the hidden (and observed) state in termsof state variables. HMMs are stochastic finite processes, where each state generates(emits) an observation. There are two types of HMMs; ergodic and left-to-right.Ergodic HMM, or fully connected, allows every state of the model to be reached ina single step from every other state of the model. The left-right HMM, or Bakismodel, models the states as increasing sequences as much as the time increases in asuccessive manner. Similarly, DBNs also model temporal series of random variables.Thus, while an HMM represents the state of the world using a single discrete randomvariables, Xt ∈ {1, ..., k}, a DBN represents the state of the world using a set of


random variables X1t , ..., X

Dt . A DBN represents P (XtjXt..1) in a compact way

using a parameterized graph. A DBN has fewer parameters to estimate than itscorresponding HMM and can be also exponentially faster for inference.

2.7.2 Hidden Markov Model

HMM can be defined as a stochastic signal model where the observation is a prob-abilistic function of the state [100]. In this report, we use be Q to represent thepossible states of a HMM. We denote the state space of Qt at time t as qi as thewhere i is the state index. If there are K possible states, then qi = {q1, . . . , qK}.The observation Yt might be a discrete observation symbols, Yt ∈ {1, . . . , L}, or afeature-vector, Yt ∈ R

L. An HMM is characterised by the initial state distribution π,the state transition probability distribution A and the observation symbol probabilitydistribution B:

The state transition probability distribution A = {aij} where

aij = P [Qt+1 = qj |Qt = qi], 1 ≤ i, j ≤ K (2.14)

The initial state distribution π = {πi} where

πi = P [Q1 = qi], 1 ≤ i ≤ K (2.15)

The observation probability distribution B = P (Yt|Qt). If the observations arediscrete symbols, we can represent the observation model as a matrix: B(i, k) =P (Yt = k|Qt = qi). Gaussian distribution is normally employed to represent P (Yt|Qt)if the observations are vectors in R

L. Specifically,

P (Yt = y|Qt = qi) = N (y;µi,Σi) (2.16)

where N (y;µ,Σ) is the Gaussian density with mean µ and covariance Σ evaluated aty:

N (y;µ,Σ) =1

(2π)L/2|Σ| 12exp

(

−1

2(y − µ)′Σ−1(y − µ)

)

(2.17)

A more flexible representation is a mixture of M Gaussians:

P (Yt = y|Qt = qi) =

M∑

m=1

P (Mt = m|Qt = qi)N (y;µm,i,Σm,i) (2.18)

where Mt is a hidden variable that specifies which mixture component to use, andP (Mt = m|Qt = qi) = C(i,m) is the conditional prior weight of each mixturecomponent [87]. We can represent an HMM as a DBN as shown in Fig. 2.22.

2.7.3 Dynamic Bayesian Networks

DBNs allows modelling time series with similar advantages to HMMs while being ableto handle incomplete data, learn specific patterns and on-line learning [87]. Therefore,

2.7. EMOTIONS AND MENTAL STATES 39

Figure 2.22: DBN representation of an HMM for two time slices.

the probability that an expression Y is generated given the model is expressed asfollows:

P(Y |λ) =∑

Π(S1)P(Y1|S1)T

∏

t=2

P(St|St−1)P(Yt|St) (2.19)

where Y is a sequence of N dimensional vector of observations {Yt, t = 1, . . . T} Sis a sequence of states {St, St−1, . . . ST }, P(St|St−1), the transition probability fromstate St to state ST . Π(S1), the probability of being in state S at time St. P(Yt|St)p.d.f. of the observation vector Yt given the state St typically modelled as a mixtureof Gaussians. K is the number of states in the model and λ is the model parameters{K,P(St|St−1),P(Yt|St)}.

By factorizing the DBN, we generalize their representation by using a collectionof M discrete state variables:

St = S(1)t , ..., S

(m)t , ..., S

(M)t , (2.20)

where each S(M)t can take qm states, which yields to qM possible combinations of all

states. For FaDBNs the underlying state transitions are constrained since state vari-able evolves according to its own dynamics and uncoupled from other state variables.This model is commonly known as FaHMM:

P (St|St−1) =

M∏

m=1

P (S(m)t |S(m)

t−1 ) (2.21)

Combining the Eq. 2.21 with the p.d.f. Eq. 2.19, the final estimation is obtainedand the ML in each time step will determine the state solution for the given observa-tion. Note that this model follows the conditions of a first-order HMM, accumulatingall time-steps influences with the immediate previous step, see Fig. 2.23.(a).

Naturally this the state space can be extended, so that each state variable S(m)t

can take on K(m) variables the state space of the factorial HMM consists of all KM

combinations of the S(m)t variables, and the transition structure results in a KM×KM

transition matrix.


(a) FaDBN (b) TsDBN

Figure 2.23: DBN representation of FaHMM and TsHMM, i.e. FaDBN andTsDBN.

There is an assumption in FaHMMs, which is the a priori independence of thestate variables at one time step given the state variables at the previous time step. Bycoupling the state variables in a single time step [65], this assumption can be relaxed,

for example, by ordering them, such that S(m)t depends on S

(n)t for 1 ≤ n < m.

Similarly, the state variables and the output can be depending on an observable inputvariable Xt. This a Tree Structured HMM (TsHMM), see Fig. 2.23.(b). TsHMMsare useful to model time series with both temporal and spatial structure at multipleresolutions.

2.8 Chapter Summary

This chapter have covered the most relevant concepts required for the better under-standing of this thesis. A review of the literature has been analytically presented.Some classic approaches and articles have brought to discussion since there are sub-jects that have been under investigation for about twenty years. However, recentworks and survey are also included in order to update the reader with the last pre-sented results. Consequently, the reader can judge the suitability of the contributionshere presented, with respect to the past, present and future of all topics.

Chapter 3

Head and Face Feature Spaces

This chapter presents the framework developed for encoding head and face move-ments for facial expressions and emotions analysis. This method does not use eitheredge detectors or head pose assumptions. 3D head pose, eyebrows, mouth and gazesare described simultaneously based on multivariate statistical modelling and gradi-ent methods. A hierarchical Appearance-Based Tracking (ABT) method allows toaccurately encode rigid and non-rigid head-face movements in real time.

This chapter improves previous work in three directions [39]. Firstly, it is shownthat by adopting a non-occluded shape-free facial texture with a similar performanceto lips region, it is possible to track eyelid motions. In order to obtain more accurateand stable 3D head pose parameters, the dependency between the eye regions andeye-less template is proven. Secondly, similarly the independence between eye regionand inner eye region is also detailed. Third, a sequential assembling of ABTs yieldsto highly accurate tracking of head and facial actions. Unlike feature-based eyelidtrackers, AAM are made on-line adaptable to deal with more complex facial actionssuch as eyelid and iris motions.

3.1 Appearance-Based Tracking

We use boldface upper-case letters for matrices (e.g. Ai,j , Dn,k,l) and the correspond-ing dimensions as sub index. Vectors are written using boldface lower-case letters (e.g.x, r) and the corresponding components with lower-case letters with the sub indexes(e.g. x0, ..., xn). Greek letters are used for functions or constants (e.g. Φ(q), α).

Tracking methods aim to estimate the associated parameters to the current timet based on previous estimations. Therefore, they need a likelihood function to predictexpectations, a transition model to estimate the new state and error minimizationtechniques.

This section describes an observation process based on a single Multivariate Nor-mal Distribution, which provides the likelihood function. Similarly, a state transitionprocess estimates the current state based on the previous and differential appearancesby adopting an adaptive velocity model.

41

42 CHAPTER 3. HEAD AND FACE FEATURE SPACES

3.1.1 Statistical Appearance Modelling

In the following let us consider the appearance vector ~x representing a face in theframe t into the appearance sequence X = [~x0, ..., ~xt]. An appearance texture is theresult of applying a warping function Ψ : R

n → Rl:

Ψ(I, ~g) = ~x (3.1)

where I ∈ Rn is a normalized input image of n pixels, ~x = [x0, ..., xl]

T is an appearancevector of l pixels and ~g is a geometrical vector determining the transformation modesaccording to a shape model. Therefore, an appearance model is an l-dimensionalvector, where l is the number of pixels xi composing the appearance texture. Forthe sake of clarity, hence we consider the appearance vector at the frame t as ~xt =[x0,t, ..., xl,t]

T . Subsequently, for an input image sequence, an appearance sequence isobtained, X ∈ R

(l,t):

X =

x0,0 · · · x0,t

.... . .

...xl,0 · · · xl,t

= [~xT

0 , ..., ~xTt ] (3.2)

Given a long enough appearance sequence X, it can be assumed that a set ofpixels at the same position (xi,0, ..., xi,t) (X rows) as random variables (r.v.) fol-lowing a Gaussian distribution overtime, xi ∼ N (µi, σi). Similarly, each appearancevector ~x = (x0,j , ..., xl,j)

T follows a single Gaussian distribution independent on thetime. Subsequently, appearances are normalized to compensate partially for contrastvariations while obtaining xj ∼ N (0, 1), see Fig. 3.1.

Consequently, the join probability distribution for the set of appearances, is as-sumed following a Multivariate Normal Distribution (MND). That is, ~x ∼ N (~µ,Σ),where ~µ = [µ0, ..., µl]

T is a l-dimensional vector containing the corresponding meansof each r.v. xi. As well, Σ is the corresponding covariance matrix. Notice thatthe pixels variation is independent of others, thereby the individual covariance val-ues σij = σji = 0, are zero. Therefore, the covariance matrix is Σ = ~σ2 ∗ I, where~σ2 = [σ2

0 , ..., σ2l ] is a l-dimensional vector containing the individual variances of each

r.v. xi and I the identity matrix. Then, an appearance ~x is a r.v. following MND,~x ∼ N (~µ, ~σ2). In summary, the observation likelihood can be written as follows:

p(~xt) =l

∏

i=0

e−(xi−µi)2/2σ2

i

σi

√2π

(3.3)

According to the Central Limit Theorem (CLT), the distribution of a sum ofa large number of independent variables is approximately normal. Therefore, theappearance sequence X should be long enough to be approximated to a MND. Theobservation model starts from first frame collecting appearance vectors into the MNDto gain significance. Consequently, the likelihood function can be used to obtainthe expected values for the Gaussian parameters. Furthermore, a linear recursivecombination is adopted, where ~µt summarize past observations under an exponentialenvelope with a degradation factor α.

3.1. APPEARANCE-BASED TRACKING 43

Figure 3.1: All appearance vectors ~x are single Gaussian distributions andthe appearance sequence X follows a Multivariate Normal Distribution.

Given the previous estimated appearance ~xt, we can compute the expected ap-pearance and use it as observation model [64] for the next frame as follows:

~µt+1 = α~µt + (1 − α)~xt

~σ2t+1 = α~σ2

t + (1 − α)(~xt − ~µt)2 (3.4)

where ~µ and ~σ (µi and σi) are initialized with the first appearance ~x0, a constant value

for ~σ and α ≈ 0.0. Notice that ~σ2 and (~xt − ~µt)2 do not imply cross or dot products.

These are binary operations affecting each component variable1. The aforementionedapproximation is significant after 50 frames. The degradation rate α is set to 1/tuntil the frame 50, otherwise, α is set to a constant degradation factor according toexperimental results.

3.1.2 Appearance Estimation

An adaptive velocity model is adopted, which is a deterministic function to estimatethe state transition between two successive frames t and time t+1, thus:

~gt+1 = ~gt + ∆~gt (3.5)

1Strictly, ~σ2 = diag(~σ ∗ I ∗ ~σT ), where I is the nxn identity matrix.


where ∆~gt is an estimated increment vector based on the previous frame. The qualityof the transition estimation depends on the optimal increment vector, which minimizesthe error measure between expected and estimated appearances, ~r2 = ||~x − ~µ||2.Consequently, let us consider the minimization problem of the error function ~ξ(~g) :R

l → R, which is convex and twice continuously differentiable:

min ~ξ(~g) =1

2

l∑

j=0

r2j (3.6)

where the residual vector ~r depends on the geometrical vector ~g, that is, ~r(~g) =[r0, ..., rl]

T . The condition for the vector ~g∗ to be an optimal solution for Equation

(3.6) is ∇~ξ(~g∗) = 0. This problem is usually solved by an iterative algorithm that

generates a sequence of solutions (~g0, ..., ~gk) ∈ dom ξ, for which ~ξ(~gk) → ~ξ∗ as k → ∞.The proposed solutions follow the same adaptive velocity model as Equation (3.5),

where ∆~gt = δk~ρk, contains the step size δk and the line-search direction ~ρk. Thisis the Vanilla gradient descent method, which is the simplest and most intuitivetechnique to find the optimal solution for Equation (3.5) and (3.6) [20]. The search

direction is according to the negative gradient, −∇~ξ(~g), with a constant step size,

δ. The derivatives of ~ξ(~g) can be computed using the Jacobian matrix J of ~r(~gt),

which is defined as J(~g) =∂~rj

∂~gi, 1 ≤ j ≤ l and 1 ≤ i ≤ n, according to the number

of geometrical modes controlling the appearance construction and the appearance’spixels, respectively.

The Vanilla method suffers convergence problems such as time-consumption find-ing the optimal δ according to the slope. Another issue is the curvature of the errorsurface, which may not be the same in all directions. Expanding the gradient of ~ξ(~g)using Taylor series around to the current state t, the Newton’s Method (NM) can beapplied while disregarding the higher order terms [88]:

~gt+1 = ~gt − [H(~gt)]−1∇~ξ(~gt) (3.7)

where H(~gt) is the Hessian matrix or second derivatives. The use of both curvatureand gradient information improves the optimization convergence. Since the conver-gence is faster using Equation (3.7), the above convergence rate is sensitive to linearityaround the starting location. Accordingly, Levenberg [71] provided an algorithm based

on the Newton’s quadratic assumption, where ~H(~g) ≈ J(~g)T J(~g) and improving theNM:

~gt+1 = ~gt − [J(~gt)T J(~gt) + δI]−1∇~ξ(~gt) (3.8)

The Levenberg Algorithm (LA) is more robust than NM, the convergence is fastereven when the current appearance and the corresponding vector ~g are far of the nextestimation (optimal minimum). However, LA has the drawback that if δ is large, theHessian information is not used at all when it is crucial for short slopes.

Marquardt proposed an insight to scale each component of the gradient accord-ing to the curvature [78]. As a result, the Levenberg-Marquardt Algorithm (LMA)

3.1. APPEARANCE-BASED TRACKING 45

produces larger movements along directions where the gradient is smaller such as theclassical error valley. Consequently, the LA Equation (3.8) is updated as follows:

~gt+1 = ~gt − [H(~gt) + δdiagH(~gt)]−1∇~ξ(~gt) (3.9)

where H(~gt) = J(~gt)T J(~gt) and ∇~ξ(~gt) = J(~gt)

T~r(~g). This combination of Equations(3.5) and (3.9) provides the state transition, which is better than Vanilla gradientdescent, LA and NM.

However, illumination changes, occlusions, perturbing objects and fast movementsmay introduce outlier pixels to the statistic model and the estimation process. Driftingproblems occur when the ABT learns outliers by introducing them into the MND andthe LMA.

3.1.3 Stability to Outliers

In order to handle outliers, occlusions and out-of-plane movements, both statisti-cal modelling and estimation of appearances are constrained by using the Huber’sfunction [62, 25]. This function guaranties that Gaussian parameters will remain un-biased whilst exponentially updating the models around the centre. Thus, appearancemodels can be learnt on-line while coping with uncontrolled environments, outliers,out-of-plane movements. To do so, both appearance modelling and estimation mustbe combined with the outlier restriction:

η(xi) =

{

x2i

2 if |xi| ≤ c

c|xi| − c2

2 if |xi| > c(3.10)

where xi is the normalized pixel value in the appearance ~x and c is a constant outlierthreshold equivalent to 3*~σ. The pixel xi is an outlier when ||xi|| > c. Consequently,this restriction impacts the texture learning and the gradient descent, producing un-corrupted expectations and estimations, ~µ and ~x, respectively.

Subsequently, the observation process is improved to lessen the influence of outlierpixels with the η(x) function as follows:

P(~xt|~gt) =

l∏

i=0

e−η(xi)

σi

√2π

(3.11)

Similarly, the state transition process is improved to down-weight the influence ofoutlier pixels by constraining the LMA with the diagonal matrix Θ(~x), whose termsare:

Θ(xi) =1

xi

∂η(xi)

∂xi=

{

1 if |xi| ≤ cc

|xi|if |xi| > c

(3.12)

Finally, the Levenberg-Marquardt Algorithm must be also updated with the Hu-ber’s function as follows:

~gt+1 = ~gt − [H(~gt) + δdiagH(~gt)]−1Θ(~xt)∇~ξ(~gt) (3.13)


The next section describes a hierarchical implementation of Appearance-BasedTrackers (ABT), which makes possible to encode head, eyebrows, lips, eyelids and irismovements.

3.2 Active Appearance Modelling and Tracking

This section explains the details of modelling head and facial movements by usingActive Appearance Models (AAM) endowed with on-line learning capabilities andoutlier controller. Subsequently, the warping process applied to build up AAM is out-lined. Appearance-Based Models (ABM) build up a low-dimensional representationof non-rigid objects, thereby allowing accurate statistical analysis of complex shapes[30]. In addition, they are more robust to image acquisition changes and differentskin colours. AAMs encode shape and texture as related patterns according to sev-eral types of variations such as colour intensity, shape and local colour distributionaccording to the shape.

3.2.1 Face Representation

A human face is a 3D elastic surface that introduces non-linear deformations due tolarge head rotations and both strong and subtle facial movements. The 3D faceCandide Model [102] is wire-frame specifically developed for model-based coding,which is composed by 113 vertices and 183 triangles, see Fig. 3.2.

Figure 3.2: 3D mesh for modelling human faces. All facial actions areencoded by the vector ~g = [~ρ,~γ] ∈ R

6+9.

The shape model, S = F ∗ V, where F ∈ ℜ(183x3x113) contains the vertices con-figuration, V ∈ ℜ(113x3) the 3D position for 113 vertices, forming 183 triangles. The

3.2. ACTIVE APPEARANCE MODELLING AND TRACKING 47

shape model deals with the facial biometry, 3D head pose and local facial movementsaccording to the matricial equation:

V = V0 + D ∗ ~β + A ∗ ~γ (3.14)

where V0 ∈ ℜ(113x3) is the initial standard shape, D ∈ ℜ(113x3x19) encodes the bio-metric parameters of the face2 that are controlled by the vector ~β ∈ ℜ19. The matrixA ∈ ℜ(113x3x9) encodes the local facial actions3, which are controlled by the vector~γ. The vector ~γ follows the Facial Animation Parameters (FAP) for MPEG-4, whichare measured as continuous variables in the range [-1.0,1.0].

Adopting a week perspective projection because of the small depth of the face [5],the 3D mesh is projected onto the image plane by applying an affine transform toobtain the appearance template. Therefore, the image mapping through the 3D facemodel is according to:

S = sR(F ∗ V) + T (3.15)

where S is the projected 2D shape onto the image plane. s is the rescaling parameter,R = [rx, ry, rz] is the rotation matrix measured in Euler’s angles. The vector T =[tx, ty, tz] contains the translation of the shape in three dimensions. Therefore, thegeometrical vector ~g = [~ρ,~γ] collects all the parameters to deform the shape model,where the vector ~ρ = [θx, θy, θz, tx, ty, s] contains the global motion for the headrotation, translation and scale. The vector ~γ = [γ0, ..., γ8] contains the facial actionsfor eyebrows, lips, eyelids and irises.

The next step is to construct the appearance model, which is based on an affinetransform, Ψ(I, ~g). This function maps each triangle of the 2D mask model in Fig.3.3.(b) from the 3D shape in Fig. 3.3.(a) as follows:

△ S(x, y, z)Ψ(I,~g)=⇒ △S(x, y) (3.16)

where △S(x, y, z) corresponds to the triangles of the 3D mesh and △S(x, y) to thetriangles of the 2D reference texture for the appearance. The function Ψ(I, ~g) is alinear combination of barycentres to map the corresponding pixels between two posesof the shape model, see Fig. 3.3 [28].

By using two appearance resolutions of 82x80 and 42x40 pixels, the completeimage transformation is implemented as follows: (i) Adapt the shape model S to theimage I, see Fig. 3.3.(a). (ii) Construct the texture ~x using the warping functionΨ(I, ~g). (iii) Perform the normalization ~x ∼ N (0, 1) on the obtained appearance, seeFig. 3.3.(b).

2Deformation parameters: Head height, Eyebrows vertical position, Eyes verticalposition, Eyes width, Eyes height, Eye separation distance, Cheeks, Nose depth extension,Nose vertical position, Nose pointing up, Mouth vertical position, Mouth width, Dilate iris,Eyes vertical difference, Chin width, Outer corner eyes, Iris asymmetry.

3Animation parameters: Upper lip raiser, Jaw drop, Lip stretcher, Brow lowerer, Lipcorner depressor, Outer brow raiser, Eyes closed, Iris yaw, Iris pitch.


(a) (b)

Figure 3.3: (a) A shape model drives the warping function Ψ(I, ~g) to mappixels from the input image into the shape model over a reference texture (b).

3.2.2 ABT and AAM for Smooth Motion

Tracking head and facial movements is challenging for marker-less approaches andAppearance-Based Trackers (ABT), since the 3D head motion involves six degrees offreedom to estimate. Moreover, facial actions such as eyebrows, lips, eyelids and irisesare non-rigid surfaces with different kinematic. This is the case of eye movements thatare faster, sensitive to illumination changes and self-occluded [91].

Given an image sequence showing head and facial movements, the tracking con-sists of estimating the position of the 3D head position, eyebrows, lips, eyelids andirises for each frame. Therefore, the goal of ABTs is estimating the vector ~g = [~ρ,~γ]at the frame t, where ~ρ is the 3D head pose and ~γ is the facial action vector. In thecontext of tracking, the adaptation results associated with the current frame will behanded over to the next frame.

Facial feature detectors may give an automatic tracking initialization [32]. How-ever, the manual placement of the mesh at the first frame, guarantees a better trackingperformance, since the biometry is kept along the sequence to provide accurate esti-mations.

In the practice, an ABT constructs the appearance according to the 3D shape(Section 3.2.1) in order to register new r.v. in the Multivariate Gaussian (Section3.1.1). Subsequently, the LMA (Section 3.1.2) generates an space of estimated ap-

pearances ~xt+1, which are compared with the expected average appearance ~µt+1. Thewhole ABT Algorithm 1 is as follows:

The above tracking method combines the strengths of adaptive appearance modelsand optimization methods, which constitutes a special combination of stochastic anddeterministic approaches.


Algorithm 1 Appearance-Based Tracking for head, eyebrows and lips.Require: Input images It, It+1 and the vector ~qt = [~ρt, γ0,t, ..., γ5,t] matching the shapemodel to the input image.Ensure: The best matching shape S∗

t+1 of the image.

1: Ψ(It, ~qt) = ~xt. Construct the appearance for the image It by applying the Equations(3.14),(3.15) and (3.16).

2: Obtain ~xt ∼ N (0, 1).

3: Obtain ~µt(~q) and ~σ2t (~q) using Equation (3.4).

4: Calculate J(It, ~qt) = ∂~rt

∂~qtand Hessian H(It, ~qt) = ∂2~rt

∂~q2t

matrices for the residual image

~r2t = ||~xt − ~µt||

2.

5: For k (number of iterations)

6: ~qkt+1 = ~qt − [H(~qt) + δdiagH(~qt)]

−1Θ(~xt)J(~qt)T~r(~q)

7: Obtain Skt+1, Equations (3.14) and (3.15).

8: Ψ(It+1, ~qkt+1) = ~xk

t+1

9: min ~ξ(~qkt+1) = 1

2

Pl

j=0 r2j

10: Return S∗

t+1(~q∗

t+1) = S∗

t+1, the optimal shape according to the geometrical vector

producing the minimum error with respect to the expected appearance ~µt+1(~q).

3.2.3 Experiments ABT in Real-Time

In these experiments the FGnet talking face video for face tracking is used, whichcontains a single moving and talking head [50]. The dataset contains five imagesequences, each one is composed by 1000 images of head and shoulders manuallyannotated based on a shape model of 68 facial features. The ground truth for head,eyebrows, lips, upper and lower eyelids, and iris centre is used for validation of trackingresults, see Fig. 3.4. A more challenging set of image sequences are also used such asthe FGnet facial expression database [121], videos from internet and others recordedin laboratory.

The convergence error ξ(~g) = 12 ||~r(~g)||2 is used to evaluate those experiments with

image sequences lacking of manual annotations for ground truth. For example, videosfor testing challenger textures, specific movements and expressions, illumination vari-ation and occlusions. In these cases, the output image helps to verify the results,since the drawn shape onto the input image depends on the tracking estimations, seeFig. 3.5.

The experiments were run on a 3.2 GHz Pentium PC and implemented in ANSIC code. The image sequences were recorded with monocular cameras and standardresolutions. The sequence frame rate is 25 Hz. Two reference textures have been usedfor warping eye images into the appearance models, 170 (17x10) and 580 (29x20)pixels. Similarly, two reference textures of 1426 (42x40) and 5659 (82x80) pixelswere used for face tracking testing. The number of iterations for the Levenberg-


(a) (b)

Figure 3.4: The FGnet talking face is used for an standard test of thefacial and gaze tracking system. It has been released with the correspondingannotations for 68 facial features (a). (b) Thirty feature points coincide withthe shape model here used for face tracking.

Marquardt Algorithm (LMA), k, is estimated experimentally according to the averageof iterations needed to converge when tracking long image sequences.

3.2.3.1 Ground Truth Comparison

Initially, the reliability of the provided ground truth is evaluated by comparing themagainst the tracking results. Each image sequence has 1000 frames of a size of 720x576pixels. The actor performs slow and short head movements, close to the frontalposition while moving eyebrows, lips, eyelids and irises.

By comparing the 30 feature points shown in Fig. 3.4.(b), it is possible to computethe FACS parameters of the facial actions encoded by tracking. For eyelid compar-ison, the average difference for the vertical position of upper and lower eyelids iscomputed. It is worth to mention that these positions may vary depending on thehorizontal position for both ground truth and tracking results. However, the averageerror obtained for eyelid position estimations is 3.2 pixels per frame. Similarly, irisestimations are compared to the iris centre of the ground truth. The average errorobtained for iris position estimation is 2.2 pixels per frame. Head pose is more diffi-cult to compare since the annotations follow the edges of the face without consideringthe 3D perspective. But a pair wise comparison between both face models gives anaverage error of 1.5 pixels per frame for head pose, eyebrows and lips.

It is possible to see in Fig. 3.6 that higher errors are present at frames whereeye blinking or fast iris movements are occurring. Nonetheless, the error decreases


(a) (b)

Figure 3.5: (a) The image I and edges of the shape model F do not matchin a wrong result. (b) The matching between face and shape model shows acorrect tracking estimation.

when both trackers improve the convergence in next frames. Altogether, the errormeasurement with respect to the ground truth agrees with the convergence error,which is possible to verify with the correct adaptations as shown in Fig. 3.7. Aslarger the mismatching of shape and image it is, as higher the tracking estimationerror it is. Higher errors are considered about 6.0 pixels with respect to the groundtruth and 4.5 pixels regarding the expected appearance.

A comparison between the ground truth and estimation errors validates the use ofthe later one for those sequences missing manually annotations of the ground truth.The Fig. 3.8 shows the similarity of these two error measures. In relation to theground truth data, it is important to mention that it may have wrong annotations,which are possible to verify with the eyelid tracker Tw. This tracker estimates thevector ~w = [~ρ, γ0, ..., γ5, γ6], where γ6 ∈ [−1.0, 1.0], -1.0 when the eyes are closedand 1.0 when they are in an open position. For example, as shown in Fig. 3.8,eyes are closed at frames 125, 126, 242, 243, 603 and 604, that is γ0 = −1.0 andthe distance between upper and lower eyelids must be zero. However, the groundtruth differences are never zero pixels (-1.0 corresponding to the FACS), which maybe mistaken annotations. The Fig. 3.7.(a) depicts a correct alignment of image and


Figure 3.6: The estimation error compared to the ground is 3.2 pixels perframe for the eyelid tracker (red line), 2.2 pixels per frame for the iris tracker(green line) and 1.5 for face tracker (blue line).

(a) (b) (c)

Figure 3.7: Tracking and ground truth are compared (a). However, theground truth may have mistakes for eyelid (b) and iris (c) annotations, wherethe sequential ABT is aligned with the image.

shape model. However, in Fig. 3.7.(b), it is possible to see a correct detection of theblinking while the ground truth wrongly marks separated upper and lower eyelids.Similarly, the iris is not correctly market with the ground truth, see Fig. 3.7.(c)

3.3. IMPROVING ABT FOR RAPID MOTION TRACKING 53

Figure 3.8: The ground truth is not accurate by marking eye blinks.

The iris tracker estimates the vector ~w = [~q, γ6], where γ6 = [−1.0, 1.0]; -1.0 wheneyes are closed and 1.0 in open position. Frames such as 124, 243 and 603 give us aFACS value of γ6 = −1.0 corresponding to blinks, and the distance between upperand lower eyelids should be zero pixels. However, the ground truth differences arenever zero pixels (-1.0 in the FACS range), see Fig. 3.8.

3.3 Improving ABT for Rapid Motion Tracking

Here the details of how to deal with the tracking of rigid and non-rigid face movements.First of all, head and facial actions have to be decoupled because they have differentmotion performance. On one hand, the head is a rigid object, without self occlusionsand smooth global movements. On the other hand, facial actions are local movementsin a non-rigid surface, where eyebrows and lips have fluid motion while eyelids andirises can change positions spontaneously through quick movements.

3.3.1 Head, Eyebrows and Lips Tracker

The first task is to analyse the eye region motion influence over the head pose esti-mation while only tracking eyebrows and lips. To this end, a robust ABT algorithmas described above must be applied, Algorithm 1. Two generic and different appear-ances are used, which are depicted in Fig. 3.9.(a) and 3.9.(b). Note that the secondappearance is obtained from the first one by removing the eyes region. Therefore, letthe vector ~q = [~ρ, γ0, ..., γ5] be the state shape for both ABTs, which includes the six


head pose parameters, eyebrows and lips. This vector is estimated by the tracker ~T~q

based on the geometrical vector ~q and the appearance texture in Fig. 3.9.(a).

(a) (b)

Figure 3.9: Comparison of ABTs (a) including and (b) excluding the eyesregion.

Eyelid and iris facial actions are faster than eyebrows and lips. All facial actionsare encoded according to the FACS parameters as continuous variables in the range[-1.0,1.0]. Head, eyebrows and lips have smooth movements that may vary in ± 0.01in two successive frames. Thus, eyelids have spontaneous movements called blinking,and the eyelid FAP may vary in ± 2.0 in two successive frames. Similarly, iris FACSmay vary in ± 0.5 due to other spontaneous movements called saccade. Therefore,the tracking process must be decoupled to avoid the inner eye region motion, whichmay add uncertainty and drifting problems when is not correctly adapted. Becauseof this, two different shape models should be used to register both regions separately,to be able to handle different search directions and damping factor he with the LMA.

The peaks in Fig. 3.10 demonstrate how eye motion increases the estimationerror. The head pose are the most affected since their estimations are based on thewhole appearance mask [43]. Consequently, this tracker is suitable for head, eyebrows,lips and stabilization.

3.3.2 Eyelid Tracker

Both eyelids and irises have smooth and spontaneous movements, which are difficultto track using statistical and deformable models [90]. Eye region images are small andwith low resolution when using monocular cameras. Eyelids and irises have specialinteraction suggesting correlation between them. On one hand, iris motion deformsthe eyelid surface, which demands additional adaptation for ABTs. On the otherhand, eye blinking occludes the iris region, implicating additional challenge to recoverthe correct position after occlusions. These small iris movements are called saccadeand are difficult to predict.

Tracking eyelids and irises with the same appearance models may produce driftingproblems due to different intensity textures and occlusions. Therefore, we proposeto construct two appearances for two independent trackers [91]. Firstly, an appear-ance ~x(~w) for eyelid tracking, excluding iris’ FACS from the shape with the vector~w = [~q, γ6]. Those pixels in the inner eye region are warped as eyelid pixels in theappearance texture, see Fig. 3.11.(a).


Figure 3.10: The head pose estimation is more accurate with an ABT exclud-ing the eyes region. The peaks reveal the eye region motion adding uncertaintyto the estimations.

(a) (b)

Figure 3.11: Two different shape models and appearances for eyelid tracking(a) and iris tracking (b).

Secondly, an appearance ~x(~g) for iris tracking includes eyelid and iris pixels, seeFig. 3.11.(b). However, this tracker has special strengths to estimate irises rather thaneyelids. Once the eyelid tracker gives its estimation, the iris tracker can estimate irismovements while refining the previous eyelid position.

Using the current shape St based on the geometrical vector ~w, we construct anABT for eyelids, ~T~w, taking the following steps as explained above, the next Algorithm2 is obtained:

Observe that Algorithm 2.(4:), the eyelid facial action γ6, may vary in ± 2.0 in twosuccessive frames. Therefore, the partial differences include the whole range [-1.0,1.0].A backtracking procedure is introduced at Algorithm 2.(8:) in order to compute the

appropriate damping factor δ for the eyelid tracker, ~T~w. At this step the Armijo


Algorithm 2 Eyelid TrackingRequire: Input images It, It+1 and the vector ~wt = [~qt, γ6,t] matching the shape model tothe input image.Ensure: The best matching eyelid shape S∗

t+1(~wt+1) with the image.

1: Ψ(It, ~wt) = ~xt(~wt). Construct the appearance for the image It by applying theEquations (3.14),(3.15) and (3.16).

2: Obtain ~xt ∼ N (0, 1).

3: Obtain ~µt(~w) and ~σ2t (~w) using Equation (3.4).

4: Calculate J(It, ~wt) = ∂~rt

∂ ~wtand Hessian H(It, ~wt) = ∂2~rt

∂ ~w2t

matrices for the residual

image ~r2t (~w) = ||~xt − ~µt||

2.


6: Compute outliers matrix Θ(~xt)(~w).

7: Compute search direction dk = −[J(~wt)T J(~wt)]

−1Θ(~xt)∇~ξ(~wt)

8: While ~ξ(~wk + δkdk) > ~ξ(~wk) + δkdk

9: δk =Pk

0(−1)k

k.

10: δ = δk.

11: ~wkt+1 = ~wt − [H(~wt) + δdiagH(~wt)]

−1Θ(~xt)J(~wt)T~r(~w)

12: Obtain Skt+1(~w∗

t+1), Equations (3.14) and (3.15).

13: Ψ(It+1, ~wkt+1) = ~xk

t+1

14: min ~ξ( ~wkt+1) = 1

2

Pl

j=0 r2j

15: Return S∗

t+1(~w∗

t+1) = S∗

t+1, the optimal shape adapting the eyelids.

Condition [88] is taken into account according to the search direction. However, thesearch direction is further modified with the Hessian instead the Jacobian to includecurvature information. Based on these two steps, the eyelid tracker can provide aspace of closer solutions and faster convergence.

3.3.3 Iris Tracker

For the same image and using the current shape ~St, based on the geometrical vector~g = [~w, γ7, γ8], we construct an ABT for irises, ~T~g, taking the following steps:

Iris movements are more subtle than eyelids, which is evident on slower iris FACSvariation, about ± 0.5 in two successive frames. Therefore, at Algorithm 3.(7:) theJacobian and Hessian for irises are estimated in the range [~γi−0.5, ~γi+0.5] for i = 7, 8.Notice that previous algorithm starts from estimations of current estimations of faceand eyelid trackers for the frame t + 1. The LMA Equation (3.13) is then appliedcombining previous input frame and the best estimated shape up to this hierarchicalsteps. Moreover, the backtracking line-search procedure considers a different way todeterministically calculate the damping factor δ, due to different kinematics of eyelids


Algorithm 3 Iris Tracking

Require: Input images It, It+1 and the vector ~gt+1 = [ ~wt+1, γ7,t, γ8,t] matching the shapemodel to the input image.Ensure: The best matching eyelid shape S∗

t+1(~gt+1) with the image.

1: Ψ(It, ~gt+1) = ~xt(~gt+1). Construct the appearance for the image It by applying theEquations (3.14),(3.15) and (3.16).

2: Obtain ~xt ∼ N (0, 1).

3: Obtain ~µt(~g) and ~σ2t (~g) using Equation (3.4).

4: Calculate J(It, ~gt+1) = ∂~rt

∂~gt+1

and Hessian H(It, ~gt+1) = ∂2~rt

∂~g2t+1

matrices.


6: Compute outliers matrix Θ(~xt, ~gt+1).

7: Compute search direction dk = −[J(~gt+1)T J(~gt+1)]

−1Θ(~xt, ~gt+1)∇~ξ(~gt+1)

8: While ~ξ(~gk + δkdk) > ~ξ(~gk) + δkdk

9: δk =δk−1

υ, for υ > 1.

10: δ = δk.

11: ~gkt+1 = ~gt+1 − [H(~gt+1) + δdiagH(~gt+1)]

−1Θ(~xt, ~gt+1)J(~gt+1)T~r(~g)

12: Obtain Skt+1(~g

∗

t+1), Equations (3.14) and (3.15).

13: Ψ(It+1, ~gkt+1) = ~xk

t+1

14: min ~ξ(~gkt+1) = 1

2

Pl

j=0 r2j

15: Return S∗

t+1(~g∗

t+1) = S∗

t+1, the optimal shape adapting the irises while refining head,eyebrows, lips and eyelids.

and irises.

3.3.4 Head, Eyebrows, Mouth and Gaze Tracking

Given that head pose, eyebrows and lips estimations are more accurate when excludingthe eyes region (Section 3.3.1), the tracker ~T~q and its corresponding shape model areused as basis, stabilising the hierarchical tracking, see Fig. 3.12.(a).

The first tracker ~T~q estimates the shape vector ~q providing the best head pose,

eyebrows and lips adaptations according to its convergence, min[~ξ(~q)] = ~ξ(~q∗). Next,

both face and eyelid trackers assemble (~T~q and ~T~w) to provide the optimal solution

for the vector ~w = [~q, γ6]. Once the eyelid tracker converges, ~ξ(~w∗), face, eyelid and

iris trackers are combined, ~T~q and ~T~g, to estimate the vector ~g = [~w, γ7, γ8], see Fig.

3.12.(b). ~T~q, ~T~w and ~T~g are combined by adding the already estimated componentsto vectors ~w and ~g.

The face tracker ~T~q sets the starting point for the eyelid tracker in the LMA (5.c).The eyelid tracker starts the iterative process (5.c) with the vector ~wk = [~q∗, γ6],


(a) (b)

Figure 3.12: (a) Three Shape models are combined to estimate hierarchi-cally head, eyebrows, lips, eyelids and irises. (b) Three appearances are com-bined, the yellow one for head, eyebrows and mouth, the red one for eyelidsand the green one for irises.

where ~q∗ contains the best estimations of head, eyebrows and lips by the face tracker~T~q.

The eyelid tracker ~T~w, sets the starting point for the iris tracker in the LMA(5.c). Therefore, the tracker ~T~g estimates the iris position starting from previousestimations for the iris and the already estimated head, eyebrows, lips and eyelids~gk = [~w∗, γ7, γ8], where ~w∗ collects the optimum adaptations from face and eyelidtrackers for the current frame, (t+ 1).

Both eyelid and iris trackers are conditioned to improve the previous trackers. Ineach tracker, the LMA is constrained for the eyelid tracker to improve the convergenceerror of the face tracker, ξ(w∗) ≤ ξ(q∗). Similarly, the iris tracker has to improveboth previous trackers by restricting the LMA according to ξ(g∗) ≤ ξ(w∗) ≤ ξ(q∗).

Appearance modelling, estimation and Jacobian matrices are based on results ofprevious adapted frames. Therefore, all trackers can run simultaneously and inde-pendently until the LMA starts, then, they run sequentially. Consequently, the threetrackers are efficiently connected by applying three times the iterative minimization,LMA. The head pose and whole face are taken into account for both eyelid and iristrackers in order to obtain estimation related to 3D perspective of pose variation.The eyelid tracker is independent from iris estimation but forced to improve the facetracker. The iris tracker is led to the correct eyelid position and required to improvethe eyelid convergence error.

The hierarchical tracking combines the strengths of three ABTs. Specific shapemodels contribute to avoid the uncertainty of eyes region to track the head pose.We avoid the high contrast between eyelids, sclera and irises by using two differentshapes. The space of solutions is extended by using different FACS ranges and step-sizes for the estimation of the Jacobian computation, since the particular backtrackingprocedures improve the convergence for each tracker.


3.3.5 Experiments Hierarchical ABT

3.3.5.1 Eyelid Tracking

The efficiency of the eyelid tracker Tw is assessed based on the estimations of the3D head pose and the eyelid facial action γ6. A non-self-occluded shape model iscombined with the face model to gain stability for 3D head pose estimations.

For this experiment a different image sequence recorded in a laboratory withmonocular cameras and standard illumination is used. Each of the 700 frames includesthe head and shoulders performing a variety of eyelid closure speeds, see Fig. 3.13.The eye-cropped region shows the experimental results. This image sequence hasbeen chosen due to the extreme movements that require high accuracy to the eyelidtracker and robustness to fast movements.

For comparison purposes, the proposed approach by Wu et al. [125] has beenimplemented to enhance the eyelid position estimation within the tracking approach.Instead of computing the gradient descent for eyelids, the position has been estimatedbased on shape matching, edge detection and skin colour thresholding. However,standard images present illumination changes and shape variation yielding to discreteoutputs for eyelid positions.

In this sequence there are also typical movements of the eye region such as low andsmooth motion, eye slitting, eye closing and eye squinting. Forced and spontaneousmovements like eyelid raising, eyelid tightening, winking and blinking, are handledcorrectly. This tracker accurately estimates the eyelid positions independently on irisrotation. This is, the eyelid facial action variable behaves as independent variablealthough the physical anatomy and iris motion do determine the eyelid position.For this appearance model, noise and errors are registered as much as iris textureis warped into the reference texture. As shown in Fig. 3.13, eyelid tracking resultsfollow a continuous curve beyond estimating binary states (open and close) or discretepositions as [125]. Wide valleys or peaks correspond to normal raising-closing motionwhile sharp peaks are detected blinks.

3.3.5.2 Iris Tracking

The iris tracker Tq estimates both eyelid and iris position, as well as the head pose.The iris yaw and pitch parameters are evaluated as continuous variables at the samecontinuous scale [-1.0,1.0]. Although iris motion is anatomically independent fromeyelids, the visual identification of iris position depends on the eyes closure. Asconsequence, the iris tracker needs to also estimate this facial action based on previousframes, the current eyelid and head positions.

In this experiment, another image sequence has been recorded taking into accountthe most remarkable demands for a good iris tracker. 500 frames of size 640x480 pixelswere recorded with a photographic camera in VGA mode and standard illumination.The actor shows iris movements in pan and tilt directions, and extremely lookingaskance.

The type of gaze movements estimated by the iris tracker have been also encodedby the FACS [48] in four separated AUs; eyes turned up, down, left and right, whichincludes extreme movements like askance looking, where the iris could be partially


Figure 3.13: The eyelid tracker Tw estimates the FACS according to thecontinuous curve, instead of a discrete state eyelid estimation based on Wu etal. approach [125] (shown by the dash line).

occluded or distorted by the 3D perspective. However, involuntary movements suchas iris saccades are commonly detected after eyelid occlusions or gaze accommodation.

The Fig. 3.14 shows how the iris tracker correctly estimates yaw and pitch move-ments. The light dash curves represent gaze tracking results by using the discretescale of related approaches [125, 60]. These two proposals are similar in the sensethat both rely on the colour intensity and contour patterns to infer eyelid and irisposition. Exemplar results can be seen at the top of Fig. 3.14, where low error adap-tations correspond to better enclosing of iris by the rectangles drawn around iris.Frames between 101 and 151 show γ8 near −0.5, when the subject is looking down.In frame 300 the iris pitch is γ7 = −1.0 because the subject is looking askance to theleft, while frame 400 has γ7 = 1.0 for askance to the right.

Psychological studies have addressed the importance of analysing gaze movementsregarding emotion analysis, image encoding and HCI. These movements are charac-terised by the Facial Actions Coding System (FACS) [48] as independent Action Units(AUs). The automatic encoding, recognition and interpretation of such facial actionsare still open subjects at nowadays research.


Figure 3.14: The iris tracker, Tq, estimates yaw and pitch movements in acontinuous scale variation as shown by the curve. The squared dash line showshow the discrete estimation could be, based on a combination of approaches[125, 60].

3.3.5.3 Hierarchical Tracking

This experiment shows the comparative results of the iris tracker for eyelid estimationwithout any previous eyelid tracker estimation, see Fig. 3.15. In order to compareboth trackers, the FGnet talking face is used. First, the input sequence is completelyanalysed by the iris tracker without previous intervention of the eyelid tracker. Duringeye blinking, the iris tracker is not able to adapt the shape model correctly. Hence,the eyelid pixels over the inner eye region are rejected and they are not included inthe appearance model. Therefore, the shape remains static at the previous correctadaptation, which is the way how occlusions are handled, see again Fig. 3.15.

The reason behind this results is that the search direction and Jacobian matrixfor eyelids are differently scaled in a wider neighbourhood, [−1.0, 1.0]. Therefore, thebacktracking process for eyelid estimation regularizes the search process by greaterdamping factors while speeding up the convergence for all appearance parameters.

However, for the same sequence, the hierarchical tracking obtains first 3D headpose, eyebrows and lips Tq. Subsequently, it is possible to obtain a first approximation


Figure 3.15: The iris tracker Tq is compared to the sequential trackingTw + Tq, by measuring the iris estimation in relation to the ground truth.

to the eyelid position by applying the eyelid tracker Tw. Lastly, the iris tracker Tq,starts estimating the iris movements whilst refining the current eyelid estimationand lightly the rest of the face. After applying this sequential tracking, the eyelidestimation is more accurate due to the second iterative process as can be seen in Fig.3.16. This is demonstrated by the descent of the error estimation in relation to thelikelihood and the ground truth.

All facial actions ~γ and the 3D shape pose are independently estimated since theJacobian J is calculated by partial differences. Although the iris tracker includeseyelid facial action, it cannot converge as faster as required for eyelid movements,which is the case of closed eyes or blinks. On one hand, the texture contrast betweenthe inner and the outer eye pixels is greater than the outlier threshold in the Huber’sfunction. Therefore, inner pixels are outliers and are not learnt fast enough to expectquick changes. On the other hand, the gaze vector, [γ7, γ8], is estimated in a smallrange with a ratio of ± 0.5.

Another test for the sequential tracking is done by using an image sequence fromthe FGnet DB for facial expression [121]. It contains 100 frames of 320x240 pixels,where the actor performs an expression of anger while squinting, blinking and movingthe iris. In Fig. 3.17, it can be seen how the iris position is retrieved after eyelid


Figure 3.16: The iris tracker is incapable to estimate eyelid position, whichproves the requirement of a hierarchical structure of trackers. The sequentialtracking Tw + Tq improves the eyelid estimations since the iris tracker Tq

(light colour curve) is not able to track eyelids.

occlusion. Moreover, the eyelid tracker adjusts correctly to the eyelid position dur-ing the iris motion. Only eyelid estimation and its convergence are shared by bothtrackers. Therefore, iris estimations do not influence the eyelid tracker at the nextframe.

3.3.5.4 Gaze tracking with standard web videos

This experiment attempts to demonstrate the robustness and accuracy of the se-quential gaze tracking with video clips from the internet. A standard video clip wasdownloaded from The Oprah Winfrey Show c©, which apart of the low image qualityand imagery settings required for fast internet broadcasting, it presents occlusions,illumination changes and zoom effects typical of a TV show.

Although the skin may be confused with inner eye region due to the colour anddark make-up, the eyelid tracker efficiently estimates the facial action while dealingwith blinking, eye closure and fluttering. Similarly, the iris position estimation hasbeen acquired as above described in a sequential manner after the eyelid tracker


Figure 3.17: The sequential tracking estimates head, eyebrows, lips, eyelidsand irises while expressing emotions, squinting and blinking eyes.

intervention. Notice that a small texture mask was use, 10x17 pixels, as referencetexture to build up the appearance models. Saccade movements and normal pan andtilt iris motion are described as continuous variable as shown in Fig. 3.18.

On one hand, the eyelid tracker adapts the eyelid position for slow movementsand blinks while estimating the head orientation. When sclera and irises are visible,they are warped as skin by the eyelid tracker, otherwise, they are considered outlierswhen the eyelid covers the inner eye region. On the other hand, the iris tracker dealswith slow yaw and pitch motion while recovering the head adaptation after eithersaccade motion or iris motion while eyelids occlude the iris. The Fig. 3.18 shows alsothe details of gaze estimation for this video clip. Observe the similarity of FACS curvefor eyelids and iris pitch, which is due to the common synchronized and spontaneousmovement. This is a physical and anatomical relationship that is independently wellestimated by the sequential ABT.

3.3.5.5 Lighting Conditions

Appearance trackers commonly suffer drifting problems due to sensitiveness to illu-mination changes. This happens mainly because of their dependency on training oftextures and shapes.

To prove the robustness of the appearance-based tracker to handle illuminationchanges, a flashing light sequence of 800 frames was recorded. A web camera wasused for recording and each frame has a size of 352x288 pixels. The image brightnesshas been increased by frontally illuminating the face with a fluorescent light. Theperson uses his hands to create shadows and occlusions, which represent additionalchallenges with flashing lights, see Fig. 3.19.

The Fig. 3.19 shows how the ABT can adapt whilst illumination changes. Beyondimage filtering to smooth out the texture variation, this tracker has been endowedwith a controlled learning ability based on a combination of the likelihood and theHuber’s function. Besides, both trackers handle different learning rates, because ofthe three different kinematics of head-eyebrows-lips, eyelids and irises.


Figure 3.18: Gaze can also perform on standard web video clips, whichpresent typical imagery effects as zooming and lighting, as well as lower qualityimages for faster broadcasting.

At frame 525, a fluorescent lamp is turned off to vary the illumination. Therefore,the FACS plot shows how the hierarchical trackers notice the environment changes,the errors increase and the FACS parameters become unstable. Both trackers recoverthe stability as soon as the new appearances are learnt and the ABT expectationsinclude this new data, see Fig. 3.20.

When illumination changes, the search process is affected since the LMA has todivert the step sizes and directions for the gradient descent. This diversion can helpsto handle illumination changes despite of affecting convergence. Consequently, thesame number of iterations has to remain constant while assuming the illuminationchanges are not extreme. Further steps of the sequential tracking improve the esti-mations as long as the new illumination conditions remain stable. Thus, the expectedappearances can update the models with new information whilst exponentially for-getting the previous illumination. This adaptive learning in companion with outlierscontrolling yield the LMA to recover the stability as can be seen in Fig. 3.19 afterthe frame 541.


Figure 3.19: The ABT is stable to illumination changes by learning the newenvironment conditions and decreasing the error. Observe how the estimationerror jumps to double at illumination changes, but it starts falling as longerthe ABT adapts and learns this new conditions.

3.3.5.6 Occlusions and Real-Time

There are real situations where it can be seen how occlusions affect the estimationerrors for all individual and hierarchical trackers. This experiment presents an imagesequence of 600 frames, recorded in an indoor scenario with a monocular camera. Thesubject performs head movements and exaggerated facial actions while illuminationis subtly changed. At one frame, the subject puts on a pair of eyeglasses, whichproduces occlusions and intensity variations see Fig. 3.21.

Spontaneous eyelid blinks occlude the iris region several times along the sequence.


Figure 3.20: The ABT recovers stability if the illumination changes arenot extreme. Under lighting conditions, the error increases until the newappearance updates the Gaussian model. Then, the error starts decreasingand the ABT becomes stable.

Figure 3.21: Occlusions normally affect face trackers, but new textures nor-mally yield to drifting problems. The ability of on-line learning appearancetextures in junction with outliers handling allow the tracker to deal with thischallenge.

After the blinking, the iris tracker has to recover the correct adaptation because duringthe blinking the position may change. This search can use one or two more frameswhile increasing the estimation error, for example, at frames 17, 164, 232. However,iris saccade movements deform the eyelid surface, changing the descent direction forthe eyelid tracker. Fig. 3.22 shows how the iris movements influence the estimationerror of the eyelid tracker at frames 96, 330, 483 and 551.


Figure 3.22: This sequence exhibits both blinks and saccades. For example,eyelid blinks occur at frames 17, 164, 232 and iris saccades are detected atframes 96, 330, 483 and 551.

In order to achieve the real-time requirements, this image sequence has beentested by applying an ABT, which builds appearances over small reference textures,10x17 pixels for eye region. Accuracy and robustness are compared between using big(20x29 pixels) and small (10x17 pixels) reference textures. The obtained images andthe time spent to find the correct adaptation are reported in Fig. 3.23).

Tracking experiments with a small appearance resolution obtained an average of85% of correct adaptations at 32 frames per second (fps). Instead, big appearanceresolution reports an average of 96% of correct adaptations and 1.1 fps. It is worthto mention that for big resolution appearances, the iris has also fewer pixels, i.e. 2x3appearance while 5x6 pixels for the big one.

3.3.5.7 Large Head Movements

Out-of-plane movements have a strong impact on the 3D head pose estimation andmore than the 50% of pixels are consider outliers producing a noisy appearance, whichis the case of monocular cameras. However, the Huber’s function provides stabilityto both observation and state transition processes.

A sequence of 650 frames was acquired, where actor performs large head move-ments and facial actions, in order to assess the effect of 3D head pose inaccuracies onthe facial action tracking, see Fig. 3.24. Each tracker is used with enough number ofiterations in the LMA, extending the ratio of the line search stage without losing the


(a) (b) (c)

Figure 3.23: Performance comparison between two ABT of 82x80 pixels(580 at eye region) (a) and 42x40 pixels (170 at eye region) (b). It is possibleto obtain robust results (c) with similar accurate results (a) and (b)

real-time.

The Fig. 3.24.(a) depicts extreme head rotations where the facial actions arealmost not affected by the introduced noise. The 2D projection of out-of-plane errorsproduce very small errors in the image plane such that the alignment between theshape and the regions of eyebrows, lips, eyelids and irises is still good enough tocapture their independent movements correctly.

In Fig. 3.24.(b), it is possible to see a large head movement out-of-plane, whichdoes affects facial action adaptations. More than 50% of the pixels are outliers, the2D projection of the shape is not correctly aligned with the face and facial actions,forcing the appearance to remain at the last correct adaptation. Once the head poseand facial actions go back nearby to that position, the hierarchical tracker retrievesthe stability and the correct adaptation.

3.3.5.8 Translucent Textures

It is an interesting challenge to analyse images when subjects wear eyeglasses orsunglasses. This experiment was intended to test the capability of the sequential gazetracking to handle translucent textures such as people wearing sunglasses.


(a)

(b)

Figure 3.24: (a) ±90◦ pan rotations can be also handled by assuming facialsymmetry for greater pan rotations to ±45◦ pose angles. (b) The ABT handlesout-of-plane movements as outliers, retrieving the position over 50% of outlierpixels.

In order to test the tracking stability while it is applied to bright and translucentsurfaces, an image sequence of 240 frames was recorded. Although the sunglasses aresemi-transparent, the reflective and bright lens surface offers interesting difficulties.The camera is photographic and the size of the image is of 640x480 pixels, see Fig.3.25.

The whole sequence presents a person wearing sunglasses allowing the ABT tolearn the translucent texture from the beginning. However, when the head is atprofile, one eye becomes partially occluded by the nose, darkening the occluded eyeand also causing reflectance on the sunglasses.

For pan rotation the facial asymmetry is considered up to 45 degrees, becauseboth eyes are visible. If rotation is greater than 45 degrees, the face is considered assymmetric and the appearance matches mainly the non-occluded eye. Both eyes canbe tracked independently by extending the sequential tracking at the cost of morecomputational effort.

The ABT handles the darker side and the reflecting effects as variations of illu-mination. In the observation process, high intensity changes are considered outliersand excluded from learning. Therefore, the 3D shape remains in the previous correctpositions, see again Fig. 3.25.


Figure 3.25: Gaze tracking results under translucent textures. ABT learnson-line this textures, assumes facial symmetry after 45 degrees to deal withprofile.

3.3.5.9 Low Size and Resolution of Images

Even though the accuracy of the sequential tracker has been validated under severaldemands, and the use of small reference textures has been highlighted for achievingreal-time, it is important to test the minimum image size and resolution for gazetracking. For this experiment, an image sequence was acquired originally at smallresolution and size by using a photo camera. The whole input image is 112x160pixels, the face area is about 64x20 and the eye region 10x17. Notice that this inputeye patch has a similar size to the small reference texture used for building smallappearances. Therefore, in case of using such a small appearance model, the imagewarping would not interpolate pixels to produce a single appearance point.

The Fig. 3.26 shows a 400 frames image sequence, where the subject is wearingsunglasses as in Fig. 3.25. In this small frame’s size, the sequential tracker obtainedan average matching of 82% while the sequence with size of frames of 640x480 pixelsobtained 91% of correct adaptations. Both big (42x82 pixels at eye region) and small(14x18 pixels at eye region) are tested by using same appearance size, 10x17.

Figure 3.26: The gaze tracking deals with images of small and low-resolution.These images are 64x20 pixels where each input eye is 10x18 pixels.


Finally, this section has presented detailed experiments for an accurate and robustsequential gaze tracking. The differences between eyelid and iris trackers have beenshown in a comparative fashion. Furthermore, the strengths of each appearancemodelling and single tracker were also experimentally supported in order to justifythe necessity of joining both trackers sequentially. Several experiments were presentedaddressing the capabilities of the face and gaze tracker to deal with fast and smoothmovements, occlusions, light variation, different image resolutions, variety of imagery,and settings to achieve real-time performance.

3.4 Face Search and Alignment

3.4.1 Face Detection

Finding faces on image sequences is primary step in many applications such as videosurveillance, human computer interface, and expression analysis. However, currentlyexisting techniques have difficulties to cope with pose variations, appearance changes,illumination contrast and complex backgrounds.

It is possible to deal with the pose variation and smooth lighting changes bymodelling the skin colour. Given an input image, this is segmented into regions thatcontain possible face candidates, while those that do not contain a face object aredropped. This segmentation helps accelerate the detection process. Next, connectedcomponent analysis is advanced as well as some basic convolutions in order to detectsome typical shape features of the human face.

3.4.2 Skin Colour Model

The skin colour pixels are modelled in the RGB colour space according to the followingheuristic thresholds:

Image(R,G,B) = skin

20 < R−G < 90R > 75

R/G < 2.5(3.17)

The simplicity of this method speed up the face detection given its computationalefficiency, since no transformation is needed to go to another colour space. An approx-imate sample of 800,000 skin pixels where taken from 64 different images of peoplefrom different ethnicities and various illumination conditions. The distributions ofthese three thresholds are shown in Fig. 3.27, where is possible to see their unimodal-ity. Our experiments reveal that 94.6% of the skin pixels fall into the limits 20 and90 at the R −G distribution, 96.9% of the skin pixels have an R component greaterthan 75 and 98.7% of them have an R/G value lower than 2.5.

3.4.3 Canonical Correlation Analysis (CCA)

Canonical correlation analysis is a powerful multivariate statistical method for explor-ing linear relationships between two variables (spaces), which are pair wise maximallycorrelated. It finds two basis spaces, one for each multidimensional variable, in which

3.4. FACE SEARCH AND ALIGNMENT 73

(a) (b) (c)

Figure 3.27: Three heuristic thresholds allow us to model skin pixels. (a)of pixels fall are allocated in the range 20<R−G<90 contains 94.6% of skinpixels. (b) 96.9% of skin pixels have R>75. (c) 98.7% of skin pixels haveratio R/G<2.5.

the inter-correlation is maximized. An important property of canonical correlationsis that they are invariant with respect to affine transformations of the variables.

Given X ∈ Rp and Y ∈ R

q two zero-mean multivariate random variables, CCAlooks for a pair of transformations x and y such the linearly transformed variablesis maximized. This is equivalent to minimizing the correlation coefficient ρ.

ρ =E[T

xXYTy]

√

E[TxXX

Tx]E[Ty Y Y

Ty](3.18)

The basis 〈x, y〉 can be obtained by Singular Value Decomposition (SVD) of

the cross-correlation matrix. Let P = Σ−1/2xx ΣxyΣ

−1/2yy and P = UDV T the SVD of P ,

where U = [u1, ..., ud] and V = [v1, ..., vd] are orthogonal matrices and D is a diagonalmatrix with singular values. Consequently, the canonical factors can be obtained as

x = Σ−1/2xx U and y = Σ

−1/2yy V . The d-dimensionality is Min[rank(X), rank(Y )].

3.4.4 Shape Model Synthesis for Face Alignment

Appearance-Based Trackers have accuracy and high performance for on-line 3D headand face tracking. This is a robust technique for rigid and non-rigid motion extractionfrom a moving head-face. Although this method can learn face textures on-line, adaptto illumination changes and occlusions, they require manual matching of the ActiveShape Model (ASM) at the first frame of an image sequence.

3.4.4.1 ASM synthesis by 2D-CCA

Here, we propose to analyse the relationship of facial images and active shape modelsby applying 2D-CCA [69], which will keep the spatial information of images andindependence of deformation and animation modes of ASMs.

Given an image I ∈ Rm,n and ASM vector S ∈ Rj,k, we can rewrite the Eq. 3.18in order to estimate their correlation. The aim is to find two pair of canonical basis


for each variable, νI ∈ Rm, I ∈ Rn, νS ∈ Rj , S ∈ Rk, such their projections,y = νT

I II and x = νTS SS , are maximally correlated:

ρ =E[νT

I IITS ST νS ]

√

E[νTI IIT

I IT νI ]E[νTS SST

S ST νS ](3.19)

In [69] authors demonstrate that the four basis < I , S > and < νI , νS > canbe found by iteratively solving the minimization problem for:

ρi(iI ,

iS , ν

(i−1)I , ν

(i−1)S ) (3.20)

ρ(i−1)((i−1)I ,

(i−1)S , ν

(i−1)I , ν

(i−1)S )

3.4.4.2 Synthetic ASMs for Automatic ABT

Once we apply 2D-CCA between Image and ASM spaces, we can extract the mostcorrelated pairs from each subspace. This is possible by learning the relationshipbetween one variable and the CCA projection of its pair. Specifically, we want tolearn the relationship between an ASM, S, and the projection of the correspondingimage, y = νT

I II . By applying regression onto the CCA transformation throughthe canonical variables, it is possible to obtain two regression matrices to pair wiseestimate an ASM based on an image and vice versa:

RIS = (νTI II

TI IT νI)

−1TI IT νIS (3.21)

RSI = (νTS SS

TS ST νS)−1T

S ST νSI

Consequently, for a given input image I, we can obtain the most correlated ASM,S, by computing S = (νT

I II)T RIS . This computation provides the best matching of

the 3D shape model adapting the facial image for AAM construction, see Fig. 3.28(c).The ASM can be estimated by using the approach of [84], which have similar

feature points to the 3D Candide face model. This ASM is based on a snake modelthat statistically models the neighbourhood of each feature. Next, an exhaustivesearch is made in a detected face [120] till the whole snake is adapted. This methodcarry with the drawbacks of the face detector, the complex search and its own trainingsettings, see Fig. 3.28(a).

3.4.4.3 Recursive CCA for Optimal Sample Selection

When performing CCA for feature space analysis, it is required to have N samplessuch that N>> Min[rank(X), rank(Y)], which guaranties enough information tofind strong linear correlations. However, there is the risk of over fitting the samplespace whilst including redundant data into the covariance matrices. Thus, to selectthe optimal sample space, we propose a recursive process to extract the redundantfeatures from the dataset.

3.4. FACE SEARCH AND ALIGNMENT 75

(a) ASM-Snake (b) Ground truth (c) 2D-CCA

Figure 3.28: (a) ASM estimated by a snake model from [84]. (b) Manualplaced mesh for ground truth. (c) Synthetic ASM retrieved by 2D-CCA.

Given two n-sample datasets I ∈ Rm,n and S ∈ Rj,k, CCA provides the canonicalvectors I and S maximizing the correlation of the corresponding projections Uand V, Eq. 3.18. Moreover, as above stated, it is possible to learn a pair of regressionmatrices Eq. 3.21, to pair wise estimate a variable with respect to the other and thecanonical vectors. Thereby, we define the bi-dimensional regression error as follows:

ǫS = ||S − (νTI II)

T RIS || (3.22)

ǫI = ||I − (νTS SS)T RSI ||

Those data with ei > µǫ + k ∗ σǫ are likely redundant and hardly yielding nearbyto the linear regression. A conservative k value is advised to slow convergence andavoid losing important samples. In our case, we use k = 2 and convergence is reachedwith the minimum combined error.

3.4.5 Experiments ABT Initialization

In order to train a face tracking initialization, the Mind Reading database [10] hasbeen used since it presents a variety of facial expressions and head poses closer to facialdisplays in real environments. This DB is composed for more than 460 emotions, splitin 24 main groups. The average of frames per sequence is 125. A total of 35,745 imagesof 320x480 pixels have been collected. All images have been scaled to images of 20x20pixels to standardize face sizes.


Figure 3.29: ASMs are synthesized by applying CCA regression from theinput images. The bottom images are the corresponding estimations by doingregression from the estimated ASMs.

In order to obtain the feature vectors for face tracking initialization, all imagesequences have been manually initialized. Thus, 9 facial actions, 6 head pose pa-rameters and 19 deformation modes are obtained for all images. Once all featuresare collected, 2D-CCA is performed between images, I and shapes S. Subsequently,the CCA recursive cleaning process is applied as described at section (Ref Section),which reduces the noise around the regression hyperplane while making stronger thecorrelation between both input subspaces. Finally, CCA regression is learnt to leadthe synthesis of shape model parameters for initialization based on an input imageface. In this way, CCA provide the shape initial parameters S = (νT

I II)T RIS , see

Fig. 3.29. After disregarding the most redundant pair of features, an average of85% of correct ASM reconstruction based on the canonical vectors and the regressionmatrix.

3.5 Chapter Summary

This chapter have presented the techniques for head and face modelling based on im-age sequences. Although the main subject of this chapter is head and face tracking,the initialization problem has been also treated. Gradient methods and several im-provements such as backtracking procedures, outliers handling, inclusion of curvatureand different gradient descents have been detailed. Recent methods for data fusionand feature correlations have been presented in order to solve the problem of trackinginitialization. Finally, detailed experiments were performed accordingly to each of the

3.5. CHAPTER SUMMARY 77

most common challenges in head and face tracking.

Chapter 4

Spatial Analysis of Facial Actions

There are several contributions from image analysis and pattern recognition towardthe human emotion understanding. Unlike previous approaches, we include the eyelidsby constructing an appearance-based tracker. Subsequently, a Case-Based Reasoningapproach is applied by training the case-base database with the six facial expres-sions proposed by Paul Ekman. Beyond of the classification and the nearness to theclosest cluster indicating confidence, we provide a classification confidence value andthe expressiveness of the analysed facial expression. Therefore, the proposed sys-tem yields efficient classification rates comparable to the best previous facial expres-sion classifiers. The Appearance-Based Tracking and Case-Based Reasoning (CBR)combination provides trusty solutions by evaluating the confidence of the eyebrows-eyelids-mouth classification.

Although CBR provides strong tools for expression recognition and knowledgediscovery, it lacks of robustness and efficiency regarding time and memory complexity.Moreover, there are some spatial relationships difficult to model by CBR. Therefore,SVMs and Bayesian Networks are explored as alternatives to increase the efficiency,reduce the complexity and time consumption. Through all classifiers, correlationanalysis and confidence will be the additional enhancements to be explored.

This chapter is distributed as follows: the Section 4.1 tackles the Facial Expres-sion Recognition problem by knowledge discovery based on CBR. Contributions suchas confidence assessment, classification based on confidence and maintenance policesfor knowledge updating are detailed. Subsequently, the Section 4.2 studies the spatialrelationships among facial expression classes, head pose and facial actions by deeplyexploiting the strengths of CCA. Furthermore, CCA is also applied and for improv-ing the classification confidence and as exhaustive classifier. Next, the Section 4.3 isintended to learn the topology of the multi dimensional space of facial expressions.Thus, SVMs contribute to FER solving challenges in combination with CCA andPCA for efficiently transforming the original feature space. Finally, an statistical in-ference structure is presented in Section 4.4. Three databases are gathered increasingthe challenge of recognizing subtle expressions. Moreover, Gaussian Mixture Modelsare used for facial action discretization and eigenvector decomposition is applied topruning TAN-BNs.

79

80 CHAPTER 4. SPATIAL ANALYSIS OF FACIAL ACTIONS

4.1 Spatial Knowledge Discovery for Facial Expres-sion Recognition

Aiming to discover topological structures of facial actions accordingly to expressionprototypes, CBR is adopted for reasoning such spatial distributions while providingclassification. CBR is a lazy classifier that rather than applying the same gener-alization rule to all problems, generalizes the decision for the target according tolocal approximations. However, the efficiency of the solving process depends on thediscrimination capabilities meanly nearby to the boundaries of clusters.

Eager classification methods do not always provide a confidence measure of theirsolution. Since their classification power depends on the amount of data and typeof expressions, the stability of the classification rate may decrease while testing withunseen data [63]. Cheetham et al [23] have pointed the importance of providing aconfidence indication with the proposed solution. Consequently, the developmentof CBR systems has increased the necessity of supporting the analysis of the case-base structure while providing solutions with a required accuracy and stability [21].According to Cheetham and Price, we are concerned on proving the classificationstability of a set estimators [22].

In the statistical sense, the main goal is to find a set of estimators with stablepredictions on several tries of different classification models. The smaller bias from theaverage prediction indicates the most confident estimator. Similarly, we need to provethat they are good classifiers in all models by remaining near to correct classification.Therefore, their classification rate should have small standard deviation with respectto the correct classification.

4.1.1 CBR Representation

Facial expression recognition systems are mostly focused to recognize six prototyp-ical facial expressions namely anger, disgust, fear, happiness, sadness, surprise, andthe neutral expression [48]. However, these expressions cannot reveal major cogni-tive or affective behaviours upon deeply reasoning and strong inference processes.They can be recognized by analysing facial features, but mental states and complexemotions require of analysing head movements and other features. CBR groups thedata (cases), see Fig.4.1.a, by assessing confidence, expressiveness and recognizingexpressions, which completes the case structure according to the description on 4.1.b.

4.1.2 CBR Classification

The classification of facial expressions is done by adopting a case-based reasoningarchitecture. The main structure of a CBR system is depicted by the CBR-cycle,which is composed by four steps namely Retrieve, Reuse, Revise and Retain; Retrievethe most similar cases from de case base; Reuse the information and knowledge inthat case to solve a target problem; Revise the proposed solution; Retain new solvedcases for future problem solving. k-NN is the most used technique to sort the casebase, retrieve similar cases and construct decision rules. Once the similar neighbours

4.1. SPATIAL KNOWLEDGE DISCOVERY FOR FACIAL EXPRESSION RECOGNITION81

Disgust Sad Happy

Anger Fear Surprise

a. Six facial expressions. b. Case-base structure.

Figure 4.1: The six basic universal expressions are considered for CBR clas-sification a. where the Case-base structure is composed by the vector ~γ ascase description and the solution attributes are expressiveness, confidence andexpression b.

are retrieved based on a similarity measure, the solution is applied following thewinner-takes-all rule.

Some specific characteristics are compared to the case base in the revise stepwhile evaluating the proposed solution. Finally, the most voted cases are tempo-rally retained if they are frequently retrieved to solve new cases, otherwise, they areextracted from the case base.

4.1.3 Confidence Assessment

Building an efficient decision surface in CBR depends on the discrimination capa-bilities nearby to the boundaries of clusters. We believe that providing confidentsolutions, it is possible to certainly arrange the case base, reject useless cases andretain new ones. Cheetham et al [23] have pointed the importance of providing a con-fidence indication, which has been quantified by using similarity criteria or evaluatingthe correlation of the proposed solution with the case base.

We are concerned on proving confident classifications in the statistical sense, thatis, stability on several tries with different classification models. The smaller biasfrom the average prediction indicates the most confident estimator, as well as strongclassifiers have small standard deviation.


4.1.3.1 Confidence Estimators

We propose five confidence estimators that are based on k-NN with the aim of assess-ing confidence for cases of the same class that are near (i.e. with high similarity) tothe target problem and cases from different classes (i.e. low similarity). The closera target case is to cases from a different class, the higher the chance that the targetcase is laying near to the decision surface. Whereas the closer a target is to othercases of the same class, the higher the chance that it is further from the decisionsurface. Similarity is computed based on the Euclidean distance. Given a retrievedneighbourhood, the confidence estimators are calculated at the Case Revise step foreach possible solution, distinguishing two Relevant Neighbours (RN) and IrrelevantNeighbours (IN), as follows, see Fig.4.2:

a. b. c.

Figure 4.2: The five confidence estimators are based on three main concepts;a. The similarity nearness to the target, b. the similarity density and c. thesimilarity norm between the RNs and the target.

1. Average RN-Nearness: for a target case ~γt belonging to class c, this indicatesthe average similarity nearness of the r-RN (r RNs within the retrieved k-NN)to the target. This is illustrated in the Fig. 4.2.a.

S1(c) = 1 − 1/rr

∑

i=1

||~γt − ~γi||. (4.1)

2. Similarity Nearness Ratio: measures the similarity nearness of the target ~γt toRNs with respect to its similarity to n-IN, see again Fig. 4.2.a.:

S2(c) =1 −

Pri=1

||~γt−~γi||

r

(2 −P

ri=1

||~γt−~γi||

r −P

ni=1

||~γt−~γi||

n ). (4.2)


The estimator measures both how close is the target to the class’ boundary aswell as the side of the boundary where the target is located.

3. Similarity RN-Density : compares the average density inside of the RNs, insideof the INs and their average similarity to the target. The highest score indicatesRNs uniformly distributed with the target, as well as scattered INs increase thissimilarity score. Fig. 4.2.b. shows this estimator:

S3(c) =1 −

Pri=1

||~γt−~γi||

r

r2 − ∑ri=1

∑rj=1 ||~γi − ~γj ||

∗ r2 (4.3)

4. Average Similarity Norm: measures expressiveness relationship by comparingthe p− norm of ~γt with the average p− norm of RNs, see Fig. 4.2.c.

S4(c) = 1 − |r||~γt||p − ∑ri=1 ||~γi||p|

r. (4.4)

5. Similarity Norm Ratio: As the similarity nearness ratio, this compares therelative similarity norm of the RNs, the target and INs. See again Figure.4.2.c.:

S5(c) =1 − |r||~γt||p−

Pri=1

||~γi||p|

r

(2 − |r||~γt||p−P

ri=1

||~γi||p|

r − |n||~γt||p−P

ri=1

||~γi||p|

n ). (4.5)

All similarity measures are standardized such that 0.0 indicates less confidentsolution and conversely 1.0 indicates the highest confidence. Once the retrieved k-NNare available, these five estimators are calculated as many times as different solutionsare proposed by the k-NN.

4.1.4 Classification Confidence

Assessing confidence improves the classification but it is not straightforward for man-ifold domains. Cheetham and Price [21] describe 12 measures to quantify confidencebased on distance and frequency. However, data are not always related by distanceor the majority is not the most efficient decision rule as we will prove.

We define the Confidence Value as the percentage of highest estimators agreeingthe majority solution. For example, it is not possible to decide by the majorityif the retrieved k-NN is [a, b, c, c, a] (tie situation). Instead, we compute S1, ..., S5

supposing solutions a, b, c, choosing as solution the class containing the majority ofhighest scores.

We have performed two validation methods to prove that S1, ..., S5 are good con-fidence estimators: a Leave-One-Out (LOO) process for k-values between 2 and 20,which is person-dependent. Also, we perform Actor-Fold-Cross-Validation (AFCV)processes, which test the CBR system capabilities to solve unseen problems (actors).The outcome classification is person-independent.


(a)

(b)

Figure 4.3: (a) Classification based on confidence is more stable and higherthan k-NN. (b) The confidence score also remains stable through all experi-ments.


For each k -value and both LOO and AFCV, we compare both classification, Fig.4.3.a, and its confidence Fig. 4.3.b, in order to learn the appropriate k that mini-mizes classification with high confidence. This is the criteria to retrieve neighboursfor an expected classification confidence in the decision making process. Given thehigh correlation between classification rate and confidence values, these measures aregood estimators of the classification confidence. Furthermore, they are statisticallygood estimators since the have small bias when classifying with different k -NN di-mensions. Classifying unseen data, the average classification is 78% ± 1%, with anaverage classification confidence 86% ± 4%.

The previously obtained results are significant contributions for CBR systems,since they provide confidence estimations in the solving process. This confidencevalue has the strengths of any statistic confidence estimator besides of a similaritymeasure indicating confidence. On one hand, the new confidence value improves thecase−retrieve step by indicating the most appropriate k−value for an allowed errorin the classification process. On the other hand, it is possible to obtain solutions suchas Expressiveness-Confidence-Expression, < ε, ϑ, λ >.

4.1.5 Confidence Classification

According to Fig.4.3.b is reasonable to use the confidence value ϑ as decision rule intothe CBR-cycle for classification:

1. Case-Retrieve: Given a facial expression, the target problem ~γt, we retrievethe appropriate k-NN neighbourhood for an expected classification confidenceϑ, see again Fig.4.3.b.

2. Case-Reuse: Compute S1, ..., S5 estimators and the confidence value ϑ for thesolution inside of the retrieved neighbourhood. Lastly, we obtain the expres-siveness of the target problem based on its p-norm and the norm of its RNs,thus obtaining the complete solution for the target ~T =< ~γ, ε, ϑ, λ >.

3. Case-Revise: The classification result is revised upon the relative differencein values of confidence and quality of its RNs. Quality is the average of highestscores of the confidence estimators, which have agreed the solution.

4. Case Retain: All confidence estimators provide information about the casebase and each class. S1 and S2 indicate how large and well delimited theclusters are, respectively. S3 reveals either the density of the class or seeds ofthe class. S4 and S5 provide information about the expressiveness of the class,i.e. how flat it is or its boundaries with respect to near classes. We compareeach estimator per class with the solved cases, thus, it is possible to decide ifretaining the new case.

4.1.6 CBR Training by Maintenance Polices

These four steps are applied only tailoring the observation space’s topology. Startingfrom zero − Confidence and zero − Classification, we iteratively identify TPs,FPs, TNs and FNs in order to calculate TPR vs. FPR of the CBR system. As a


consequence, we obtain optimum confidence-classification thresholds to retain in thecase base those cases that maximize the TPR and reject those cases that do notminimize FPR, see Fig.4.4.

Figure 4.4: Classification and confidence are improved by identifying theoptimum thresholds to minimize TNs and FPs.

Aiming to minimize the False Positives Rate (FPR) detection while maximizingthe True Positives Rate (TPR), a supervised training process is performed to identifythe optimum confidence threshold to discriminate True Positives (TPs), False Posi-tives (FPs), True Negatives (TNs) and False Negatives (FNs). To this end, Leave-One-Out (LOO) and Actor-Fold-Cross-Validation (AFCV) are iteratively performedby using the sequence labels at the frame level.

Starting from zero−zero for Confidence−Classification, the solving capabilityof the CBR system is iteratively evaluated by calculating the TPR vs. FPR. In orderto classify facial expressions by applying confidence, it is better first to identify TPs,the correct classified cases with high confidence, (λ, ϑ), while FPs correspond tomisclassification with high confidence (λ′, ϑ). Conversely, TNs are the cases withlow classification and low confidence (λ′, ϑ′), while FNs are the correct classifiedcases with low Confidence (λ, ϑ′). As consequence, two thresholds are obtained forconfidence and classification respectively, which are used to retain in the case basethose cases that maximize the TPR and minimize FPR.

Firstly, the TNs are extracted from the case base while iteratively computing theevaluation test. As soon as the case base is free of TNs, the next step is to deal withFPs. Here, the other proposed solutions in the case-retrieve step are reconsidered. It


is possible that actors mix expressions while performing an emotion sequence, whichspreads the expression clusters near to the decision surface. Finally, if a data iscontinuously detected as FP along the LOO and AFCV for k ∈ {2, ..., 20}, the caseis extracted from the Case-Base. The details of the training process are provided inthe next Algorithm 4.

Algorithm 4 Learning Confidence and Classification ThresholdsRequire: Case-Base → CB.Ensure: Optimum thresholds, λ∗

e and ϑ∗

e .

1: For λe ∈ [0.0, ..., 1.0]

2: For ϑe ∈ [0.0, ..., 1.0]

3: While TN and FP are non-empty

4: For k ∈ {2, ..., 20}

5: Return λ and ϑ.

6: Return TN := (CB(λi) < λe and CB(ϑi) < ϑe)

7: If TN is non-empty Then

8: delete CB = TN

9: Else

10: FP := (CB(λi) < λe and CB(ϑi) ≥ ϑe)

11: If FP is non-empty Then

12: re-label CB = FP

13: λmax = max{λ, λmax}

14: ϑmax = max{ϑ, ϑmax}

15: Return λ∗

e(λmax) and ϑ∗

e(ϑmax)

This process allows computing the corresponding thresholds for person-dependent(LOO) and person-independent (AFCV) recognition. After deleting the TNs, FPsnormally gain confidence thereby they can be assessed under the hypothesis of beingmislabelled. The FPs becoming TPs after re-labelling are those recovered from withina emotion sequence. Contrarily, the FPs becoming TNs after re-labelling are thosecases near to the boundaries but with low confidence. The convergence is declaredwhen there are not more TNs and FPs to deal with, see again Fig. 4.4.

4.1.7 Experiments CBR

This section details the experimental results of using CBR for expression recognition.FGnet [121], MMI [98] and Mind Reading [10] databases are described at the corre-sponding experiments. Three main results are presented; CBR expression recognitionbased on confidence assessment, CBR improvements by implementing maintenancepolices and CBR confidence improvements by applying CCA.


4.1.7.1 Expression Recognition Based On Confidence Assessment

This experiment assesses the reliability of CBR spatial inferetion on facial expressionrecognition problems. To this end, the quick movements of eyelids are gathered withthe smoother lips and eyebrows motion to train a CBR system for knowledge discovery.Three objectives are targeted; First to assess the confidence of training cases solutions;Second, to obtain the optimal neighbourhood size, k, leading to the most confidentsolutions; Third, to estimate the stability of the proposed confidence measure.

The FGnet database for facial expressions is used [121], which contains twentyactors of different genders and races, performing the six basic emotions defined byEkman and Friesen [48] plus the neutral pose. The dataset contains three imagesequences per actor and emotion with an average of 130 frames each and image sizeof 320x240 pixels, see again Fig. 4.1.(a).

All image sequences were processed by an ABT encoding six head pose para-meters Chapter 3, four facial actions for lips, two for eyebrows and one for eyelids.Posterior, one tracked sequence per actor is chosen and sampled at 25 Hz in orderto complete 600 images for training the Case-Base (CB), training-CB. The FGnetis a sequence-labelled database, which can be use for emotion analysis whilst facialexpressions do not have any labelling beyond the emotion classes. A testing sampleis chosen, training-CB, from the unused sequences of each actor, completing 1200images equally distributed along the actors and classes.

The extracted ABT feature vectors, [γ0, ..., γ6], are normalized into the range [0, 1]and the provided emotion labels by FGnet are taken as Case-Solution for the facialexpressions, see again Fig. 4.1.(b).

In the training-CB, a Leave-One-Out process is performed to set a confidencethreshold based on neighbourhood size comparison. Firstly, given a target case de-scribed by facial actions, a k Nearest Neighbourhood is retrieved by measuring anEuclidean similarity. Secondly, the five confidence predictors described in Section4.1.3.1 are calculated for the target case given the emotion label λ from the FGnetDB and a majority policy determines the confidence score. Finally, the above processis done iteratively for k = 1, ..., 12 for each case in order to set the relationship be-tween the confidence assessment and the k value, see Fig. 4.5. Accordingly to theresults, the appropriate neighbourhood size is k = 9 as it is the optimum value forobtaining the highest confidence values.

In the testing-CB process, a dataset of 1200 expressions was chosen in orderto advance three different experiments. First, an standard k -NN is applied whileexcluding eyelid facial action from both training and testing, achieving an averageclassification rate of 76.42% ± 5.06%.

A second test includes eyelid facial action in both training and testing. A k-NN classification is performed with k = 9, which produces a classification rate of81.81% ± 21.01% as it is shown in Table 4.1. Here, all retrieved neighbours are usedindistinctly from their solution confidence.

However, by constraining the case retrieval with the learnt confidence threshold itis possible to increase the classification up to 87.37±2.6% with an average confidenceof 89.36 ± 7.7%, see Table 4.2. The main idea of setting a confidence threshold is tomaximize correct classification (true positives) whilst minimizing the classifications


Figure 4.5: A confidence threshold is learnt based on confidence scoringcomparison along several k-NN models.

Emotion Anger Disgust Fears Happy Neutral Sadness Surprise

Anger 77.27 0.00 0.00 0.00 22.73 0.00 0.00Disgust 0.00 90.14 0.00 0.00 9.86 0.00 0.00Fears 0.00 0.00 36.36 0.00 55.88 7.76 0.00Happy 2.27 0.00 4.18 93.08 0.47 0.00 0.00Neutral 0.00 4.85 0.00 1.69 90.13 0.95 2.38Sadness 0.00 2.00 0.00 0.00 0.00 98.00 0.00Surprise 0.00 0.00 0.00 0.00 12.30 0.00 87.70

Classification: 81.81 ± 21.01%

Table 4.1: Confusion matrix of CBR facial expression recognition. Only theoptimal k-NN is used without setting confidence thresholds.

with low confidence (false positives).

As a result, the expression recognition rate increases by improving the solutionquality based on confidence assessments. Although the decision surface is not func-tionally described, the topological relationships scored by the confidence predictorsrelay the problem solutions over more confident data. However, maintenance poli-cies can be applied to improve the confidence levels at the decision surface by eitherretaining confident solutions or rejecting low confidence cases. Consequently, facialexpressions are recognized within a problem-solving fashion based on previous solu-tions and their current confidence.


Emotion Anger Disgust Fears Happy Neutral Sadness Surprise Confidence

Anger 86.67 0.00 3.00 0.00 3.33 7.00 0.00 90.23Disgust 2.00 88.20 1.00 6.80 0.00 2.00 0.00 84.69Fears 2.00 4.12 88.88 3.00 1.00 0.00 1.00 75.00Happy 2.00 1.00 3.30 91.50 0.00 1.20 1.00 92.30Neutral 1.74 2.50 10.00 0.00 84.76 1.00 0.00 89.13Sadness 4.00 2.00 4.33 1.00 3.00 83.67 2.00 96.06Surprise 1.00 4.31 1.00 0.00 5.80 0.00 87.89 98.09

Classification: 87.37 ± 2.6% Confidence: 89.36 ± 7.7%

Table 4.2: Confusion matrix of CBR facial expression recognition. Retriev-ing cases of at least 95% of solution confidence and the optimal k = 9, thecorrect classifications are maximized whilst minimizing the non-confident clas-sifications.

4.1.7.2 CBR Maintenance For Expression Recognition

This experiment tests the improvements over the Case−Base after applying mainte-nance polices for training. The Case−Base contains facial action encoding eyebrows,lips, eyelids and irises, ~gamma = [γ0, ..., γ8]. Next, the Algorithm 4 is applied. Con-sequently, the database is cleaned with respect to TNs and FPs.

A comparison of experimental results at the beginning and at the end of thetraining process, are summarized in the Table 4.3. AFCV was used to learn thresholdsλ∗e and ϑ∗e, since AFCV gives the real system’s capability to solve unseen faces.

Experimental results at the beginning and at the end of this process, are sum-marized in the Table 4.3. The confidence value is learnt by applying AFCV, whichgives us the real capability of the CBR system to solve unseen data. Finding thebest confidence classification threshold, both the k-NN majority and the confidence-classification increase effectiveness while reducing error detections.

Emotion AFCV Initial AFCV Final TAN-BNMajority Confidence Majority Confidence

Anger 72.23 89.05 71.53 86.67 85.92Disgust 57.23 87.56 61.60 89.70 83.23Fears 48.62 88.45 56.70 90.37 83.68Happy 71.58 85.03 85.19 90.24 87.55Neutral 93.51 91.34 94.80 91.65 79.58Sadness 77.84 89.00 97.74 83.67 80.97Surprise 56.07 91.05 76.88 92.87 82.22Average 68.15 88.78 77.78 89.31 83.31

Table 4.3: Comparison of Confidence-Classification while finding the opti-mum Confidence Threshold. The TAN-BN results were reported by I. Cohenet al.

Finding the best confidence classification threshold, both the k-NN majority andthe confidence-classification increase the effectiveness while reducing the error de-tections. For example, in Table 4.3, it can be seen that the obtained results arecomparable to eager classifiers. In [26], authors present a comparison of eager classi-fiers such as Neural Networks, Naïve Bayesian Networks and TAN Bayesian Networks.


The best average classification rate that they reported was 86% by using TAN-BN.

As above mentioned, maintenance polices allow improving both classification andconfidence. To this end, 1,000 additional data are used for testing from which theCBR system retains some solved cases. Detailed experimental results are shown inTable 4.4, where is possible to appreciate the current classification rate and confi-dence for each one of the expressions. The trained Case-Base contains data, whoseexpressivenesses are between 20% and 100%.




Table 4.4: Person-Independent CBR Facial Expression Recognition by As-sessing Confidence.

Consequently, those classes presenting weaknesses in the Table 4.3, such as angerand sadness are evidently improved as shown in Table 4.4.

4.1.7.3 CBR Gaze Expression Recognition

Gaze and eyebrows motion is analysed in this experiment due to the strength of theABT to encode micro expressions. For this purpose, another database had to begathered to the FGnet. The MMI database [98] contains more subtle expressions andless mislabelled data. Both databases complete 25,000 images split in 200 sequencesand 50 subjects performing the seven basic facial expressions.

The Case− Base is trained with facial actions corresponding to upper face mo-tion, such as eyebrows and eyelids. Irises are excluded from this experiment due totheir poor expressiveness, which is a problem of posed expressions. The training isperformed as explained in Algorithm 4. As consequence, the classification and con-fidence thresholds are settled while TNs and FPs are identified and deleted againfrom Case− Base. Both classification rate and confidence are summarized in Table4.5. Likewise, after replacing 55% of the FGnet DB with cases from the MMI, it waspossible to obtain an average classification rate of 93%.

Similarly, as previous experiments, 1,000 new data are tested aiming to retain andupdate the Case−Base. Successfully, the CBR system gains stability and increasesclassification and confidence for expression recognition in gaze micro expressions. Theachieved classification is 94.19 ± 2.8% with and average confidence of 95.07 ± 1.9%,see detail in Table 4.6.


Emotion AFCV Initial AFCV FinalMajority Confidence Majority Confidence

Anger 72.23 60.59 71.53 74.52Disgust 57.23 68.30 61.60 89.18Fears 48.62 55.18 56.70 86.36Happy 71.58 89.59 85.19 97.64Neutral 93.51 93.07 94.80 96.02Sadness 77.84 87.56 97.74 98.70Surprise 56.07 68.79 76.88 87.70Average 74.00 78.00 86.00 93.00

Table 4.5: Comparison of Confidence-Classification while finding the op-timum Confidence Threshold. AFCV is the strongest test to guaranty theperson-independent coverage while computing the topology of the observa-tion space.




Table 4.6: Classification and confidence can be improved by gathering FGnetand MMI while performing Case-Retaining.

4.2 Cognitive Correlated Facial Actions

4.2.1 Experiments

In order to explore the reasons why the whole face and upper part achieve differentrecognition rates, CCA is applied decoupling and fusioning lower and upper facialactions. Here, the Case-Base is trained with the FGnet and MMI [121, 98] together.

Thus, CCA is applied to these two set of features, the mouth and eyebrows-eyelids,~m and ~y. Next, a pair of canonical vectors are found, ~ m and ~ y, such that there isa hyperplane over which both input features are projected and maximally correlated,(Section 3.4.3). Therefore, two regression matrices make possible the estimation ofthe most likely corresponding data in one space, given a real data from the otherspace:

Rmy = inv(~mTcca ~mcca)~mT

cca~y (4.6)

Rym = inv(~yTcca~ycca)~yT

cca ~m

~m = ( ~ Ty ~y)

T Rym

~y = ( ~ Tm ~m)T Rmy

where ~mcca and ~ycca are the projections over the CCA subspace through the corre-

4.2. COGNITIVE CORRELATED FACIAL ACTIONS 93

lation hyperplane. Moreover, ~m and ~y are the corresponding estimations from CCAsubspace on the original spaces. This two set of estimations are further fused in asingle vector:

~f =

[

~m

~y

]

(4.7)

All training data are transformed using the Eqs. 4.6 and 4.7. Classifying unseenfaces under a AFCV test, reveals that CBR can obtain a 91% of confidence withoutCCA. However, after correlating rigid and non-rigid movements by CCA, they canbe further projected onto the CCA-subspace. Subsequently, CCA regression is ap-plied over these projections thus stressing up their correlation at the original inputsubspaces.

Consequently, lower and upper facial actions can be powered by maximizing theinter-correlation among them. An AFCV test is run for the new Case-Base whileassessing both classification and confidence levels. Due to the CCA cleaning fromredundant cases, the system’s confidence increases for either improving classificationor lowering misclassification. Table 4.7 shows a simple comparison of assessing con-fidence with the three type of features; original vector, the ~mcca and ~ycca canonicalprojections and the estimated ~m and y. With a 97% of confidence, CCA proves thatmouth and gazes are not directly correlated while displaying emotions, espcially inposed ones.

Expression AFCV [~m, ~y] AFCV [ ~m,~y] AFCV [~mcca,~ycca]

Anger 93.4 82.1 85.9Disgust 94.7 95.6 83.2Fears 92.1 99.3 83.7Happy 94.6 96.7 87.6Neutral 96.0 98.2 79.6Sadness 97.6 95.6 81.0Surprise 97.0 97.9 82.2Average 95.1 96.9 83.3

Table 4.7: Comparison of confidence assessment for the three different typeof features; the original features, [~m, ~y], CCA-regression estimation [ ~m,~y] and[~mcca,ycca] the corresponding projections into the CCA-subspace.

4.2.2 Exhaustive Mapping Of Intra-Correlated Expressions

This experiment aims to prove the suitability of CCA for facial expression recognition.Here the Mind database must be used since it contains head poses with a semanticmeaning according to facial expressions and emotions. Subsequently, a CCA process isperformed for each of the seven type of facial expressions, where the input subspacesare the head ~ρ = [θx, θy, θz, s, tx, ty] and facial actions ~γ = [γ0, ..., γ8]. The CCArecursive cleaning is applied while reducing the amount of redundant data whichspread the inter-class correlation.


For each expression class a pair of canonical basis are obtained τ ih and τ i

f and

i = {1, ..., 7}. These basis allow to learn a pair of regression matrices Rih,f and

Rif,h, which allow to pair-wisely reconstruct feature vectors from one subspace into

another. Likewise, a pair of regression errors, ǫih,f and ǫif,h, can be computed foreach expression class given a head-face expression descriptor. Finally a expression isrecognized according to λ = arg(min{kh ∗ ǫif,h + kf ∗ ǫih,f}). kh = 0.35 and kf = 0.65have been experimentally determined by comparing the classification rates. Sincefacial expressions are more deterministic for the class decision, kf > kh. A 10-FoldCross-V alidation has been advanced in order to assess the classification rate by CCA.Detailed results are shown in Table 4.8.

Emotion Anger Disgust Fear Happy Neutral Sadness Surprise

Anger 90 2 1 4 0 3 0Disgust 4 91 2 1 0 2 0Fear 3 2 89 0 5 0 1Happy 0 1 2 92 3 1 1Neutral 0 0 0 2 95 3 0Sadness 2 1 0 1 4 92 0Surprise 2 1 2 2 3 3 87


Table 4.8: CCA to solve FER problems. Exhaustive comparisons of theregression errors allow classifying a face into the class with minimum error.

These results can also be improved by making stronger the intra-class correlationfor those classes with low classification rates. As highest the intra correlation is, ashigher the classification rate is. The obtained results are comparable to those reportedby authors in [130], which used KCCA to obtain an average of 86% of expressionrecognition in the Ekman’s database [47] and recognizing only 5 expression classes.

4.3 Dichotomizing Non-Linear Decision Surfaces

How are facial actions distributed in a multi-dimensional space according to facialexpressions?

Either supervised or unsupervised classifiers may find problems on efficiently de-termining the decision surface between two classes. Moreover, the complexity in-creases as many dimensions describe the feature space and as well as classes to belearnt. The Fig. 4.6 shows an example of topology of upper-face micro-expressions.Expression classes may not be allocated as single manifold following a simple geom-etry suggesting a statistical model. Instead, multiple seeds spread along the spacecan constitute a class, which can be tightly surrounded by others at the same time.This is an example where the decision surface cannot be described as linear functiondiscriminating between two classes.

4.3. DICHOTOMIZING NON-LINEAR DECISION SURFACES 95

Figure 4.6: Eyebrows and eyelids render facial actions suitable to be cat-egorized as both subtle or micro-expressions, which can give an idea of thetopological distribution of facial expression classes in a 3D feature space.

4.3.1 Non-Linear Feature Spaces

When the feature space is not linearly separable, it must be transformed by applyinga kernel function K(~γ,~γ) = ϕ(~γ) · ϕ(~γ), which transform the space into a new spacewhere data are linearly separable. Consequently, the decision function Eq. 2.9 can bewritten as follows:

f(~γ) = ~w ·K(~γ) + b (4.8)

As mentioned earlier, SVM can be extended to a non-linear case by utilizing aKernel technique. It can be shown that, replacing the inner product with a Kernelfunction is equivalent to projecting the data into some higher dimensional space wherehopefully the data are more separable. Widely-used Kernel functions include:

• Linear Kernel: ϕ(~γ,~λ) = ~γ · ~λ

• Polynomial Kernel:ϕ(~γ,~λ) = (1 + ~γ · ~λ)p

• Gaussian Kernel: ϕ(~γ,~λ) = exp(−||~γ − ~λ||2/σ2)

• Sigmoid Kernel: ϕ(~γ,~λ) = tanh(β~γ · ~λ+ r)

Among them, the Gaussian Kernel is by far the most popular. By applying theGaussian Kernel, the data are projected into an infinite dimensional space and it


delivers good results in most applications. In this experiments, only the GaussianKernels (also known as Radial Basis Function (RBF)) and the Linear Kernel (whichis equivalent to no kernel) are considered.

4.3.2 SVM for Multi-Class FER

SVM were originally designed for binary classification. However, facial expressionrecognition is a problem of multi-class classification. In order to recognize multiplefacial expressions, a one-against-rest approach is adopted, which combines C binaryclassifiers (the parameter C is the number of classes). The ith SVM constructs ahyperplane between the class i and the C−1 remaining classes. A majority vote acrossthe classifiers or some other measure can then be applied to classify a new sample[35]. Thus given N training samples {(~γ1, λ1), ..., (~γN , λN )}, where λN ∈ {1, 2, ..., C}is the expression class determined by :

f(~γ) = ~wij ·K(~γ) + ~wij0 (4.9)

λi = Ci if ∀j 6= i, f(~γ) > 0

There are some other methods in the literature that consider all classes at once[117, 123].

4.3.3 Experiments SVM

Two type of experiments have been advanced to test the efficiency of SVMs for facialexpressions and affective behaviours recognition. FGnet, MMI and Mind readingdatabases are used in these experimentation [121, 98, 10].

4.3.3.1 Stress Recognition

Stress is another human behaviour that usually affects the normal performance ofperson’s activity. Stress recognition from facial images has not received yet muchattention although it represents an important problem for applications such as securityand HCI. There are two pieces of work in the literature; one is related to speechrecognition under stress [112]. Another work is based on facial tracking and HiddenMarkov Models (HMM) [82]. Here, authors encode facial deformations of eyebrows,lips and mouth, which constitute the input data for a single HMM.

Under stressing situations, the facial movements do not flow naturally while ex-pressing affection. Even simple interactions become abnormal and new unexpectedmuscular actions are displayed. The Fig. 4.7 shows an example of a video sequencethat has been encoded by the appearance tracker. Although the head pose and mostof the facial actions flow subtly, it is evident that the upper lip and lip stretcher suffersteep changes.

In order to study the stress levels in affective behaviours, the Mind Reading data-base [10] has been utilized. Hundred people were asked to classify 412 video sequencesof the database as stress, non-stress and irrelevant. Next the video sequences were


Figure 4.7: Affective behaviours may be completely stress reactions or mix-ture of emotions and stress. Facial actions (FACS) suffer alterations thatindicate anomalies (highlighted by the grey band).

encoded by the ABT (Section 3) that provides nine facial actions and six head poseparameters as output.

A binary SVM have been trained with 1,200 facial actions corresponding toeyebrows, lips, eyelids and irises, ~γ ∈ ℜ9. These data were sampled from thevideos at 50 Hz (about 2 frames from each sequence). A Gaussian Kernel is used

ϕ(~γ,~λ) = exp(−||~γ − ~λ||2/σ2), where the sigma value and a penalty function ξ de-termine the flexibility of soft margins to allow misclassification [31]. Similarly 32,800images where chosen for testing. An actor-fold cross-validation test was advancedexhaustively till all actors were tested for stress recognition. Compared to reportedresults by Metaxas et al. [82], 92% of correct classification, this SVM achieved 83%of recognition of stress and 99% recognizing non-stressful images. In average, theobtained recognition rate of stress/non-stress is 91% ± 11%.

4.3.3.2 Fusion of SVM and CCA

An especial interest arise on understanding the effects of considering the cognitive mo-tion of head. Humans express displacement, disappointment and many other gesturesby moving the head or adopting certain poses.

Before to dimensionally extend the feature space to higher dimensions, some


experiments should be performed. As higher the feature space as harder the dis-crimination of classes, even in the case of stress recognition. For example, the Fig.4.8 shows two viewpoints of features vectors including head pose and facial actions~g = [~ρ,~γ] = [θx, θy, θz, tx, ty, s, γ0, ..., γ8].

Figure 4.8: The 2D and 3D PCA projections of the trained SVM’s data.Simple dimensionality reduction and/or data fusion cannot discriminate theseoverlapping classes. The space must be transformed by a kernel function.

Given that head movements are rigid and physiologically independent from facialactions, the first step is to find out how much they can be reduced. PCA is appliedto each group of data, the head parameters ~ρ and facial actions ~γ. In Mind database,the head pose varies constantly in rotation and 3D translation since the behaviourare produced by external stimulus and interaction with other individuals. Therefore,the vector ~ρ ∈ R

6 is only reduce in one dimension, ~h ∈ ℜ5 while the vector ~γ ∈ Re9

is reduced to the vector ~f ∈ ℜ7. 96% of the variance of head pose is projected overthe 5 first eigenvectors. Likewise, 95% of the variability of facial actions is containedby the 7 first eigenvectors.

Subsequently, CCA is applied by using the Eqs. 4.6 to correlate head and faceparameters, ~h and ~f . Therefore, two regression matrices, Rhf and Rfh, can beobtained to estimate the most likely corresponding data in one space, given a real

data from the other space, ~h and ~f , respectively. Thus, head and face parameters canbe fused into a single vector:

~g =

[

~h

~f

]

(4.10)

Finally, all training data are transformed to the form of the vector ~g, by applyingPCA and CCA. Subsequently a SVM is trained to recognized stress based on head


pose and facial actions. A Radial Basis Function (RBF) or Gaussian kernel is applied.An actor-fold cross-validation test is advanced over 32,800 data points correspondingto images from the Mind DB. The Fig. 4.9 shows the obtained kernel for stressrecognition on this cognitive expressions.

Figure 4.9: RBF Kernel of a SVM that learnt PCA+CCA head and faceparameters for stress recognition. Blue marks are stress data, the green onescorrespond to non-stress, and the red marks are the SVs determining thedecision surface which is the grey hyperplane in between.

Observe that the learnt RBF kernel separates stress data (blue marks) from non-stress data (green marks). Furthermore, due to the few amount of SVs (red marks)in comparison to the big clouds of both data, a high classification rate is achieved.The Table 4.9 summarizes the obtained results for stress recognition.

Experiment Features Stress Non-Stress Average

FACS and SVM ~γ 83.38 99.09 91.68Head, FACS and SVM ~g = [~ρ,~γ] 57.23 68.30 62.77

CCA and SVM ~g2 = [~ρ, ~γ] 88.62 95.18 91.90

PCA, CCA and SVM ~g = [~h, ~f ] 96.02 99.89 97.96

Table 4.9: Four experiments are summarized. The former related to sim-ple SVM classification of FACS. Second, a common fusion of head pose andFACS. Third, CCA and fusion without dimensionality reduction. Four, di-mensionality reduction PCA, CCA fusion and SVM classification.

It can be seen in Table 4.9 how the correlated fusion of head pose and facialactions achieves the higher efficiency for the problem of stress recognition. Contrarily,


the common concatenation of all available features in a single feature vector ~g =[~ρ,~γ] yields to the lowest results. These results confirm on one hand that head posedoes enable us to recognize affective behaviours. On the other hand, head pose andfacial actions require a correlation analysis for data fusion. Strongly correlated datareinforce the pattern formation while uncorrelated data helps to discriminate theclasses.

4.3.3.3 Multi-SVM for Expression Recognition

This experiment uses the previously explained databases FGnet and MMI used atsections 4.1 and 4.2. Due this DBs contain only the six basic expressions and theneutral pose, the head pose is not useful for this experiment. Therefore, the featurevectors are just FACS, ~γ ∈ R

9.In order to train a SVM for multi-class recognition, the one − against − rest

technique is adopted. 20 actors and 7 sequences per actor were chosen to completea training sample of 1,400 data points. A Gaussian kernel is used with a high sigmavalue since a single class has to be discriminated against six remaining classes.

There are several ways to verify the efficiency of a SVM classifier. One possibilityis plotting the SVs against the Lagrangians, see Fig. 4.10.

(a) (b)

Figure 4.10: SVs are the non-zero Lagrangians, α, which determine thehyperplane bounds. As fewer the SVs as higher the chance getting a goodclassifier. (a) A first training of SVM for expression recognition achieves anaverage of 85 SVs tending to high values. Instead, (b) an increment of virtualSVs and relaxation of soft margins allows achieving an average of 46 SVs.

The results shown in Fig. 4.10.(b) reveal a good training of the SVM since mostof the Lagrangians are zero or almost zero. This SVM is expected to be an efficientclassifier. To prove it, an actor-fold cross-validation test is again performed in sampleof 40,000 FACS corresponding to still images from both FGnet and MMI databases.The 1,400 training data were excluded from the experiment, however 20 actors areunseen and the other 20 are the same as used for training. The following table presentsthe results obtained on FER for seen and unseen actors, as well as with both SVMsshown in Fig. 4.10.

It can be seen in Table 4.10 the summary of four experiments; The SVM result of

4.4. STATISTICAL EXPRESSION RECOGNITION 101

Classifier Anger Disgust Fears Happy Neutral Sadness Surprise Average

SVM 79.58 60.59 71.53 89.18 85.92 93.51 93.07 81.91Unseen→ 57.23 68.30 61.60 74.52 83.23 77.84 87.56 72.90SVM + SVs 96.02 98.70 86.00 93.00 94.80 87.70 97.74 93.42Unseen→ 87.23 92.59 87.19 97.64 87.55 86.68 88.31 89.60

Table 4.10: FER with SVM and Virtual SVs. Both SVMs presented in Fig.4.10 are tested with an Actor-Fold Cross-Validation (AFCV); Seen people(top row) and Unseen people (bottom row).

initial training Fig. 4.10.(a) and first row in table) acquires a competitive classificationrate, 81.91%. It is expected to obtain lower efficiency when recognizing unseen data,72.90%. Likewise, the results obtained after re-training the SVM with additionalvirtual SVs (Fig. 4.10.(b)) are presented in the same table. Due to more accuratehyperplane and less SVs, this SVM achieved higher rates in both cases, with seenactors 93.42% while 89.60% was obtained for unseen actors.

4.4 Statistical Expression Recognition

As above commented in section 4.3, one of the main issues in FER challenges islearning any structure allowing to automatically and efficiently recognize expressions.In that section, the main goals were to construct a classifier able to describe thespatial distribution of facial actions in a multi dimensional space and find out theappropriate transformations of such space to discriminate classes.

However, there are other type of patterns describing expression classes and theirconfiguration. Instead of only describing the facial actions which are the observablefeatures, it is possible to analyse the distribution and relationship among classesand their spatio-temporal morphism. Furthermore, by applying statistical learningmethods, it is possible to deal with incomplete observed features, estimate missingdata, and infer further states of facial expressions along the time.

Consequently Bayesian Networks are used in order to learn probabilist relationsand distributions as mentioned at Section 2.6.3.1. It is worth to remember thatdatabases for facial expression recognition such as FGnet [121], MMI [98] and MindDB[10] normally contain neutral pose images. These frames are commonly located atthe beginning of the sequence but can be also present elsewhere along the sequence.Therefore, pre-processing must be done for data cleaning and re-labelling.

The number of parameters to learn in a BNCs are related to the number ofobservable features, the number of classes and the type of variables (binary, discreteor continuous). Consequently, the amount of data must satisfy the completeness ofsuch features space.

4.4.1 Data Preparation

According to the ABT, a neutral facial expressions is the vector ~γ = [0, ..., 0]. However,FGnet and MMI databases contain sequences of neutral emotion which will be useto model the corresponding expression class. These databases do not provide labels


neither ground truth for expression recognition at the frame level. Therefore, theemotion’s labels are used after previously extracting the neutral part of all sequences.But still, is a supervised classifier is used, part of the misclassification can be explainedas mislabels. Similarly, an unsupervised classifier may find more than one class withinthe same emotion sequence while only one is expected. Therefore, for a frame-basedclassification of facial expressions it is crucial to determine the neutral part of thesequence, the range of subtle expressions and the most expressive part.

Given a sequence of facial action vectors, G = {~γ0, ..., ~γt}, from an image sequenceI, the p-norm is computed for each vector in the Lp-Space as follows:

||~γ||p = (

p∑

i=1

|γ|p) 1p . (4.11)

Figure 4.11: Highlighting neutral expressions and expressiveness with P -norm. This normalization allows to distinguish different intensities of anemotional sequence as well as the neutral parts.

In Fig. 4.11, it can be seen that all facial expressions tend to start near the neutralexpressiveness and drop once the peak is reached. Therefore, we model the p−normof all original neutral expressions as a singular normal distribution, ||~γ||neutral

p ∼N (µp, σp). Thus, it is possible to filter all facial expressions in the acceptance ratioof (µp ± σp). The expressiveness ǫ is calculated for each expression ~γ in each class c,as the percentage of the p-norm over the maximum variation inside of the expressionclass as follows:

ǫc =||~γ||p

(||~γ||maxpc

− ||~γ||minpc

)∗ 100 (4.12)


Consequently, facial expressions with ǫ < 50% are considered subtle whereasǫ > 80% indicates the peak of the expression and possibly a forced expression. Thus,the sequence-level labels area adapted to frame-level for expression classification.

4.4.2 FACS Discretization

As observed at Fig. 4.11, emotion sequences vary according to different kinematicand time. Due these variations are the result of variability of facial actions, it meansFACS do not follow same performance for all facial expressions and emotions.

Since the feature space must be discretized for efficiently modelling with BNCs,FACS must be analysed separately with respect to each expression class. To this end,mixture of Gaussians are adopted for strong discretization of each FAC.

For a given sequence of facial actions G = {~γ0, ..., ~γt} there is a discrete featuresub-space P = {~p0, ..., ~pt}, where γi ∈ R and pi ∈ Z

+. Therefore, for a given class C,the are k number of Gaussians for each facial action, such γik

∼ N (µik, σik

).

The number of Gaussian to consider in the discretization of each facial action areestimated by using a Gaussian Mixture Model (GMM). The number of discrete statesk for each facial action variable is determined by automatic model order selectionbased on Bayesian Information Criterion (BIC) [109]. The learnt GMM is used fordiscretizing the facial action. Subsequently, for given facial expression characterizedby seven facial actions for eyebrows, lips and eyelids, all facial actions are gathered ina multiple Bayesian Network as shown in Fig. 4.12

Figure 4.12: A facial expression vector ~γ = [γ0, ..., γ6] is discretized byGMM per variable and expression class. Thus, the expression vector becomes~P = [P0, ..., P6], where Pi ∈ Z

+.

The Naïve Bayesian Network in Fig. 4.12 can be used for discretization andclassification of expressions. However, it is better to factorize the class node in orderto allow independent discretization per class, despite of needing more data.


4.4.3 Bayesian Network Classifiers

To infer gestures from observed facial movements, the proposal is to use Tree Aug-mented Bayesian Networks (TAN). This type of Naïve Bayesian Networks (NBN)constraint the class variable to have parents and allows at most two parents for eachattribute (class variable and another attribute).

In general, allowing correlation among the attributes overcomes the performanceof a NBN in spite of needing bigger observation space and polynomial time for learn-ing. Thus, instead of one TAN model we have a collection of networks by factorizingthe state space in order to compensate the effects of including correlation. This fac-torization also enhance the modelling each class with different structures; a given classmight need partially observe the feature space, which speed up the inference processwhile strengthening the inference. We consider twelve groups of expressions split intosix prototypical (angry, fear, disgust, happy, sad and surprise). For each expression,the optimal TAN structure is learnt by applying the K2 algorithm [27].

Lets B = (G, θ) be a Bayesian Network if G = (X,E) is a Directed Acyclic Graph(DAG) where the set of nodes represents a set of random variables X = X1, . . . , Xn,and if θi = [P(Xi/XPa(Xi))] is the matrix containing the conditional probability ofnode i given the state of its parents Pa(Xi). A Bayesian Network B represents aprobability distribution over X which admits the following joint distribution decom-position:

P (X1, X2, . . . , Xn) =

n∏

i=1

P(Xi/XPa(Xi)) (4.13)

The K2 method is very fast and is frequently use in the literature despite of thedrawback of initialization. However, its results are constant for a given initialisa-tion order. The K2 algorithm was developed by [27] and searches for the BayesianStructure, DAG, that maximizes a score. Among a set of all DAGs containing the allvariables, the algorithm assumes an ordering node. Let Pred (Xi) be the set of nodesthat proceed Xi in the ordering. Set initial parents Pai of Xi to empty and computescoreB. Next, examine the nodes in sequence according to the ordering. When ex-amining Xi, determine the nodes in Pred (Xi) which most increases the score. Addgreedily this node to Pai. Continue doing this until the addition of no node increasesthe score, and finally obtaining a TAN, see Figure 4.13.

To tackle the initialization problem, an eigenvector selection technique is applied[126] which helps to stablish an order of relevance. In Fig. 4.13 it is possible to seethe full connectivity of the expression class and the observable feature space. ThisBN also includes the discretization process by GMM above explained.

Observe in Fig. 4.13 that only Angry expression is connected as TAN BN. Thisis for illustration purposes. However, the upper parental level, where class node isconnected to observations is the final learn structure after pruning the arcs undera relevance threshold. By restricting also the utilization of all observable featureslearning the TAN of a class, the necessity of bigger training samples is avoided. Thefullest correlated network, the hardest to structure to learn.

Finally a higher layer with a NBN decodes the decision for only facial expressionclass accordingly to the maximum log likelihoods. Although a simple voting for the


Figure 4.13: A full TAN BN is used for facial expression recognition. OnlyAngry expression is fully presented for better visualization. The K2 algorithmhas been modified with an eigenvector relevance threshold to avoid weak re-lationships between two features.

max(LL) can provide a decision, the NBN can learn a p.d.f for more precise inferationprocess.

4.4.4 Experiments TAN-BNs

Before to proceed with classification experiments, the neutral expression must belearnt and the rest of image sequences clean from the neutral part. Subsequently, thediscretization process has to be done.

FGnet and MMI databases contain together 60,620 images, 534 image sequencesand 40 actors. There are 2,980 images corresponding to neutral sequences, that meansthe sequence labels remain at the frame level. The facial expressions contain facialactions for eyebrows, lips and eyelids. Gazes are not considered from these databasesdue to the frontal pose and fixed gaze towards the camera.

4.4.4.1 Data Preparation

The p-norm is then computed for all neutral expressions, obtaining the Gaussianparameters such that ||~γ||neutral

p ∼ N (0.68, 0.004). Subsequently, all facial actions


whose norm is one standard deviation far from the neutral mean, are consideredneutral, i.e. Neutral = {∀~γ/||~γ||p ≤ 0.068 ± 0.06}. Consequently, 29,020 imagesfrom 56,640 were re-labelled as neutral facial expressions at the frame level. However,there may be more neutral frames more far from one standard deviation, but they willbe taken out from the training according to the expression intensity. This decisionis in order to keep expression classes unbiased from mislabelled data while keepingessential data whose expressiveness is higher than 50%.

4.4.4.2 Discretization

Figure 4.14: BIC criteria for Angry facial expression where inner eyebrow isred line, outer eyebrow is green line and eyelids is blue line. The plots suggestfour Gaussians for each facial action, however they do not split at the sameplaces.

By estimating the number of Gaussians for each Facial action γci , where i ∈ [0, 1]

and c = {1, ..., 7}. The BIC criteria suggests different number of states for eachfacial action accordingly to the expression class. For example the inner eyebrowis discretized into three states for fears while two states are considered for happyexpressions, which is a more spontaneous expression.

Once the number of states are determined for all facial actions and expressions,the corresponding BNs are constructed as in Fig. 4.12. Notice that GMM and BIC areused to compute the number of states per facial action within each expression class.It is possible to build a BN to model a GMM with discrete outputs as explained in[87]. However, a BN with two nodes, for the continuous variables γi and discreteoutputs Pi. For the output, a uniform CPD with k number of states is used.


4.4.4.3 FER by Seven TAN-BNs

The FER challenge is again faced by performing Actor-Fold Cross-Validation. In thiscase, three databases are gathered for recognizing the six basic expressions; FGnet,MMI and MindDB. They add up to 74, 762 facial expressions, 57, 640 from FGnetand MMI databases plus 17, 122 data from MindDB.

The experimental results of the seven separated TAN-BNs are summarised inTable 4.11. Here, the mean number of states per facial action is four. Moreover, inthis experiment the construction of the TAN-BN is only constrained to at most oneparent and another correlation.




Table 4.11: Seven TAN-BNs are tested by AFCV process. Each BN clas-sifies whether the expression belongs to the class. Due the independence ofexperiments, a facial expressions can be classified as belonging to more thanone expression class.

Given the BNs are tested independently, each binary output may indicate thata facial expression belongs without taking into account if the it has been classifiedinto another expression. However, an exhaustive comparison of all likelihoods of theexpression nodes can be done in order to unify a final decision.

This last step is expensive when classifying on-line as well as lacking of inferationchances. Therefore, an additional layer is plugged covering all expression TAN-BNs,which will learn a CPD with seven possible states given the individual TAN outputs.As shown in Fig. 4.13 this last BN is simply Naïve, so the expression classes remainprobabilistically independent and only P (X = C) matters, where C = {1, ..., 7}. Theresults of adding this last decision BN are shown in Table 4.12.




Table 4.12: A Naïve BN is added over the Seven TAN-BNs in order to learnthe joint distribution of all facial actions and unify the final decision.

After adding the decision node to unify the estimations of all TAN-BNs, the exper-


imental result improved notoriously. However, the TAN-BNs are still fully connectedwhich require enough samples to cover all possible combination of facial expressions.Additionally, the complexity increases as many discrete states are considered for eachfacial action Pi. Therefore, the number of states is fixed to at most three and thediscretization process is repeated again.

On the other hand, the eigenvector decomposition is applied over all facial actionsper expression class. Thus, it is possible to obtain the optimum order to run the K2Algorithm. Once the final structure is provided by K2, the TAN-BN is pruned fromeither irrelevant parents or correlation with other facial actions. Consequently, theCPDs are better trained by having more data available. The results of constrainingthe full BN to maximum three states per facial action and pruning the TAN-BNs areshown in Table 4.13.




Table 4.13: In order to reduce complexity of training, the number of statesfor facial action is restricted to at most three. Moreover, the TAN-BNs arepruned according to eigenvector decompositions.

This last result prove the suitability of classifying facial expression by TAN-BNs.The obtained results are comparable to those reported by Cohen et al. in [63]. Al-though both approaches are similarly applying TAN-BN for facial expression recog-nition problems, the databases used in these experiments contain more spontaneousexpressions.

4.5 Chapter Summary

In this chapter four expression recognition methods have been presented; CBR, CCA-CBR, SVMs and TAN-BNs. Although the final result are suitable of ranking com-parison, there are specific advantages of applying each method. The Table 4.14 sum-marises the results and contributions of the techniques here presented.

The classification methods presented here, were chosen in order to solve problemsrelated to spatial knowledge discovery, on-line learning, confidence and strong infer-ence. The experiments were conducted aiming to prove the strengths and suitabilityof all contributions. Finally, the most important results are summarised in the Table4.14 for a better comparison of the methods, their results and contributions.


Classifier Average Advantages

CBR+Confidence 94.19% Discovers spatial relationships.Assesses confidence of output classification.Allows updating knowledge.

Exhaustive CCA 90.86% Learns correlation among FACS.Requires less data robust to noisy data.Learns a correlation hyperplane for fast classification.Suitable for real-time.

PCA, CCA and SVM 97.96% Fast classification and strong separability of classes.Incremental and on-line learning.Suitable for real-time.

SVM + SVs 93.42% Fast classification and strong separability of classes.Incremental and on-line learning.Suitable for real-time.

Unseen→ 89.60%

TAN-BNs+Pruning 85.29% Strong for inferation when missing data.Supervised and Unsupervised Learning.Suitable for real-time.

Table 4.14: Summary of FER Techniques.

Chapter 5

Cognitive Emotion Analysis

In this chapter, the main goal is to model this complicated relationship betweensimple expressions, cognitive emotions and cognoscitive behaviours (Mental States).Attempting to construct a spatio-temporal taxonomy for conceptual mapping of emo-tions and mental states, first, a Case-Based Reasoning (CBR) module builds a spatialtopology of simple expressions by assessing confidence. Second, A Tree AugmentedBayesian Network (TAN) is factorized on its class space in order to de-correlate allexpressions and enhance the learning of independent structures. Emotions and men-tal states are recognized by independent Dynamic Bayesian Networks (DBN). Eachof basic behaviours is decomposed into four different variances thus building a higherlevel of emotion recognition. Finally, a single solution is provided by combining awinner-takes-all method and a confidence assessment.

A cognitive map of head-face gestures, emotions and mental states will allow build-ing a deeper taxonomy of human emotional behaviours rather than simply posed facialexpression classification. Through this deductive generalization of mental process-ing (cognition), we aim to map the psychological transformations around emotionalbehaviour by encoding signals, acquiring knowledge from data, hypothesizing andinferring the emotion nature (cognitive process) [36] and reasoning how people useemotional behaviours (cognoscitive process) [1, 11, 105]. This brings a comprehen-sive breakdown of observable signals, spatial and temporal knowledge, reasoning andmental mapping.

Although static pictures have been extensively analysed for expression recognition[127, 77, 97], there is a scientific agreement considering emotions and mental statesas dynamic patterns evolving on the time [63]. We tackle this problem through threehierarchical levels that combine deterministic and stochastic methods;

1. Observations: Nine facial actions are encoded by an Appearance-Based Tracker;Inner brow, outer brow, eyelids, iris yaw, iris pitch, corner lip stretcher, cornerlip depressor, upper lip, lower lip. The head pose is encoded according to threespatial angles. Moreover, additional kinematic features are computed such asblinking frequency, eye closure speed, saccade frequency and gaze speed. Thesefeatures enhance the recognition of most cognitive upper levels.

111

112 CHAPTER 5. COGNITIVE EMOTION ANALYSIS

2. Feature Knowledge: Case-Based Reasoning approaches are applied to exploitthe database knowledge. Based on input features, their spatial distribution, in-trinsic feature proprieties and provided expert solutions, CBR constructs a firstapproximation to the emotions taxonomy. It gives an enhanced knowledge ofthe decision surface beyond classification since also provides generalization rulesfor updating knowledge.

3. Confidence Assessment is performed within the CBR module accordinglyto some confidence estimators, neighbourhoods and weighted voting. The con-fidence enhance the cluster definition by strongly dichotomizing the boundaries.

4. Expression Recognition is learnt once the spatial distribution of observationsis settled. Next, these observed features are linked according to probabilisticcorrelations, topology of classes and confidence assessment. A Tree AugmentedBayesian Network (TAN) is factorized for each expression class (FaTANs). De-spite of avoiding correlation of expressions, it allows learning independent dy-namics and structures for each class. Here, we recognize twelve different groupsof facial expressions, six for Prototypical Cognitive Emotions (PCE) and othersix for Cognoscitive Mental States (CMS).

5. Emotion and Mental State Recognition: Considering these as temporaland dynamic interactions of facial expressions, a Dynamic Bayesian Network(DBN) is factorized and trained for each behaviour pattern. Similarly to Fa-TANs, FaDBNs independently recognize four PCEs and four CMSs. Here, thecorrelation among expressions may be included to consider the possibilities ofinterpreting sequences with a mix of expressions of different nature. Conse-quently, an emotional behaviour is recognized for an elapsed time period.

6. Behaviour Interpretation: The final interpretation is given by assessingconfidence for previous stage’s outputs; the Emotion Class (EC), MaximumLikelihood (ML) and Time Length (TL). Although ML can provide a solution,a confident solution can stress the unbiased interpretations.

This system performs an inductive generalization of acquisition of observations,categorization of expressions and inference of behaviours. It covers a wide variety ofpsychological transformations from bottom to top; Expressions related to few/manysignals, different intensities and durations, see Figure 5.1. The lower module on thesystem extracts the observation features for expression and emotion investigation.These features can be stored and clustered by nearness or probability. However, aim-ing to tailor a spatio-temporal taxonomy for cognitive mapping of emotions, we mustbe able to model a multiple kernel manifolds. Accordingly, a CBR system enhancedwith confidence assessment is used to exploit the database knowledge, generalize de-cision rules and dichotomize as much as possible classes’ boundaries. Consequently,the higher level provides the final interpretation of a given emotional behaviour fromthe input video-sequence.

5.1. REASONING THE COGNITIVE STRUCTURE 113

Figure 5.1: Bottom-Up System Overview. Head-face features are encodedby ABT from an image sequence. CBR stores and exploits the feature vec-tors database. Confidence is assessed to enhance the separability of classesand knowledge of spatial topology of observation space. FaTANs indepen-dently recognize facial expressions, establishing the features’ correlation andlikelihoods. FaDBNs independently recognize PCEs and CMSs for a sequenceof facial expressions. After assessing confidence for emotions the behaviourinterpretation is obtained.

5.1 Reasoning The Cognitive Structure

According to the Table 4.14 CBR achieved high classification rates despite of thememory and time consumption. Therefore, here CBR is used in order to exploit asmuch as possible the spatial relationships of facial actions according to the six basicfacial expressions and additional type of displays.

In Chapter 4 only basic emotion prototypes were taken into account. However,in this chapter additional and much complex affective behaviours are considered.According to Mind DB [10] there are 24 group of emotions collecting 412 detailedemotions. For example, a basic emotion such as Anger encloses at the same timeother emotions in deeper details, see Fig. 5.2.

Although such extensive discrimination of facial actions increases the complexityfor spatial classifiers, the goal here is to establish a hierarchy of the new emotions andthe corresponding static expressions. Therefore, CBR is applied to each emotion classseparately. Consequently, a strong taxonomy is built inter-class for facial expressionsfor six basic emotions (Anger, Disgust, Fears, Happiness, Sadness and Surprise) andsix new emotions, hence Mental States (Interested, Hurt, Sneaky, Sure, Thinking,Unfriendly). Note that the mental states were picked up considering the intentionality


Figure 5.2: Basic emotions can be categorized in deeper details accordingto intensity, intentionality and contextual information.

that they may involve.

Thus, given the observable feature spaces of basic emotions, Sb =< ~γb, εb, ϑb, λb >,and mental states, Sm =< ~γm, εm, ϑm, λm >, CBR is applied to assess confidence.The facial actions, ~γ, are extracted by applying ABT as in Chapter 3. The expressive-ness, ε, is computed by the p-norm as in Eq. 4.12. The labels for all expressions , λ,are taken from the Mind DB, since this is the only DB containing such discrimination.Finally, the training samples are cleaned up from those frames with low confidencewhile keeping the expression classes with higher confidence within each emotion class.

After establishing the spatial taxonomy at the expression and gesture level, emo-tions and mental states are also analysed intra-class by applying CBR and confidenceassessment. This process will provide the spatial distribution of affective behaviours.For this purpose, only the Mind DB is used. Consequently, CBR-confidence is appliedto the set of affective behaviours B = {S ′

b∪S ′m}, where S ′ =< ~g, ε′, ϑ′, λ′ > includes

the head movements.

After assessing confidence intra-class for head-face gestures and inter-class forfacial emotions (spatial relationships), a Cognitive map is built, which connects emo-tions and mental states to the corresponding complex displays at the deeper level ofexpressions, see Fig. 5.3. Moreover, confidence values are obtained for both levelsexpressions and emotions, in relationship to the classes at the same level.

5.2. PROBABILISTIC LEARNING OF COGNITIVE MAPS 115

Figure 5.3: Cognitive map for the interpretation of head-face gestures, emo-tions and mental states.

Finally, CCA is applied as in Section 4.2 in order to increase the confidence at alllevels. CCA is applied to head and facial actions parameters and later assessing againconfidence over the maximised correlations at the first level of facial expressions.

5.2 Probabilistic Learning of Cognitive Maps

In main goal of this section is to probabilistically learn the cognitive map of head-facemovements, head-face gestures and emotions (Fig. 5.3). In Section 4.4 the advantagesof learning Tree Augmented Naïve Bayesian Networks (TAN-BN) by applying the K2algorithm for structure learning. Therefore, TAN-BNs are learn over the facial actionobservation space in order to classify a given facial action (head-face when required)into one of the new facial actions proposed by Mind DB. Consequently, TAN-BNsand NBN are learnt and trained for the following datasets:


1. TAN-BNs are learnt from Sb =< ~γm, λb >, where the class nodes are deter-mined by the labels λb according to the six basic emotions.

2. TAN-BNs are learnt from Sm =< ~γm, λm >, where the class nodes are deter-mined by the labels λm provided by Mind DB for mental states.

3. NBNs are learnt from S ′b =< ~ρb, ~γb, λ

′b >. Here head movements are gathered

to facial actions in order to learn the CPD of deeper expressions, see Fig. 5.3.

4. NBNs are learnt from S ′m =< ~ρm, ~γm, λ

′m >. Similarly, head and face are the

input of the NBNs to infer the spatial relationship between basic expressionsand deeper mental states.

Finally, a Multi-net composed by TANs and NBNs allow a spatial mapping offacial expressions, see Fig. 5.4. Starting from an observable lower level of facialactions, TANs are intended to recognize facial expressions according to either the sixbasic emotions or the six chosen mental states.

5.3 Inferring Expressions, Emotions and Mental States

The affective behaviours suitable of discrimination and understanding require themethods capable of dealing with spatio-temporal patterns. In a real word, the dura-tion of such behaviours is not constant, the kinematic of along the different intensitiesis not either uniform. Therefore, a hybrid system is required attempting to deal withas many challenges as possible while reducing the restrictions and assumptions.

5.3.1 Mixture of HMMs

A Mixture of HMMs (MHMM) can be used for clustering (pre-segmented) sequences.The main idea is to unify the HMMs into a hidden cluster variable as a parent.Thereby, all the distributions become conditional on the (hidden) class variable. Ageneralization of this, which I call a dynamical mixture of HMMs, adds Markoviandynamics to the mixture/class variable, as in Fig. 5.5.

Consequently, the previously obtained Multi-nets of TAN-BNs and NBNs arehere used to build a Dynamic Bayesian Network for temporal analysis of emotionsand mental states. The first expression class is factorized into states to form a FaDBN,see again Fig. 2.23. The factorization of one of the slices of this multinet and therepresented as FaDBN is shown in Fig. 5.6

In the same as shown in Fig. 5.6, all slices in the multinet are factorized intoFaDBN which are further gathered through an upper hidden parent. This new struc-ture is called Factorized Mixture of HMMs. The new hidden models the temporaldynamics of the emotional patterns while providing a single solution as output from43 possible emotions and mental states. This solution can be also scaled to the lowerlevel of emotions by connecting the hidden node directly to all 12 emotions and mentalstates.

5.3. INFERRING EXPRESSIONS, EMOTIONS AND MENTAL STATES117

(a)

(b)

Figure 5.4: (a) Each emotion is inferred from a Multinet that combinesTANs and NBNs. (b) Mental States are also recognized from a Multinet thatstarts from facial expressions inferred from head-face movements till reachingdeeper expressions.


Figure 5.5: DBN representation of a Mixture of HMMs [87].

Figure 5.6: DBN representation of a factorized Mixture of HMMs [87].

5.4 Behaviour Interpretation Based on Confidence

Although the final structure of FaMHMM models spatial and temporal probabilities,HMMs tend to misclassify sequences variable duration and incomplete sequences.Consequently, a final confidence assessment is proposed for spatio-temporal interpre-tations; a Behaviour Confidence Score (BCS) similar to that obtain in section 4.1.

The five confidence estimators are calculated with the emotion recognition out-

puts from FaHMM. At each time-step, a 3D vector Θ(i)t =< ψ, τ, λ > is composed,

where ψ is the likelihood of the hidden node, τ is a discrete variable measuring thefrequency of the current solution λ along the sequence. Thus, ψ and τ are randomvariables and λ ∈ R

12 represent the six emotions and six mental states. Next, thisinformation is collected for a temporal window of five consecutive frames, what com-pletes a neighbourhood of fifteen data. Finally, a similarity matrix is computed for

5.5. EXPERIMENTS ON EMOTIONS AND MENTAL STATES 119

this small sample and the confidence score (Section 4.1.3.1) is computed to providentthe decision with highest confidence.

This assessment enhance the interpretation-making since enhance the spatialknowledge of emotion space. It also gives an estimation of the minimum lengthof a behavioural sequence to be confidently interpreted.

5.5 Experiments on Emotions and Mental States

The following experiments were advanced in a similar way to the Chapter 4, i.e. thedatasets were tested into an Actor-Fold Cross-Validation. This type of testing aspreviously explained, reveals the system’s capability to deal with unseen data, whichin this case correspond to unseen people.

5.5.1 Building the Spatial Taxonomy

The first experiment is to build the spatial taxonomy of facial expressions at twolevels, called Expression 1 and 2 according to Fig. 5.3; Expression 1 corresponds to12 expressions of the basic emotions and mental states. Subsequently, the Expression2 level refers to the more complex emotions but the frame level.

The taxonomy is constructed by assessing confidence inter-class for the 12 expres-sions with respect to the facial actions and for intra-class for the 46 deeper expres-sions inside of each expression at level 1. The process of confidence assessment is asdescribed in Section 4.1; five confidence predictors are obtained iteratively in an opti-mization process of false detection, see again Algorithm 4. The Table 5.1 summarizesthe confidence and expressiveness results of this experiment.

It is worth to mention that as explained in Section 4.1, the optimization of con-fidence assessment suggests a cleaning of the DB. However, due to the necessity ofentire sequences of facial expressions for emotion analysis, a frame is dropped froma sequence only if the confidence is lower than 50% and not other consecutive framehas been deleted before.

5.5.2 Probabilistic Recognition of Expressions in a Multinet

After cleaning the database, both levels of expressions are used to build a multinetcomposed by TAN-BNs at first level and NBNs at second level (Section 5.2). Thetraining process considers decoupled networks which further coupled into a multinet,such the output at the first level of 12 facial expressions correspond to the inputsof the upper level of NBNs. In essence the construction of the multinet is doneas a cascade of Bayesian Networks where the outputs are decoded according to themaximum likelihood function. The recognition rate of facial expressions at both levelsby the multinet is summarized in the Table 5.2.

The recognition result shown in Table 5.2 correspond to those obtained by eachcomponent of the multinet, i.e. NBN or TAN-BN. Observe that this results prove theclassification rate per expression class at each level, which are only comparable to theresults reported by R. El Kaliouby [49]. This approach is the closest work comparable


Expression 1 Confidence Expressiveness Expression 2 Confidence ExpressivenessAnger 99.31 85.31 Complaining 77.33 60.3

Frustrated 64.28 86.62Annoyed 96.12 53.93Furious 82.24 82.92

Fears 90.4 97.47 Intimidated 73.53 73.1Vulnerable 52.53 85.47Nervous 82.34 88.33Cowed 68.44 56.72

Disgust 93.89 93.69 Disgusted 53.81 62.18Revulsion 54.68 82.93Distaste 78.23 83.87

Happy 98.9 95.73 Mischievous 71.71 56.53Overjoyed 71.55 51.32Enjoying 61.14 75.1Joking 55.51 90.94

Sadness 95.94 72.72 Anguished 71.26 96.72Tearful 79.19 71.74Grave 89.51 65.77Lost 85.14 80.38

Surprise 84.92 96.33 Scandalized 59.21 98.61Horrified 51.04 90.42Shocked 88.6 82.24Wonder 86.97 65.26

Interested 89.84 92.3 Concentrating 89.65 62.96Spellbound 73.91 90.21Listening 77.29 91.93Asking 72.86 92.06

Hurt 85.23 94.65 Exploited 80.16 74.95Neglected 99.12 85.44Tortured 55.46 56.44Offended 83.21 83.39

Sneaky 50.59 68.62 Calculating 75.1 70.15Insincere 69.61 66.67Fawning 92.02 71.98

Sure 87.42 79.8 Determined 70.01 69.59Committed 88.75 51.37Convinced 90.28 99.86Assertive 94.11 92.19

Thinking 78.78 88.82 Thoughtful 98.86 57.15Calculating 88.93 72.28Dreamy 58.01 64.46Judging 79.04 84.52

Unfriendly 93.75 98.35 Threatening 50.84 96.14Blaming 54.66 75.2Ignoring 74.14 95.75Cold 52.84 98.88

Table 5.1: Confidence and Expressiveness of facial expressions at first andsecond level of the cognitive map (Fig. 5.3).

to the Multinet proposal, since both are modelling cognitive behaviour from the MindReading database.

5.5. EXPERIMENTS ON EMOTIONS AND MENTAL STATES 121

Expression 1 Recognition Expression 2 RecognitionAnger 92.37 Complaining 89.16

Frustrated 82.28Annoyed 76.06Furious 88.11

Fears 87.11 Intimidated 83.83Vulnerable 89.17Nervous 64.83Cowed 74.22

Disgust 76.89 Disgusted 74.71Revulsion 81.61Distaste 63.55

Happy 86.69 Mischievous 82.07Overjoyed 90.5Enjoying 78.47Joking 78.32

Sadness 70.46 Anguished 79.44Tearful 79.89Grave 72.55Lost 77.11

Surprise 85.84 Scandalized 84.45Horrified 86.25Shocked 82.22Wonder 99.58

Interested 78.08 Concentrating 66.42Spellbound 84.09Listening 86.16Asking 85.21

Hurt 73.56 Exploited 76.8Neglected 88.77Tortured 67.5Offended 62.23

Sneaky 71.42 Calculating 64.45Insincere 86.29Fawning 84.36

Sure 62.91 Determined 87.23Committed 73.51Convinced 54.49Assertive 81.01

Thinking 67.59 Thoughtful 78.8Calculating 69.57Dreamy 77.84Judging 76.47

Unfriendly 77.84 Threatening 78.46Blaming 81.99Ignoring 81.86Cold 77.01

Average 77.56 78.89

Table 5.2: Confidence and Expressiveness of facial expressions at first andsecond level of the cognitive map (Fig. 5.3).


5.5.3 Emotion and Mental States Recognition

In order to analyse the temporal evolution of facial expressions at both levels, TAN-BNs and NBNs are factorized at the expression class node. Therefore, three newdiscrete states replace the expression class node, which are modelling the spatialrecognition of expressions according to three expression intensities. By consideringthe expressiveness and confidence obtained at Table 5.1, each expression class is splitinto three groups. Next, the TAN-BNs and NBNs are coupled separately to the statenodes thus constructing the time slices for the corresponding HMM.

The training process is done in similar way as the multinet in order to reducethe complexity of handling a huge mixture of HMMs. Once the singular HMMs aretrained, they are coupled to the higher parent node which joins the singular networksinto the Mixture of HMMs (see Murphy’s Thesis for additional details on trainingMHMMs [87]).

All databases, FGnet, MMI and Mind are sub-sampled into a variety of sequenceswhile increasing the amount of time-series and the extending the capabilities of theMHMM to deal with different type of emotions. The length of emotion sequences wasvaried between 10 and 100 frames, where the frequency with respect to the originalsequence is at most 3 Hz. Thus, from an initial set of 537 image sequences, 15,100sequences are obtained after the sub-sampling pre-processing. The obtained result foremotion and mental state recognition are presented in the Table 5.3.

After testing the whole database by an exhaustive AFCV, the FaMHMMs acquiresan average recognition rate of 80.20% for the emotions at the first level (12 groups).Likewise, 46 emotions are recognized with an average of 81.5% at the complex level ofemotions. These results are significant for the state of the art since they correspondto person-independent tests (unseen actors).

5.5.4 Confident Interpretation of Cognitive Behaviours

As above mentioned, a new dataset of 15,100 image sequences have been obtained,from which the Behaviour Confidence Score is computed for the last 5 time-stepsof each sequence. Therefore, a total of 15,100 sub-neighbourhoods are analysed forconfidence of the obtained solutions at previous temporal classification. The obtainedresults of such experiment are shown in Table 5.4. This experiment has been doneoff-line by collecting the information of 5 frames of each sequence. However, it can beplugged to an on-line modular system despite of the time delays caused by observing5-frame time window.

As expected, the recognition rate increases due to the confidence assessment. Thisprocess is similar to a Markov process of fifth degree. However, HMM of 5 time slicesties the transition probability distribution by additional conditioning random vari-ables. Instead, the BCS takes into account duration as complement to the likelihoodvalue at certain time.


Emotion/Mental State Recognition Mental State RecognitionAnger 96.85 Complaining 92.82

Frustrated 86.65Annoyed 78.92Furious 91.08

Fears 87.27 Intimidated 84.16Vulnerable 93.32Nervous 67.41Cowed 77.63

Disgust 77.3 Disgusted 78.79Revulsion 85.28Distaste 63.73

Happy 88.98 Mischievous 82.64Overjoyed 92.55Enjoying 83.35Joking 80.56

Sadness 73.63 Anguished 83.97Tearful 81.95Grave 76.09Lost 78.47

Surprise 90.54 Scandalized 88.26Horrified 90.5Shocked 85.13Wonder 99.51

Interested 82.41 Concentrating 66.75Spellbound 87.48Listening 89.78Asking 86.93

Hurt 77.21 Exploited 81.3Neglected 90.56Tortured 70.3Offended 62.95

Sneaky 72.77 Calculating 69.08Insincere 89.16Fawning 86.84

Sure 66.84 Determined 91.59Committed 74.53Convinced 55.71Assertive 85.05

Thinking 67.85 Thoughtful 79.76Calculating 71.21Dreamy 79.96Judging 79.38

Unfriendly 80.73 Threatening 80.79Blaming 84Ignoring 83.73Cold 79.28

Average 80.20 81.50

Table 5.3: Emotion and Mental States recognition by the FaMHMM.

5.6 Chapter Summary

In this chapter have proposed a new approach for emotion analysis by combiningseveral classifiers and their strengths of knowledge acquisition, spatial modelling of


Emotion/Mental State Recognition ConfidenceComplaining 97.67 96.57Frustrated 98.50 91.96Annoyed 87.70 100.2Furious 96.07 85.11Intimidated 90.79 93.41Vulnerable 84.54 85.43Nervous 99.77 83.81Cowed 89.00 98.15Disgusted 83.49 95.03Revulsion 98.92 86.28Distaste 98.23 85.44Mischievous 91.95 82.98Overjoyed 99.01 90.24Enjoying 99.06 95.42Joking 89.09 89.02Anguished 91.20 94.87Tearful 88.09 86.06Grave 94.36 89.77Lost 97.03 90.36Scandalized 89.41 98.64Horrified 88.14 98.35Shocked 89.28 94.66Wonder 96.51 82.41Concentrating 95.19 94.12Spellbound 88.38 98.6Listening 85.96 89.03Asking 98.81 82.43Exploited 95.91 98.3Neglected 83.86 92.58Tortured 89.64 86.57Offended 98.22 99.74Calculating 98.93 84.27Insincere 98.62 97.22Fawning 95.05 98.5Determined 96.37 84.98Committed 86.93 83.61Convinced 86.86 99.21Assertive 91.47 98.38Thoughtful 94.76 91.45Calculating 94.86 95.71Dreamy 99.61 98.87Judging 89.93 87.02Threatening 99.3 98.13Blaming 97.15 98.59Ignoring 83.30 99.53Cold 97.15 92.48

Average 93.13 92.25

Table 5.4: Emotion and Mental State Interpretation by BCS.

feature space and expression classes, and modelling of dynamic temporal patterns.The combination of CBR, TAN-BNs, NBNs and DBN (representing HMMs) allowextending the inferation capabilities of system for cognitive behaviour analysis. Fur-thermore, a confidence score is proposed to interpret such behavioural patterns which


are not completely understood by HMMs when testing incomplete time-series.Finally, this chapter collects the entire application of proposal in previous chap-

ters; Face modelling, head-face tracking, correlation analysis, knowledge discoveryby CBR, confidence assessment, probabilistic expression recognition and temporalrecognition of emotion and mental states. Therefore, a cognitive mapping of emo-tional behaviours is achieved by reading the face as reliable display of inner humanemotions.

Chapter 6

Conclusions and Future Research

6.1 Face Motion Modelling

An efficient combination of stochastic and deterministic methods to address the prob-lem of a complete and unconstrained 3D head and facial tracking. Multivariate sta-tistic modelling, optimization methods and line search allowed to deal with threedifferent type of movements. They have been decoupled and modelled with specificactive appearance models and backtracking procedures.

Three main contributions have been proposed. Firstly, a given input image se-quence is represented as a sequence of appearance models, which are modelled througha Multivariate Normal Distribution. This Gaussian assumption is fundamental toconstruct a likelihood function and to compute recursively the expected appearancebased on previous adaptations.

Secondly, regarding previous estimations, a modified Levenberg-Marquardt Algo-rithm (LMA) is used to obtain the optimum current 3D shape model, which minimizesthe residual image between expected and estimated appearances. We have shown theevolution, advantages and drawbacks of other optimization problems. With the LMA,we solved the convergence paradigm of gradient methods.

Thirdly, we include line-search methods to handle different descent rates accordingto the facial action kinematics. For example, backtracking procedures and specificshape models allow to obtain accurately the eyelid and iris positions. Finally, thesecontributions compose the core of three sequential trackers for head, eyebrows, lips,eyelids and irises.

The strengths to cope with most of the hard challenges have been demonstrated.Some of these are stability under illumination changes, translucent surfaces, occlusionsand out-of-plane movements. All trackers have been improved with the Huber’s func-tion to down-weight the influence of outliers pixels in the observation and transitionprocesses.

Two main resolutions for appearances have compared according to accuracy, effec-tiveness and robustness. Both appearances contain 5659 and 1426 pixels respectively.Firstly, the small appearance is suitable to achieve the real-time requirements; theestimation error is an average of 3.5 pixels, 86% of correct adaptations and a perfor-

127

128 CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH

mance of 32 frames per second. Secondly, the big appearance provides an estimationerror of 2.5 pixels, 96% of effectiveness and 1.1 frames per second as performance.

This framework extends the application of appearance models to obtain a com-plete 3D head, eyebrows, lips, eyelids and iris motion extraction. Beyond of separatedhead pose, eyelids and iris tracking, we provide a robust and accurate method to hier-archically obtain the whole head and facial tracking. Eyelids and irises are estimatedas continuous FACS values, in frontal and rotated head pose for spontaneous andnatural movements. The proposed method is a significant contribution towards gazemotion tracking by using appearance-based model. On one hand, the shape represen-tation, the statistical modelling and optimization algorithms give an alternative toalready proposed methods. On the other hand, the accuracy and robustness make thismethod suitable for HCI applications and psychological analysis, since it can work instandard video surveillance environments without requiring training.

The necessity of three different appearance based trackers has been proven, withspecific shape models and textures, and some variations for their observation andstate transitions processes. The first tracker estimates the 3D head pose, eyebrowsand lips, since they have smooth movements with small gradient descents. The secondtracker estimates the eyelid position based on non-occluded shape model that avoidsthe inner eye region. Moreover, this tracker includes special backtracking proceduresto deal with movements such as upper eyelid raising, tightening drooping, squint,blinking, slit and eyes closed. The third tracker obtains the iris position based ona full eye region shape model and a more monotonic backtracking procedure to dealwith iris yaw, iris pitch and saccade movements.

Detailed experimental results proved the accuracy and robustness of our proposedmethod with challenger issues for marker-less face trackers. The FGnet database forface tracking has been tested in three experiments showing the accuracy with respectto the ground truth, the problems of tracking eyelids with the face tracker and thehierarchical tracker reliability.

Canonical correlation analysis has been widely used here for image synthesis, fea-tures fusion and exploration of the relationships among multiple dependent and inde-pendent variables. The strengths of this technique for fast search, synthesis of data,prediction and classification have been proven. To generate ASMs based on head-facehas been extremely useful to enhance the ABT who needs manual initialization atthe first frame.

The wise combination of head-face features through CCA, has proven that headprovide important information revealing cognitive behaviours. In our work we demon-strate the reliability to build a cognitive expression system that combines head-facefeatures, which have been poorly analysed. We proposed a mapping technique toevolve from basic to cognitive expressions. The high classification rate at the highestlevel, demonstrates that our proposal of mapping works with posed and spontaneousfacial expressions.

This technique allowed to additional contribute to facial expression recognitionand cognitive systems. First, ABTs have been enhanced with an automatic ASMgeneration, which allows a realistic on-line tracking. Second, a transformation ofthe features space was proposed by projecting these features onto canonical vectors.However, due to the strengths of CCA, it is better to perform classification on the

6.2. FACIAL EXPRESSION RECOGNITION 129

estimated vector. Third, a taxonomy of basic and cognitive expressions was built inorder to be able to assess confidence while avoiding the heavy computations in CBR.Fourth, it has been proposed to replicate the knowledge to all branches of the builtnetwork by learning the correlation between the high level classes and the solutionsproposed in the inferior classes.

6.2 Facial Expression Recognition

6.2.1 CBR

This section has shown how to exploit the knowledge contained in a database byfollowing the CBR-cycle. The Case-Base has been cleaned from those data loweringthe classification rates by assuming them as having low confident solutions (expres-sion labels). Data with high confidence solutions and ranked as misclassification werere-labelled. Finally, this recycling process has been complemented with a retainingprocedure according to the weaknesses addressed by the confidence predictors. How-ever, although the system can adaptively learn to recognize facial expressions, it is stilla computational expensive process due to the exhaustive validation of the Case-Basealong several k-NN models.

According to the FGnet data, all sequences raise a local maximum from the neutralexpression that is usually recovered after the maximum expression intensity. Thistransition between expressions allocates the cases around areas of either subtle ornoisy data; the former case can also acquire high confidence while the later case tendsto low confidence values.

In comparison to the approaches proposed in [97, 68], this system also followsthe full CBR cycle for expression recognition. However, three remarkable differencesconstitute the advantages over those methods; First, facial actions are taken as casedescriptors rather than using Action Units for classification. Second, the systemincludes a wide variety of people while the above mentioned works were used forprofiling a single actor. Third, the expression is provided as solution of the systemtogether with its estimated confidence and expression intensity based on intra-classnormalization.

In conclusion, this approach can still be sensitive to more spontaneous expressionsand emotions, which is the case of real scenarios where people do not neither pose(but still believably fake) facial emotions and wider variety of complex affective states.Additional investigation must be done addressing the stability of confidence predictorsalong neighbourhood size variation and different facial actions as descriptors sincehuman face is not always completely visible.

6.2.2 Support Vector Machines

6.3 Expression and Emotion Interpretation

It was demonstrated that the proposed confidence measures are good confidence esti-mators and stable classifiers. They increase the classification efficiency when including

130 CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH

subtle expressions. Moreover, They were extended for maintenance polices, contribut-ing to the four steps of the CBR cycle; retrieving the optimal k-neighbourhood forexpected classification error and confidence. Reusing previous solved cases to provideconfident solutions. It is possible to revise the solutions based on updated knowledgeof the decision surface and the case base. By learning the optimum confidence thresh-olds, it is possible to retain or reject new solved cases. Experimental results haveproven that the classification confidence improves the facial expression recognitionrather providing additional information for the solution. After training, the systemconfidently recognizes facial expressions of unseen actors, can delete cases with lowconfidence and retain those with high confidence. As result, the system achieves anaverage classification rate of 93%± 1% with an average confidence of 96%± 3%. Thesolutions are composed by expressiveness, confidence and expression class (ǫ,K,Λ).

Four contributions represent important advances for facial expression recognitionin dynamical environments, which include contextual information, different people’scultures and expression intensity. CBR and confidence assessment allow improvingthe recognition capabilities by making decision rules to increase the classificationconfidence. The low cost of knowledge acquisition and the strengths of dichotomizingthe decision boundaries in multi-class and multi-modal domains, make CBR suitablefor adaptive learning systems.

Appendix A

Image Warping

Building a geometrically normalized image or patch from each frame of an imagesequence into a shape model, an image warping process is needed to project the inputimage onto the patch. This is made with a piece-wise affine, see Figure A.1:

Figure A.1: Image Warping

Image warping implies to transform one spatial configuration of an image intoanother [81]. Hence, a simple translation of an image can be considered as a warpedimage. Formally, given the input image I ∈ ℜk → I’ ∈ ℜk and pictorial in the planarcase, k=2.

Given an input facial image, we have to recover the 3D perspective of the face inorder to project it onto a two dimensional image, this is the appearance model. The3D mesh is useful for capturing the perspective and for projecting the input face.

We measure three Euler’s angles, {θx, θy, θz} of the mesh, as well as three transla-tions, {tx, ty, tz}. Let R = [r1, r2, r3] and T = [tx, ty, tz] the rotation and translationbetween the 3D face model coordinate system and the camera coordinate system, seeFigure A.2. Let βu, βv, uc, vc, be the intrinsic parameters of the camera, where βu,βv are pixel scaling factor in horizontal and vertical direction of the focal length of thecamera. In addition uc, vc are the coordinates of the principal point (image centre),see Figure A.2.

So, proceeding with the weak perspective projection blended with the 3D pointPi onto the reference plane z = tz involves that the projection matrix M, which canbe written as:

131

132 APPENDIX A. IMAGE WARPING

Figure A.2: Weak perspective projection of the 3D face mesh onto the imageplane.

M =

[

βu

tzsrt

1βu

tztx + uc

βv

tzsrt

2βv

tzty + vc

]

, (A.1)

where r1 and r2 are the first two rows of the rotation matrix R, and s is an unknownscale. In the initial shape set-up, the camera is partly calibrated. The coordinates ofthe principal point uc, vc are known, and the pixel scaling factor are equals. Later:

M =

[

s ∗ rt1 tx + uc

s ∗ rt2 ty + vc

]

. (A.2)

The rotation matrix R is represented by the three Euler’s angles, {θx, θy, θz}. The3D pose parameters, which are three angles and three translations is represented bythe vector ρ = [θx, θy, θz, s, tx, ty]T . Here, the face can move along the depth directionbecause tz is included in the scale, s. Thus, pi is given by:

(

ui

vi

)

=

[

srt1 tx + uc

srt2 ty + vc

]

Xi

Yi

Zi

1

(A.3)

Since AAM are landmark-based, we use a linear image warping method whichconsiders the mapping of one arbitrary point set I = {x1, . . ., xn} into another I′ ={x′1, . . ., x′n} where each point is represented as z = (x, y). Formally written as acontinuous vector valued mapping function such that: g(xi) = x′i, ∀i = 1, . . ., n.

In order to construct an n-point based warp, we assume that g is locally linearand zero otherwise while extracting only the reference shape free. In order to warp the

133

input image into a 2D AAM, we can use a partition of the convex hull of the pointsby using a suitable triangulation such as Delaunay. Thus, we connect an irregular setof points through a triangular mesh, each one satisfying the Delaunay property [108].

Consequently, warping process is done by applying the triangular mesh of the setof points in I towards the second set of points in I’. Each point on each triangle canbe uniquely mapped onto the corresponding triangle of the second set of points by anaffine transformation, which consists of scaling, translation and skewing. Thus, if x1,x2 and x3 denotes the vertices of a triangle in I, any internal point z = (x, y) can bewritten as superposition:

z = x1 + β(x2 − x1) + ϕ(x3 − x1) = αx1 + βx2 + ϕx3 (A.4)

Thus α = 1 − (β + ϕ) given α + β + ϕ = 1, and in order to constrain z intothe triangle, α, β, ϕ ∈ [0, 1]. Consequently, the warping process is given by using therelative position within the triangle defined by α, β, ϕ on the triangle in I’: z’ = g(x)=αx′1 + βx′2 + ϕx′3.

Setting the three points of a triangle allows determining α, β, ϕ by solving thesystem of the two linear equations for a known point, z = (x, y):

α = 1 − (β + ϕ)

β =yx3 − x1y − x3y1 − xy3 + x1y3 + xy1

−y3x2 + x2y1 + x1y3 + x3y2 − x3y1 − x1y2,

ϕ =xy2 − xy1 − x1y2 − x2y + x2y1 + xy1

−y3x2 + x2y1 + x1y3 + x3y2 − x3y1 − x1y2(A.5)

In the piece-wise affine warping, even g(x) produces a continuous deformation, thedeformation field is not smooth, see Figure A.3. Another problem of piece-wise affinewarping is that it will not detect singularities in the deformation field in the form offolding triangles. This can however be detected by a simple test on the triangle facenormalized.

134 APPENDIX A. IMAGE WARPING

Figure A.3: Problem piece wise affine warping by warping the image withnot smooth for the field deformation.

Appendix B

Publications

B.1 Journals

1. Fadi Dornaika and Javier Orozco. Real time 3D face and facial feature track-ing. Journal of Real-Time Image Processing, 2(1). October 28, 2007.

2. Javier Orozco, F.Xavier Roca and Jordi Gonzàlez. Real-Time Gaze TrackingWith Appearance-Based Models. Machine Vision and Applications. April 04,2008.

3. Javier Orozco, F.Xavier Roca and Jordi Gonzàlez. Unconstrained Appearance-Based Tracking For 3D Head pose, Eyebrows, Lips, Eyelids and Irises In Real-Time. Submitted to International Journal of Computer Vision, October 17th2007.

4. Javier Orozco, F.Xavier Roca and Jordi Gonzàlez. Confident Subtle Fa-cial Expression Recognition By Case-Based Reasoning and Appearance-BasedTracking. Submitted to Pattern Analysis and Applications, Springer, 2009.

5. Javier Orozco, F.Xavier Roca and Jordi Gonzàlez. Beyond Cognitive Map-ping of Emotions and Mental States by Confidential Reasoning and Inference.Submitted to Cognitive Systems Research, April 2009.

B.2 Conferences

1. Javier Orozco, Ognjen Rudovic, F. Xavier Roca, Jordi Gonzàlez. ConfidenceAssessment On Eyelid and Eyebrow Expression Recognition. In 8th IEEE In-ternational Conference on Automatic Face and Gesture Recognition (FG’2008).Amsterdam, The Netherlands, September, 2008.

2. Murad Al Haj, Javier Orozco, Jordi Gonzàlez, Juan José Villanueva. Auto-matic Face and Facial Features Initialization for Robust and Accurate Tracking.In 19th International Conference on Pattern Recognition (ICPR’2008). Tampa,Florida, USA, December, 2008.

135

136 APPENDIX B. PUBLICATIONS

3. Javier Orozco, F. Alejandro García, Josep Lluis Arcos, Jordi Gonzàlez. Spatio-Temporal Reasoning for Reliable Facial Expression Interpretation. In 5th In-ternational Conference on Computer Vision Systems (ICVS’2007). Bielefeld,Germany, March, 2007.

4. Javier Orozco, F.X. Roca, Jordi Gonzàlez. Deterministic and StochasticMethods for Gaze Tracking in Real-Time. In 12th International Conferenceon Computer Analysis of Images and Patterns (CAIP’2007). Vienna, Austria,August, 2007.

5. Javier Orozco, Jordi Gonzàlez, Ignasi Rius, F.X. Roca. Hierarchical Eyelidand Face Tracking. In 3rd Iberian Conference on Pattern Recognition andImage Analysis (ibPRIA’2007). Girona, Spain, June, 2007.

6. Fadi Dornaika, Javier Orozco, Jordi Gonzàlez. Combined Head, Lips, Eye-brows, and Eyelids Tracking using Adaptive Appearance Models. In 4th Inter-national Workshop on Articulated Motion and Deformable Objects (AMDO’2006).Andratx, Mallorca, Spain, July, 2006.

7. F. Alejandro García, Javier Orozco, Jordi Gonzàlez, Josep Lluis Arcos. As-sessing Confidence in Cased Based Reuse Step. In Desè Congrés Interna-cional de l’Associació Catalana d’Inteloligència Artificial (CCIA’2007), Prin-cipat d’Andorra, 2007.

8. Javier Orozco, Pau Baiget, Jordi Gonzàlez, F. Xavier Roca. Eyelids andFace Tracking in Real-Time. In 6th International Conference on Visualization,Imaging, and Image Processing (VIIP 2006). Palma de Mallorca, Spain, 2006.

9. Alejandro García, Javier Orozco, Murad Al Haj, F. Xavier Roca, JordiGonzàlez. Assessing Confidence and Quality for Expression Analysis WithCBR. In Computer Vision: Advances in Research and Development (CVCRD’2007).Bellaterra, Spain, October, 2007.

10. Murad Al Haj, Javier Orozco, Ariel Amato, F. Xavier Roca, Jordi Gonzàlez.Finding Faces in Color Images trough Primitive Shape Features. In ComputerVision: Advances in Research and Development (CVCRD’2007). Bellaterra,Spain, October, 2007.

11. Javier Orozco, Jordi Gonzàlez, Ignasi Rius, Xavier Roca. Robust Real-TimeTracking for Facial Expression Analysis. In 1st CVC Workshop on ComputerVision: Progress of Research and Development (CVCRD’2006). Bellaterra,Spain, October, 2006.

12. Pau Baiget, Jordi Gonzàlez, Javier Orozco, Xavier Roca. Interpretationof Human Motion in Image Sequences using Situation Graph Trees. In 1stCVC Workshop on Computer Vision: Progress of Research and Development(CVCRD’2006). Bellaterra, Spain, October, 2006.

13. Javier Orozco, Jordi Gonzàlez, Pau Baiget, Juan José Villanueva. HumanEmotion Evaluation on Image Sequences. CogSys II Conference. RadboudUniversity Nijmegen, Netherlands, April, 2006.

B.2. CONFERENCES 137

14. Pau Baiget, Javier Orozco, F. Xavier Roca, Jordi Gonzàlez. Situation GraphTrees for Human Behavior Analysis. CogSys II Conference. Radboud Univer-sity Nijmegen, Netherlands, April, 2006.

Bibliography

[1] G. Clore A. Ortony and A. Collins. The cognitive structure of emotions. Cam-bridge University Press, 1988. (Cited on pages 37 and 111)

[2] P. Hallinan A. Yuille and D. Cohen. Feature extraction from faces using de-formable templates. International Journal of Computer Vision, 8(2):99–111,1992. (Cited on page 29)

[3] A. Aamodt and E. Plaza. Case-based reasoning: Foundational issues, method-ological variations, and system approaches. Artificial Intelligence Communica-tions, 7(1):39–52, 1994. (Cited on pages 6, 8, and 31)

[4] J. Ahlberg. An active model for facial feature tracking. EURASIP Journal onApplied Signal Processing, 2002(6):566–571, 2002. (Cited on pages 2 and 28)

[5] J. Aloimonos. Purposive and qualitative active vision. In Proceedings of Inter-national Conference on Pattern Recognition (ICPR’90), pages 346–360, 1990.(Cited on page 47)

[6] N. Andrei. An acceleration of gradient descent algorithm with backtrackingfor unconstrained optimization. Numerical Algorithms, 42(1):63–73, May 2006.(Cited on page 9)

[7] A. Ayesh. Perception and emotion based reasoning: A connectionist ap-proach. INFORMATICA, an International Journal of Computing and Infor-matics, 27(2):119–126, June 2003. (Cited on page 8)

[8] G. Ball and J. Breese. Emotion and personality in a conversational agent.In "In J. Cassell, J. Sullivan, S. Prevost, and E. Churchill", editors, EmbodiedConversational Agents, pages 189–219, MIT Press, Cambridge MA, 2000. (Cited

on page 37)

[9] S. Baron-Cohen. How to build a baby that can read minds: Cognitive mecha-nisms in mindreading. Current Psychology of Cognition, 13(5):513–552, 1948.(Cited on page 8)

[10] S. Baron-Cohen. Mind Reading: The Interactive Guide to Emotions - Version1.3. 2003. (Cited on pages 4, 14, 16, 75, 87, 96, 101, and 113)

[11] S. Baron-Cohen, S. Weelwright, and T. Jollife. Is there a ”language of the eyes”evidence from normal adults, and adults with autism or asperger syndrome?Visual Cognition, 4(3):311–331, 1995. (Cited on pages 5, 7, and 111)

139

140 BIBLIOGRAPHY

[12] C. Barron and I. Kakadiaris. Estimating anthropometry and pose from a singlecamera. In Proceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR’00), volume 1, pages 669 – 676, 2000. (Cited on page 28)

[13] M.S. Bartlett, G.C. Littlewort, M.G. Frank, C. Lainscsek, I. Fasel, and J. Movel-lan. Automatic recognition of facial actions in spontaneous expressions. Journalof Multimedia, 1(6):22–35, 2006. (Cited on pages 7, 31, and 37)

[14] J. Barzilai and J.M. Borwein. Two point step size gradient method. IMA J.Numerical Analysis, 8:141–148, 1988. (Cited on page 9)

[15] M. Bichsel and A. P. Pentland. Human face recognition and the face imagesets topology. Computer Vision, Graphics and Image Processing: Image Un-derstanding, 59(2):254–261, 1994. (Cited on page 18)

[16] M.J. Black and A.D. Jepson. Eigentracking: Robust matching and tracking ofarticulated objects using a view-based representation. International Journal ofComputer Vision, 26(1):63–84, 1998. (Cited on page 29)

[17] M. Borga. Learning Multidimensional Signal Processing. PhD thesis, LinkopingUniversity, Linkoping, Sweden, 1998. (Cited on page 5)

[18] B.E. Boser, I.M. Guyon, and V.N. Vapnik. A training algorithm for optimalmargin classifiers. In Fifth Annual Workshop on Computational Learning The-ory (COLT’92), 1992. (Cited on page 33)

[19] M.L. Cascia, S. Sclaroff, and V. Athitsos. Fast reliable head tracking undervarying illumination. an approach based on registration of texture-mapped 3dmodels. IEEE Transactions on Pattern Analysis and Machine Intelligence,22(4):322–336, May 2000. (Cited on pages 2 and 9)

[20] A. Cauchy. Méthodes générales pour la résolution des systèmes déquationssimultanées. C.R. Acad. Sci. Par., 25:536–538, 1847. (Cited on pages 9 and 44)

[21] W. Cheetam and J. Price. Measure of a solution accuracy in case-based reason-ing systems. In Funk, P.,González-Calero, P.,eds.: 7th European Conference onCase-Based Reasoning (ECCBR 2004), 3155:106–118, 2004. (Cited on pages 80

and 83)

[22] W. Cheetham. Case-Based Reasoning with Confidence. Book Advances in Case-Based Reasoning, 2000. (Cited on pages 8 and 80)

[23] J. Chua and P. Tischer. Determining the trustworthiness of a case-based rea-soning solution. In International Conference on Computational Intelligence forModeling, Control and Automation:7th European Conference on Case-BasedReasoning (ECCBR 2004), 3155(1740881885):12–14, 2004. (Cited on pages 80

and 81)

[24] C. Chuanga and F.Y. Shih. Recognizing facial action units using indepen-dent component analysis and support vector machine. Pattern Recognition,39(9):1795–1798, September 2006. (Cited on page 31)

BIBLIOGRAPHY 141

[25] D. Clark. Comparing huber’s m-estimator function with the mean square errorin backpropagation networks when the training data is noisy. The InformationScience Discussion Paper Series, 19, 2000. (Cited on pages 9 and 45)

[26] I. Cohen, N. Sebe, Y. Sun, M.S. Lew, and T.S. Huang. Evaluation of expres-sion recognition techniques. In International Conference on Image and VideoRetrieval (CIVR 2003), volume 2728, pages 184–195, 2003. (Cited on pages 13,

32, 34, and 90)

[27] G.F. Cooper and E. Herkovits. Ca bayesian method for induction of probabilisticnetworks from data. Machine Learning, 9(4):309–348, 1992. (Cited on page 104)

[28] T.F. Cootes and C.J.Taylor. Statistical Models of Appearance for ComputerVision. Imaging Science and Biomedical Engineering, University of Manchester,2004. (Cited on pages 1 and 47)

[29] T.F. Cootes, G.J. Edwards, and C.J. Taylor. Active appearance models. IEEETransactions on Pattern Analysis and Machine Intelligence, 23(6):681–684,2001. (Cited on pages 2 and 26)

[30] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham. Active shape models- their training and application. Computer Vision and Image Understanding,61(1):39–59, 1995. (Cited on pages 21, 24, 29, and 46)

[31] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–297, 1995. (Cited on page 97)

[32] D. Cristinacce, T. Cootes, and I. Scott. A multi-stage approach to facial featuredetection. In Proceedings of the British Machine Vision Conference, 2004. (Cited

on page 48)

[33] Charles Darwin. The expression of the emotions in man and animals. KessingerPublishing, 2005. (Cited on pages 1 and 7)

[34] D.N. Davis and S.C. Lewis. Computational models of emotion for autonomyand reasoning. INFORMATICA, an International Journal of Computing andInformatics, 27(2):157–164, June 2003. (Cited on page 8)

[35] R. Debnath, N. Takahide, and H. Takahashi. A decision based one-against-onemethod for multi-class support vector machine. PAA, 7(2):164–175, 2004. (Cited

on page 96)

[36] J. Deigh. Emotions: The legacy of james and freud. International Journal ofPsycho-Analysis, 82:1247–1256, 2001. (Cited on page 111)

[37] S. Delany, P. Cunningham, and D. Doyle. Generating of classification confidencefor a case-based spam filter. In In 6th International Conference on Case-BasedReasoning (ICCBR 2005).Springer, volume 3620, pages 177–190, 2005. (Cited

on page 6)

[38] J. Deng and F. Lai. Region-based template deformation and masking for eye-feature extraction and description. Pattern Recognition, 30(3):403–419, 1997.(Cited on pages 2 and 29)

142 BIBLIOGRAPHY

[39] F. Dornaika and J. Ahlberg. Face model adaptation for tracking and activeappearance model training. British Machine Vision Conference, 2003. (Cited on

pages 28 and 41)

[40] F. Dornaika and J. Ahlberg. Fitting 3d face models for tracking and activeappearance model training. Image and Vision Computing, 24(9):1010–1024,2006. (Cited on page 2)

[41] F. Dornaika and F. Davoine. "head and facial animation tracking usingappearance-adaptive models and particle filters". (Cited on page 25)

[42] F. Dornaika and F. Davoine. Online appearance based face and facial featuretracking. CNRS HEUDIASYC, 1:1–4, 2004. (Cited on page 26)

[43] F. Dornaika and J. Orozco. Real time 3d face and facial feature tracking.Journal Journal of Real-Time Image Processing, 2(1), October 2007. (Cited on

page 54)

[44] A. Egges, S. Kshirsagar, and N. Thalmann. A model for personality and emotionsimulation. KES, 2003. (Cited on page 37)

[45] S. Kshirsagar A. Egges and N. Magnenat-Thalmann. A model for personalityand emotion simulation. 2003. (Cited on page 3)

[46] P. Ekman. Emotions Revealed. Times Books (Henry Holt and Company) LLC,2003. (Cited on pages 1, 5, 6, and 37)

[47] P. Ekman and W. V. Friesen. Pictures of facial affect. in Human InteractionLaboratory, San Francisco, CA: Univ. California Medical Center, 1976. (Cited

on page 94)

[48] P. Ekman and W. V. Friesen. Facial Action Coding System: A Technique forthe Measurement of Facial Movement. Consulting Psychologists Press, PaloAlto, 1978. (Cited on pages 4, 5, 7, 11, 13, 59, 60, 80, and 88)

[49] R. el Kaliouby and P. Robinson. Mind reading machines: Automated inferenceof cognitive mental states from video. In The IEEE International Conferenceon Systems, Man and Cybernetics, 2004. (Cited on pages 5, 13, 32, 37, and 119)

[50] Face and Gesture Recognition Working Group. Fgnet database. http://www-prima.inrialpes.fr/FGnet. (Cited on page 49)

[51] B. Fasel. Head-pose invariant facial expression recognition using convolutionalneural networks. In Fourth IEEE International Conference on Multimodal In-terfaces, 2002. (Cited on pages 5 and 37)

[52] B. Fasel and J. Luettin. Automatic facial expression analysis: A survey. PatternRecognition, 36(2003):259–275, 2003. (Cited on pages 5, 6, 13, and 31)

[53] D. Forsyth, P. Torr, and A. Zisserman. Relating personality and behavior:Posture and gestures. In European Conference of Computer Vision ECCV’2008,volume 5304, pages 168–181, 2008. (Cited on page 35)

h

BIBLIOGRAPHY 143

[54] G.E. Forsythe and T.S. Motzkin. Asymptotic properties of the optimum gradi-ent method. Bull. American Society, 57:183, 1951. (Cited on page 9)

[55] W. Friesen and P. Ekman. Emfacs-7: emotional facial action coding system.1983. (Cited on page 12)

[56] J. Breese G. Ball. Relating personality and behavior: Posture and gestures.In International Workshop of Artificial Intelligence (IWAI’99), 1999. (Cited on

page 34)

[57] S.B. Gokturk, J.Y. Bouguet, and R. Grzeszczuk. A data-driven model formonocular face tracking. In Eighth International Conference on Computer Vi-sion (ICCV’01), volume 2, page 701, 2001. (Cited on page 2)

[58] L. Gu and T. Kanade. A generative shape regularization model for robust facealignment. In 10th European Conference on Computer Vision (ECCV’2008),volume 5302, pages 413–426, 2008. (Cited on pages 22 and 24)

[59] H. Zha H. Liu, Y. Wu. Eye states detection from color facial image sequence.Proc. 2nd Int. Conf. Image and Graphics (ICIG02), 4875:693–698, 2002. (Cited

on pages 29 and 30)

[60] Y. Zhang H. Tan. Detecting eye blink states by tracking iris and eyelids. PatternRecognition Letters, 2005. (Cited on pages 3, 60, and 61)

[61] H. Hotelling. Relations between two sets of variates. Biometrika, 28:312–377,1936. (Cited on page 5)

[62] P.J. Huber. Robust estimation of a location parameter. The Annals of Mathe-matical Statistics, 35:73–101, 1964. (Cited on pages 9 and 45)

[63] A. Garg L. Chen I. Cohen, N. Sebe and T.S. Huang. Facial expression recog-nition from video sequences: Temporal and static modelling. Computer Visionand Image Understanding, 91(1):160–187, 2003. (Cited on pages 3, 4, 7, 37, 80,

108, and 111)

[64] A.D. Jepson, D.J. Fleet, and T.F. El-Maraghi. Robust on-line appearancemodels for visual tracking. IEEE Transactions on Pattern Analysis and MachineIntelligence, 25(10):1296–1311, 2003. (Cited on page 43)

[65] B. Juang, S.E. Levinson, and M.M. Sondhi. ML estimation for multivariatemixture observations of Markov chains. 1985. (Cited on page 40)

[66] T. Kanade, J.F. Cohn, and Y. Tian. Comprehensive database for facial ex-pression analysis. In Fourth IEEE International Conference on Automatic Faceand Gesture Recognition (FG’00), volume 5304, pages 168–181, 2000. (Cited on

page 13)

[67] A. Khanum, M. Zubair Shafiq, and E. Muhammad. Cbr: Fuzzified case re-trieval approach for facial expression recognition. In Artificial Intelligence andApplications (AIA’2007), 2007. (Cited on pages 6 and 37)

144 BIBLIOGRAPHY

[68] A. Khanum and M. Zubair. Facial expression recognition system using casebased reasoning. In International Conference on Advances in Space Technolo-gies, 2006. (Cited on pages 31 and 129)

[69] S. Ho Lee and S. Choi. Two-dimensional canonical correlation analysis. IEEESignal Processing Letters, 14(10):735–738, October 2007. (Cited on pages 5, 73,

and 74)

[70] Z. Lei, Q. Bai, R. He, and S. Z. Li. Face shape recovery from a single imageusing cca mapping between tensor spaces. In IEEE International Conferenceon Computer Vision and Pattern Recognition, (CVPR2008), 2008. (Cited on

page 5)

[71] K. Levenberg. A method for the solution of certain problems in least squares.Quart. Appl. Math., 2:164–168, 1944. (Cited on pages 9, 21, and 44)

[72] S.Z. Li, L. Zhu, Z.Q. Zhang, A. Blake, H. Zhang, and H. Shum. Statisticallearning of multi-view face detection. 2002. (Cited on page 20)

[73] W. Liejun, Q. Xizhong, and Z. Taiyi. Facial expression recognition using im-proved support vector machine by modifying kernels. Information TechnologyJournal, 8:595–599, 2009. (Cited on page 31)

[74] R. Lienhart and J. Maydt. An extended set of haar-like features for rapidobject detection. In IEEE Conf. on Image Processing (ICIP’02), 2002. (Cited

on page 21)

[75] D.H. Liu, K.M. Lam, and L.S. Shen. Optimal sampling of gabor features forface recognition. PRL, 25(2):267–276, January 2004. (Cited on page 25)

[76] H. Liu, Y. Wu, and H. Zha. Eye states detection from color facial image se-quence. In SPIE International Conference on Image and Graphics, volume4875, pages 693–698, 2002. (Cited on page 2)

[77] C. Lainscsek I. Fasel M. Bartlett, G. Littlewort and J. Movellan. Machinelearning methods or fully automatic recognition of facial expressions and facialactions. In in IEEE Int. Conference on Systems, Man and Cybernetics, 2004.(Cited on pages 5, 13, and 111)

[78] D. Marquardt. An algorithm for least-squares estimation of non-linear parame-ters. SIAM J. Appl. Math., 11:431–441, 1963. (Cited on pages 9, 21, and 44)

[79] K. Matsuno, C.W. Lee, S. Kimura, and S. Tsuji. Automatic recognition of hu-man facial expressions. In Fifth International Conference on Computer Vision(ICCV’95), 1995. (Cited on page 26)

[80] I. Matthews and S. Baker. Active appearance models revisited. Technical ReportCMU-RI-TR-03-02, The Robotics Institute, Carnegie Mellon University, 2002.(Cited on page 28)

[81] Domingo Mery. Visión por Computador. Departamento de Ciencia de la Com-putación Universidad Católica de Chile, 2004. (Cited on page 131)

BIBLIOGRAPHY 145

[82] D. Metaxas, S. Venkataraman, and C. Vogler. Image-based stress recognitionusing a model-based dynamic face tracking system. In Workshop on DynamicData Driven Applications Systems, volume 3038, pages 813–821, May 2004.(Cited on pages 96 and 97)

[83] P. Michel and R.E. Kaliouby. Real time facial expression recognition in videousing support vector machines. In 5th international conference on Multimodalinterfaces (ICMI’03), pages 258–264, New York, NY, USA, 2003. (Cited on

page 31)

[84] S. Milborrow and F. Nicolls. Locating facial features with an extended activeshape model. In ECCV, 2008. (Cited on pages 23, 24, 74, and 75)

[85] T. Moriyama and T. Kanade et al. Automatic recognition of eye blinkingin spontaneosly ocurring behavior. In International Conference on PatternRecognition, 2002. (Cited on page 30)

[86] T. Moriyama, J. Xiao, J. Cohn, and T. Kanade. Meticulously detailed eyemodel and its application to analysis of facial image. IEEE Transactions onPattern Analysis and Machine Intelligence, 28(5):738–752, May 2006. (Cited on

pages 2, 3, and 30)

[87] K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learn-ing. PhD thesis, U.C. Berkeley, 2002. (Cited on pages 8, 38, 106, 118, and 122)

[88] J. Nocedal and S. Wright. Numerical Optimization. Springer, New York, 1999.(Cited on pages 44 and 56)

[89] J. Orozco, F.A. García, J.L. Arcos, and J. Gonzàlez. Spatio-temporal reasoningfor reliable facial expression interpretation. In Proceedings of the 5th Interna-tional Conference on Computer Vision Systems (ICVS’2007), ISBN 978-3-00-020933-8, Bielefeld, Germany, March 2007. Applied Computer Science Group,Bielefeld University, Germany. (Cited on page 8)

[90] J. Orozco, J. Gonzàlez, I. Rius, and F.X. Roca. Hierarchical eyelid and facetracking. In Proceedings of the 3rd Iberian Conference on Pattern Recogni-tion and Image Analysis (ibPRIA’2007), volume 4477 of LNCS, pages 499–506,Girona, Spain, June 6-8 2007. Springer. (Cited on pages 6 and 54)

[91] J. Orozco, F.X. Roca, and J. Gonzàlez. Deterministic and stochastic methodsfor gaze tracking in real-time. In Proceedings of the 12th International Con-ference on Computer Analysis of Images and Patterns (CAIP’2007), Vienna,Austria, August 2006. (Cited on pages 48 and 54)

[92] J. Orozco, F.X. Roca, and Jordi Gonzàlez. Real-time gaze tracking withappearance-based models. Journal Machine Vision and Applications, 2008.(Cited on page 8)

[93] G. Pajares and J.M. de la Cruz. Fuzzy cognitive maps for stereo-vision match-ing. Pattern Recognition, 39(11), November 2006. (Cited on page 8)

146 BIBLIOGRAPHY

[94] M. Pantic and I. Patras. Dynamics of facial expression: Recognition of facialactions and their temporal segments from face profile image sequences. IEEETransactions on Systems, Man, and Cybernetics, Part B, 36(2):433–449, 2006.(Cited on page 11)

[95] M. Pantic and L.J.M. Rothkrantz. "automatic analysis of facial expressions:The state of the art". IEEE Transactions on Pattern Analysis and MachineIntelligence, 22(12):1424–1445, 2000. (Cited on pages 5 and 13)

[96] M. Pantic and L.J.M. Rothkrantz. Case-based reasoning for user-profiled recog-nition of emotions from face images. In IEEE International Conference on Mul-timedia and Expo (ICME’04), volume 1, pages 391–394, June 2004. (Cited on

page 6)

[97] M. Pantic and L.J.M. Rothkrantz. Case-based reasoning for user-profiled recog-nition of emotions from face images. In IEEE International Conference on Mul-timedia and Expo (ICME’04), volume 1, pages 391–394, June 2004. (Cited on

pages 7, 31, 37, 111, and 129)

[98] M. Pantic, M.F. Valstar, R. Rademaker, and L. Maat. Web-based database forfacial expression analysis. In IEEE International Conference of Multmedia andExpo (ICME’05), Amsterdam, The Netherlands, July 2005. (Cited on pages 14,

15, 16, 87, 91, 92, 96, and 101)

[99] J. Pearl. Probability Reasoning in Intelligent Systems: Networks of PlausibleInference. 1988. (Cited on page 35)

[100] L.R. Rabiner. A tutorial on hidden markov models and selected applicationsin speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. (Cited on

page 38)

[101] H.A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):23–38,January 1998. (Cited on pages 19 and 23)

[102] M. Rydfalk. Candide, a parameterized face. Technical report, Dept. of ElectricalEngineering, Linkoping University, Sweden, 1987. (Cited on page 46)

[103] A. Basu S. Bernogger, L. Yin and A. Pinz. Eye tracking and animation formpeg-4 coding. IEEE Pattern Recognition, 1998. Proceedings, Fourteenth In-ternational Conference, 2:1281–1284, 1998. (Cited on pages 3 and 30)

[104] A. Rosenfeld S. Sirhey and Z. Duric. A method of detecting and trackingirises and eyelids in video. In International Conference on Pattern Recognition,35(6):1389–1401, 2002. (Cited on pages 3, 29, and 30)

[105] Jean-Paul Sartre. Theory of the Emotions. Routledge (UK)Academic Press,2002. (Cited on pages 37 and 111)

[106] R.C. Schank. Memory based expert systems. AFOSR.TR. 84-0814. Comp.Science Dept., Yale University, 1984., August 1984. (Cited on page 32)

BIBLIOGRAPHY 147

[107] H. Schneiderman and T. Kanade. A statistical method for 3d object detectionapplied to faces and cars. 2000. (Cited on page 20)

[108] W.J. Schroeder and M.S. Shephard. Dgeometry-based fully automatic meshgeneration and delaunay triangulation. International Journal for NumericalMethods in Engineering, 26:2503–2515, 1988. (Cited on page 133)

[109] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461–464, 1978. (Cited on page 103)

[110] A. Shashua and T. Hazan. Non-negative tensor factorization with applicationsto statistics and computer vision. In 22nd International Conference on MachineLearning ICML’05, volume 1, pages 792–799, 2005. (Cited on page 25)

[111] J. Shi and C. Tomasi. Good features to track. In IEEE Conference on ComputerVision and Pattern Recognition, volume 1, pages 593–600, 1994. (Cited on page 1)

[112] H.J.M. Steeneken and J.H.L. Hansen. Speech under stress conditions: Overviewof the effect on speech production and on system performance. In IEEE In-ternational Conference on Acoustics, Speech and Signal Processing (ICSSP’99),volume 4, pages 2079 – 2082, May 1999. (Cited on page 96)

[113] K.-K. Sung and T. Poggio. Example-based learning for view-based human facedetection. 1998. (Cited on page 19)

[114] M. Suwa, N. Sugie, and K. Fujimora. A preliminary note on pattern recognitionof human emotional expression. 1978. (Cited on page 11)

[115] Y. Tian, T. Kanade, and J.F. Cohn. Dual-state parametric eye tracking. InProceedings of the 4th IEEE International Conference on Automatic Face andGesture Recognition (FG’00), pages 110–115, 2000. (Cited on page 2)

[116] E. C. Tolman. Cognitive maps in rats and man. Psychological Review, 55:189–208, 1948. (Cited on page 7)

[117] V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience. 1998. (Cited on

page 96)

[118] M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles:Tensorfaces. In 10th European Conference on Computer Vision (ECCV’2002),volume 1, pages 447–460, 2002. (Cited on page 24)

[119] M. A. O. Vasilescu and D. Terzopoulos. Multilinear independent componentsanalysis. In IEEE Computer Society Conference on Computer Vision and Pat-tern Recognition (CVPR’05), volume 1, pages 547–553, 2005. (Cited on page 25)

[120] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simplefeatures. In IEEE Conference on Computer Vision and Pattern Recognition,volume 1, pages 511–518, 2001. (Cited on pages 1, 20, and 74)

[121] F. Wallhoff. Fgnet facial expressions and emotion database.www.mmk.ei.tum.de/~waf/fgnet/feedtum.html. (Cited on pages 14, 16,

49, 62, 87, 88, 92, 96, and 101)

www.mmk.ei.tum.de/~waf/fgnet/feedtum.html

148 BIBLIOGRAPHY

[122] J. Wei, Z. Jian, S. Ting-zhi, and W. Xiao-hua. A novel facial features extractionalgorithm using gabor wavelets. In Congress on Image and Signal Processing,2008. (Cited on pages 27 and 28)

[123] J. Weston and C. Watkins. Multi-class support vector machines. 1998. (Cited

on page 96)

[124] J. Wiggins. The five-factor model of personality: Theoretical perspectives. TheGuilford Press, 1996. (Cited on page 37)

[125] Y. Wu, H. Liu, and H. Zha. A new method of detecting human eyelids basedon deformable templates. In Systems, Man and Cybernetics, 2004 IEEE Inter-national Conference, volume 1, pages 604–609, 2004. (Cited on pages 2, 3, 29, 30,

31, 59, 60, and 61)

[126] T. Xiang and S. Gong. Spectral clustering with eigenvector selection. PatternRecognition, 41(3):1012–1029, March 2008. (Cited on page 104)

[127] J.F Cohn Y.L Tian, T Kanade. Recognizing action units for facial expressionanalysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,23(2):97–114, 2001. (Cited on pages 7, 29, and 111)

[128] M.S. Yu and S.F. Li. Tracking facial feature points with statistical modelsand gabor wavelet. In Fifth Mexican International Conference on ArtificialIntelligence (MICAI’06), 2006. (Cited on page 27)

[129] Z. Zeng, M. Pantic, G.I. Roisman, and T.S. Huang. A survey of affect recogni-tion methods: Audio, visual, and spontaneous expressions. IEEE Transactionson Pattern Analysis and Machine Intelligence, 31(1):39–58, January 2009. (Cited

on page 31)

[130] W. Zheng, X. Zhou, C. Zou, and L. Zhao. Facial expression recognition usingkernel canonical correlation analysis (kcca). IEEE Transactions On NeuralNetworks, 17(1):233–238, January 2006. (Cited on pages 5, 37, and 94)

[131] S. Zhou, R. Chellappa, and B. Mogghaddam. Visual tracking and recognitionusing appearance-adaptive models in particle filters. IEEE Transactions onImage Processing, 13(11):1491–1506, 2004. (Cited on page 9)

Documents

Universitat Autònoma de Barcelona · Universitat Autònoma de Barcelona ... a wise combination of deformable models captures rigid and non-rigid movements of diﬀerent kinematics;