Download pdf - Thèse de l’Université de L - Les Thèses de l'INSA de Lyontheses.insa-lyon.fr/publication/2010ISAL0107/these.pdf · image analysis and pattern recognition, it is still a very

Numéro d’ordre : 2010-ISAL-0107 Année 2010

Institut National des Sciences Appliquées de Lyon

Laboratoire d’InfoRmatique en Image et Systèmes d’information

École Doctorale Informatique et Mathématiques de Lyon

Thèse de l’Université de Lyon

Présentée en vue d’obtenir le grade de Docteur,spécialité Informatique

par

Yi Ji

Object classification in images and videos

Application to facial expressions

Thèse soutenue le 3 Decembre 2010 devant le jury composé de :

M. Patrick Lambert Professeur, Polytech Annecy-Chambery RapporteurM. William Puech Professeur, LIRMM RapporteurM. Jean-Luc Dugelay Professeur, EURECOM ExaminateurM. Fabrice Mériaudeau Professeur , LABORATOIRE Le2i ExaminateurM. Atilla Baskurt Professeur, INSA Lyon DirecteurM. Khalid Idrrissi Maître de Conférences, INSA Lyon Co-encadrant

Laboratoire d’InfoRmatique en Image et Systèmes d’informationUMR 5205 CNRS - INSA de Lyon - Bât. Jules Verne

69621 Villeurbanne cedex - FranceTel: +33 (0)4 72 43 60 97 - Fax: +33 (0)4 72 43 71 17

To my family

iii

iv

Acknowledgments

First of all, I would like to thank my two advisers Pr.Attila Baskurt and Dr.Khalid Idrrisi

for their guidance, encouragement and patience. I’m very thankful to Pr.Baskurt for the

chance to do research in INSA de Lyon. To Dr.Idrrisi, I’m very grateful for his invaluable

advice and guidance all along my study.

I would like to thank Pr. Patrick Lambert and Pr. William Puech to spend their pre-

cious time to review the manuscript of this dissertation and gave me much appreciated

advice. I’m also very thankful to Pr. Jean-Luc Dugelay and Pr. Fabrice Mériaudeau to

agree to be be examiners in the jury.

I thank the PhD Grants from China Scholarship Council for co-operation program

between CSC and INSA.

I also would like to thank all the members of group IMAGINE at LIRIS. And especial

thanks to my friends shared the same office as Cagatay, Imane, Phuong, Kai and Yuyao.

I thank Pr. Mohan S. Kankanhalli of National University of Singapore who super-

vised my master thesis and inspired my interests in multimedia and computer vision.

My final thanks go to my parents and my friends in Lyon, Shanghai and Suzhou for

their supports.

v

vi

Abstract

In this dissertation, we address the problem of generative object categorization in com-puter vision. Then, we apply to the classification of facial expressions.

Humans can solve the problem of object recognition and categorization quickly, effi-ciently and almost effortlessly. But for a corresponding algorithm from computer vision,image analysis and pattern recognition, it is still a very difficult task. For the objects insame category, the variations in pose, lighting, scale and affine changes and the intra-class differences generate an extreme task and unsolved challenge to assign the object inan image to a certain category.

In our proposal, we are inspired by the method Hierarchical Dirichlet Processes togenerate intermediate mixture components to improve recognition and categorization,as it shares with documents modelling topic two similar aspects: its nonparametric andits hierarchical nature. After we obtain the set of components, instead of boosting thefeatures as Viola and Jones, we try to boost the components in the intermediate layerto find the most distinctive ones. We consider that these components are more impor-tant for object class recognition than others and use them to improve the classification.Our target is to understand the correct classification of objects, and also to discover theessential latent themes sharing across multiple categories of object and the particulardistribution of the latent themes for a special category.

There are many approaches for facial expression recognition system (FER) proposedto improve the Human-Computer Interaction (HCI). Ekman and Friesen defined six ba-sic emotions: Anger, Disgust, Fear, Happiness, Sadness and Surprise. The task to inter-pret these universal expressions, by applying recognition algorithm or, even manually,by human beings, is difficult because of individual differences and culture background.FER is still one of the most active fields in computer vision and has attracted manyproposals over the last several decades.

Regarding the relation between basic expressions and corresponding facial deforma-tion models, we propose two new textons, VTB and moments on spatiotemporal plane,to describe the transformation of human face during facial expressions. These descrip-tors aim to catch both general shape changes and motion texture details. The dynamicdeformation of facial components is so captured by modelling the temporal behaviour offacial expression. Finally, SVM based system is used to efficiently recognize the expres-sion for a single image in sequence, then, the weighted probabilities of all the framesare used to predict the class of the current sequence. My thesis includes finding theproper methods to describe the static and dynamic aspects during facial expression. I

vii

also aim to design new descriptors to denote characteristics of facial muscle movements,and furthermore, identify the category of emotion.

Keywords: Object Recognition, Facial Expression Recognition, SVM, Bayesian model,AdaBoost, Spatialtemporal Information, Moments, SIFT.

viii

Contents

Acknowledgments v

Abstract vii

Contents ix

List of Figures xiii

List of Tables xvii

List of Algorithms xix

1 Introduction 1

1.1 General Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Object Categorization and Methods . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Facial Expression Classification and Related Application . . . . . . . . . . 4

1.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Survey on Object Recognition and Facial Expression Recognition 7

2.1 Object Categorization: The State of the Art . . . . . . . . . . . . . . . . . . 9

ix

2.1.1 Problem Statement: Category Level Recognition . . . . . . . . . . . 9

2.1.2 Popular Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.3 Classic Models for object categorization . . . . . . . . . . . . . . . . 12

2.1.4 Recent Work and Summary . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Literature Review: Facial Expression Recognition . . . . . . . . . . . . . . 23

2.2.1 Basic Facial Expressions and Facial Actions . . . . . . . . . . . . . 23

2.2.2 Three Stages in Automatic FER System . . . . . . . . . . . . . . . . 23

2.2.3 Recent Work and Trend . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Feature Representation for Objects and Faces 31

3.1 Overview of Local Feature Descriptors . . . . . . . . . . . . . . . . . . . . 33

3.2 The Detection of Regions of Interest . . . . . . . . . . . . . . . . . . . . . . 34

3.2.1 DOG for Key Point Detection . . . . . . . . . . . . . . . . . . . . . 34

3.2.2 Face Organ Location . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Features for Static Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.1 Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.2 SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.3 Shape Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.4 LBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.5 Gabor features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4 Features for Dynamic Schemes . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4.1 Introduction to Temporal Texture Analysis . . . . . . . . . . . . . . 47

3.4.2 Dynamic Deformation for Facial Expression . . . . . . . . . . . . . 47

3.4.3 VTB Descriptor for Dynamic Feature . . . . . . . . . . . . . . . . . 49

3.4.4 Moments on Spatiotemporal Plane . . . . . . . . . . . . . . . . . . 52

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

x

4 Recognition and Classification Methods 55

4.1 Overview to Machine Learning Methods . . . . . . . . . . . . . . . . . . . 57

4.2 Discriminative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.2 AdaBoost Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.2 BoW and Naïve Bayes Implemantation . . . . . . . . . . . . . . . . 64

4.3.3 Hierarchical Generative Model . . . . . . . . . . . . . . . . . . . . . 66

4.3.4 Construction of Hierarchical Dirichlet Processes . . . . . . . . . . . 66

4.3.5 Inference and sampling . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.4 Hybrid System: Integrated Boosting and HDP . . . . . . . . . . . . . . . . 70

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Testing and Results 75

5.1 General Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.1.1 Datasets of Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.1.2 Classification Using HDP with Heterogeneous Features . . . . . . 77

5.1.3 Classification Using Boosting within Hierarchical Bayesian Model 81

5.2 Facial Expression Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2.1 Face databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2.2 Overview of Our System . . . . . . . . . . . . . . . . . . . . . . . . 88

5.2.3 Image Based Classification using Static features . . . . . . . . . . . 89

5.2.4 Image Based Classification Using Static and Dynamic Features . . 92

5.2.5 Classification for Sequences . . . . . . . . . . . . . . . . . . . . . . . 95

6 Conclusion 99

6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.1.1 Object Categorization Using Boosting Within Hierarchical BayesianModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

xi

6.1.2 Automatic Facial Expression Recognition . . . . . . . . . . . . . . . 100

6.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.2.1 Object Similarities and Polymorphism . . . . . . . . . . . . . . . . . 101

6.2.2 Spontaneous Facial Expression Understanding . . . . . . . . . . . 102

7 Résumé en Français 103

7.1 Sommaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.2.1 Détection des organes faciaux . . . . . . . . . . . . . . . . . . . . . 105

7.2.2 Descriptions des expressions . . . . . . . . . . . . . . . . . . . . . . 106

7.2.3 Classification par HDP (Hierarchical Dirichlet Process) . . . . . . . 111

7.2.4 Processus de Dirichlet Hiérarchique pour la classification . . . . . 112

7.3 Résulats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.3.1 Validation de la classification d’objets par HDP . . . . . . . . . . . 114

7.3.2 Validation des descripteurs proposés pour la classification des ex-pressions par SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.3.2.1 Classification des images d’expression par descripteursstatiques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.3.2.2 Classification des images d’expression par descripteursdynamiques . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.3.2.3 Classification de séquences d’expressions par descripteursdynamiques . . . . . . . . . . . . . . . . . . . . . . . . . . 120

Bibliography 123

Author’s Publications 137

xii

List of Figures

1.1 The results from http://www.picsearch.com when searching for "cat". . . 2

1.2 Photographs of facial expressions from Paul Ekman[EF76]. . . . . . . . . . 3

2.1 Generic OR and Specific OR . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Samples from Caltech datasets . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 20 categories in PASCAL Visual Object Classes Challenge 2010 . . . . . . 12

2.4 Construction of a three-dimensional array, solid objects from a single two-dimensional photograph [Rob63] . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 BoW for Object Categorization (ICCV 2009 short course by L. Fei-Fei) . . 14

2.6 Examples of ’key features’ that are detected by: The scale invariant Harrisdetector , the affine invariant Harris detector, and the DoG/SIFT detec-tor/descriptor [Pin05]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.7 The most likely words (shown by 5 examples in a row) for four learnedtopics (1): (a) Faces, (b) Motorbikes, (c) Airplanes, (d) Cars. [SRE∗05a]. . . 17

2.8 Learning relevant intermediate representations of scenes automaticallyand without supervision. by [LP05] . . . . . . . . . . . . . . . . . . . . . . 18

2.9 Graphical geometric models by [CL06] . . . . . . . . . . . . . . . . . . . . 19

2.10 Schematic depiction of the Adaboost detection cascade by [VJ01] . . . . . 21

2.11 Pyramid Match by Grauman and Darrel [GD05a] . . . . . . . . . . . . . . 22

2.12 Latent components generated by [LJ08] . . . . . . . . . . . . . . . . . . . . 22

2.13 Six basic expressions, from left to right: Anger, Disgust, Fear, Happiness,Sadness and Surprise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.14 Examples of facial action units (AUs) and their combinations defined inFACS from [PB07] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

xiii

2.15 The main blocks in facial expression recognition [CB03] . . . . . . . . . . 25

2.16 Rectangle feature by [KJC08] . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.17 Three video examples associated 3D tracker with different degree of motion.[RD09] 29

3.1 Above: [PB07] Tracking the facial characteristic point. Below: [HPA04]LBP histograms from whole face and divided blocks. . . . . . . . . . . . 33

3.2 The construction of Difference of Gaussian(DOG). [Low99a] . . . . . . . . 35

3.3 Facial organ location: step by step. . . . . . . . . . . . . . . . . . . . . . . . 38

3.4 Sample images, from left to right: Anger, Disgust, Fear, Happiness, Sad-ness and Surprise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5 The histogram computation for SIFT descriptor . . . . . . . . . . . . . . . 42

3.6 Left and center: samples of edge points for two similiar shape. Right: thelog-polar histrogram bins used to compute the shape context. . . . . . . . 43

3.7 Shape context as a discriminative descriptor . . . . . . . . . . . . . . . . . 43

3.8 An example of LBP computing [HPA04]. . . . . . . . . . . . . . . . . . . . 44

3.9 Examples of texture primitives which can be detected by LBP [HPA04]. . 44

3.10 (Left) A face image. (Center) Identified changed parts. (Right) Maskedand divided in 8 blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.11 The face image in Fig.3.10 is represented by a concatenation of 8 local LBPhistograms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.12 The cues of facial movements [Bas79]. . . . . . . . . . . . . . . . . . . . . . 47

3.13 Left: XY(front face); Center: YT slice; Right: XT slice. . . . . . . . . . . . . 48

3.14 The dynamic deformation for different expressions on vertical slices . . . 49

3.15 3 blocks on VT plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.16 VTB computing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.1 Separating hyperplanes under linear case. [Bur98]. . . . . . . . . . . . . . 61

4.2 The examples of SVM kernels by pictures[Bur98]. . . . . . . . . . . . . . . 63

4.3 HDP model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.4 The mixture of components. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.5 Hybrid approach for learning . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1 Samples from four categories we used. . . . . . . . . . . . . . . . . . . . . 78

xiv

5.2 Component No.17 in motorbike category. . . . . . . . . . . . . . . . . . . . 80

5.3 Convergence vs. Iteration times. . . . . . . . . . . . . . . . . . . . . . . . . 81

5.4 The distinctive components in large image. . . . . . . . . . . . . . . . . . . 82

5.5 Performance vs the size of component set. . . . . . . . . . . . . . . . . . . 83

5.6 The sample images from JAFFE database. From left to right: Angry, Dis-gust, Fear, Happiness, Sadness, Surprise and Neutral. . . . . . . . . . . . . 85

5.7 Sample images and location procedure, from left to right: Anger, Disgust,Fear, Happiness, Sadness and Surprise. . . . . . . . . . . . . . . . . . . . . 86

5.8 Sample frames from MMI database . . . . . . . . . . . . . . . . . . . . . . 87

5.9 Sample frames from Cohan-Kanade database . . . . . . . . . . . . . . . . . 88

5.10 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.11 One sequence of happiness and corresponding plot for six expressions. . 94

7.1 Détection des organes faciaux pour les 6 expressions . . . . . . . . . . . . 106

7.2 Les mouvements faciaux décrits par [Bas79] . . . . . . . . . . . . . . . . . 107

7.3 10 blocks correspondant à une image de sourire . . . . . . . . . . . . . . . 107

7.4 Gauche : XY(vue de face) ; Milieu : la tranche YT ; Droite : La tranche XT . 108

7.5 Les déformations de la tranche YT pour différentes expressions . . . . . . 108

7.6 Exemple de tranches dans le plan YT. . . . . . . . . . . . . . . . . . . . . . 109

7.7 Calcul de VTB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.8 Modèle graphique de LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.9 Modèle graphique de HDP . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.10 Catégories de la base Caltech utilisées pour nos tests. . . . . . . . . . . . . 114

7.11 Convergence et nombre d’itérations. . . . . . . . . . . . . . . . . . . . . . . 116

7.12 Les principales composantes caractéristiques du léopard. . . . . . . . . . . 116

7.13 Performance en fonction du nombre de composantes. . . . . . . . . . . . . 117

xv

xvi

List of Tables

5.1 Classification results, in parenthesis is the value K for number of compo-nents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2 The confusion matrix for best-T. . . . . . . . . . . . . . . . . . . . . . . . . 82

5.3 Performance comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4 Recognition performances by SVM on different resolutions(%) . . . . . . 89

5.5 Average recognition performances on JAFFE database (%) . . . . . . . . . 89

5.6 Recognition performances by boosted-SVM for 64× 64 resolution(%) . . 90

5.7 Confusion matrix by boosted-SVM for 64× 64 resolution for 6-class recog-nition (%) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.8 Recognition performances for 64× 64 resolution on sub-sets (%) . . . . . 91

5.9 Confusion on MMI database for 6-class recognition (%) . . . . . . . . . . . 91

5.10 Recognition performances comparisons on Cohn-Kanade Database(%) . . 92

5.11 Recognition performances comparisons for image-based methods (%) . . 92

5.12 Confusion matrix of Moments on MMI database (%) . . . . . . . . . . . . 94

5.13 Confusion matrix of Ours:N (µ, σ22 )(%) . . . . . . . . . . . . . . . . . . . . . 96

5.14 Recognition performances comparisons for sequences-based methods(%) 96

7.1 résultats de la classification, (Nombre de composantes entre parenthèses). 115

7.2 Comparaison des performances . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.3 Performance moyenne sur la base JAFFE (%) . . . . . . . . . . . . . . . . . 118

7.4 Performance moyenne sur la base MMI (%) . . . . . . . . . . . . . . . . . . 118

7.5 Comparaison des performances sur la base Cohn-Kanade (%) . . . . . . . 119

7.6 Comparaison avec descripteurs spaciaux temporels . . . . . . . . . . . . . 120

xvii

7.7 Matrice de confusion MMI database (%) . . . . . . . . . . . . . . . . . . . . 120

7.8 Comparatif pour la classification de séquences . . . . . . . . . . . . . . . . 121

xviii

List of Algorithms

3.1 Detect facial organs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1 HDP to build components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.2 AdaBoost for Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.1 Sequence level classification using weighted sum . . . . . . . . . . . . . . . 95

7.1 Classification d’une séquence par somme pondérée . . . . . . . . . . . . . . 121

xix

xx

Chapter 1Introduction

Contents

1.1 General Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Object Categorization and Methods . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Facial Expression Classification and Related Application . . . . . . . . 4

1.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1

2 Chapter 1. Introduction

1.1 General Context

Nowadays, variant electronic devices such as digital cameras, smart phones, or even

handheld game consoles life can take a digital photo or shot a short video. More and

more images and videos are presented in our daily life. For Human Visual System, the

perception of these visual information is with apparent ease. According to psychologists,

only few milliseconds is needed to understand the contents [Kos94].

In artificial intelligence, we face a colossal scale of visual information in our digital

storage, and few useful techniques to process, understand and classify them. New

image search engines (Google, Altavista, Picsearch) normally are based on the file name

of images or key words. Irrelevant results often appear in feedback (Fig. 1.1). Machine

perception from photographs is indeed an challenge and one of the most important

problem in computer vision. In this thesis work, our first research topic is to category

the objects in images to correct semantic labels automatically and build the content-

based recognition system. In order to accomplish this objective of dissertation work,

we propose to extract a set of intermediate visual components to present objects for

categorization. The method is efficient for general recognition but we also faced the

problem of processing complexity. The cost of training stage increases if we want to

handle tens of thousands classes. Though HDP is an efficient and potential method, due

to the limitation of converge speed, we confront several problems in extending to more

categories.

Figure 1.1: The results from http://www.picsearch.com when searching for "cat".

1.2. Object Categorization and Methods 3

Furthermore, we began to consider a more practical classification problem with pre-

fixed number of classes: Facial Expression Recognition. Perceptual psychologists have

spent decades to defined and represent human emotions and related facial expressions

(1.2).

Figure 1.2: Photographs of facial expressions from Paul Ekman[EF76].

1.2 Object Categorization and Methods

1.2.1 Objective

As cited in 1.1, humans can solve the problem of object recognition and categorization

quickly, efficiently and almost effortlessly. But for a corresponding algorithm from com-

puter vision, image analysis and pattern recognition, it is still a very difficult task. For

the objects in same category, the variations in pose, lighting, scale and affine changes and

the intra-class differences generate an extreme task and unsolved challenge to assign the

object in an image to a certain category.


1.2.2 Contribution

In our proposal, we combine thousands of descriptors (e.g. local gradient, shape, and

color) from small patches within one hierarchical generative model. These different data

sources have complementary characteristics, which should be independently combined

to improve the classification. We are also inspired by the method "Hierarchical Dirichlet

Processes" to generate intermediate mixture components to improve recognition and

categorization. Our work shares with topic modeling for documents two similar aspects:

its nonparametric and hierarchical nature. Unlike the documents that are usually only

written in single language, we have several set of descriptors for the same image, just as

using different scripts from different language sources to tell the same story. The method

reinforces the same contents from complementary origins for better understanding and

comprehension.

To find the right category for an object, these features represent different facets of

its characteristics. These facets are usually shared by multiple categories. The particular

combination and distribution of these facets signify the particular category and let us

do the correct classification.

Furthermore, after we obtain the set of components, instead of boosting the features

as Viola and Jones [VJ01], we boosted the components in the intermediate layer to find

the most distinctive ones. We consider that these components are more important for

object class recognition than others and use them to improve the classification. Our

target is to understand the correct classification of objects, also to discover the essential

latent themes sharing across multiple categories of object and the particular distribution

of the latent themes for a special category.

1.3 Facial Expression Classification and Related Application

1.3.1 Motivation

The second topic that we studies in this thesis concerns human facial expressions. Au-

tomatic Facial Expression Recognition (FER) is one of the most active fields in computer

vision and has attracted many proposals over the last several decades. Recognition and

interpretation of human facial expressions take an important role in interpersonal and

non-verbal communication. However, the task of interpreting, by applying algorithms in

1.3. Facial Expression Classification and Related Application 5

computer vision or, even manually, by human beings, is difficult because of individual

differences and cultural background. In machine analysis, automatic Facial Expression

Recognition (FER) is one of the most active fields and has attracted many proposals

over the last several decades. On a human face, expressions can be seen from the sub-

tle movements of facial muscles and influenced by internal emotion states. Based on

psychological researches, [EF78] proposed to use FACS (Facial Action Code System) as

a standard to systematically categorize facial expressions. They also defined six basic

emotions: Anger, Disgust, Fear, Happiness, Sadness and Surprise. Since then, two fami-

lies of methods are developed: one aims at understanding these prototypic expressions

while the other is concentrated on the detection of basic action units. Some approaches

combined these two methods. In this thesis, we propose to use the appearance-based

and spatial-temporal information so as to build automatic FER system to give an inter-

pretation of facial expression in video sequences.

1.3.2 Contribution

We developed an automatic FER system which establishes relations between facial ex-

pressions and the facial parts changes. An unchanged state during a long run usually

implies a neutral face. If changes take place between frames, we can detect the facial

movements and then a possible facial expression. These changes are related to differ-

ent emotions which are classified into six distinctive universal emotions by Erkman and

Friesen[EF78]. This builds the base line of our system: we detect the facial changes due

to an expression, we locate face parts and then, from these parts, we identify the facial

expressions.

In the proposed method, the detection of faces begins with a roughly face detector

then improved by detecting the important facial organs (eyes, nose and mouth). The

representation of faces is appearance-based, applied on a designed mask with 8 block

layout. For each block, LBP histogram is extracted then concatenated as feature vectors.

Gabor features are optional for combination at this step.

Our automatic system also contributes in the second stage by suggesting the new

descriptors. It separates the feature extraction into two parts: static images and dynamic

information estimation. To describe the spatiotemporal planes, we introduce VTB, a

modification of a well known LBP descriptor, and an unique usage of moments on

spatial-temporal domain for the challenge in representative features. VTB concentrates

on the texture characteristics while moments concentrate on the changing shapes. In


other words, the appearance features are applied to measure the geometric changes of

shape and locations of main facial components.

The usage of these descriptors can evaluate the expressions intensity and handle

temporal information for image sequences. Furthermore, the possibility to combine the

static appearance feature and the texture and shape features of motion is also explored.

1.4 Outline

The structure of this thesis is organized as follows:

1. In Chapter 2, we provide a survey of the existing techniques in the areas of object

categorization and facial expression classification. For object categorization, we

focus on different models and their hybrid for category level recognition. For the

more applicable area of facial expression classification, we will concentration on

the three main stages of Facial Expression Recognition(FER) systems to reflect the

historical development since its inception.

2. Chapter 3, we present various descriptors (mostly based on local appearance) used

in generic object recognition and facial expression recognition systems. Then, our

new descriptors are proposed and detailed.

3. Chapter 4 describes different classifiers used in computer vision systems. The

major two branches: discriminative and generative methods are presented. Our

proposal of hybrid system is then introduced.

4. In Chapter 5, both the systems for generic object recognition and facial expression

recognition are detailed in every step. Benchmark databases are used to evaluate

the performances of these systems.

5. Chapter 6 summarizes the contributions in both object categorization and facial

expression classification. Final conclusions which concerns this dissertation are

drawed. We also propose several future working directions.

Chapter 2Survey on Object Recognition and

Facial Expression Recognition

Contents

2.1 Object Categorization: The State of the Art . . . . . . . . . . . . . . . . 9

2.1.1 Problem Statement: Category Level Recognition . . . . . . . . . . 9

2.1.2 Popular Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.3 Classic Models for object categorization . . . . . . . . . . . . . . . 12

2.1.4 Recent Work and Summary . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Literature Review: Facial Expression Recognition . . . . . . . . . . . . 23

2.2.1 Basic Facial Expressions and Facial Actions . . . . . . . . . . . . 23

2.2.2 Three Stages in Automatic FER System . . . . . . . . . . . . . . . 23

2.2.3 Recent Work and Trend . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7

8 Chapter 2. Survey on Object Recognition and Facial Expression Recognition

In this chapter, we review the existing techniques in the areas of object categorization

and facial expression classification. For object categorization, we focus on methods

for basic recognition of visual categories based on images feature extraction. For the

more applicable area of facial expression classification, we review some popular systems

to reflect the historical development since its inception.

2.1. Object Categorization: The State of the Art 9

2.1 Object Categorization: The State of the Art

2.1.1 Problem Statement: Category Level Recognition

According to neuroscientists ([FFTR∗05]), human can perform very fast categorization

(well below 300ms) of objects or identification of a special object , even under highly

degraded and novel viewing conditions. Between these two tasks: recognize the same

object and find the objects belong to the same categories in natural images, there is

an important distinction. For the first task of instance-level problem, specific OR will

deal with an individual object, for example, to recognize a special human being (Alan

Turing in first row of Fig. 2.1) no matter viewpoint change, differences in illumination,

occlusion, background clutter and imaging noise as mentioned by Fergus[Fer05]. This

technique is widely integrated in security applications (e.g. Airport Security Check).

Some researches have claimed as 100% accuracy in special instance recognition [JB08].

Figure 2.1: Generic OR and Specific OR

The second task of category level recognition is defined in Pinz’s book [Pin05] as

"generic object recognition" (generic OR). For generic recognition, (e.g. people, bikes or

bottles) can consist the list of categories or classes. In the case of robust category level


categorization (generic OR), it is a harder task [Fer05] for the researchers in computer

vision. Here the goal is to define a process which can assign objects within images to

a certain category. To design an efficient algorithm, one of the major problems is the

distinction of the intra-class variabilities and inter-class differences. Furthermore, the

visual categories are normally hierarchical in semantic domain (e.g. Automobile-Car-

SUV or sport utility vehicle). An object also can belong to two or more classes (e.g. a

wooden box can be stationery case or flowerpot). Many methods have challenged this

task but a satisfying solution for an exhaustive set of objects (10,000 to 30,000 according

to psychologists [RMG∗76, Bie87]) in real world is not likely to appear now or in near

future. However, for a small set of objects, many contributions have been made since

the first attempt in 60’s, for example, the identification of smiling people in second row

of Fig. 2.1. In this thesis, we will handle this generic object recognition task and provide

literature review and original research in this field.

2.1.2 Popular Datasets

For a thorough evaluation of various categorization algorithms, benchmark or com-

monly used databases are required to train and test the proposed models. A useful

database should satisfy the criteria below:

1. Enough quantity in each category,

2. High intra-class variability between different categories and low inter-class vari-

ability inside the same category,

3. Scale, orientation and viewpoint variation for the same category,

4. Background clutter complexity.

Some database are designed for special object recognition. For example, Object

Recognition Database from Ponce research group includes modeling shots of eight ob-

jects for comparative evaluation [RLSP06]. Several databases are well known but con-

centrated on different aspects of the above requirements. Here we list some widely used

sets:

1. Caltech 4 [FPZ03], Caltech101 [FFFP07], Caltech256 [GHP07], a nearly standard for

testing natural object recognition algorithms. The images (Fig. 2.2) varies greatly


Figure 2.2: Samples from Caltech datasets

in view, position, size, lighting condition and so on. Some categories have under

40 images.

2. ETH-80 from TH CogVis project [LLS04], contains images of 8 categories and 80

objects, such as apples, pears, tomatoes, cows, dogs, horses, cups, and cars. And

each object is represented by 41 views.

3. UIUC Image Database for Car Detection [AR02, AAR04], which contains grey-scale

images of cars in PGM format for use in evaluating object detection algorithms

(only one category with background). No scale or viewpoint change are available.

4. The TU Graz-02 Database [OFPA04], which includes images of people, cars, bicy-

cles and counter exmaples.

5. PASCAL Object Recognition Database Collection [EVGW∗], this is a famous annual

challenge to recognize objects from a number of visual object classes in realistic

scenes. It has 20 object classes and is collected from the "flickr" website (Fig. 2.3.

Due to the huge size of the database, the training and testing process can be time-

consuming.

6. MIRFLICKR-25000 Image Collection [HL08] This database is also collected from

the "flickr" website and offers its users to search and share their pictures based on

image tags. For retrieval research it provided very interesting visual concepts.


Figure 2.3: 20 categories in PASCAL Visual Object Classes Challenge 2010

Among these databases, the PASCAL VOC challenge is obviously the hardest be-

cause it includes the largest sets and some objects are very small in images. UIUC

includes only one category and not enough for biased free solution. ETH-80 provides

too few samples for each categories. In TU Graz-02, certain backgrounds and objects are

linked. Caltech series are most widely used and have low level clutter, though the num-

ber of images in some categories are too low to valid the efficiency of algorithms. In the

future, billions of image from popular search engine GOOGLE or open dataset LabelMe

[RTMF08] may become the more solid base for computer vision researchers. In our

research, we will use a subset of Caltech 101 to achieve the categorization performance.

2.1.3 Classic Models for object categorization

The research for automatic understanding of images and video began from 60’s. L.

Roberts built the first try to construct and display a three-dimensional array, solid ob-

jects from a single two-dimensional photograph as in Fig. 2.4 [Rob63]. Thus it developed

to reconstruction branch of object recognition. In Computational Neuroscience, David

Marr [Mar82] built the cornerstone for this group computational approaches to percep-

tion, and also for brain science. He described the vision model as proceeding from a

two-dimensional visual array (on the retina) to a three-dimensional description of the

world as output. His three stages of vision include:


Figure 2.4: Construction of a three-dimensional array, solid objects from a single two-dimensional photograph [Rob63]

1. Original images,

2. Primal sketch, with features like edges, regions, etc., as suitable representation of

the changes and structures in the images,

3. 2-1/2.5D sketch of the scene, the immediate representation of visible surface,

4. 3D object model, interpret the surface and lead to pure perception.

Marr’s idea influences the recent researches like [RLSP06, VKM09].


Another branch, which majors in the two-dimensional domain, extracts the features

for pattern recognition from images or videos. This group of methods can handle the

visual data under natural environments and is widely used in robotics, security systems

and other autonomous systems . Here the general task of object categorization, is that

assigning the correct category label to unknown instances, given a small number of

training images of a category. We will present three major models in this section: Bag of

words, part-based and discriminative models.

Bag of Words

Bag of words (BoW) approach is originally a simplifying assumption in natural lan-

guage processing [Har54]. This group of methods are popular in document classifi-

cation. It represents the text as a bag, which contains the collection of words from

dictionary and ignored the word order and grammar.

Figure 2.5: BoW for Object Categorization (ICCV 2009 short course by L. Fei-Fei)

In computer vision, there is a similar treatment. An image is represented as a his-

togram vector of features; these features are extracted from regular grid or a set of

keypoints; each feature is a visual codeword which is an entry in visual dictionary; the

visual dictionary is normally generated by k-means. In this structure showed in Fig.

2.5, all features are independent. The model regards the image as a collection of these

features. These features are found by a set of detectors, which have been chosen to re-


spond to different types of structures within images (e.g. interesting features of pixels;

the outlines of objects and so on). Normally the system includes two main parts:

1. Feature detection and histogram representation of original image (object),

2. Learning and recognition based on histograms of features.

In the first part, methods for object categorization typically extract features by ap-

plying some salient point detectors on the images. The survey by Schmid et al. [SMB00]

evaluated the repeatability rate and information content of various interesting point de-

tectors. They compared contour based, intensity based and parametric model based

methods. They found that Harris point detector [HS88] and its multi-scale variation

perform better or at least equivalent to other detectors in two aspects: repeatability and

information content.

Matas et al. [MCUP04] proposed detection algorithm for an affinely-invariant sta-

ble subset of extremal regions, named the maximally stable extremal regions (MSER).

Integrated in the SIFT descriptor [Low99b], the difference of Gaussian (DoG) is also a

good keypoint detector and widely used. One comparison is showed in Fig. 2.6, which

shows that the salient parts in images are detected no matter it belongs to objects or

noisy background.

Some other methods for scene categorization [VS04, LP05] just used regular grid

on the images to extract features from rectangle patches. Random sampling is also

used [VNU03]. In these systems, salient regions are detected in the image but not

all are supposed to be keypoints of object that we are looking for. Some will lie on

the background or cluttered. The successful usage of these points after detection will

depend on descriptors and classification.

After salient regions are located, the next step is to represent them in term of descrip-

tive features, which is often represented by feature vectors. These descriptors should be

high discriminative and easy to generate. The combination of detector and descriptor

should be invariant to scaling, rotation, affine deformation, illumination changes and ge-

ometric or radiometric distortion. Some simple features are pixel intensities (graylevels)

or color, moments and it invariants, filters or transformations (DCT, Gabor). Some more

complex features (e.g. SURF [BETVG08]) are also proved to be high performance. These

features can be used to build codewords dictionary or directly put in part-based or

classifier.


Figure 2.6: Examples of ’key features’ that are detected by: The scale invariant Harrisdetector , the affine invariant Harris detector, and the DoG/SIFT detector/descriptor[Pin05].

With codebook dictionary, we can map the original image to a histogram of code-

words. We still need efficient tools to complete the task of learning and recognition.

Two groups of approaches: discriminative and generative, or their hybrid are avail-

able. Generally speaking, the discriminative methods [GD05a] are driven by data with a

bottom-up manner, such as SVM [BGV92]. On the other hand, the generative methods

[LP05, SRE∗05a] are built by top-down diagram. Discriminative approaches can obtain

very high accuracy in some datasets but the over fitting problem is always a ghost above

it.

Using this BoW structure, various methods can achieve state-of-the-art performance

and are worth of note. Pinz et al. [Pin05] presented ’boundary-fragment-model’, which

is contour-based and used a codebook of ’boundary-fragments’ to vote for potential ob-

ject centroid location. Their method can cope with multi-models for a single category

which is not applicable for region-based methods. In the other aspect, it also needed

bounding box around the objects in the training images so more supervised. The pos-

sibility of combination of boundary and patched based methods is explored in Pinz’s


cooperative work [OFPA04], but the computational complexity problem is not solved.

Figure 2.7: The most likely words (shown by 5 examples in a row) for four learned topics(1): (a) Faces, (b) Motorbikes, (c) Airplanes, (d) Cars. [SRE∗05a].

Using probabilistic Latent Semantic Analysis (pLSA) model, Sivic et al. [SRE∗05a]

tried to discover the object categories depicted in a set of unlabelled images. The model

is applied to images by using a visual analogue of a word, formed by vector quantizing

SIFT-like region descriptors. They also extend the bag-of-words vocabulary to include

’doublets’ which encode spatially local co-occurring regions. Their unsupervised meth-

ods is successful though, in their strait forward system, one topic is equal to one object

(as in Fig. 2.7) and no potential latent component is considered.

In the area of natural language processing, for the goal to share the latent topics

among documents, an approach based on Latent Dirichlet Allocation (LDA, Blei et

al.[BNJ03]) provides a set of shared finite mixture components based on ’BoW’. This

method is efficient but need to fix some parameters. By using the nonparametric nature

of Dirichlet Process to solve the problem of determining the appropriate parameters,

LDA method is extended, by Teh et al. [TJBB06] as Hierarchical Dirichlet Processes, an

algorithm developed to capture uncertainty regarding the number of mixture compo-

nents in document modeling (details in Chapter 4.3).

Other than document analysis, The LDA model is also used in natural scene clas-

sification. L.Fei-Fei et al. [LP05] represented the image of a scene by a collection of


Figure 2.8: Learning relevant intermediate representations of scenes automatically andwithout supervision. by [LP05]

local regions, denoted as codewords obtained by unsupervised learning. It learns the

theme distributions as well as the codewords distribution over the themes without su-

pervision as in Fig. 2.8. Furthermore, in L.Fei-Fei’s cooperative work [WZFF06], the

local patches are no long independent. They introduced a linkage structure over the

latent themes to encode the dependency of the local patches. Their methods showed

high competitive categorization results but the sharing of intermediate components is

not really functional because normally there are only one or two components for each

category. So the objects and the middle level components are almost equal and the core

of LDA algorithm is not notable.

As a similar object recognition system, Sudderth et al. [STFW05] used the SIFT

[Low99b] descriptors, the spatial information and HDP model for visual scene catego-

rization. By training transformed mixture components, the method applies complicated

transformed components and uses very limited number of themes.

Part-Based models


In Bag-of-Words model, all visual patches are considered as the same importance

and the spatial relations between these patches are simply ignored. Based on the similar

interesting point detection, part-based models are proposed to cope with these prob-

lems. The original proposal by Fischler and Elschlager [FE73] is to find a visual object in

an image given the relative position of a few template matches. Based on the geometry

and number of local features, different geometric models are applied (Fig. 2.9).

Figure 2.9: Graphical geometric models by [CL06]

By generating Constellation models, L. Fei-Fei et al. [FFFP07] presented a Bayesian

algorithm for learning generative models of object categories from a very limited train-

ing set. Their method used prior information then learned the unknown parameters of

a generative probabilistic model. No latent theme is considered so their system is only

one level. The complete graph is also used in [BKSS10] to locate objects.

Winn and Shotton [WS06] imposed Layout Consistent Random Field (LayoutCRF)

model of asymmetric local spatial constraints on these labels to ensure the consistent

layout of parts whilst allowing for object deformation. Although in their system, the

scale is fixed and only two categories: car and faces are tested.

Star models [FPZ05, CHC09] applied a star topology configuration of parts modeling

a variety of features (appearance, shape, spatial information or hybrid). part-based Star

Model (SM) is used in an exhaustive manner to learn the object model and recognize the

test images. The shortcoming is that the model parts are limited to be taken at similar

view points and not flexible.


Felzenszwalb and Huttenlocher [FH05] built tree model to detect and locate humans

and faces, but can only capture a small collection of parts which are arranged in a

deformable configuration.

Crandall et al. [CFH05] extended the restricted form of tree model, and introduced

k-fan model, which is a complete subgraph over k nodes of a set <, and each node in

< is connected to every node which not belongs to <. They used k = 2 but the cost is

already relatively high(O(N3)).

Hierarchy models [BT05, STFW05] used two levels loose hierarchical structure; It

grouped the pixels to parts and parts to object. These methods apply to arbitrary spatial

transformations between parts and their subparts, the request for geometric constraints

is quite precise.

Carneiro and Lowe [CL06] stated that in their sparse flexible model the geometry of

each model part depends on the geometry of its k closest neighbors. The connectivity

parameter k can improve the performance but reduce the system’s efficiency.

Unlike [FFFP07, BKSS10], Revaud et al. [RLAB10] proposed an approach to select

subgraphs using the mutual information to learn automatically for a specific object.

Their method is based on local keypoints and their spatial proximity relationships. The

cost of recognition is relatively high and real time is not possible. They used uncom-

pleted graphs for specific object recognition though for category level recognition, ob-

jects within the same category also shared some degree of subgraphs.

Classifier-Based Models

Popular classifiers such as SVM are also used in the recognition stage of BoW mod-

els (e.g. [Pin05, SRE∗05a, GD05a]. Though, there are methods simply based on these

discriminative classifiers and the results are also promising.

Among all the classical methods, boosting is original from Freund and Schapire

[FS95]. Viola and Jones [VJ01] began to combine boosting weaker learners in face de-

tector (Fig. 2.10) under a very wide range of natural conditions, and trained a cascade

of classifiers based on simple Harr features. Based on this method of essential feature

selection, Zhang et al. [ZYZS05] broadened the feature pool by combining local tex-

ture features, global features and spatial features within a single multi-layer model to

improve the performance. Even though, the result from AdaBoost is not as good as

from another discriminative method SVM, their simple and efficient methods will help


Figure 2.10: Schematic depiction of the Adaboost detection cascade by [VJ01]

us to design our own object classification system. For the powerful kernel-based SVM

methods, different kernels or the kernels mixture are incorporated to improve the per-

formance. Pozdnoukhov and Bengio [PB04] defined new SVM kernels based on tangent

vectors that take into account prior information on known invariances. Kienzle etl al.

[KBFS04] replaced the set of support vectors by a smaller so-called reduced set of syn-

thetic points to speed the evaluation.

Some researchers proposed the usage of the combination of multi-classifiers. Mat-

tern et al. [MRD05] combined the classification results by the voting method. Each

classifier can vote for one class, and the class with the most votes wins. Siddiquie et

al. [SVD09] proposed Boosted Kernel SVM (BK-SVM) and learned a mixture of ker-

nels by greedily selecting exemplar data instances corresponding to each kernel using

AdaBoost. This method reduced the number of kernel computations but suffer from

reduction of accuracy.

2.1.4 Recent Work and Summary

To extend the BoW model, Grauman and Darrel [GD05a] designed a method to form a

partial matching between two sets of feature vectors (or histograms). This matching is

used as a robust measure of similarity to perform content-based image retrieval, as well

as a basis for learning object categories. They use a multi-resolution histogram pyramid


(Fig. 2.11)in the feature space to implicitly form a feature matching.

Figure 2.11: Pyramid Match by Grauman and Darrel [GD05a]

Based on ’bag-of-words’ model, Larlus and Jurie [LJ08] used the HDP model to train

one-object-one-component blobs for segmentation and applied Markov Random Fields

to find boundaries of objects. This method combined heterogeneous features and MRK

components, however, it is applied on only 306̃0 patches for an image, and obtain only

one or two latent themes for an object(Fig. 2.12). In this case it fails to catch the essential

characteristics inside the ’categories’ and ignore the inner-class variance.

Figure 2.12: Latent components generated by [LJ08]

Similarly but working on different topics, Wang et al.[WMG09] combined low-level

2.2. Literature Review: Facial Expression Recognition 23

visual features, simple atomic activities and interactions in hierarchical Dirichlet model

for activity perception under complicated environments. It shows the power of HDP

model and also inspire our work.

These merits and weakness of current systems lead us to explore our original work,

which includes the combination of heterogeneous features, the search for latent compo-

nents and the cascaded generative and discriminative models.

2.2 Literature Review: Facial Expression Recognition

2.2.1 Basic Facial Expressions and Facial Actions

Facial expressions have an important role in signaling emotional states and have been

the focus of psychologists, (e.g. Ekman and Friesen [EF78]).

A great deal of effort has been put into the translation between temporal and pattern

features of human expressions and semantic labels. In computer vision, machine under-

standing of this human non-verbal communication is still a challenge in human-machine

interaction.

To point out the directions of facial measurement technology, the psychologists de-

fined two main streams on automatic analysis of facial expressions. In one direction,

facial affect detection is concerned, and six basic universal human emotions: Fear, Sur-

prise, Sad, Angry, Disgust and Happiness, are identified as in Fig.2.13. For the other

direction, the research concerns about facial muscle action detection, and the target is

to identify the basic actions or vocabulary of expressions: AUs (Action Units. A whole

set of object coding labels for facial actions is defined in Facial Action Coding System

as in Fig. 2.14). 44 different facial movements are defined as AUs. Between these two

main research directions, some translations or mapping tables are provided but no solid

theory bases.

2.2.2 Three Stages in Automatic FER System

The early work was summarized by Pantic and Rothkrantz [PR00] since one of the first

attempts for FER system done by Suwa et al. [SSF78]. They concluded that the ideal

system of facial expression analysis includes three important stages: the detection of


Figure 2.13: Six basic expressions, from left to right: Anger, Disgust, Fear, Happiness,Sadness and Surprise.

Figure 2.14: Examples of facial action units (AUs) and their combinations defined inFACS from [PB07]

faces, their representation and the classification of these representations. In the overview

of automatic FER systems by Chibelushi and Bourel [CB03], the authors gave a similar

structure but they added two architectural components: pre-processing (normally as

a sub-stage in face acquisition) and post-processing (normally included in classification

step) as in Fig. 2.15. These main blocks are used in almost all FER systems. In later state-

of-the-art reviews by Tian et al. [TKC05], Pantic and Bartlett [PB07], the basic structure

of FER system is illustrated similarly and the main techniques for the three stages are

discussed and summarized. Zeng et al. [ZPRH09] surveyed the latest advances in facial

expressions, head and body movements and temporal audio-visual correlation.

In the face detection stage, some approaches used the distance between eyes [FPH05,

SGM09] to normalize the faces, or identify a set of fiducial points[GD05b, KP08]. Among

the existing automatic systems, Bartlette et al.[BGL∗06] employed the boosting tech-

niques of Gabor features to track each of 20 AUs (Action Units) to differentiate between

fake pain and real pain. Koelstra and Pantic[KP08] developed a system based on Viola


Figure 2.15: The main blocks in facial expression recognition [CB03]

and Jones face detector[VJ01] and used Gabor features to located 20 facial characteris-

tic points. Koutlas and Fotiadis [KF08] automatically located 74 landmark points using

Active Shape Models(ASM). Buenaposada et al.[BMB08] constructed a three-level face

tracker using Viola and Jones face detector[VJ01], template-based rigid face tracker and

subspace-based tracker. These methods are concentrated only on face detection but they

did not consider the changes that are caused by facial expressions.

In the second stage, discriminative information is extracted to represent facial ex-

pression. One category of approaches is based on geometric features, whose models are

established by a set of important points on the face or face contour deformation. More

recently, some researchers [VKM09] suggested the use of deformed 3D model. In the

other category, expressions are described by appearance based features. Usually, the

methods in this group handle the image wise problem and identify 6 basic classes plus

the neutral class to label images. Among appearance based-methods, many of them

used the texture as discriminative feature. One of the most frequently used descriptors

is Gabor wavelet [LBL09], which is a powerful, while still remaining a time-consuming

tool. Local Binary Pattern (LBP) [HPA04] is another popular texture feature and usually

used on arbitrarily gridded sub-regions of images [SGM09].

There are others widely used texture descriptors and their variants. Yang et al.

[YLM09] built a class-specific codebook and applied boosts to select dynamic Harr-like

features in the same position along the temporal window. The weakness of the coded

features is the fixed size on time-span. Xie et al. [XL09] built the histogram of pixel

values at each pixel coordinate for all the training images. If a bin’s value is higher than

the value of its two neighbors, they defined it as a peak bin. The top peak bins with

high probabilities of occurrence are selected and the corresponding grey levels are used


to code the pattern images for testing. The proposed feature is based on grey level peaks

and has been tested on only four expressions. As facial changing speed for different ex-

pressions is variable, these continuous pixel coding in longer span became inadaptable.

None of these sequence based methods considered the texture of appearance on static

faces. Contrary to these approaches, Buenaposada et al.[BMB08] introduced a proba-

bilistic procedure to combine this information, based on static information from each

frame in the sequence, so as to compute the posterior probability. This is a real-time

system but the recognition rate is not competitive. In order to represent both static and

dynamic information, Zhao et al.[ZP09] connected the LBP features on three orthogo-

nal planes to represent the video, making the same treatment for motion than the one

used for appearance, regardless of their obvious different texture pattern. Note that the

authors annotated manually the eye’s position in the sequences. Some other authors ex-

plored expression recognition on non-frontal face images by using SIFT features Zheng

et al. [ZTLH09], hybrid features of LBP and Gabor (e.g. [MB09]) or variable-intensity

template [KOY∗09].

Some authors proposed to use shape descriptors such as Kotsia et al. [KZP08] who

considered shape information from Candide facial grid based on a set of landmarks, or

Zhu et al. [ZSK02] who computed moment invariants on several manually annotated

faces areas. However, most of these approaches extract features on static images and do

not consider the transient movements of essential facial parts.

Some new approaches focused on taking into account dynamic information so as to

deal with the sequence wise problem. As an early example, Yeasin et al. [YBS04] used

discrete HMMs to uncover the hidden pattern associated with expressions, which is in-

variant to illumination changes. Djemal et Peuch [DPR06] tracked the contour changes

in a sequence of medical image for 3D rebuild. Xiang et al. [XLC08] analyzed fixed

size sequences with 11 frames using fuzzy C means computation to generate the ex-

pression model but this method is inflexible. For shorter sequences, the method cannot

yield meaningful interpretation, while for longer ones, some frames have to be elim-

inated. They also tested different parameters for the number of frames and reduced

performance when the number is cut down.

Spatial-temporal features also became popular in use, Laganière et lambert [LBH∗08]

observed prominent motion of visual interest points which are found by Hessian matrix.

Koelstra and Pantic [KP08] derived motion histogram descriptors in sliding windows of

frames along the time axis. The motion orientation histograms were extracted as feature


descriptors to train a classification system for the automatic frame by frame recognition

of AUs and their temporal dynamics using a combination of ensemble learning and

HMMs. They tested for recognition of all 27 lower and upper face Action Units on MMI

databases. They also reported high false positive as only motion information is used and

some AU have very similar motion direction. In our opinion, the static or appearance

information can be used to reduce this error.

In the last stage, different machine learning techniques are applied. Guo et al.[GD05b]

compared the performance of variant classifiers: simplified Bayes classifier, support vec-

tor machine, and AdaBoost. The results reported that SVM is the most suitable. Buena-

posada et al.[BMB08] introduced a probabilistic procedure to combine the information

from input image sequence in order to compute the posterior probability. The proposed

system is robust to realistic environment but obtain only about 80% recognition rates

for expressions as fear, sadness and angry, not competitive to others [SGM09, ZP09]. In

Shan et al. [SGM09], better recognition performance is obtained by combining SVM and

Boosted-LBP features, but, as their boosting is based on sub-regions the method is under

the curse of dimensionality.

2.2.3 Recent Work and Trend

Figure 2.16: Rectangle feature by [KJC08]

For the recent work, the common new contributions lies on the design of new de-

scriptors or the recognition algorithms to improve the overall performance. Here we

introduce some new techniques and present their merits and weaknesses.

Kim et al. [PSK08, KJC08] extended Harr-like descriptors to variant rectangles (Fig.

2.16). In the feature selection, Adaboost algorithm from Viola and Jones is used to


build the classifier. Their results are around 90% from still images of JAFFE database.

This made the performance of these 3× 3 features not remarkable among all available

features.

Vretos et al. [VNP09] utilized Kanade-Lucas-Tomasi algorithm to track the Candide

facial grid and applied principal components analysis (PCA) to find the two eigenvectors

of the model vertices. Then they used SVM upon selected vertices features to perform

classification. The achieved facial expression classification accuracy is approximately

90% on Kanade-Cohn database. Their system is efficient but nor best performance nor

automatic because the vertices of the grid on the frames are manually located.

Chang et al. [CLL09] linked the output class label to the underlying emotion of a

facial expression sequences and connect the hidden variables to the image frame-wise

action units based on Hidden Conditional random field. In their sequence wise classi-

fication, the labels of a sequence is decided by a majority vote from each image frame.

This solution is similar to Buenaposada et al. [BMB08], the labels and probabilities are

implemented to achieve better accuracy rates. In their system, only static features are

extracted and no dynamic actions are considered. This degraded the recognition rates

for the "Anger" and "Sadness".

From the temporally deformed facial features, Park et al. [PK09] changed the inten-

sity of facial actions so as to recognize subtle facial expressions using motion magnifi-

cation and SVM classifier. It is a novel angle to use temporal information to recognize

subtle and spontaneous facial expression. Their experiments are not complete and dif-

ficult to compare to other methods as only four classes (three expressions plus neutral

expression) and tested on their own datasets.

On the popular Kanade-Cohen database, Raducanu and Dornaika [RD09] reported

one of best results as 100% on 5 classes from 70 objects. Their method recovered 3D head

pose and facial actions from the video sequence using the appearance-based face and fa-

cial action tracker (Fig. 2.17). They concluded that the dynamic recognition scheme had

outperformed all static recognition schemes. We also consider the direction of tracking

dynamic transition as potential and more robust in our proposal. For weakness, their

accuracy rate is high but they ignored the most difficult expression "Fear" and possibly

over-fitting in SVM classifier.

For the future trends, research maturity in this field leads many researchers to be-

come interested in recognizing the spontaneous expressions in realistic environment.


Figure 2.17: Three video examples associated 3D tracker with different degree ofmotion.[RD09]

Tong et al. [TCJ10] collected videos from MAD database, Belfast natural facial expres-

sion database and Youtube, and proposed a probabilistic based framework but nothing

new about descriptors (traditional Gabor wavelet features are applied.). Littlewort et

al. [LBL09] videotaped 26 participants to record the posed and genuine pain. In their

proposed system, the output from traditional Gabor filters is passed to classifiers, such

as SVM, Adaboost and linear discriminant analysis. It can obtain 88% correct for distin-

guishing fake from real pain while no dynamic information is used. However, although

some spontaneous affective behavior databases exist now, the benchmark database for

spontaneous facial expressions is still not available [ZPRH09].

Another trend is towards pose-invariant solution. Kumano et al. [KOY∗09] proposed

variable-intensity template for FER system. Their method described how the intensity

of multiple points defined in the vicinity of facial parts varies for different facial expres-

sions and is individual-dependent. By using this model in the framework of a particular

filter, their method is capable of estimating facial poses and expressions simultaneously.

The experiments demonstrate a recognition rate of over 93.1% on a range of ±40 degrees


from the frontal view from their own data, but only 70% from 53 objects in Cohn-Kanade

database. Another problem is that the performance under their interest point location

system is inferior than the performance under random point extraction and far from

practical usage.

Another try for pose changes in facial expression recognition from Moore et el.

[MB09]. They used static features such as local binary patterns (LBPs) and a novel

descriptor as local gabor binary patterns (LGBPs). The evaluation is performed on 3D

BU-3DEF database and overall results are not very high even on frontal view case, which

showed this area still need more exploration.

Though limited to only one category: smile, Whitehill et al.[WLF∗09] collected pic-

tures, photographed by the subjects themselves, from thousands of different people in

many different real-world imaging conditions. Their experience developing a smile de-

tector suggests that robust automation of the Facial Action Coding system may require

on the order of 1,000 to 10,000 examples images per target Action Unit. Datasets of

this size are likely to be needed to capture the variability in illumination and personal

characteristics likely to be encountered in practical applications.

2.3 Conclusion

Object recognition and facial expression recognition are two long existing and still

promising research areas in computer vision. In this chapter, we present a comprehen-

sive survey on the various systems, with the summaries of the recent works and possible

trends in both topics. However, among all the methods, there are still many difficulties

waiting to be solved. For example, the inaccuracy of facial characteristic point location

influences the detection of movements in facial expression classification. Now, many ap-

proaches are interested in moving from static features to dynamic features. The accurate

detection of those points for dynamic description is perhaps not realistic. More general

methods to avoid the request of precise point location seem more practical. In our work,

we followed these promising works and proposed our own methods and systems in the

next several chapters.

Chapter 3Feature Representation for Objects

and Faces

Contents

3.1 Overview of Local Feature Descriptors . . . . . . . . . . . . . . . . . . . 33

3.2 The Detection of Regions of Interest . . . . . . . . . . . . . . . . . . . . 34

3.2.1 DOG for Key Point Detection . . . . . . . . . . . . . . . . . . . . 34

3.2.2 Face Organ Location . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Features for Static Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.1 Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.2 SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.3 Shape Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.4 LBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.5 Gabor features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4 Features for Dynamic Schemes . . . . . . . . . . . . . . . . . . . . . . . 47

3.4.1 Introduction to Temporal Texture Analysis . . . . . . . . . . . . . 47

3.4.2 Dynamic Deformation for Facial Expression . . . . . . . . . . . . 47

3.4.3 VTB Descriptor for Dynamic Feature . . . . . . . . . . . . . . . . 49

3.4.4 Moments on Spatiotemporal Plane . . . . . . . . . . . . . . . . . 52

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

31

32 Chapter 3. Feature Representation for Objects and Faces

In this chapter, we present an overview of various local descriptors which are required

for pattern recognition. These descriptors provides the representation of descriptive

features on images or videos with success. Normally, they are denoted by feature vectors

on salient region or a patch around a key point. According to the different usages,

we put them into two sections: features for static images and features for dynamic

characteristics.

Among these features, some descriptors, such as SIFT or Shape Context, already ex-

ist. Because of their powerful ability of discrimination, we will incorporate them in our

systems. We also proposed a new texton descriptor, namely VTB (vertical time back-

ward), which contains the intrinsic changes of human faces on spatiotemporal domain.

For the general shape changes on spatioltemporal domains, we extend the usage of mo-

ments. To the best of our knowledge, our method is unique in literature that extract the

shape transformation of expressional faces on dynamic slices. Moreover, the combina-

tion of these heterogeneous features is also introduced and discuted in next chapter.

Part of the work described in this chapter was published in form of three interna-

tional conference papers [JIB09], [JI10b] and [JI10a].

3.1. Overview of Local Feature Descriptors 33

3.1 Overview of Local Feature Descriptors

The usage of features is an important component in computer vision system. The fea-

tures generally denotes the relevant information for learning and recognition in images

and videos (or image sequences). Many features are proposed by the researchers since

70s when the early development of applied artificial intelligence appeared. By using

these features, computational algorithm to mimic human visual perception can used in

behavioral research, affective computing, robotics and human-machine interfaces. For

these different applications, specific features should be carefully chosen to handle the

particular problems. Generally the representation of images and videos by using fea-

tures can be referred to two categories (Fig. 3.1):

Figure 3.1: Above: [PB07] Tracking the facial characteristic point. Below: [HPA04] LBPhistograms from whole face and divided blocks.

1. Geometric-based features: the structure information usually related to spatial or

temporal domain.

2. Appearance-based features: the information on a region or the neighborhood

around an interest point in the images;

Basically, we use the first category of features, which apply local neighborhood op-

erators on the patch around interest points or on the salient regions. The spatial or

temporal domain, which is traditionally explored by geometric feature, will be repre-

sented by the dynamic features proposed in section 3.3. The effectiveness of all these


features is dependent on the two steps: detector of "interesting" regions (points) and fea-

ture descriptor. The performance is firstly relied on detecting where is the right region

( or right point) a feature can be extracted. Furthermore, the pattern information on the

region ( or around the point) represented by the descriptors are commonly denoted by

a single vector called feature vector. The evaluation criteria of feature vectors should

includes three aspects:

1. Distinctive power

2. A simple and fast extraction process

3. Invariance to illumination, orientation, scale, and affine transformation.

For facial expression recognition, it also should be robust for human subjects from

different culture backgrounds and their non-rigid movements.

In this chapter, various feature detectors and feature descriptors will be introduced

and compared. They are incorporated in our system for object recognition and facial

expression classification. Especially for dynamic features, we introduce the appearance-

based features into spatiotemporal domains.

3.2 The Detection of Regions of Interest

3.2.1 DOG for Key Point Detection

If we want to find an object in an image, interesting points related to the object can be

located to provide a characteristic description of the object. For the task of object recog-

nition, the features extracted from the training imaged should be repeatable in testing

ones regardless of image noise, changes in illumination, uniform scaling, rotation, and

local geometric distortion. Evaluated on these merits, we choose and summarize Lowe’s

keypoint detection method based on DOG (difference-of-Gaussian)[Low99a] in this sec-

tion. The method extract a large number of feature vectors, which is invariant to image

translation, scaling, and rotation, partially invariant to illumination changes and robust

to affine local geometric distortion. The method of key point location includes three

cascade filtering stages:

3.2. The Detection of Regions of Interest 35

1. Find the maxima and minima of the result of difference of Gaussians function

applied on all scales and image locations. These extreme points are considered as

potential key points;

2. Low contrast points and edge response points along an edge are discarded. Only

the most stable among all candidate points are left.

3. Finally one or more orientation are assigned to each localized keypoints based on

local image gradient directions.

These steps ensure that a large number of keypoints over all scales and locations

can be generated for matching and recognition. By adjusting the threshold, for a high

texture image of 500x500, more than 2000 stable and repeatable keypoints can be located

after this filtering approach.

Figure 3.2: The construction of Difference of Gaussian(DOG). [Low99a]

The first stage of keypoint detection is to identify locations and scales that can be

repeatably assigned under differing views of the same object. For this, the image is

convolved with Gaussian filters at different scales, and then the difference of succes-

sive Gaussian-blurred images are computed as in Fig.3.2. Keypoints are then taken as

maxima/minima of the Difference of Gaussians (DoG) that occur at multiple scales.

Therefore, with an input image I(x, y), and a variable scale Gaussian as

G (x, y, σ) =1

2πσ2 e−x2+y2

2σ2 (3.1)


The scale space images L(x, y, ) on the left of Fig.3.2 is generated by convolving the

image as

L (x, y, σ) = G (x, y, σ) ∗ I (x, y) (3.2)

To use the scale-space extrema, among these scale spaces images, DOG image D (x, y, σ)

on the right of Fig.3.2 is given by the difference of two nearby scales separated by a con-

stant multiplicative factor k,

D (x, y, σ) = (G (x, y, kσ)− G (x, y, σ)) ∗ I (x, y)

= L (x, y, kσ)− L (x, y, σ)(3.3)

Hence the initial image is repeatedly convolved with Gaussians to produce the set of

scale space images. The convolved images are grouped by octave (an octave corresponds

to doubling the value of σ). Adjacent Gaussian images are simply subtracted to produce

the difference-of-Gaussian images. After each octave, the Gaussian image is down-

sampled by a factor of 2, and the process repeated. For scale-space extrema detection in

the SIFT algorithm, the image is first convolved with Gaussian-blurs at different scales.

Once DoG images have been obtained, keypoints are identified as local minima/maxima

of the DoG images across scales. The extremes of the difference-of-Gaussian images are

detected by comparing each pixel to its eight neighbors in the current image and nine

neighbors in the scale above and below, so totally 26 neighbors in 3x3 regions at the cur-

rent and adjacent scales. If the pixel value is the largest or smallest among all compared

pixels, it is selected as a potential keypoint.

The selection of the prior smoothing parameter σ = 0.009 concerns the frequency

of sampling in the spatial domain. Lowe’s experiments show that the repeatability of

keypoint detection increases with σ, but also increases the cost in efficiency. A proper

σ should be chosen to balance the relation between sampling frequency and rate of

detection in extrema selection. Here we use σ = initially. If the number of points is

below a threshold, the value of sigma will be doubled and more potential points will

be detected for future use. This process is repeated until we have enough points or the

extrema is too low contrast to be meaningful.

For accurate keypoint location, Lowe’s method rejected the points which have low

contrast or are poorly localized along an edge. This stage provides a substantial im-


provement to matching and stability. Firstly after applying a threshold on minimum

contrast with its neighbouring points, only a subset of potential keypoints are left to

avoid too much clutter. Then, to eliminate the keypoints along the edge and therefore

unstable to small amounts of noise, the sum of the eigenvalues from the trace of Hessian

matrix and their product from the determinant is calculated. The curvatures around this

point are roughly checked by ratio and edge responses are discarded.

In the last stage of DOG detector, each keypoint is assigned one or more orientations

based on local image gradient directions. This is the most important step in achiev-

ing invariance to rotation as the keypoint descriptor can be represented relative to this

orientation and therefore achieve invariance to image rotation.

First, the Gaussian-smoothed image L (x, y, σ) with the closest scale σ, is selected so

that all computations are performed in a scale-invariant manner. For each image sample,

the gradient magnitude, m (x, y), and orientation, θ (x, y), are precomputed using pixel

differences in a neighboring region around the keypoint. An orientation histogram with

36 bins is formed, with each bin covering 10 degrees. Each sample in the neighboring

window added to a histogram bin is weighted by its gradient magnitude and by a

Gaussian-weighted circular window with a σ that is 1.5 times that of the scale of the

keypoint. The peaks in this histogram correspond to dominant orientations. Once the

histogram is filled, the orientations corresponding to the highest peak and local peaks

that are within 80% of the highest peaks are assigned to the keypoint. In the case of

multiple orientations being assigned, an additional keypoint is created having the same

location and scale but different orientations. Only a small set of points are assigned

multiple orientations, but these contribute significantly to the stability of matching in

experiments.

The previous operations in three stages have assigned an image location, scale, and

orientation to each keypoint. The detection till now is invariant to rotation, scaling and

small deformation, the corresponding discriminative descriptor is presented in section.

3.3.2.

3.2.2 Face Organ Location

Like the detection of interest points in object recognition, face and facial organs detection

is the first stage in an automatic facial expression recognition system (FER). The facial

detection in our FER system does not use the traditional ways of face detector. Viola and


Jones [VJ01] proposes a popular face detector using Harr-like features and AdaBoost al-

gorithm to locate faces, and have been used in many later system as [BMB08], [MWR∗08].

Some other methods manually label the location of eyes to normalize faces as the oval

CSU FERET faces [BBDT05]. After equalizing the histogram of the image and scaling

the pixel values to have a mean of zero and a standard deviation of one, we subtract

the images with expressions from images with neutral faces. We suppose the images

with expressions and images with neutral faces are from the same video sequence and

aligned. Possible head motion and pose changing are not considered here but available

technologies dealing with these subjects can be added to our system [BMB08].

Figure 3.3: Facial organ location: step by step.

Based on these subtracted images (the first image on the left in Fig. 3.3 and the third

row in Fig.3.4), as the expression changes are related to those of facial components, we

use them to identify the relevant facial organs for this expression and roughly locate

them as in Fig.3.4. For different expressions, the location results for the same face will

be different too. We firstly use Gaussian filter so as to blur the difference images to

eliminate isolate noisy points. Then we detect facial organ regions as in Algorithm 3.1.

As showed step by step in Fig.3.3, we first find the dense block in the image as facial

region, then locate the two dense blocks as left and right eye areas, and finally find

the low part which consists of nose and mouth, though sometimes the nose can not be

detected. In the algorithm, the validity of a vertical or horizontal line L are calculated as

line density DL:

DL =NValid

NTotal(3.4)

where NValid is the number of valid pixels, whose brightness is higher than a em-

pirical threshold Tbright, NTotal is the total number of points on the vertical or horizontal

line.

The results for different expressions are shown in Fig. 3.4. The first row shows

the original facial expression images. The second row shows the corresponding neutral


Algorithm 3.1: Detect facial organs

1. Draw initial face rectangle,suppose the image only contains a front-view face,the face with full width of the image, eyes begin at 1

3 of the whole height andchin occupies 1

9 of whole height.

2. Shrink face rectangle, check line density DLof the left and right lines of face

3. Eye regions detection

(a) Locate the small area between eyes

(b) From the area, search the left part and right parts, find the true width ofthe face

(c) Initial the positions of eyes, locate the left and right blocks with enoughvalid points

(d) Adjust the sizes of left and right blocks to make sure they are the samesize

4. Locate low part consists of nose and mouth

(a) Initial the low part below eyes. the width is from center of the left eye tocenter of the right eye, height is from the low edge of left eye to the lowedge of face;

(b) Erase the blank lines around four edges of low part, and include the validparts below or above current low parts.

(c) Balance the low part according to central point between left and right eye.

5. Cut the minimal rectangle which contains left eye block, right eye block andlow part block.


Figure 3.4: Sample images, from left to right: Anger, Disgust, Fear, Happiness, Sadnessand Surprise.

expression images, the subtracted images are listed in row three. The detected facial

organs are drawn in forth row. Then we cut the original image using a square mask,

so only the face region from eyebrow to chin and left eye to right eye is kept. The

cropped region is processed through histogram equalization and pixel normalization in

the same way as in CSU Face Identification Evaluation System [BBDT05] to remove the

illumination changes. Finally, the facial region is scaled to a fixed resolution (e.g. 64x64)

for next stage. These cropped and normalized face images are shown in the fifth row of

Fig. 3.4.

Based on this stage of detection, the features for static and dynamic deformation will

be extracted in next two sections.

3.3. Features for Static Schemes 41

3.3 Features for Static Schemes

3.3.1 Color

Raw graylevel or color intensity of the pixels in the image are one of the most natural

way to describe one subregions of patch. Though it is far from being robust in realistic

uses. For example, the most popular RGB color space is easy to obtain, but not all not

visible colors can be represented by positive values of red, green and blue components.

CIE (International Commission on Illumination) defined CIE XYZ and CIE LUV color

spaces to overcome this difficulty. The merit is that it can obtain perceptual uniformity.

Among three components, the L value corresponds roughly to illuminance or brightness.

If not considering U and V , varying L can be used as the gray-scale. The U parameter

seems to mimic mostly shifts from green to red (with increasing U), and the V parameter

seems to mimic mostly blue and purple type colors. Both are chromo components.

In application, in order to balance and rich the information for the regions around

the DOG keypoints in section 3.2.1, we also used the LUV to average the pixel values in

8× 8 regions. The color information, denoted as C, is represented by a 3 dimensional

vector and clustered to one codebook of size 24.

3.3.2 SIFT

SIFT (Scale-invariant feature transform) descriptor is proposed by [Low99b]. it composes

a high distinctive descriptor for the local image region after the DOG key point detection

as in section 3.2.1. Furthermore, the descriptor should be as invariant as possible to

remaining variations, such as change in illumination or viewpoint changes.

As keypoint locations at particular scales are known and principle orientations are

assigned, the important invariance to image location, scale and rotation are ensured.

A descriptor vector at each keypoint is extracted on the image closest in scale to the

assigned scale of current point.

First, a 4x4 array of histograms with 8 bins each is created. These histograms are

computed from magnitude and orientation values of samples in a 16 x 16 region around

the keypoint such that each histogram contains samples from a 4 x 4 subregion of the

original neighborhood region. Fig.3.5 shows a 2x2 descriptor array computed from an

8x8 set of samples, whereas the full descriptor uses 4x4 descriptors computed from a


Figure 3.5: The histogram computation for SIFT descriptor

16x16 sample array. The magnitudes are further weighted by a Gaussian function with σ

equal to one half the width of the descriptor window, which is illustrated with a circular

window on the left side of Fig.3.5. The descriptor then becomes a vector of all the values

of these histograms. Since there are 4 x 4 = 16 histograms each with 8 bins the vector

has 128 elements. This vector is then normalized to unit length in order to enhance

invariance to affine changes in illumination. The influence of large gradient magnitudes

is reduced by thresholding the values in the unit feature vector to each be no larger than

0.2, and then renormalizing to unit length.

The dimension of the descriptor, i.e. 128, seems a bit high. The matching tasks

and the computational cost is still relatively low due by using Euclidean distance of

their feature vectors. SIFT descriptors are also proved to be invariant to minor affine

changes. Therefor, SIFT features are highly distinctive, robust and suitable for high

texture regions.

3.3.3 Shape Context

Shape context is proposed by Belongie and Malik [BMP00] as a contour-based feature.

It is another popular descriptor in object recognition.

As showed in Fig.3.6, firstly, an edge detector is performed on an image I(x, y), such

results a set of edge points {P = p1, p2, . . . , pn}. For each point pi on the shape, consider

the n − 1 vectors obtained by connecting pi to all other points. The set of all these

vectors is a rich description of the shape localized at that point but is far too detailed

since shapes and their sampled representation may vary from one instance to another in

the same category. The key idea in the proposal of Belongie and Malik [BMP00] is that


Figure 3.6: Left and center: samples of edge points for two similiar shape. Right: thelog-polar histrogram bins used to compute the shape context.

the distribution over relative positions is a robust, compact, and highly discriminative

descriptor. So, for the point pi, the coarse histogram of the relative coordinates of the

remaining n− 1 points is defined to be the shape context of pi. The bins are normally

taken to be uniform in log-polar space around the keypoint. The fact that the shape

context is a rich and discriminative descriptor can be seen in the figure below, in which

the shape contexts of two different versions of the letter "A" are shown.

Figure 3.7: Shape context as a discriminative descriptor

In Fig.3.6, there are the sampled edge points of the two shapes on left and center.

The right is the diagram of the log-polar, with five bins for logr and twelve bins for θ,

used to compute the shape context. In fig.3.7, there are the corresponing shape context

for the the three points. As can be seen, since left and center are the shape contexts for

two points in parallel positions , they are quite similar, while the shape context in right

is very different since the point is at low, angle-like position.

Because shape contexts are distributions represented as histograms, the similarity

between two vector are evaluated by X2 test statistic.


3.3.4 LBP

The local binary pattern (LBP) operator is defined as a gray-scale invariant texture mea-

sure, derived from a general definition of texture in a local neighborhood. It has be-

come a really powerful measure of image texture, showing excellent results in many

researches. The original LBP operator labels the pixels of an image by thresholding the

3x3 neighborhood of each pixel with the value of the center pixel and considering the

results as a binary number. Fig. 3.8 shows an example of LBP calculation [HPA04]. The

256-bin histogram of the labels computed over a region can be used as a texture descrip-

tor. Here, each bin (LBP code) can be regarded as a micro-texton. Local primitives which

are codified by these bins include different types of curved edges, spots, flat areas etc.

Fig.3.9 shows some examples. The operator has important advantages as its robustness

to any monotonic gray level change and its computational simplicity. The LBP operator

is also extended to consider neighborhood of different sizes and rotation. Here we select

the original LBP operator of 256 bins though we use it in a different manner.

Figure 3.8: An example of LBP computing [HPA04].

Figure 3.9: Examples of texture primitives which can be detected by LBP [HPA04].

Generally the face image are gridded and divided into small sub-regions of equal

size. The histograms are computed for the sub-regions and are concatenated into a

single and spatially enhanced feature histogram. In the dissimilarity measure between

histograms, weights are set for different sub-regions. In our approach, as the facial


organs have been detected as described in the section 3.2.2, we proposed to use a new

mask for the division. Our method of division is derived from the difference images

obtained in section 3.2.2, and it considers the natural distribution of facial organs. For

the normalized face image with resolution (l× l), the layout of the mask and the 8 blocks

is showed in Fig.3.10, and their positions are determined as follows:

1. Left block of left eye: (0,0) to ( 316 l, 3

8 l)

2. Right block of left eye: ( 316 l, 0) to ( 3

8 l, 38 l)

3. Left block of right eye: ( 58 l, 0) to ( 13

16 l, 38 l)

4. Right block of right eye: ( 1316 l, 0) to (l, 3

8 l)

5. Left block of nose: ( 14 l, 3

8 l) to ( 12 l, 3

4 l)

6. Right block of nose: ( 12 l, 3

8 l) to ( 34 l, 3

4 l)

7. Left block of mouse: ( 18 l, 3

4 l) to ( 12 l, l)

8. Right block of mouse: ( 12 l, 3

4 l) to ( 78 l, l)

Figure 3.10: (Left) A face image. (Center) Identified changed parts. (Right) Masked anddivided in 8 blocks.

The LBP histogram is calculated in each block. Then histograms of the 8 blocks are

concatenated into a single feature histogram containing K bins (in our case K = 2048

as shown in Fig.3.11). Comparing to other methods, Our way of division has important

properties:

1. Each block normally consists of half part of facial component


2. It includes the essential parts which concerns human face changes when an ex-

pression flirts across his or her face

3. It adapts to different expressions, because for different expressions the degree of

changes is variant.

Figure 3.11: The face image in Fig.3.10 is represented by a concatenation of 8 local LBPhistograms.

3.3.5 Gabor features

A Gabor filter is a linear filter used in computer vision for various tasks such as edge

detection or texture descriptor. Frequency and orientation are two parameters for ga-

bor filter and the representations are reported to be similar to those of human visual

system. It has been found to be particularly appropriate for texture representation and

discrimination. In the area of facial expression recognition, it is one of the most popular

descriptors.

We also use the Gabor descriptors, which are robust with respect to illumination

variations, scaling, translation, and distortion. Using the code provided by Zhu et

al.[ZVM04], a Gabor jet with 40 coefficients at 5 different scales and 8 orientations is

computed and stored on pixel locations of gray-level images from Section 2.1. These

jets are clustered by k-NN to a vocabulary with size 100. The histogram based on the

vocabulary is generated to represent each expressional face.

3.4. Features for Dynamic Schemes 47

3.4 Features for Dynamic Schemes

3.4.1 Introduction to Temporal Texture Analysis

The usage of specific dynamic textures (DTs) is one of most popular techniques in com-

putational intelligence. It is often used to describe the real world image sequences with

certain form of regularities.

Normally there are two main categories of techniques: correlation and differential.

Optical flow is the method to compute pixel motion. Normally only gray-level intensity

values are kept to avoid unnecessary complications.

3.4.2 Dynamic Deformation for Facial Expression

Figure 3.12: The cues of facial movements [Bas79].

In Fig. 3.12, deformations measurement, if performed directly on frontal faces, is dif-

ficult and sometimes very confusing. Each component (facial organs like eyes, mouth)

also varies in size, shape and color due to individual difference. As a solution, [Bas79]

suggested that face motion would allow expressions to be identified even with minimal

information about the spatial arrangement of features. His pioneer work reported that


facial expressions were more accurately recognized from dynamic images than from a

single static image. Since then, various approaches are applied to catch the temporal in-

formation from a neutral face to an expressive one. [PK09] used motion magnification to

exaggerate these movements regarding subtle expressions, [XLC08] observed the pixel-

based temporal pattern on sequences that are normalized to a fixed length, and [ZP09]

extended the feature extraction from traditional ones on static image to spatiotemporal

domain. The authors proposed to use LBP on three orthogonal planes as LBP-TOP. So

for three axes X, Y and T, LBP histogram is extracted from XY, XT and YT slices. How-

ever, the appearance and characteristics of XY, XT and YT planes are very different as

showed in Fig.3.13.

Figure 3.13: Left: XY(front face); Center: YT slice; Right: XT slice.

LBP can be successfully used for front face on XY plane but it is not proper to catch

the movements because time goes in only one direction neither four nor eight directions.

The texture on XT and YT slice, which comes from the dynamic deformation of facial

components, shows a different orientational tendency. For the same expression from

different subjects, the YT slices have a similar texture can be noticed. This similarity

comes from the uniform movement orientation of facial components in vertical direction

for different subjects during the same expression. After observing the characteristic of

XT and YT planes, no regular and systematic texture is found on XT planes at the same

Y position. This is why we select YT plane, but one can note that we might miss the

horizontal movements of facial muscles.

Inspired by their works, and after watching the repeatable texture for different ex-


Figure 3.14: The dynamic deformation for different expressions on vertical slices

pressions as in Fig. 3.14, we proposed to extract the dynamic deformation on vertical

slices. Let S = {I1, I2, . . . , IT} be a temporal ordered face sequence, where T is the num-

ber of frames in S and where each Ii has a fixed resolution n×m. In order to extract the

dynamic information related to facial expressions, for each value of x, with 1 ≤ x ≤ n,

we decompose S to n spatiotemporal slices Px as in Fig. 3.15.

Moreover, as we want to track the texture (VTB) and shape (moments) changes of

different facial components, we separate each slice of height m to three blocks corre-

sponding to the three main components with different heights mk: eyes (m1 = 38 m), nose

(m2 = 14 m) and mouth (m3 = 3

8 m). In the time axis, we track the changes not from

the whole sequence but from overlapped subsequences. Each subsequence includes τs

frames. After this, the two textons are extracted from spatiotemporal planes to describe

dynamic deformation.

3.4.3 VTB Descriptor for Dynamic Feature

The VTB pattern descriptor is a gray-level invariant texture primitive statistic, but de-

signed to be used on spatiotemporal planes extracted in previous section. The operator


Figure 3.15: 3 blocks on VT plane.

is related to a sequence of three continuous images as τs = 3. In the third image, each

pixel is used to threshold its two backward neighboring pixels with the same coordinate.

Similar to the original LBP operator, we build the descriptor on 3∗3 neighborhood as in

Fig.3.16. The binary result is calculated to label the middle-right pixel (’6’ in Fig.3.16).

For each pixel p with coordinate (x, y) and gray value gx,yin an image It of time t,


Figure 3.16: VTB computing.

the binary code is produced as Eq.3.5.

VTB = s(gx,y−1,t−2 − gx,y−1,t)25

+s(gx,y,t−2 − gx,y,t)24

+s(gx,y+1,t−2 − gx,y+1,t)23

+s(gx,y+1,t−1 − gx,y+1,t)22

+s(gx,y,t−1 − gx,y,t)21

+s(gx,y−1,t−1 − gx,y−1,t)20.

(3.5)

where,

s(x) =

1, i f x > 0

0, i f x ≤ 0

In facial expression, most facial part movements are vertically oriented and the main

directions can be up or down for different organs. For example, an expression of surprise

is related to eyes moving up and mouth moving down. Therefore we divide the VT

plane into three blocks: eye, nose and mouth regions as in Fig.3.15. Finally, a total

vector of 26 ∗ 3 = 192 bins is used to represent the movements. Now we have 10 blocks

from appearance and 3 blocks from motion. These histograms are concatenated into a

single one. Such a representation of an image is obtained from the image itself plus

two previous reference images. This extraction can be done per image in the sequence,

except for the first two images. The vectors obtained from LBP+VTB will be used to

identify sequences.


3.4.4 Moments on Spatiotemporal Plane

Coming from physics, an image moment is a particular weighted average of the image

pixels’ intensities and can be used as an effective descriptor of global shape. Given gray

level image with pixel density I(x, y), the image moments Mp,q are calculated by

Mp,q = ∑x

∑y

xpyq I(x, y) (3.6)

Especially some simple properties can be derived from moments such as:

1. M00: the mean of gray-level block

2. M10/M00, M01/M00: the centroid of gray-level block

Higher order moments can also be derived for shape features. In order to avoid the

curse of dimensionality, we only use three values (M00, M10/M00, M01/M00) for each of

the three blocks on one slice. The moments Mp,q(x, i, k) of each block at position x of

frame Ii with window size τs are calculated on slice Px as follows:

Mp,q(x, i, k) =mk

∑y=1

i

∑t=i−τs

yptqPx(y, t) (3.7)

where mk is the height of current block for 1 ≤ k ≤ 3 as in Fig. 3.15.

For each image Ii, moments are extracted for all the values of x, from current image Ii

and its (τs − 1) previous images in sequence. These values are combined into a feature

vector for current frame. This vector will be used to classify the current image into

one of six basic expressions plus neutral expression. The first (τs − 1) frames yield no

feature vector. Furthermore, the probabilities that the current image belongs to each of

the seven facial expressions are recorded to predict the possible facial expression in the

whole sequence.

3.5 Conclusion

In this chapter, various features used for the recognition for objects and facial expres-

sions are introduced and summarized. Two new spatiol-temporal descriptors, VTB and

3.5. Conclusion 53

moments on YT plans, are also proposed for dynamic schemes.

In the chapter 5 for testing and results, we will apply these features in benchmark

databases and show their efficiency. Indeed, in facial expression recognition, they make

possible to combine the existing static solutions and new temporal features to describe

characteristics of local facial regions.

Furthermore, in the next chapter, we will focus the classification methods to build

proper models using these feature vectors and their combinations.


Chapter 4Recognition and Classification

Methods

Contents

4.1 Overview to Machine Learning Methods . . . . . . . . . . . . . . . . . 57

4.2 Discriminative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.2 AdaBoost Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.2 BoW and Naïve Bayes Implemantation . . . . . . . . . . . . . . . 64

4.3.3 Hierarchical Generative Model . . . . . . . . . . . . . . . . . . . . 66

4.3.4 Construction of Hierarchical Dirichlet Processes . . . . . . . . . . 66

4.3.5 Inference and sampling . . . . . . . . . . . . . . . . . . . . . . . . 68

4.4 Hybrid System: Integrated Boosting and HDP . . . . . . . . . . . . . . 70

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

55

56 Chapter 4. Recognition and Classification Methods

This chapter briefly presents the methods in Bayesian statistics for categorization in

machine vision system and details our proposal for object classification. These

classifiers are learned through Bayes rules and applied in both object categorization and

the specified area of facial expression classification. Among these statistical approaches

to learning and discovery, there are two groups: generative and discriminative models.

Generative models are a top-down manner while discriminative models are bottom-up

driving. Later in this manuscript, for the complex problem for object recognition, we

will combine the two models and build a set of essential middle level components. For

the more special topic of facial expression, the powerful discriminative models are used.

The chapter begins with an overview of learning algorithms based on Bayes rule.

Then, discriminative and generative classifiers are presented. Finally, we provide the

method to use hybrid system combined by discriminative and generative models.

Parts of this chapter were published in an international conference paper [JIB09].

Readers could also refer to the paper of Teh et al.[TJBB06] for the details about hierar-

chical Dirichlet process used for text modeling for documents.

4.1. Overview to Machine Learning Methods 57

4.1 Overview to Machine Learning Methods

The recurring problem in object recognition is the need to classify a new observation

into a limited number of categories. In other words, a new image or video sequence

is processed by model-based clustering and a decision is made about which category

it is most likely to belong to. These techniques are not limited to computer vision but

much more widely applicable to complex real-world phenomena to make accurate pre-

dictions about them, for examples such as speech recognition, text classification and

bioinformatics problems [Jeb04]. The learning and inference approaches falls into two

major categories: generative and discriminative models. Generative models specify a

full structured joint probability over observed data. The models in this category are

often related on graphical models based on Bayesian reasoning. On the other hand,

discriminative models rely on the conditional probability distribution over the exam-

ples. The largest possible margin separates classes are optimized by the adjustment of

parameters in classifiers. These method can assign available labels to new observations

but also might introduce the problem of over-fitting.

The formulations involves estimating f : X → Y, or P(Y|X). For discriminative clas-

sifiers, some functional form for P(Y|X) is assumed. Then the parameters of P(Y|X) are

estimated directly from training data to maximize the boundaries to separate different

classes. For generative classifiers, some functional form for P(X|Y), P(X) are assume

so the parameters of P(X|Y), P(X) are estimated directly from training data then Bayes

rule is used to calculate the prediction P(Y|X = xi).

These two powerful paradigms are the main techniques of pattern recognition, arti-

ficial intelligence and perception systems. The discriminative approaches, though em-

pirical, can usually provide superior performance. Yet various probability models from

generative methods can reflect the prior knowledge about the practical domains. We

will choose the proper method to solve the particular task or fuse the two framework to

combine the complementary powers of the two groups of approaches.


4.2 Discriminative Model

4.2.1 Introduction

One of the traditional problems in machine learning is to measuring the similarities

between query samples and training samples. Discriminative models are empirically

successful to solve this problem [Nal04]. This category of models is relatively simple

because discriminative algorithms do not consider the joint distribution but directly

optimize the conditional probability distribution as remarked in 4.1.

Here the classification can be defined as the problem of choosing a class C for an

example with a feature vector x, which is obtained from Chapter 3. In the learning stage,

the classifier learns the parameters of discriminant function from the labeled training

data. In the testing stage, the ideal models will map or rank the feature vector x into its

correct class through the discriminative function. As concluded in [UB05], discriminative

models owe following advantage:

1. Discriminative models are flexible when the training data differ significantly;

2. In making predictions for testing samples, discriminative models are typically very

fast, while generative models often require iterative operations;

3. Normally, discriminative methods would provide better predictive performance

since they are trained to predict the class label rather than the joint distribution of

input vectors and targets.

Owing to their robustness and relatively simple nature, discriminative models (eg.

Maximum Entropy, Linear discriminant analysis, Support Vector Machines and Ad-

aBoost) have been preferred in many domains.

4.2.2 AdaBoost Method

AdaBoost, which stands for Adaptive Boosting, is an algorithm formulated by [FS95]. It

aims at constructing a strong classifier as linear combination:

f (x) =T

∑t=1

αtht(x) (4.1)

4.2. Discriminative Model 59

where ht(x) : X → {−1,+1} are weak classifiers, αt is the set of weights associated

with each weak classifier or feature. Initially, the distribution alphat will be set to be

uniform. For a sequence of N labeled training samples, at each iteration t, the weak

learner try to find a hypothesis ht(x) which is consistent with most of the samples (i.e.,

ht(xi) = yi for most 1 ≤ i ≤ N) with small error. Using the new hypothesis ht(x) ,

the algorithm generates the new set of weights alphat and this process repeats T times.

The final strong hypothesis h f is the result which combines the outputs of the T weak

hypotheses by a weighted majority vote. AdaBoost also can be proved to maximize

margin because it chooses ht(x) with minimal errors in each iteration to minimizes the

margin.

Through this boosting algorithm, AdaBoost provides some interesting merits for

machine learning as:

1. It produces a strong and complex classifier with relatively simple classifier Ad-

aboost is capable of reducing both bias (e.g. stumps) and variance (e.g. trees) of

the weak classifiers

2. It reduces the weakness of a single weak classifier, such as the bias of stumps.

3. It selects the most relevant feature by evaluating the empirical error.

4. It is able to obtain maximal margin in the groups of discriminative methods;

5. It generates a series of cascaded classifiers where the number of iteration T can be

decided by user.

In general, a hypothesis which is accurate on the training set might not be accurate

on examples outside the training set. This problem is usually referred to as over-fitting.

Often, however, overfitting can be avoided by restricting the hypothesis to be simple

[FS95].

In our study, Adaboost algorithm performs well in selection of essential features and

components as demonstrated in Chapter 4.4 and Chapter 5. It is also successfully used

in face detection by [VJ01]. Their method selects a small number of critical Harr features

from a larger set and yields cascaded classifiers. Their multi-layers detector is one of

most popular face detectors and implemented in OPENCV [Bra00].


4.2.3 Support Vector Machine

In the area of machine vision, SVM (Support Vector Machine) often produces state of

the art classification performance [BGV92, Jeb04]. This group of algorithms are based on

the statistical learning theory and the Vapnik-Chervonenkis (VC) dimension introduced

by [CV95]. Intuitively, for the two-group classification problems, SVM conceptually

implements the following idea: input vectors are non-linearly mapped to a very high-

dimension feature space. Given the training data, this procedure generates a model to

predict the target values of the test data. Later, it evolves to handle the single multi-

class problem by reducing it into multiple binary classification problems and each of the

problems yields a binary classifier. In implementation, the efficiency of SVM is depend-

ing on the choice of kernels such as linear, polynomial and RBF (Radial basis function).

Based on these kernels , several applications (eg. WEKA by [HFH∗09] in JAVA, LIBSVM

by [CL01] in C++) are developed to obtain acceptable results rapidly.

For formalization under the case of two categories, a set of training data is pro-

vided as D ={(xi, yi)|xi ∈ <n, yi ∈ {1,−1}l

}, where a data point xi is viewed as a

n-dimensional vector (a list of n real numbers) and yi is either 1 or -1 to indicate the

class to which the point xi belongs. The target is to know whether the model can sep-

arate these points as clearly as possible with hyperplanes that their distance from the

nearest data point on each side is maximized. To identify the hyperplane, SVM intro-

duced a normal vector w which is perpendicular to the hyperplane and the parameterb‖w‖ which determine the offset of the hyperplane from the origin along the normal

vector w (showed in fig4.1). For the possible misclassified samples, a set of non-zero

variable ξi is introduced as

yi(w · xi + b) ≥ 1− ξi (4.2)

where 1 ≤ i ≤ l and ξ ≥ 0. The training vectors xi are mapped into a higher dimen-

sional space by a function φ. Then SVM tries to find a linear separating hyperplane with

the maximal margin in this higher dimensional space. With φ, 4.2 becames 4.3 as

yi(w · φ(xi) + b) ≥ 1− ξi (4.3)

The optimization problem becomes to find a large margin and a small error penalty

4.2. Discriminative Model 61

Figure 4.1: Separating hyperplanes under linear case. [Bur98].

as:

minw,b,ξ12

wTw + Cl

∑i=1

ξi (4.4)

where C is the penalty parameter of the error term. For constraints of the form

yi > 0, the constraint equations are multiplied by positive Lagrange multipliers and

subtracted from the objective function, to form the Lagrangian formulation. then can

introduces the so called kernel function.

K(xi, xj) ≡ φ(xi)Tφ(xj) (4.5)

Many kernel functions are proposed by researchers in machine learning (eg. cloud

basis functions by [DSRDS08]). The basic common ones are listed as:

1. Linear: K(xi, xj) = xTi xj

2. Non-linear:


(a) Polynomial (homogeneous): K(xi, xj) = (xTi xj)

d

(b) Polynomial (inhomogeneous): K(xi, xj) = (γxTi xj + r)d, where γ > 0.

(c) Radial Basis Function (RBF): K(xi, xj) = exp(−γ∥∥xi − xj

∥∥2), where γ > 0.

(d) Sigmoid(Hyperbolic tangent) : K(xi, xj) = tanh(γxTi xj + r))

Here, γ, r and d are kernel parameters. To improve the efficiency of SVM, different set

of (C, γ, d, r) values should be tuned and the one with the best cross-validation accuracy

is used to train the whole training set D. To show the function of different kernels, in

Fig. 4.2 [Bur98], the above graphs show two examples of a two-class pattern recognition

problem, one separable and one not. The two classes are denoted by circles and disks

respectively. Support vectors are identified with an extra circle. For these machines, the

support vectors are the critical elements of the training set because they lie closest to the

decision boundary. The error in the non-separable case is identified with a cross. In the

low part of Fig. 4.2, the kernel was chosen to be a cubic polynomial function (degree 3).

For the linearly separable case (low left), the solution is still roughly linear, and that the

linearly non-separable case (low right) has become separable.

For the usage of trained model, when a new data point x is put in, an SVM is used by

computing dot products of a given test point x with w, or more specifically by computing

the sign of

f (x) =Ns

∑i=1

αiyiφ(si) · φ(x) + b =Ns

∑i=1

αiyiK(si, x) + b (4.6)

where NS is the number of support vectors. αi are Lagrange multipliers for every

training point. In this solution, those points for which αi > 0 are called support vectors,

and lie on one of the hyperplanes as in 4.2.

The SVM is originally binary classification method. To expend it to multiclass SVM,

the most common methods are built on one-versus-all method and the one-versus-one

method [HL02]. One-versus-all method use one binary classifier to separate current

class to all the rest of classes and the winner takes all. One-versus-one method which is

also called as pairwise method, built binary classifiers for each pair of classes then use

max-voting for final separation. Though, the efficiency of multi-SVM is still influenced

by the characteristic of datasets in practical use. The empirical study by [bDK05] showed

that pair-wise method proposed by [HT98] is highly recommended. We will use this as

the kernel discriminant method for solving multiclass problems in Chapter 5.

4.3. Generative Model 63

Figure 4.2: The examples of SVM kernels by pictures[Bur98].

4.3 Generative Model

4.3.1 Introduction

In the area of machine learning, the generative approaches use the joint distribution

of observable data. These group of methods are often casted by probabilistic graphic

model [Jeb04]. These methods provide a rich framework of imposing structure and prior

knowledge to estimate models from available observations or training data. Contrast

to the empirical nature of discriminative models, generative models can prove to be

informative in understanding the form of the probability distribution represented by

that model [Bis06].

Suppose a testing image (object or face) I∗ is described by a vector X, which consists

of some features extracted from it, the trained model will make a decision to assign I∗


to one of C classes where c = 1, . . . , C or to a new class C + 1. Generative approaches

introduce the joint distribution p(c, X). For learning, Bayes’ theorem is applied using

prior probability p(c) and the class-conditional densities p(X|c) as

p(c|X) =p(X|c)p(c))

∑j=1C p(X|j)p(j)

(4.7)

This category of models, such as Naïve Bayes, Hidden Markov Models (HMM) and

Mixtures of Gaussians, has become prominent tools in computer vision domain, espe-

cially for object recognition and feature extraction because these applications benefit

greatly from the probabilistic methods that estimate the statistic between images and

features. For the generative modeling framework, as mentioned in [UB05], the relative

merits can be illustrated as:

1. It can handle noisy data such as missing or partially labeled ones.

2. A new class c + 1 can be added incrementally by learning its class-conditional

density p(X|c + 1) independently of all the previous classes.

3. It can readily handle compositeness and proportion.

Normally, based on prior knowledge, generative models learn the parameters from

training data to maximize the data likelihood. This group of models are robust and

with high accuracy, but the inference and classification speed are much slower than

discriminative models. In the following sections, we will apply it in general object

recognition based on Bag of Words (BoW) and Hierarchical Dirichlet Processes (HDP).

4.3.2 BoW and Naïve Bayes Implemantation

Bag of Words(BoW) or bag of feature model originally is a very popular method for

natural language processing (NLP). It represents a document as an unordered collection

of words, disregarding grammar and even word order [Lew98]. This dictionary-based

method has achieved great success in areas as spam filter, search engine design and

semantic analysis. Recently it is also extended to computer vision, especially object

categorization (eg. [LP05, SRE∗05b]).

To immigrate from text domain to image domain, some basic elements in BoW model

also changed. Words, which are much clearer defined in document treatment, now


are replaced by some feature vectors from local patches (regions). And the traditional

dictionary evolves to "codebooks". Normally these visual codebooks are generated by K-

means clustering over all the training vectors. Contrary to natural language dictionary,

the size of codebooks( number of available words) is flexible. In this visual dictionary,

each word is the center of a group of similar feature vectors. Then, a feature vector (eg.

SIFT in chapter 3.3.2) will be mapped to one corresponding visual word in codebook.

Then the image can be abstractively represented by histogram of visual words.

To formally note the method, given a training set with J images, a feature vector tj

is used to denote the label associated with each image J, where tjc ∈ 0, 1, c = 1, . . . , C,

and j = 1, . . . , J. Here, the class label is related to the existence of objects and not

directly to images. For notation, 0 means that this class is absent in image j and 1 mean

that this class exists. Each image j is represented by a feature vector Xj which consists

in Iw components, where the ith visual word xji instance in image j is a draw from a

distribution (or so-said histogram) F(θji) for association to a visual word vji belonging

to a vocabulary of size W. Here, xji is an extracted feature such as color, shape or texture

descriptor as in Chapter 3. For a codebook V = v1, v2, . . . , vw, the Naïve Bayes model is

used. Based on training data, we wish to maximize the likelihood by learning the latent

variables. After this inference stage, for the testing image I∗, the class c∗ is decided by

the probability as

P(I∗) = argmaxC p(c|V∗)= argmaxC p(c)p(V∗|c)= argmaxC p(c)∏W

w=1 p(V∗w|C).

(4.8)

Given all the features extracted from testing image I∗, the posterior probability of

class C(I∗) can be found by marginalizing out all the features. This basic model is

later extended to hierarchical Bayesian models as pLSA(probabilistic latent semantic

analysis, [Hof99]) and LDA (latent Dirichlet allocation, [BNJ03]). In order to use the

non-parametric properties of generative model, we apply HDP (Hierarchical Dirichlet

Processes) for our learning system in the following sections. Our target is to construct

the set of middle components for general object categories. The number of components

is unknown and will be inferred in iterations. The non-parametric solution is naturally

more proper in this case.


4.3.3 Hierarchical Generative Model

To explain the method which we use for object recognition, we adopt an extended Chi-

nese Restaurant franchise or CRF in [TJBB06] as metaphor for the hierarchical Dirich-

let process used here. Suppose there are multiple restaurants, like Chinese, Japanese,

French and Italian, with a shared menu across the restaurants. A set of dishes are shared,

such as fried rice is shared among Asian ones, also noodle is possibly shared between

Chinese restaurants and Italian restaurants. At each table of each restaurant one dish

is ordered from the menu and shared among all customers who sit at that table. Then

one new customer will tend to select the table where sit the customers with similar

culture or career background. From all the dishes on tables and the customer’s sitting

pattern, we try to find the most prominent dishes, for example, Pekin Duck Recipe for

Chinese cuisine or fois gras for French cuisine. Such from the dishes, we can judge the

dominant favor of the restaurant, so as the main object in the image. Here the restau-

rants respond to images, the tables correspond to latent mixture components, and the

customers correspond to the visual words.

4.3.4 Construction of Hierarchical Dirichlet Processes

We consider a category which includes multiple images and each image can be modeled

as a mixture with different mixing proportions using shared components. Each compo-

nent is a mixture of visual words with different mixing proportions. The components

and the number of components will be inferred from training data.

Graphical representation of the model is showed in Fig. 4.3 [TJBB06], the global

probability measure G0 is distributed as a Dirichlet processes with hyperparameters γ,

H as G0|γ, H ∼ DP(γ, H) denoted by the stick-breaking construction as:

G0 =∞

∑k=1

βkδφk (4.9)

where βk is the global mixing proportions, δφk is an atom at φk and the φk denotes the

random variables distributed according to H, as the components shared among images.

The random measures Gj, are also Dirichet processes Gj|α0, G0 ∼ DP(α0, G0) and can be


Figure 4.3: HDP model.

written as

Gj =∞

∑k=1

πjkδφk (4.10)

where πjk is the mixing proportions for image j with hyperparameter α0.

The baseline H provides the prior distribution for the probability θji, which is a factor

corresponding to a single observation of visual word xji in the image j. The stick-break

construction is used to provide the prior distribution G0 as the global random measure:

β|λ ∼ GEM(γ)

π|α0, β ∼ DP(α0, β)

φk|H ∼ H

(4.11)

where

βk = β′k ∏k−1

l=1 1− βkl

β′k ∼ Beta(1, λ)

(4.12)


and

πj =(πjk)∞

k=1

πjk = π′jk ∏k−1

l=1 1− πkjl

π′jk ∼ Beta(α0βk, α0(1−∑k

l=1 βl))

(4.13)

Here GEM, the so-called Griffiths-Engen-McCloskey distribution, is a probability law

for the sequence arising as a residual allocation model (RAM) [GK01]. This is a popular

partition model and has many remarkable peculiarities such as it is analytically best

tractable case.

4.3.5 Inference and sampling

Figure 4.4: The mixture of components.

Based on the prior, HDP uses the straightforward Gibbs sampling [TJBB06] in Markov

chain Monte Carlo (MCMC) algorithm for posterior sampling to learn the parameters of

the mixture model. As in Fig. 4.4, the ith visual word xji instance in image j is a draw

from a distribution F(θji) for association to a visual word vji belonging to a vocabulary

of size W. vji is associated to a component instance tji. tji is associated with a component

k ji in a component pool of size K and zji are used to denote the index of k ji which is mix-


ture of heterogeneous visualwords. For better understanding, we will use the restaurant

metaphor in 4.3.3 to explain the sampling method and separate the two main steps in

hierarchical models: the selection of table (one instance of component) and the selection

of component (the class of the component).

Sampling instance t. Here −ji,−jt is used in the superscripted index to denote

which corresponding variable is removed from the dataset, eg. x−ji = x

xji means that for x = (xji : all j, i) we consider all the data in x except xji itself, as xji is

the last customer to enter the restaurant to select table or the ith visual word to select

component instance. The conditional distribution of table tji is

n−jijt. f

−xjik jt

(xji) i f t is already used

α0 p(xji|t−ji, tjt = tnew, k) i f t = tnew(4.14)

where n−jijt is the number of visualwords in image j at component t after removing

xji , the f−xjik jt

(xji) is conditional density of xji for components k jt except xji, itself. In

iteration of updating tji, new component instances can appear, some instances become

empty and are discarded. As a result, some mixing components will be associated with

zero instance and deleted. Or a new instance introduces a new mixing component. Such

adapts to the proper number of all shared components K.

Sampling component k. Since changing component k jt also changes the membership

of all data items, here visualwords in instance t can select an old component or begin a

new component, the conditional probability of component k jt as

m−jt.k f

−xjik i f k is already used

γ f−xjiknew i f k = knew

(4.15)

where m−jt.k denotes the number of component instances in image j associating to a

special component k, f−xjik is the prior density of xji. The number of components K s

also possibly adjusted here. After the iteration of sampling, it gets the distribution of

global latent components for each class.

For the inference part as in Fig. 4.42, on the image training set in the right column

which includes M images, we consider each visual word from one image is the last

customer to enter the restaurant to select table. It selects one of component instances

with the probability that is proportional to the conditional density of xji. It also can


add a new instance with the probability that is proportional to hyperparameter a0. For

each instance the mixing proportions associating visualwords changes in iterations. As

a result, some components in central column will be associated with zero instance and

deleted. Or a new instance introduces a new mixing component with the probability

that is proportional to hyperparameter γ. Such adapts to a set of components and the

proper number K of the set. More details about HDP are given in [TJBB06].

4.4 Hybrid System: Integrated Boosting and HDP

We have introduced two classes of statistical frames used in machine learning for mod-

eling the training data and predicting the label of new input data. Among these two

training frameworks: discriminative techniques are widely used since they give excel-

lent generalization performance; then generative models can handle the noisy training

data and deal with the potential partial overlapping between different classes. Hybrid

systems by the combination of generative and discriminative models are introduced by

the works by Raina et al. [RSNM03], which used naive Bayes and logistic regression

to imporve the classification of documents. Furthur more extented to computer vision,

Bosch et al. [BZMn08] investigated the combination of pLSA (probabilistic Latent Se-

mantic Analysis) and subsequently trained a multiway classifier on the topic distribution

vector for each image. In order to gain the benefit of both generative and discriminative

approaches, we propose another hybrid system which firstly uses generative model to

extract basic bricks on semantic middle levels, then applies discriminative approach to

select the most prominent ones.

Figure 4.5: Hybrid approach for learning

Basically, the training stage (Fig. 4.5) is based on BoostHDP (Adaboosting Hierarchi-

cal Bayesian model). For formal notation, a set of image C = I1, . . . , IJ , belongs to one

category C. Each image j has Nj descriptors, each descriptors xij, where 1 < i < Nj, can

belong to different dictionary (i.e. color, shape, texture, orientation), which are clustered

to a combined vocabulary V with W visual words. One image Ij can be represented by a

vector Vj = (vj1, vj2, . . . , vjw, . . . , vjW). So each instance xij from observation is associated

4.4. Hybrid System: Integrated Boosting and HDP 71

with a visual word vjw.

Firstly, a Dirichlet prior is built using stick-breaking construction as described in sec-

tion 4.3.4 with hyperparameter α. The posterior distribution of the mixture weights πc of

component set Z for category C is also Dirichlet, and determined with hyperparameters

γ by the number of observations Nj currently assigned to each components:

p(πc|z, α) = Dir(N1 +αK , . . . , NK + α

K ) (4.16)

where Nk = ∑Ni=1 δ(zi, k) and δ() is delta function.

Similarly, assuming λ is the precision of a symmetric Dirichlet prior, the posterior

distribution of the mixture weights ηk of descriptor to each component is also Dirichlet,

with hyperparameters determined by the number of observations Nw currently assigned

to each visual words:

p(ηk|w, λ) = Dir(C1 +λW , . . . , CW + λ

W ) (4.17)

where Cw = ∑Ni=1 δ(Nj, k) and δ() is delta function.

In the t iteration, given previous part assignments Z(t−1), for image set C = I1, I2, . . . , Ij, . . . , IJ

which depicts category c and in each image Ij there are Nj descriptor, we sequentially

sample new assignments Z(t) as follows:

After the iterations of sampling, we get a set of K latent components and the mixture

weight sets πc and ηk as distribution for each class. For one class with Hp positive

sample images (belong to this category), each image is represented by the mixture of

components. And each component includes a set of multiple visualwords (Fig. 4.4).

Here we integrated the discriminative method Adaboost weaker learner from Chapter

4.2.2 to find the most related components for classification to handle the variance in

intra-class and inter-class information.

So we select Hn negative sample images (not belong to this category) and iterate

multi-times as in inference (chapter 4.3.5). The negative samples are also the mixture

of components and visualwords. For each component we construct a weak classifier

hk, which consists of a component zk, a threshold θk and a parity pj. We compute

the distance dkj defined as Euclidean distance between two normalized instances of


Algorithm 4.1: HDP to build components

1. Sample a random permutation τ(·) of the integers 1, . . . , Nj.

2. Set Z = Z(t−1), For each m belongs to τ(1), τ(2), . . . , τ(Nj), sequentially resamplecomponent zjm as follows:

(a) Remove the feature Djm from the cached statistics for its current componentk = zjm

Nck = Nck − 1Ckw = Ckw − 1Here w is the visual word which is associated with current descriptor Djm andCkw denotes the number of times appearance that a visual word w is assigned tocomponent k, and Nck is the number of features in category c assigned tocomponent k.

(b) There is the probability, proportional to parameter α0, to form a new component.The probability that Djm picks up an existing component is proportional to thenumber of features already selected the components. Here for each existing Kparts, determine the predictive likelihood

fk(Djm) = (Ckw+

λW

∑w Ckw+λ )∏Nckq=1 fk(Djm) (4.18)

Also determine the likelihood f k(Djm) of a potential new cluster k:

f k(Djm) = λ∑w Ckw+λ (4.19)

(c) At last, remove all empty component instances which are component instanceswithout any descriptor associated.

(d) Sample a new component assignment zji for current component instance fromthe following multinomial distribution:

zjm1

Zm(Nck +

αK ) fk(Djm)δ(zjm, k) (4.20)

Where Zm = ∑Kk=1(Nck +

αK )

(e) Add feature Djm to the cached statistics for its new part k = zjm, updateNck = Nck + 1Ckw = Ckw + 1

3. Set Z(t) = Z. Update part weights and parameters as

π(t)c Dir(Nc1 +

αK , . . . , NcK + α

K )

η(t)k Dir(Ck1 +

λW , . . . , CkW + λ

W )(4.21)

4. If any current component is empty (Nk = 0) and without any component instanceassociated, remove them and decrement K accordingly.

4.4. Hybrid System: Integrated Boosting and HDP 73

component zk in image j, and also the classifier rule as

hk(z, θ) =

1, dkj > θk

0, dkj ≤ θk(4.22)

Suppose that T is an integer between 1 and K, which is the total number of compo-

nents, the AdaBoost classifier learns for top T ’positive’ components as follows [VJ01]:

Algorithm 4.2: AdaBoost for Components

1. Given samples images (I1, y1), . . . , (In, yn), . . . , (IH , yH), where H = Hp + Hn , andyn = 0 or yn = 1 for negative or positive samples respectively, Hn, Hp are the numbersof negatives and positives samples respectively,

2. Initialize weight set for images j = 1, ...H as m1,j =1

2Hnor 1

2Hp,

3. For the tth iteration, with t = (1, ..., T) :

(a) Normalize the weights as ∑Hj=1 mt,j = 1

(b) For each component zk, we train a classifier hk(z, θ), and the error is evaluated as

εk =N

∑j=1

mt,j∣∣hk(z, θ)− yj

∣∣ (4.23)

(c) Find the classifier ht with the lowest error εt, add the corresponding componentzt into component set Zsub−T

(d) Update the weights as mt+1,j = mt,jβ1−ejt , where ej = 0 if sample is classified

correctly, ej = 1 otherwise and βt =εt

1−εt

From this AdaBoost learning method, we obtain a subset Zsub−T which consists of

the most distinguishing components.

This hybrid system takes a combination of generative and discriminative functions.

This modified model integrated smoothly the two classes of approaches and tried to

benefit from the merits of both. The first part provides the middle-level or preformed

units which are inferred entirely from the training data. The number of these units are

determined by the complexity of objects. In the second part, the most distinguishing

units are filtered out and the blend is built. Now computational cost is reduced after

this step. This extension of original HDP models improves both performance and speed

of recognition process.


4.5 Conclusion

In this chapter, two main groups of solutions for supervised learning: generative and

discriminative models are presented. Especially, hierarchical generative model, a learn-

ing algorithms based on Bayesian statistics, is focused and detailed. Furthermore, we

also developed a nonparametric hybrid system to combine the merits of Dirichlet pro-

cess and traditional Adaboost approach. These proposals will be challenged to solve the

problem of robust learning in object classification in the first part of chapter 5. Other

discriminative learners, such as SVM, will also be used for our experiments in chapter

5.

Chapter 5Testing and Results

Contents

5.1 General Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.1.1 Datasets of Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.1.2 Classification Using HDP with Heterogeneous Features . . . . . 77

5.1.3 Classification Using Boosting within Hierarchical Bayesian Model 81

5.2 Facial Expression Recognition . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2.1 Face databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2.2 Overview of Our System . . . . . . . . . . . . . . . . . . . . . . . 88

5.2.3 Image Based Classification using Static features . . . . . . . . . . 89

5.2.4 Image Based Classification Using Static and Dynamic Features . 92

5.2.5 Classification for Sequences . . . . . . . . . . . . . . . . . . . . . . 95

75

76 Chapter 5. Testing and Results

In this chapter, we present the experimental conduct to evaluate the performance of

our proposed systems, which are related to image/video classification and feature

extraction. For object classification and recognition, we used Caltech dataset and we

show that it performed well and had the potential to obtain visual concepts in semantic

domain. Another system is about the recognition and interpretation of human facial

expression.

To evaluate the proposed facial expression recognition system, we select three bench-

mark databases: Jaffe, Cohan-Kanade and MMI. By comparison to other methods, the

effectiveness of our method is clearly demonstrated.

These experimental results and comparisons are published in international confer-

ences [JIB09, JI10b, JK09, JI10a]. Some results are now submitted to one international

journal.

5.1. General Object Recognition 77

5.1 General Object Recognition

5.1.1 Datasets of Objects

Generic object recognition is one of the most extremely difficult computational prob-

lems due to the intra-class and inter-class variability problems. Various solutions are

proposed as reviewed in Chapter 2.

For the comparison and evaluation of these methods, appropriate datasets are re-

quired. The database should consist of enough images in each categories. The well-

known databases for natural object categorization includes LabelMe, Caltech series and

PASCAL series.

These databases have played a key role in category-level recognition research, driv-

ing the field by providing a common ground for algorithm development and evaluation

[PBE∗06]. Recently, Google or flicker, as the most popular image collection online, be-

came the appealing sources of natural object images. Though the overwhelming amount

of images and often incorrect annotation are obstructive to build ground truth.

A number of well known datasets are used to compare different algorithms in cat-

egorization. As listed in section 2.1.2, one of he most popular one is Caltech datasets

collected by FeiFei et al. [FFFP07], which consists 101 object categories and an additional

background category. Normally these objects are central and no clutter, which makes

the reported accuracy relatively high. For each category, the quantity is about 40 to 800

images per category and totally 9,144 images for the dataset. The size of each image is

roughly 300x200 pixels.

To balance the numbers across categories, we chose four categories that have more

pictures (Airplanes, Motorcycles, Faces and Leopards. Samples in Fig. 5.1). From each

category we use 50 images for training and 50 images for testing. To balance the richness

of real-world, we also include 50 images from the background category in the training

of visual words.

5.1.2 Classification Using HDP with Heterogeneous Features

In this experiments, we combine a large number of descriptors (e.g. local gradient,

shape, and color) from small patches within one hierarchical generative model. These


Figure 5.1: Samples from four categories we used.

different data sources have complementary characteristics, which should be indepen-

dently combined to improve the classification. We are also inspired by the method Hi-

erarchical Dirichlet Processes to generate intermediate mixture components to improve

recognition and categorization.

In previous works, authors [LJ08] propose to describe regions or patches around the

salient interesting points with feature vectors combining the different kinds of features

in one visual word. In our method, we chose three sets of different features, namely

SIFT, shape context and color, to generate three sets of independent heterogeneous fea-

ture vectors on small patches and individually clustering these feature vectors to three

different visual codebooks. Then we combine the codebooks into one extended vocabu-

lary to be used for the generative model introduced in Chapter 4.

In implementation, for each image in the training and test sets, DOG is firstly used to

detect the key points. Then using SIFT descriptor as in Section 3.3.2, we compute a 4×4× 8 = 128 dimensional feature vector at the 16x16 regions around keypoints according

to its proper scale, and then these vectors to be clustered to a visual vocabulary with

size 100. The similarity between two descriptors is quantified by the simple Euclidean

distance.

To represent the geometrical information related to local shapes, SC ( shape context

in Section 3.3.3), developed by [BMP00], is with 2 levels in radius and 8 bins in log-

polar space. For each bin, count 8 orientations are counted. Thus a 2x8x8 dimensional

descriptor is built around the non-zero points in edge map. fixed the size of visual

words as 50, in order to acquire the similarity between two shape features, χ2 distance


function is used.

To balance and rich the information for the regions around the DOG keypoints, we

also used the LUV to average the pixel values in 8x8 regions. The color information,

denoted as C, is represented by a 3 dimension vector and clustered to the codebook of

size 24.

First experiments are carried on using Caltech categories as described in Section

5.1.1. For prior hyperparameters, we fixed a0 = 0.1, γ = 1.0, and λ = 1.0. After 50

iterations for training model and 20 for testing images, the recognition performance

is showed in Table. 5.1. As adding shape and color information in descriptor pool,

the complex natural of objects is better grabbed and also it outperforms the average

performance 75% and best performance 95% in [FFFP07] which also used the Bayesian

approach.

Table 5.1: Classification results, in parenthesis is the value K for number of components.

SIFT+SC+C SIFT+SC SIFT

Airplane 96%(54) 92%(43) 68%(34)Leopard 100%(67) 94%(42) 92%(46)

Motorcycles 88%(40) 90%(43) 84%(35)Face 66%(50) 70%(41) 62%(43)

We also show one distinctive mixture component (No.17) detected. In Fig. 5.2, some

links are added to show similar descriptors detected in different images. Component

No.17 of motorbike category is made up of several repeated visualwords distributing

around some distinctive parts of motorbikes, and they are clustered to the same compo-

nent after process. The number and the positions are relatively stable, though there are

still few irrelevant descriptors as noise.

According to the algorithm, in the testing part results were obtained by running

the Gibbs Sampler of MCMC described in section 2. It is not obvious that after how

many times of iteration the algorithm will in general converge to a useful result. In

previous works, the burn-in times are arbitrary selected as 100 [TJBB06] for documents,

60 [MS05] for natural scenes. We also do a test on 30 images from different categories to

know when the efficient convergence occurred for efficient mixing.

In Fig. 5.3, we show the average results of 6 random images from category Airplane,

the first several iterations as initially burn-in correspond to random movements close


Figure 5.2: Component No.17 in motorbike category.

to the randomly initialized starting point. In the next stage all the combinations of

different features rapidly move towards the posterior mode. And finally at equilibrium,

all the samples run towards the stable values. For different combination of features,

Fig. 5.3 shows that the average convergence speed is variable. For SIFT it will begin

to converge around iteration 10-15, for SIFT+SC, around 15-20 iteration, though for the

combination of three set features, we need 25-30 iterations. To balance the problem of

time-consuming and result-accuracy, we advice to adaptively select the iteration times

in testing stage and set the stopping criteria as variance/average is less than 0.2% to

0.5%.

Next experiment is to locate a single object in complex background by finding dis-

tinctive components. We try to accumulate the relevant components to decide where

the prominent object is. We firstly tested the image in trained HDP model (as in section

4.3.3, after getting all the instances of components, then selecting top 5 components.

In Fig. 5.4, the squares are detected on the body of leopard and displayed the rough

position of the leopard. The result is rather satisfying with few noises.

This model firstly improved the generation and sharing of 10-20 latent themes among

classes, while previous works usually has one or two themes [LJ08, TJBB06] for one

category. Secondly, it combined information provided individually from three diverse

feature sources: local texture feature, shape feature and color feature. The combina-

tion makes our method outperformed the single-feature or large-patch-one-descriptor

methods in object class detection.


Figure 5.3: Convergence vs. Iteration times.

5.1.3 Classification Using Boosting within Hierarchical Bayesian Model

In this section, after we obtain the set of components as in 5.1.2, instead of boosting the

features as Viola and Jones [VJ01], we try to boost the components in the intermedi-

ate layer to find the most distinctive ones. We consider that these components are more

important for object class recognition than others and use them to improve the classifica-

tion. Our target is to understand the correct classification of objects, also to discover the

essential latent themes sharing across multiple categories of objects and the particular

distribution of the latent themes for a special category.

After observing Table. 5.1 for appropriate features, we chose two sets of different

features, SIFT and shape context. Because the combination of these two features per-

forms well as in Table. 5.1, we mixed the codebooks into one extended vocabulary to be

used for the generative model introduced in section 4.3.3.

The experiments are also carried out on four categories: Airplanes, Leopards, Faces

and Motobikes. For prior hyperparameters, we use a0 = 0.1, γ = 1.0 and λ = 1.0 to train

HDP model for a set of components. For each category, we use the 50 training images

as positive samples and 50 random images from the other three categories as negative

samples for boosting. For classification, the number T is adjusted to verify the relation

between the overall performance and the size of the set of ’good’ components. For dif-

ferent value of T and different categories the results are showed in Fig. 5.5. We can see


Figure 5.4: The distinctive components in large image.

that for the different categories, the first T components sets yield an acceptable but not

very good detection rate. Then we adjust the number T and add more positive compo-

nents into the subset. As result, the performances increase and reach the comparatively

stable values.

Table 5.2: The confusion matrix for best-T.

Airplane Face Leopard Motobike

Airplane 94% 0% 4% 2%Face 4% 74% 16% 6%

Leopard 8% 0% 92% 0%Motobikes 10% 0% 2% 88%

Table 5.3: Performance comparison.HDP(Num: K) HDPBoost (Best of T)

Airplane 92%(43) 94%(8)Face 70%(41) 74%(10)

Leopard 94%(42) 92%(11)Motobikes 90%(43) 88%(6)

The confusion matrix is given in Tab. 5.2. It shows the HDPBoost classification

performances in each category with best value of T. The comparison between HDP and

HDPBoost is showed in Tab. 5.3. Here we list the performances, and, in parenthesis, the

number of components value (K) and the best value of T. We find that a small subset

of all the components found in HDP is enough to catch the complex nature of objects,

5.2. Facial Expression Recognition 83

Figure 5.5: Performance vs the size of component set.

and performs as well as using the whole set. Even the performance are not improved so

much, the boosting not only helps to find the essential characteristics of objects, but also

can accelerate the classification speed. For some categories (Leopard and Motobikes),

the results decreases 2% or 4%. This outcome is perhaps contributed by the occlusion

of backgroud. The overall performance here also outperforms the average performance

75% in [FFFP07] which used the Bayesian approach. As conclusion, we can predict that

these top ’positive’ components are strongly linked to the objects in semantic domain.

5.2 Facial Expression Recognition

5.2.1 Face databases

In human society, face occupies an important role in interpersonal communication. From

Darwin [Dar02] to Matsumoto [MW09], social psychologists try to interpreter the facial

signals and the expression of emotion of human beings. From the beginning of computer

era, the advances in image analysis make the automatic analysis of facial expression

in computer vision become possible and now accurate emotion categorization forms a

challenging task.


As this topic is very useful for man-machine interaction, researchers need a bench-

mark to be able to directly compare the results. Different databases provide the standard

test data for the detection of identity, face pose, illumination, facial expression and age.

Here we list some publicly available databases as:

1. Cohn-Kanade AU-Coded Facial Expression Database [KCT00], which provides re-

searchers in automatic facial image analysis and synthesis and for perceptual stud-

ies. The peak expression for each sequence is fully FACS but emotion labels are

not available.

2. MMI Database [PVRM05], this dataset is manually FACS coded frame-by-frame

annotations of the temporal segments. Some sessions are labeled as one of the six

basic emotions.

3. Japanese Female Facial Expression (JAFFE) Database [LAKG98],includes images

of 7 facial expressions (6 basic facial expressions + 1 neutral) posed by 10 Japanese

female models.

4. Belfast Naturalistic Database [CCS03], consists of audiovisual clips from 125 speak-

ers (31 male, 94 female). Emotional clips provide within themselves at least most

of the context necessary to understand a local peak in the display of emotion and

to show how it develops over time.

5. RPI ISL Facial Expression Databases [TLJ07], which is primarily used for frontal-

view facial action unit recognition.

6. DaFEx - Database of Human Facial Expressions [BC08], is a Database of posed

human facial expressions for the evaluation of synthetic faces and embodied con-

versational agents but not widely used.

7. The AR Face Database [MB98], contains images of 116 individuals (63 men and 53

women). The imaging and recording conditions (camera parameters, illumination

setting, camera distance) were carefully controlled and constantly recalibrated to

ensure that settings are identical across subjects.

8. BU-3DFE (Binghamton University 3D Facial Expression) Database and BU-4DFE

(3D + time): A 3D Dynamic Facial Expression Database [YWS∗06] This database

aims in identifying facial expressions and understanding of facial behavior and 3D

structure of facial expressions on a detailed level. It contains 100 subjects (56%


female, 44% male) and each subject performed seven expressions in front of the

3D face scanner.

Among these comparatively large number of face databases, subjects are usually

asked to perform the desired actions. Most of the databases collected primarily for face

recognition also recorded subjects under changing facial expressions. However, appear-

ance and timing of these directed facial actions may differ from spontaneously occurring

behavior [Gro05]. Recently Youtube and Google Video also became the resources for po-

tential high quality databases.

The development of algorithms should be robust to these databases with suffi-

cient size that include carefully controlled variations of these factors. Furthermore,

benchmark databases are necessary to comparatively evaluate algorithms. We select

three widely used one as datasets in our experiments: Cohn-Kanade, JAFFE and MMI

databases.

Figure 5.6: The sample images from JAFFE database. From left to right: Angry, Disgust,Fear, Happiness, Sadness, Surprise and Neutral.

JAFFE database[LAKG98] is widely used in this area and Asian women usually yield

the lowest rates in facial expression recognition. JAFFE database consists of 213 images

with 6 basic facial expressions and one set of neutral expression. These images in Fig. 5.6


are posed by 10 Japanese females. As processing in Fig. 5.7 (the same one as Fig. 3.4 in

Chapter 3, for each individual, we use images from the 6 categories of facial expressions

(the first row), the neutral face images (the second row) to obtain the subtracted differ-

ence images (the third row). As we have 3 neutral images per person, and 2 to 4 facial

expression images per expression category per person, the number of total difference

images is multiplied and we obtain 531 normalized face images for classification.

Figure 5.7: Sample images and location procedure, from left to right: Anger, Disgust,Fear, Happiness, Sadness and Surprise.

We also evaluated our method on MMI database [PVRM05], another publicly avail-

able database which aims to deliver large volumes of facial expressions visual data to

the facial expression analysis community. It includes more than 20 different faces of

students and research staff members of both sexes (44% female), ranging in age from 19

to 62, having either an European, Asian or South American ethnic background. These


Figure 5.8: Sample frames from MMI database

faces appeared in frontal and profile views displaying six facial expressions of emotions.

The sequences are of variable length from 40 and 520 frames, picturing one or more

neutral-expressive-neutral facial behavior patterns. In sequences, each frame measures

(720 ∗ 576), (576 ∗ 720) or (640 ∗ 480) pixels in true color as in Fig.5.8.

In our experiments, we select 199 image sequences. As a selection criteria, the se-

quence has to include a frontal or near frontal view face and is already labeled in the

MMI database as ground truth. For each sequence, we manually labeled the position of

peak frames and selected four peak frames from it. We also included all neutral ones to

build the evaluation dataset and converted images to gray level.

The third used one is Cohan-Kanade database [KCT00], which includes 486 se-

quences from 97 students (samples in Fig. 5.9). Subjects range in age from 18 to 30

years inclusive. 65% were female; 15% were African-American and 3% Asian or Latino.

They were asked to perform different expressions using a camera directly in front of

one subject. Each sequence runs from a face in neutral state to a target expression in

peak state. A comprehensive comparison is difficult because even though most of the

proposed systems worked on this database, they did not use the same selection of se-

quence sets and their own labelings of expressions. In our experiments, we select 348

sequences (40 Anger, 41 Disgust, 45 Fear, 97 Happiness, 48 Sadness and 77 Surprise).

We also manually labeled the starting frame of expression in every sequence.

Because Cohn-Kanade and MMI databases consist of video sequences, both image-


Figure 5.9: Sample frames from Cohan-Kanade database

based and video based methods are tested. For JAFFE database, we will only test image-

based method.

5.2.2 Overview of Our System

Figure 5.10: System overview

The same as most of FER system, our system (Fig.5.10) consists in three stages: face

and facial parts detection, face representation and facial expression recognition. The

first one is the process to automatically locate the face and facial parts as in Chapter

3.2.2. The next stage concerns the extraction of the appropriate descriptors in Chapter

3 to represent the normalized face sequences from the first stage. The representation


is extracted from two sources: the appearance-based information using traditional LBP

(Local Binary Pattern) or Gabor features on static images, and two new proposed textons

(VTB and moments) for the dynamic spatial-temporal information. These descriptors

(both static and dynamic feature) are used by the classification methods in Chapter 4 for

facial expression recognition.

Furthermore, the last stage of our system is composed of two steps. First, classifica-

tion is performed on every image, except for the first ones in the sequence, by predicting

the probability that each image belongs to each expression from binary descriptor and

moments. Then the weighted probabilities obtained are combined so as to predict the

expression associated to the whole sequence.

5.2.3 Image Based Classification using Static features

The first experiment on JAFFE database is performed on different resolutions as (32×32),(64× 64) and (128× 128). The face images (the forth row in Fig.3.4) are cropped

from original images (the first row in Fig.3.4) and normalized separately to different

sizes. On these face images, we apply the face mask in Fig.3.10 and obtain 8 sub-

regions. LBP histograms are computed for each block and are concatenated to a single

feature vector of 2048 bins. The recognition rates are obtained by using SVM classifier

with polynomial kernel and 10-fold cross-validation. The performances for each class

are showed in Table 5.4, where An stands for Angry, Di for Disgust, Fe for Fear, Ha for

Happiness, Sa for Sadness, Su for Surprise, Av for Average. The same annotations are

used in the rest of this paper. The average performance is 98.3%, and has outperformed

other manually annotated and automatic methods as listed in Table 5.5.

Table 5.4: Recognition performances by SVM on different resolutions(%)l An Di Fe Ha Sa Su Av

32 87.8 81.6 77.1 89.4 76.1 83.5 82.564 100 97.7 97.9 97.6 97.7 98.8 98.3

128 93.3 89.7 87.5 92.9 86.4 91.8 90.2

Table 5.5: Average recognition performances on JAFFE database (%)Ours Feng[FPH05] Guo[GD05b] Koutlas[KF08] Liao[LFCY06]

98.3 93.8 92.3 90.8 94.59

The above performances show that our automatic system is effective for facial ex-


Table 5.6: Recognition performances by boosted-SVM for 64× 64 resolution(%)

An Di Fe Ha Sa Su Av

100 96.6 99 100 100 97.6 98.8

pression recognition and out-performed current existing methods, even some manually

pre-processed systems. We still face the problem of high dimension for the feature

vector. In order to find the most discriminative features and explore the possibility of

real-time recognition system, we boost the features as presented in Section 2.2. For each

expression, we randomly select 50 positive face images and 50 negative face images from

all other categories of expressions. After one-against-all boosting, the top 20 features are

selected for each expression. As some features appear multiple times in different ex-

pressions, we eliminate the duplicate and reduce the dimension from 20 ∗ 6 = 120 to

73 bins. The generalization performance of boost-SVM classifier is showed in Table 5.6,

according confusion matrix is showed in Table 5.7.

Table 5.7: Confusion matrix by boosted-SVM for 64× 64 resolution for 6-class recogni-tion (%)

An Di Fe Ha Sa Su

An 100 0 0 0 0 0

Di 0 96.6 1.1 0 0 2.3Fe 0 1 99 0 0 0

Ha 0 0 0 100 0 0

Sa 0 0 0 0 100 0

Su 0 2.4 0 0 0 97.6

As in the normalization of face images we multiply the number of dataset by three,

which is the number of neutral images per person, some images are possibly identical

as being cut from the same original images (not in all the cases, because for different

neutral images, detection , normalization and histograms usually are different). So as to

confirm our results are valid and to compare to other methods, we divide the dataset to

three sub-sets, to make sure that each sample face image is cut from different original

image. In the later experiments we will test only on 64× 64 resolution boosted results.

After 10-cross validation in each sub-set, it observes that the performance decreases but

is still robust and stable as listed in Table 5.8.

We also evaluated this method on MMI database. For totally 199 images sequences,


Table 5.8: Recognition performances for 64× 64 resolution on sub-sets (%)Set An Di Fe Ha Sa Su Av

No.1 100 93.1 93.8 100 100 100 97.8No.2 100 94.8 100 100 98.3 98.2 98.6No.3 100 96.6 100 96.4 100 93.1 97.7

we manually labeled the position of peak frames and selected three peak frames from

it to build the evaluation dataset. Hence, 597 images are extracted from videos and

converted to gray level. The face regions are automatically identified as described in

Section 3.2.2 and normalized to (64x64) pixels. After the extraction of LBP features

using masks, we also performed 10-cross validation using SVM for 6-class expression

recognition. The average performances are 91.2%, which is the best in state of arts,

comparing to 86.9% from [SGM09] and 82.2% from [TA07]. The confusion matrix on

MMI database is showed in Table. 5.9.

Table 5.9: Confusion on MMI database for 6-class recognition (%)

An Di Ha Fe Sa Su

An 90.6 2.1 0 0 5.2 2.1Di 2.3 94.3 3.4 0 0 0

Ha 0 0 100 0 0 0

Fe 4.8 4.8 4.8 72.6 2.4 10.7Sa 2.1 2.1 5.2 2.1 88.5 0

Su 0 0 2.6 0.9 0.9 95.7

The overall performance on MMI database is inferior to the results on JAFFE database.

This may be due to the more wide selection of ethnic background and out-of-plane

movements in MMI database. A large training set with variations in culture origins will

be helpful to build more comprehensive models for facial expression recognition.

For Cohn-Kanade database, we do a similar processing. Considering the rich vari-

ation in physical appearance for this database, we select the top 50 features from each

expression after AdaBoost and combine them to 220 bins. As an option, we also use the

histogram of Gabor features as complementary information to improvement the perfor-

mance. The comparison to other alternative methods are listed in Table. 5.10, where SN

stands for the number of subjects, SqN stands for the number of sequences. As we can

see, the results are not as good as [SGM09], however, the eyes location in their system is

manually labeled. Furthermore, our system is more robust as it performs well on all the


Table 5.10: Recognition performances comparisons on Cohn-Kanade Database(%)Representation SN SqN C A M AR(%)

[CSG∗03] Motion Units 53 - 6 N - 91.8[BGL∗06] Gabor+AdaBoost 90 313 7 Y 10-Fold 93.3

[MB07] Gabor+AdaBoost - - 6 Y - 84.6[MB07] Edge/chamfer+AdaBoost - - 6 Y - 85

[SGM09] BoostedLBP 96 320 6 N 10-Fold 95.1Ours: LBP LBP 94 346 6 Y 10-Fold 91.9

Ours: BoostedLBP BoostedLBP 94 346 6 Y 10-Fold 91.4Ours: Fusion BoostedLBP+ Gabor 94 346 6 Y 10-Fold 94.3

Table 5.11: Recognition performances comparisons for image-based methods (%)Features SN SqN C D A M AR(%)

[CSG∗03] Gabor 53 318 6 N N - 91.8[BGL∗06] Gabor 90 313 7 N Y 10-Fold 93

[KZP08] Shape - - 6 N Y 5-Fold 92.3[DSRDS08] Holistic 98 411 6 N N 5-Fold 96.1

[PY09] Harr+Boost 96 - 6 N Y 3-Fold 88

[VNP09] Candide - 440 7 N N 5-Fold 90

[SGM09] BoostedLBP 96 320 6 N N 10-Fold 95.1[SC09] AAM - 72 6 N N 3-Fold 97.22

Ours LBP+VTB 95 348 7 Y Y 2-Fold 94

Ours LBP+VTB 95 348 7 Y Y 10-Fold 97.2Ours Moments 95 348 7 Y Y 2-Fold 95.5Ours Moments 95 348 7 Y Y 10-Fold 97.3

three databases.

5.2.4 Image Based Classification Using Static and Dynamic Features

Among the image-based approaches on Cohen-Kanade database, [CSG∗03] used a sub-

set of 53 subjects, for which at least four of the sequences were available. For each

person, they selected an average of 8 frames for each expression. [BGL∗06] selected the

first and last frame from each sequence as training images and for testing. [SGM09] used

the neutral faces and three peak frames for prototypic expression recognition. We did

a similar selection to build static image sets. Our selection criteria is that LBP, VTB and

moments can be computed on these images (e.g for moments, the first as τs − 1frames

are ignored). All the neutral images and four peak images per sequence are used as

training and testing images.

LBP histograms are extracted from 10 blocks on these image. For VTB features, we

will connect the motion information with current frame and its previous two frames.


These three images are decomposed into vertical spatiotemporal planes. The total his-

togram length for LBP-VTB descriptors is 256 ∗ 10 + 64 ∗ 3 = 2752. Moments values,

which are extracted from one image and its previous (τs− 1) images in sequence, are ob-

tained from three blocks (related to eyes, nose and mouth) for each of n = 64 spatiotem-

poral planes as showed in 3.15. For each block, three values (M00, M10/M00, M01/M00)

are computed. The total vector dimension is 64 ∗ 3 ∗ 3 = 576.

The value of the temporal window τs is fixed to 8, as a good compromise between

having a large enough region needed for moment computation and having numerous

probability vectors for sequence based approach. Because of its powerful discriminative

ability, the SVM with polynomial kernel is used.

The 10-fold validation is applied in all experiments. We also tested the 2-fold val-

idation for comparison. Our results are compared to other image-based methods in

Table.5.11, where SN stands for the Number of Subjects, C for number of Classes, D

for Dynamic, A for Automatic, AR for Accuracy Rate, SqN stands for the number of

SequeNces and LOSO stands for Leave-One-Subject-Out.

By comparing the results in Table.5.11, we can fairly say that our proposed method

is more performant than the other ones in classifying the static images. As showed in

Table.5.11, the results are relatively stable around 94% (2-fold) and 97% (10-fold). The

new descriptors, VTB and moments, improve the classification of neutral+peak images.

[CLL09] provided similar recognition rate, however, they use less sequences, and only 6

classes for classification.

From MMI database, we select 199 image sequences. As a selection criteria, the

sequence has to include a frontal or near frontal view face and is already labeled in the

MMI database as ground truth. For each sequence, we manually labeled the position of

peak frames and selected four peak frames from it. We also included all neutral ones

to build the evaluation dataset and converted images to gray level. The face regions are

automatically identified as described in section 3.2.2 and normalized to 64 ∗ 64 pixels.

After the extraction of LBP+VTB and moments features, we also performed 10-cross

validation using SVM for 7-class expressions recognition. The average performances are

92%(LBP+VTB) and 89%(moments), which is considered as the best performance in the

state of the art, comparing to 86.9% from [SGM09]. For each category, we also provide

the confusion matrix in Fig. 5.12

The results are inclined to neutral expressions. The reason possibly lies on the small


Table 5.12: Confusion matrix of Moments on MMI database (%)

An Di Ha Fe Sa Su Ne

An 76.6 2.34 0.78 0 2.34 0 17.97

Di 4.27 81 0 5.13 0.85 2.56 5.98

Ha 1.28 0.64 94.9 0 0 0.64 2.56

Fe 2.68 2.68 1.79 58.9 1.79 16.96 15.18

Sa 3.9 0 0 0.78 66.4 1.56 27.34

Su 1.28 0.64 0.64 7.69 0 71.2 18.60

ne 0.71 0.1 0.1 0.3 0.3 0.3 98.2

Figure 5.11: One sequence of happiness and corresponding plot for six expressions.

changes on the beginning of sequences. This error can be coped with the references to

later peak frames. We also tested the 2-fold and 5-fold classification. The results are

81.9% and 86.8%. It shows the methods is relatively robust.

After this step, we suppose a temporal ordered image sequence S = {I1, I2, . . . , IT},where T is the number of frames in one sequence. After classifying all the frames except

the first two ones, an image Ii, with 3 ≤ i ≤ T, is associated with a vector combined

by probabilities: {pi,c0 , pi,c1 , . . . , pi,c6}, where pi,cj , (0 ≤ j ≤ 6) is the probability for the 7

classes including pi,c0 for neutral expression.


5.2.5 Classification for Sequences

Recently, more and more researches moved from static-based methods to dynamic analy-

sis of video sequences exploration. Similar to Bartlette et el.[BGL∗06], Chang et el.[CLL09]

and Buenaposada et el.[BMB08], we use the classification results from each frame to se-

lect the final label for the sequences. The detailed algorithm is presented in Algorithm

5.1. One sample sequence and the plot of six expressions are showed in Fig.5.11.

Algorithm 5.1: Sequence level classification using weighted sum

1. Initiate a vector PS = {P1, P2, . . . , P6} as Pk = 0, where 1 ≤ k ≤ 6

2. For (i = 3 to T) :

(a) If Gi =′ Neutral′, ignore the frame and go to next iteration;

(b) If Gi 6=′ Neutral′, Pk = Pk + wi ∗ pi,ck , where 1 ≤ k ≤ 6

3. The final label for sequence

G = argmaxk {Pk} , (5.1)

where 1 ≤ k ≤ 6.

In experiments, we built a set of weights W = {w3, w4, . . . , wT}, and we associate one

weight to one image. Three sets of weights W are tested. In the first set, we consider

the same weights wi = 1 for all the images 3 ≤ i ≤ T. For the second and third set,

the weights will have higher values for the last few frames in sequences as they provide

more valuable information. As we can observe [BMB08, CLL09], face changes are not

linear and, for different expressions, the movement patterns are also variant. As the

real pattern is unknown, we assume that the relation between distance to peak frame

and similarity follows normal distribution N (µ, σ2). The maximum is reached for the

peak frame, corresponding to the last frame (µ = T). In the second and third sets, σ21

and σ22 are respectively T and 1

2 T. A comparison between our proposed approach and

other methods is given in Table.5.14. As we can see, better performances are obtained

with the three sets of W. The confusion matrix is showed in Table.5.13, where An

stands for Angry, Di for Disgust, Fe for Fear, Ha for Happiness, Sa for Sadness, Su for

Surprise, Av for Average. Here, the results are also better than [BMB08]. In the future

use of the complete sequences as neutral-expressive-neutral, this setting can be easily

extended from applying left bell to full bell. The use of normal distribution is only a

rough estimation to approximate the movement patterns of expressions. Future tuning

and special pattern build can be achieved.


Table 5.13: Confusion matrix of Ours:N (µ, σ22 )(%)

An Di Fe Ha Sa Su

An 97.5 2.43 0 0 0 0

Di 2.5 92.7 0 0 0 0

Fe 0 0 90.91 2.01 2.08 0

Ha 0 2.43 0 96.91 0 0

Sa 0 0 0 0 97.92 0

Su 0 2.43 9.09 1.06 0 100

Table 5.14: Recognition performances comparisons for sequences-based methods(%)SN SqN C D A M AR(%)

[YBS04] - - 6 Y Y 5-Fold 90.9[XLC08] 95 365 6 Y N LOSO 88.8[BMB08] 94 333 6 N Y LOSO 89.13

[ZP09] a 97 374 6 Y N 10-Fold 95.1[ZP09] b 97 374 6 Y Y 2-Fold 93.85

[KOY∗09] 53 129 6 Y - - 70

[CLL09] - 392 6 Y N 5-Fold 92.86

Ours:LBP+VTBwi = 1 95 348 6 Y Y 10-Fold 93.68

Ours:LBP+VTBN (µ, σ21 ) 95 348 6 Y Y 10-Fold 95.1

Ours:LBP+VTBN (µ, σ22 ) 95 348 6 Y Y 10-Fold 95.7

Ours:momentswi = 1 95 348 6 Y Y 10-Fold 98.5

In our approach, we begin from labeling all images in a sequence from neutral ex-

pression to apex status by using image based classifying method. The first several im-

ages are usually classified as ’Neutral’ and will be ignored. From the starting frame of

expression, the trace of facial organ movements is captured by the texture and shape

changes on spatiotemporal slices. The first several images with subtle expression after

the starting frames of expression are possibly wrongly labeled as "Neutral". However

the following images normally yield a high accuracy. One of the advantages, is that, if

a few peak frames are not correctly identified, the probabilities from other frames will

help to label the sequence. After taking into account all images except the first (τs − 1)

ones in sequences, our method produced a recognition rate of 95.7%(LBP+VTB) and

98.5%(moments). The result listed in Table.5.14 outperformed others who had tested

their systems on the same database. On MMI database, we only test the situation of

wi = 1, the results are 95% (LBP+VTB) and 97% (moments) for 6 classes.

The dynamic deformation on spatiotemporal planes is explored to be used to recog-

nize the human facial expressions in image sequences. Here, the dynamic information

is derived from vertical-time plane and especially used to model the evolution of facial


parts from neutral state to apex status.

Compared to other methods based on appearance or motion, we do not only learn

both static and spatiotemporal features, but also treat these features according to their

specific domains. The strategy improved the effectiveness of image-based recognition.

The system further used the predicted probabilities from a single image to classify the

whole sequence. After training and testing it on the two commonly used databases:

Cohn-Kanade and MMI, the proposed approach yielded better results than other meth-

ods.


Chapter 6Conclusion

Contents

6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.1.1 Object Categorization Using Boosting Within Hierarchical Bayesian

Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.1.2 Automatic Facial Expression Recognition . . . . . . . . . . . . . . 100

6.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.2.1 Object Similarities and Polymorphism . . . . . . . . . . . . . . . . 101

6.2.2 Spontaneous Facial Expression Understanding . . . . . . . . . . 102

99

100 Chapter 6. Conclusion

6.1 Contributions

6.1.1 Object Categorization Using Boosting Within Hierarchical Bayesian Model

In this thesis, we present a novel framework for object categorization, which is based on

mixing Hierarchical Dirichlet Processes and Adaboost Learner to recognize the object

from multiple categories. This probabilistic model for an object category can be learned

from a set of labeled training image instances. This method combined the discriminative

and generative models, not only accomplishes the task of recognition, but also provides

the potential links between the low-level features and high-level semantic labels.

Here a category includes multiple images and each image can be modeled as a

mixture with different mixing proportions using shared components. Each compo-

nent is a mixture of visual words with different mixing proportions. The components

and the number of components will be inferred from training data. In other methods

[WZFF06, LJ08] using HDP models, typically, one object is only corresponding to one

or two components. This setting can reduce the complexity of processing but now, the

multi-layers hierarchical models becomes meaningless as two levels are almost identical

because their system is not with a pyramid-like structure and one object is assigned by

only one or two middle components. In our system, for each category, the number of

middle level components will exceed twenty or more. These components can be shared

by similar classes, inherited in hierarchical class structure (e.g. building->theater) and

represented the unique characteristic in current categories. As our system generated

more complex middle level model, the speed of processing becomes a problem to multi-

class recognition. These prefabricated components are new and generated from training

samples. They are more adaptive and their total number is flexible comparing to other

methods.

6.1.2 Automatic Facial Expression Recognition

In this thesis, we present the systems for automatic facial expression recognition. The

main contribution exists in several aspects: the adaptive face location based on facial

expression changes and new textons to describe the dynamic deformation of these facial

expression changes.

The first system is learned from essential facial parts and local features for facial

expression identification. In the system, we use the automatic method for essential facial

6.2. Perspectives 101

parts detection, the specially designed mask for LBP histogram, the boosted features

from bins and the fusion of heterogeneous features. All these aspects show their impact

on the recognition rates. Comparing to other systems, no translation of face or alignment

of mouth position is needed in pre-processing. The system is testified on JAFFE, Cohn-

Kanade and MMI database and proved to be robust to slightly translation or small

alignments but cannot support large rotations or translations.

Based on this system, we furthermore propose a novel FER system to recognize the

human facial expressions in image sequences by using statiotemporal information. The

system analyzed a single image with static information from appearance-based descrip-

tors and dynamic information provided by statiotemporal plane. Here, the dynamic

information is derived from vertical-time plane and specially used to model the evolu-

tion of facial parts from neutral state to expressional state. The system further used the

predicted probabilities from a single image to classify the whole sequence. Compared

to other methods based on appearance or motion, we do not only learn both static and

spatiotemporal features, but also treat these features according to their specific domains.

The strategy improved the effectiveness of image-based recognition. After training and

testing it on the two commonly used databases: Cohn-Kanade and MMI, the proposed

approaches yield competitive results on the state of the arts.

6.2 Perspectives

6.2.1 Object Similarities and Polymorphism

There are many open questions in object categorization. Some are to extend current

system and some are new directions for future researches.

Firstly, we can extend to consider the co-occurrence of heterogeneous features or

components and adopt the graph method to model the spatial relation among compo-

nents. The component instances detected also can be used for location or build hierar-

chical tree to show the structure of object class. The usage of different combinations of

more kinds of descriptors is also very interesting to explore and tune.

We also should train the models on larger number of images, and prepare to con-

struct the ’component dictionary’ of objects and taxonomy tree in semantic level. This

dictionary can be updated and extended when a new category is added. The basic

blocks for categorization are generated in dynamic way just like the real vocabulary.

102 Chapter 6. Conclusion

Another thought is that objects is possibly polymorphism. A thesaurus is necessary

for visual systems. An object can be manifested by several models and a model can be

related to multiple labels. This idea is rarely explored and promising. The potential

problem maybe lies on the complex degree.

6.2.2 Spontaneous Facial Expression Understanding

In our future work, we plan to build the complete real-time system to recognize the

front faces and identify the facial expressions from a video camera under variant en-

vironments. Though, the rigid motion generated by pose changing or head motion is

not considered. These motions, which often appeared unintentionally in spontaneous

expressions, will influence detection results. In future work, we aim to first differentiate

and compensate rigid motion and expressional muscular movements on spatiotemporal

domain.

Secondly, the second half of the sequence after the peak frame, i.e. the apex-offset-

neutral segments, is not explored yet. The offset parts, which show opposite direction of

movements and different speed, should produce their own models. So, we plan to use

the complete sequences. For different expressions, the different model should be built

to catch the different rhythm along the time axis.

In current work, the subtle facial expression is mentioned though the level of inten-

sity is not identify or evaluated. The level of different expression is also meaningful in

human emotion recognition and human-computer interaction. For example, smile and

laugh show different signals of unspeakable message. These facial events are important

though not fully explored.

Becoming famous after the broadcasting of American television series "Lie to Me",

microexpressions are another hot topic in psychology [MR83, Eck03]. Unlike basic facial

expressions, it is said to be difficult to fake. At the same time, also difficult to catch.

Computer vision is said to be able to see what human eyes cannot sense. We hope

to study frame by frame expression labels to further identify this kind of deception in

videos.

Chapter 7Résumé en Français

Contents

7.1 Sommaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.2.1 Détection des organes faciaux . . . . . . . . . . . . . . . . . . . . 105

7.2.2 Descriptions des expressions . . . . . . . . . . . . . . . . . . . . . 106

7.2.3 Classification par HDP (Hierarchical Dirichlet Process) . . . . . . 111

7.2.4 Processus de Dirichlet Hiérarchique pour la classification . . . . 112

7.3 Résulats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.3.1 Validation de la classification d’objets par HDP . . . . . . . . . . 114

7.3.2 Validation des descripteurs proposés pour la classification des

expressions par SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.3.2.1 Classification des images d’expression par descripteurs

statiques . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.3.2.2 Classification des images d’expression par descripteurs

dynamiques . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.3.2.3 Classification de séquences d’expressions par descrip-

teurs dynamiques . . . . . . . . . . . . . . . . . . . . . . 120

103

7.1 Sommaire

Dans cette thèse, nous avons abordé la problématique de la classification d’objets puis

nous l’avons appliqué à la classification et la reconnaissance des expressions faciales.

1. Classification par modèles bayésiens hiérarchiques

Les humains peuvent résoudre le problème de la classification et de la reconnais-

sance d’objets sans grand effort, rapidement et de manière relativement efficace. Mais

dans le cadre de l’apprentissage automatique (ou Machine Learning), ceci reste encore

une tâche très difficile. Dans ce domaine, et pour une certaine catégorie d’approches,

la connaissance est acquise par apprentissage à partir d’un ensemble d’images, puis

un modèle de l’objet est construit automatiquement pour permettre la classification, la

reconnaissance ou l’identification.

Dans cette thèse, nous sommes inspirés des processus de Dirichlet, comme des dis-

tributions dans l’espace des distributions, qui génèrent des composantes intermédiaires

permettant d’améliorer la catégorisation d’objets. Ce modèle, utilisé notamment dans

la classification sémantique de documents, se caractérise par le fait d’être non paramé-

trique, et d’être hiérarchique. Dans une première phase, l’ensemble des composantes

intermédiaires de base sont extraites en utilisant l’apprentissage bayésien par MCMC

puis une sélection itérative des classifiers faibles les plus distinctifs parmi toutes les

composantes est opéré par Adaboost. Notre objectif est de cerner les distributions des

composantes latentes aussi bien celles partagées par les différentes classes que celles

associées à une catégorie particulière.

2. Classification d’expressions faciales Nous avons cherché dans cette seconde par-

tie à appliquer notre approche de classification aux expressions faciales. De nombreuses

approches pour des systèmes de reconnaissance d’expression faciale sont proposées. Ek-

man et Friesen ont défini six émotions de base : la colère, le dégoût, le peur, la joie, la

tristesse, et la surprise. En raison des différences individuelles et du contexte culturel,

il est souvent difficile d’interpréter ces expressions universelles en appliquant des algo-

rithmes de reconnaissance, et parfois même en procédant manuellement par des êtres

humains. La reconnaissance automatique d’expression faciale est encore un domaine

des plus actifs en vision par ordinateur, et elle a attiré de nombreuses propositions au

cours des dernières décennies. Nous nous sommes alors intéressés à la description des

expressions faciales, et avons proposé deux nouveaux descripteurs pour la reconnais-

sance automatique (VTB et des moments sur le plan spatio-temporel). Ces descripteurs

permettent de décrire la transformation du visage pendant les expressions faciales. Ils

visent à capturer le changement de la forme générale et les détails des textures induits

par les mouvements des organes faciaux. Enfin, les classifiers HDP et SVM ont été uti-

lisés afin de reconnaître efficacement l’expression qu’il s’agisse d’images fixes ou de

séquences d’images.

Ce travail a consisté à trouver les méthodes adéquates pour décrire les aspects sta-

tiques et dynamiques au cours de l’expression faciale, et donc à concevoir de nouveaux

descripteurs capables de représenter les caractéristiques des mouvements des muscles

faciaux, et par là même, d’identifier la catégorie de l’expression.

Mots clés : Classification d’objets, Classification d’expressions du visage, Modèle Hy-

bride, Modèles hiérarchiques de Dirichlet, Spatio-temporelle, descripteur dynamique.

.

7.2 Contributions

7.2.1 Détection des organes faciaux

Dans ce travail, nous nous sommes placés dans le cas de séquences d’images, partant de

l’idée qu’à terme, l’objectif sera d’analyser des vidéos en temps réel, et d’en extraire les

visages puis les expressions qui leur sont associées. Pour détecter de manière grossière

la position des visages, nous nous sommes basés dans un premier temps sur l’algo-

rithme de Kienzel [KBFS05]. Celui-ci fait appel à un apprentissage par SVM à marge

floue et une recherche pyramidale multi-échelles à 12 niveaux afin de détecter les patchs

candidats. Ceux-ci sont ensuite introduits dans un système de classifiers en cascade qui

présente un taux de détection de 95%. Partant de ces résultats, nous avons affiné la détec-

tion des organes faciaux comme présenté dans [JK09]. Après normalisation des images

de la séquence, nous procédons à la soustraction des images Neutres− Expressives, ce

qui nous permet de localiser les organes faciaux grossièrement, du fait que ce sont les

parties les plus concernées par l’apparition d’une expression. Ensuite, une détection plus

fine est opérée, se basant sur l’accumulation des intensités des pixels le long de lignes

horizontales et verticales, et exploitant les propriétés géométriques du visage, comme

illustré sur la figure 7.1.

Nous commençons par appliquer sur l’image différence un filtre Gaussien qui per-

Figure 7.1 – Détection des organes faciaux pour les 6 expressions

met d’éliminer les points relatifs au bruit, ensuite nous appliquons un algorithme qui

scanne l’image pour détecter les blocs denses correspondant aux yeux, au nez et à la

bouche. Enfin nous délimitons ces organes par des lignes verticales et horizontales véri-

fiant des propriétés de densité linéique en calculant le rapport :

DL =PIXvalid

PIXTot, (7.1)

où PIXvalid est le nombre de pixel dont le niveau de gris est supérieur à un seuil fixé, et

PIXTot est le nombre total de pixel le long de la ligne.

Les tests effectués ont montré qu’en limitant le calcul des descripteurs aux organes

détectés, nous obtenons de meilleurs résultats que si l’on travaillait sur l’ensemble de

l’image.

7.2.2 Descriptions des expressions

D’après [Bas79], le cerveau humain procède davantage par une analyse locale du visage

que par une approche holistique pour déterminer l’expression faciale présente. La fi-

gure 7.2 illustre les mouvements associés à chaque expression. Nous sommes partis de

ce constat pour tenter de reconnaitre les expressions à partir d’une combinaison d’infor-

mations statiques, calculées dans le plan image, et dynamiques, calculées dans le plan

temporel.

Figure 7.2 – Les mouvements faciaux décrits par [Bas79]

Pour la description statique, nous avons utilisé LBP que nous avons appliqué à un

masque de 10 blocks (Figure 7.3), relatifs aux zones qui sont concernées par les expres-

sions.

Figure 7.3 – 10 blocks correspondant à une image de sourire

Pour extraire l’information dynamique, nous avons repris les travaux de [ZP09] et

avons analysé le plan image XY, ainsi que les plans temporels en horizontal XT, et en

vertical YT, ce qui a fait apparaître de très grandes différences (Fig.7.4).

L’analyse des plans XT et YT a montré, que pour différentes expressions, les plus

grandes différences apparaissaient dans le plan YT, et que ces variations sont quasiment

les mêmes pour différents sujets ((Fig.7.5)

Nous avons proposé de quantifier les déformations selon le plan YT

Soit S = {I1, I2, . . . , IT} une séquence temporelle, avec T le nombre d’images dans

la séquences S, et où chaque image Ii a une taille de n× m. Pour chaque valeur de x,

Figure 7.4 – Gauche : XY(vue de face) ; Milieu : la tranche YT ; Droite : La tranche XT .

Figure 7.5 – Les déformations de la tranche YT pour différentes expressions

avec 1 ≤ x ≤ n, on décompose S en n spatio-temporelles tranches Px tel que décrit sur

la figure 7.6.

Par la suite, chaque tranche de hauteur m est décomposée en trois parties de diffé-

rentes hauteurs : (m1 = 38 m) pour les yeux, (m2 = 1

4 m) pour le nez et (m3 = 38 m) pour la

bouche. Enfin, le calcul des descripteurs se fera sur des portions de tranche de taille τs,

qui se chevauchent, le long de chaque tranche.

Le premier descripteur, VTB, est une extension de LBP, où nous avons fixé τs =

3. Dans la 3ime image, chaque pixel est utilisé comme valeur de seuil appliqué à ses

précédents voisins, situés aux mêmes coordonnées (cf figure 7.7)

Pour chaque pixel p de coordonnées (x, y) et de niveau de gris gx,y dans l’image It à

Figure 7.6 – Exemple de tranches dans le plan YT.

l’instant t, la valeur associée à ce pixel est donnée par l’équation 7.2.

VTB = s(gx,y−1,t−2 − gx,y−1,t)25

+s(gx,y,t−2 − gx,y,t)24

+s(gx,y+1,t−2 − gx,y+1,t)23

+s(gx,y+1,t−1 − gx,y+1,t)22

+s(gx,y,t−1 − gx,y,t)21

+s(gx,y−1,t−1 − gx,y−1,t)20.

(7.2)

Figure 7.7 – Calcul de VTB

où

s(x) =

1, i f x > 0

0, i f x ≤ 0

Le descripteur final, de taille 192, est donné alors par l’histogramme des valeurs

calculées, et mesure la variation de texture qui apparait lors d’une expression faciale.

Le second descripteur que l’on propose de calculer concerne le moment spatio-

temporel, qui a pour but de capturer les variations de forme observées. Nous avons

opté pour les moments (M00, M10/M00, M01/M00) à calculer sur chacune des 3 parties

de la tranche définies précédemment. Le moment Mp,q(x, i, k) de chaque partie pour la

position x de l’image Ii avec une fenêtre temporelle de τs est donné, pour la tranche Px

par :

Mp,q(x, i, k) =mk

∑y=1

i

∑t=i−τs

yptqPx(y, t) (7.3)

où mk est la hauteur de la partie en question avec 1 ≤ k ≤ 3 comme indiqué sur

la figure 3.15. Pour une image Ii donnée, les moments sont calculés avec τs = 8 pour

toutes les positions de x. Ensuite ces moments sont regroupés dans un vecteur final

de taille 3 × 3 × 64, soit 576 bins. Le vecteur ainsi obtenu servira alors à déterminer

les probabilités pour que l’image courante Ii soit associée à chacune des 7 expressions

connues.

7.2.3 Classification par HDP (Hierarchical Dirichlet Process)

Notre première idée a été d’utiliser les processus hiérarchiques de Dirichlet pour mo-

déliser les expressions faciales. Nous avons considéré ce problème similaire à celui de

la modélisation des thèmes abordés par les documents d’un corpus, à partir des oc-

currences de mots dans le texte. Dans cette application, un document peut traiter de

nombreux thèmes (inconnus et dont le nombre est également inconnu) et partager cer-

tains thèmes avec d’autres documents, chaque thème étant caractérisé par un ensemble

de mots.

Les textes ou documents sont donc des distributions de thèmes, et chaque thème est

une distribution de mots pris dans un vocabulaire. L’analogie peut être faite avec notre

cadre applicatif sur les expressions faciales en procédant à la correspondance (texte,

mot, thème) - (image, mot visuel, expression) où les images sont des distributions sur

les mots visuels, les mots visuels sont des distributions sur des "composantes internes"

à déterminer, et les expressions (à déterminer également) sont aussi des distributions

sur ces "composantes internes". Dans ce cas, l’idée est de déterminer non seulement

les composantes internes (les thèmes), ce qui peut être réalisé par l’intermédiaire d’une

LDA (Latent Dirichlet Allocation), mais également les expressions, qui sont elles-mêmes

des distributions sur les composantes internes. Nous avons ainsi une distribution sur

un espace de distributions, pour laquelle la modélisation par processus de Dirichlet se

trouve bien adaptée.

D’autres motivations pour ce choix sont d’une part l’aspect non paramétrique de ce

modèle car, bien qu’un nombre connus d’expressions identifiées par Ekman soit établi,

nous pensons que la classification peut concerner en réalité un nombre inconnu d’ex-

pressions relatives aux différentes intensités que l’on peut rencontrer ; d’autres part, les

clusters à trouver ne sont pas disjoints mais partagent certaines caractéristiques que l’on

cherchera à mettre en évidence. La classification par HDP a été appliquée à la catégori-

sation d’objets et les premiers tests effectués ont montré que cette approche donne des

résultats comparables à l’état de l’art.

La base de la classification Bayésienne est la formule de Bayes qui permet d’estimer

la probabilité a postériori d’appartenance à une classe, à partir de la connaissance de l’a

priori et de la vraissemblance du modèle dont la forme la plus simple est donnée par :

p(x/y) = p(x)× p(y/x)÷ p(y) (7.4)

Plus globalement, le but est d’établir une inférence à partir d’observations soit pour

décrire soit pour prédire un phénomène. Généralement le premier terme de l’équation

7.4 est issu de la connaissance a priori que l’on a du problème ou d’étude statistique,

tandis que le second est donné par le modèle de distribution choisi, dont les paramètres

sont déterminés par apprentissage dans le cas d’approches supervisées. De nombreux

travaux de recherche ont porté sur la détermination de ces deux termes. L’approche

bayesienne a conquis de très nombreux domaines (finance, météo, sciences humaines,

informatique, etc.). En traitement d’image, elle a également été très utilisée pour la clas-

sification, la segmentation, la restauration, etc. Pour chaque application, il s’agit d’établir

le modèle probabiliste adéquat, de déterminer les inférences puis de trouver les para-

mètres des modèles choisis. Une autre problématique vient se rajouter dans un certain

nombre de cas, à travers la complexité du calcul qui oblige de faire appel d’autres tech-

niques (programmation dynamique, graphecut, etc.).

7.2.4 Processus de Dirichlet Hiérarchique pour la classification

Les papiers de [TJBB06] et [WZFF06] fournissent une présentation bien détaillée du mé-

canisme utilisé. Sans toutefois expliciter le formalisme mathématique dont la théorique

dépasse largement le cadre de nos travaux, l’idée maîtresse est de modéliser l’image

par un ensemble de patch, chacun décrit par un mot code issu d’un dictionnaire. Un

"‘thème latent"’ est associé à chaque patch. On admet que les patchs qui sont liés à une

structure donnée de l’image doivent probablement partager des thèmes similaires. Il est

alors possible, pour chaque classe d’objets, de déterminer la distribution des thèmes as-

sociés à celle-ci. Nous pouvons voir le processus de Dirichlet Hiérarchique comme une

extension de LDA. La figure 7.8, donne le modèle graphique de LDA. Initialement, ces

méthodes ont été appliquées à la classification des documents d’un corpus. LDA est

basée sur l’idée qu’un document est un mélange probabiliste de thématiques latentes

(z) et que chaque thématique est caractérisée par une distribution de probabilité sur les

mots (x) qui lui sont associés. Le modèle LDA permet d’empêcher la dépendance entre

la distribution de probabilités qui va servir au choix des thèmes et les documents déjà

connus. Dans le graphe présenté nous avons :

Figure 7.8 – Modèle graphique de LDA

γ représente le paramètre de la distribution uniforme de Dirichlet entre documents

et thèmes et qui est l’a priori,

β est le paramètre de la distribution uniforme de Dirichlet entre mots et thèmes et

que l’on va estimer,

πj est la distribution des thèmes sur le jime document,

zj,i est le thème associé au iime mot dans le jime document,

et xj,i est le iime mot dans le jime document.

Figure 7.9 – Modèle graphique de HDP

La figure 7.9 donne le modèle de HDP proposé par [TJBB06], dans lequel π n’est

plus une distribution paramétrée par α mais une distribution de distribution. Ce qui,

comme l’indique son nom, inclue un niveau supplémentaire de hiérarchie.

Nous avons alors :G0|λ, H ∼ DP(γ, H)

Gj|α0, G0 ∼ DP(α0, G0)

φji|Gj ∼ Gj

xji|φji ∼ F(φji)

(7.5)

Appliqué à notre cas de figure, nous avons donc des observations xj,i qui sont les

mots visuels calculés selon la technique du Bag Of Words. φji représentent des compo-

santes latentes à déterminer, de distribution Gj de paramètre α0 suivant la distribution de

Dirichlet G0, elle-même de paramètres λ et H. A une classe donnée (objets, expressions,

etc.) correspondra une certaine distribution de composantes latentes φji. La connaissance

de celle-ci permettra d’associer chaque image à la classe qui lui est la plus probable.

L’inférence est conduite par Monte carlo Chaine de Markov, à l’aide d’un algorithme

de Gibbs. Nous avons fait appel à la représentation par "‘stick-breaking"’.

7.3 Résulats

Les tests ont concerné 2 aspects. D’une part la classification d’objets par Processus Hié-

rarchique de Dirichlet afin de valider le modèle, et d’autre part la détection et la des-

cription des expressions faciales.

7.3.1 Validation de la classification d’objets par HDP

Nous avons utilisé quelques catégories de la base Calthech [FFFP07] qui contient 101 ca-

tégories d’objets, constituée chacune de 40 à 800 images de taille 300x200 pixels (exemples

donnés en figure 7.10).

Figure 7.10 – Catégories de la base Caltech utilisées pour nos tests.

Nous avons fait appel à trois descripteurs complémentaires, SIFT, Shape Context

[BMP00] et couleur, calculés localement au niveau de points d’intérêt, et donnant cha-

cun un codebook visuel différent. Les trois codebooks sont ensuite combinés puis utilisés

dans le modèle génératif proposé. Pratiquement, pour SIFT, un vecteur de 128 compo-

santes est calculé sur une région 16x16 pixels autour des points d’intérêt détectés, puis

une clusterisation est appliquée à tous les descripteurs d’une classe pour ne garder

qu’un vocabulaire de 100 mots visuels. Pour le descripteur de forme Shape Context,

celui-ci est calculé autour de points issus d’une carte des contours, donnant un vecteur

de taille 128. On garde ici un vocabulaire de 50 mots visuels. Enfin, on calcule la couleur

moyenne autour des points d’intérêt dans l’espace LUV, et nous ne gardons que 24 mots

visuels. Le tableau 7.1 donne les résultats obtenus par HDP au bout de 50 itérations

pour la phase d’apprentissage et 20 pour la phase de test.

Table 7.1 – résultats de la classification, (Nombre de composantes entre parenthèses).

SIFT+SC+C SIFT+SC SIFT

Avion 96%(54) 92%(43) 68%(34)Leopard 100%(67) 94%(42) 92%(46)

Motos 88%(40) 90%(43) 84%(35)Visages 66%(50) 70%(41) 62%(43)

Nous comparons la combinaison de plusieurs descripteurs (SIFT, SIFT+SC et SIFT+SC+Couleur).

Nous observons que ces résultats rivalisent avec ceux de l’état de l’art [FFFP07] (75% en

moyenne et 96% en valeur max). L’une des difficultés rencontrées est de déterminer le

nombre d’itérations nécessaires pour être certain que nous avons bien convergé vers

le bon modèle. La figure 7.11 montre l’évolution de la vraisemblance en fonction du

nombre d’itération. Pour SIFT seul, 10 à 15 itérations suffisent. Plus le nombre de des-

cripteurs augmente, plus il y a besoin d’itérations pour obtenir le même résultat (15 à

20 pour SIFT+SC et 20 à 25 pour SIFT+SC+Couleur).

Nous cherchons également à localiser un objet à partir des composantes internes

détectées par l’apprentissage de sa catégorie. Sur la figure 7.12 apparaissent sous forme

de carré les positions des composantes les plus caractéristiques du léopard sur une

image test, il apparaît clairement que celles-ci permettent de localiser le léopard bien

que le fond présente des similitudes avec l’objet.

Le dernier point auquel nous nous sommes intéressés à ce niveau concerne le temps

de calcul. Celui-ci est excessif, notamment dans la phase d’apprentissage. Nous avons

Figure 7.11 – Convergence et nombre d’itérations.

Figure 7.12 – Les principales composantes caractéristiques du léopard.

cherché à réduire le nombre de composantes intermédiaires à sélectionner, à la différence

de certains travaux comme ceux de [VJ01] qui réduisent plutôt le nombre de descrip-

teurs, par utilisation d’Adaboost. Travaillant toujours sur les mêmes quatre catégories,

nous avons utilisé 50 images d’une catégorie pour les exemples positifs et 50 images

de chacune des trois autres catégories pour les exemples négatifs. Nous faisons varier

le nombre T de composantes intermédiaires à garder. La figure 7.13 montre qu’à partir

d’une dizaine de composantes, les résultats n’évoluent que très peu.

Nous pouvons voir sur le tableau 7.2 une comparaison entre HDP simple et HDP

Boosté, où il ressort qu’en moyenne, il n’y a pas de perte de la qualité de la classification

lors de l’utilisation d’un jeu réduit de composantes, déterminé par Adaboost.

Cependant, lors des tests, la phase d’apprentissage s’est avérée très lourde en temps

Figure 7.13 – Performance en fonction du nombre de composantes.

Table 7.2 – Comparaison des performances .HDP(K composantes) HDPBoost (Les meilleures T)

Avion 92%(43) 94%(8)Visage 70%(41) 74%(10)

Leopard 94%(42) 92%(11)Moto 90%(43) 88%(6)

de calcul, étant donnée la taille des descripteurs utilisés, et ce malgré un nombre de

classes et d’images réduit. Par conséquent, nous avons finalement opté concernant la

classification des expressions faciales, pour un classifier SVM classique, que nous avons

boosté pour sélectionner les éléments les plus significatifs. Ceci nous a notamment per-

mis de valider notre méthode pour la détection et la description des expressions

7.3.2 Validation des descripteurs proposés pour la classification des expres-

sions par SVM

Nous avons distingué ici la classification d’image de la classification de séquences, sa-

chant que le but était de valider la méthode proposée pour la localisation des expressions

faciales ainsi que de vérifier la pertinence des descripteurs proposés. Les tests ont été

menés sur trois bases. La première est la base JAFFE [LAKG98] qui inclue 7 expressions

faciales (six + neutre) pour chacun des dix personnages d’origine Japonaise. Cette base

est largement utilisée pour les tests non spontanés, et est réputée pour présenter de

faible taux de reconnaissance. La deuxième base est MMI [PVRM05], qui est constituée

de séquences annotées manuellement. Enfin la base Cohn-Kanade [KCT00] constituée

également de séquences, et qui utilise le système FACS (Facial Action Coded System)

pour l’identification des pics des expressions. En revanche, les labels relatifs aux émo-

tions ne sont disponibles.

7.3.2.1 Classification des images d’expression par descripteurs statiques

On utilise le descripteur LBP calculé autour des organes détectés tel que décrit dans la

section . La classification est réalisée par SVM Boosté où seuls 20 paramètres sont sélec-

tionnés par expression. Etant donné les recouvrements, nous arrivons à un vecteur de di-

mension 73 pour l’ensemble des 6 expressions. Les tableaux 7.3, 7.4 et 7.5 comparent nos

résultats avec certains de ceux obtenus respectivement par [FPH05],[GD05b],[KF08],[LFCY06]

pour la base JAFFE, par [SGM09],[TA07] pour la base MMI et par [CSG∗03], [BGL∗06],

[MB07] et [SGM09] pour la base Cohn-Kanade.

Table 7.3 – Performance moyenne sur la base JAFFE (%)notre approche Feng Guo Koutlas Liao

98.3 93.8 92.3 90.8 94.59

Table 7.4 – Performance moyenne sur la base MMI (%)Notre approche Shan Tripathie

91.2% 86.9% 82.2%

Pour la base Cohn-Kanade, étant donnée la grande variabilité des apparences, nous

avons gardé 50 paramètres, ce qui donne finallement un vecteur de dimension 220.

Enfin, pour pouvoir se comparer à d’autres travaux, nous avons utilisé également des

histogrammes de Gabor features pour la description. Dans le tableau 7.5, SN correspond

au nombre de sujets, SqN au nombre de séquences, C au nombre de classe et A au fait

que la localisation des organes est automatique ou non. Nous obtenons des résultats

légèrement meilleurs que ceux de l’état de l’art.

Table 7.5 – Comparaison des performances sur la base Cohn-Kanade (%)Representation SN SqN C A Taux(%)

[CSG∗03] Motion Units 53 - 6 N 91.8[BGL∗06] Gabor+AdaBoost 90 313 7 Y 93.3

[MB07] Gabor+AdaBoost - - 6 Y 84.6[MB07] Edge/chamfer+AdaBoost - - 6 Y 85

[SGM09] BoostedLBP 96 320 6 N 95.1Notre approche LBP 94 346 6 Y 91.9Notre approche BoostedLBP 94 346 6 Y 91.4Notre approche BoostedLBP+ Gabor 94 346 6 Y 94.3

7.3.2.2 Classification des images d’expression par descripteurs dynamiques

Le but ici est de reconnaitre l’expression relative à une séquence, en calculant en une

seule fois les descripteurs correspondant à une portion de celle-ci. Nous travaillons dans

ce cas sur des portions de séquences dont la longueur est suffisante pour permettre de

calculer les descripteurs VTB et les moments. Dans [CSG∗03], l’auteur utilise un jeu

de 53 sujets, présentant chacun au moins 4 séquences, et pour lesquels, pour chaque

expression, huit frames sont considérées pour l’extraction des descripteurs. [BGL∗06]

considère la première et la dernière frame de chaque séquence aussi bien pour l’ap-

prentissage que pour le test. [SGM09] utilise une frame de l’expression neutre et les

trois dernières frames relatives au pic de l’expression à reconnaitre. Nous avons suivi

un schéma similaire pour le jeu d’images à utiliser pour la description en prenant en

compte à chaque fois les images neutres et au moins 4 images expressives.

Les descripteurs LBP, VTB et Moments sont calculés comme présenté dans la section

7.2.2, avec une largeur de fenêtre temporelle τs fixée à 8. Cette valeur est un bon com-

promis entre le fait de disposer de suffisamment de régions pour calculer les moments

temporels et suffisamment de probabilités, chacune associée à une mini-séquence de lon-

gueur τs, pour calculer la probabilité associée à chaque image. Le tableau 7.6 montre les

résultats de comparaison avec divers travaux menés sur la même base (Cohn-Kanade),

où D précise si les descripteurs sont dynamiques ou non. Nous pouvons en conclure

que ces descripteurs améliorent sensiblement la classification.

La base MMI a également servi pour les tests. La matrice de confusion donnée par le

tableau 7.7 pour le descripteur Moment montre les classes difficiles à caractériser pour ce

descripteur (ici C pour colère, D pour dégout, J pour joie, P pour peur, T pour tristesse,

S pour surprise et N pour neutre).

Table 7.6 – Comparaison avec descripteurs spaciaux temporelsFeatures SN SqN C D A AR(%)

[CSG∗03] Gabor 53 318 6 N N 91.8[BGL∗06] Gabor 90 313 7 N Y 93

[KZP08] Shape - - 6 N Y 92.3[DSRDS08] Holistic 98 411 6 N N 96.1

[PY09] Harr+Boost 96 - 6 N Y 88

[VNP09] Candide - 440 7 N N 90

[SGM09] BoostedLBP 96 320 6 N N 95.1[SC09] AAM - 72 6 N N 97.22

Notre approche LBP+VTB 95 348 7 Y Y 94

Notre approche LBP+VTB 95 348 7 Y Y 97.2Notre approche Moments 95 348 7 Y Y 95.5Notre approche Moments 95 348 7 Y Y 97.3

Table 7.7 – Matrice de confusion MMI database (%)

C D J P T S N

C 76.6 2.34 0.78 0 2.34 0 17.97

D 4.27 81 0 5.13 0.85 2.56 5.98

J 1.28 0.64 94.9 0 0 0.64 2.56

P 2.68 2.68 1.79 58.9 1.79 16.96 15.18

T 3.9 0 0 0.78 66.4 1.56 27.34

S 1.28 0.64 0.64 7.69 0 71.2 18.60

N 0.71 0.1 0.1 0.3 0.3 0.3 98.2

Pour la suite, au lieu de classer des frames d’une séquence donnée, nous allons

classer la séquence entière, en tenant compte de l’évolution des changements du visage

au cours de l’expression.

7.3.2.3 Classification de séquences d’expressions par descripteurs dynamiques

A la différence de la section précédente, nous cherchons ici à déterminer l’expression

associée à une séquence en analysant l’évolution des descripteurs spatio-temporels au

cours de la séquence. La recherche sur l’expression faciale s’oriente davantage vers la

vidéo et les expressions spontanées. Dans ce sens, nous avons utilisé les résultats pré-

cédents pour déterminer l’expression portée par la séquence, en supposant que celle-ci

n’en contient qu’une. Différentes études ont montré que les expressions sont très souvent

brèves, et par conséquent, ne tiennent que sur quelques frames. Dans notre approche,

nous avons considéré toutes les frames constituant la séquence, et avons déterminé pour

chacune d’elle la probabilité d’appartenir à l’une des classes. Pour cela, on suppose avoir

une séquence ordonnée S = {I1, I2, . . . , IT}, où T est le nombre de frames dans une sé-

quence. Nous classons chaque frame Ii, avec 3 ≤ i ≤ T, selon la méthode précédente

en lui associant un vecteur de probabilité {pi,c0 , pi,c1 , . . . , pi,c6}, où pi,cj est la probabilité

pour que la frame i appartienne à la classe j, avec pi,c0 pour l’expression neutre. Par la

suite, l’algorithme 7.1 permet d’associer à la séquence la classe la plus probable.

Algorithm 7.1: Classification d’une séquence par somme pondérée

1. Initialiser le vecteur PS = {P1, P2, . . . , P6} avec Pk = 0, et 1 ≤ k ≤ 6

2. pour (i = 3 à T) :

(a) Si Gi =′ Neutre′, ignorer frame et aller itération suivante ;

(b) If Gi 6=′ Neutre′, Pk = Pk + wi ∗ pi,ck , pour 1 ≤ k ≤ 6

3. Label de la séquenceG = argmaxk {Pk} , (7.6)

pour 1 ≤ k ≤ 6.

Concernant les poids wi, nous avons émis des hypothèses sur la relation entre la

similarité et la distance par rapport à l’image la plus expressive (Loi uniforme ou loi

Gaussienne pour 2 valeurs de σ, σ1 = T et σ2 = T/2). Sur le tableau 7.8 nous avons les

comparaisons avec d’autres travaux testés sur la même base, et nous pouvons constater

que pour différents paramètres de la distribution, nous obtenons de meilleurs résultats.

Table 7.8 – Comparatif pour la classification de séquencesSN SqN C D A AR(%)

[YBS04] - - 6 Y Y 90.9[XLC08] 95 365 6 Y N 88.8[BMB08] 94 333 6 N Y 89.13

[ZP09] a 97 374 6 Y N 95.1[ZP09] b 97 374 6 Y Y 93.85

[RD09] 70 350 5 Y - 93.85

[KOY∗09] 53 129 6 Y - 70

[CLL09] - 392 6 Y N 92.86

Notre LBP+VTBwi = 1 95 348 6 Y Y 93.68

Notre LBP+VTBN (µ, σ21 ) 95 348 6 Y Y 95.1

Notre LBP+VTBN (µ, σ22 ) 95 348 6 Y Y 95.7

Notre Momentswi = 1 95 348 6 Y Y 98.5

Bibliography

[AAR04] Agarwal S., Awan A., Roth D.: Learning to detect objects in images via asparse, part-based representation. IEEE Trans. Pattern Anal. Mach. Intell. 26,11 (2004), 1475–1490. 11

[AR02] Agarwal S., Roth D.: Learning a sparse representation for object detection.In ECCV ’02: Proceedings of the 7th European Conference on Computer Vision-Part IV (2002), pp. 113–130. 11

[Bas79] Bassili J. N.: Emotion recognition: The role of facial movement and the rel-ative importance of upper and lower areas of the face. Journal of Personalityand Social Psychology 37, 11 (1979), 2049 – 2058. xiv, xv, 47, 106, 107

[BBDT05] Beveridge J., Bolme D., Draper B., Teixeira M.: The csu face identificationevaluation system: Its purpose, features, and structure. Machine Vision andApplications 16, 2 (February 2005), 128–138. 38, 40

[BC08] Beskow J., Cerrato L.: Evaluation of the expressivity of a swedish talking headin the context of human-machine interaction. mar 2008. 84

[bDK05] bo Duan K., Keerthi S. S.: Which is the best multiclass svm method? anempirical study. In Proceedings of the Sixth International Workshop on MultipleClassifier Systems (2005), pp. 278–285. 62

[BETVG08] Bay H., Ess A., Tuytelaars T., Van Gool L.: Speeded-up robust features(surf). Comput. Vis. Image Underst. 110, 3 (2008), 346–359. 15

[BGL∗06] Bartlett M., GLittlewort C., Lainscsek C., Fasel I., Frank M., Movel-lan J.: Fully automatic facial action recognition in spontaneous behavior.In Automatic Face and Gesture Recognition (2006), pp. 223–228. 24, 92, 95, 118,119, 120

123

124 Bibliography

[BGV92] Boser B. E., Guyon I. M., Vapnik V. N.: A training algorithm for optimalmargin classifiers. In COLT ’92: Proceedings of the fifth annual workshop onComputational learning theory (1992), pp. 144–152. 16, 60

[Bie87] Biederman I.: Recognition by components: A theory of human imageunderstanding. PsychR 94, 2 (1987), 115–147. 10

[Bis06] Bishop C. M.: Pattern Recognition and Machine Learning (Information Scienceand Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. 63

[BKSS10] Bergtholdt M., Kappes J., Schmidt S., Schnörr C.: A study of parts-based object class detection using complete graphs. Int. J. Comput. Vision 87,1-2 (2010), 93–117. 19, 20

[BMB08] Buenaposada J., Munoz E., Baumela L.: Recognising facial expressionsin video sequences. Pattern Analysis and Applications 1, 2 (January 2008),101–116. 25, 26, 27, 28, 38, 95, 96, 121

[BMP00] Belongie S., Malik J., Puzicha J.: Shape context: A new descriptor forshape matching and object recognition. In Proc. NIPS (2000), pp. 831–837.42, 78, 115

[BNJ03] Blei D., Ng A., Jordan M.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(January 2003), 993–1022. 17, 65

[Bra00] Bradski G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000).59

[BT05] Bouchard G., Triggs B.: Hierarchical part-based visual object categoriza-tion. In CVPR ’05: Proceedings of the 2005 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR’05) - Volume 1 (2005), pp. 710–715. 20

[Bur98] Burges C. J. C.: A tutorial on support vector machines for pattern recogni-tion. Data Min. Knowl. Discov. 2, 2 (1998), 121–167. xiv, 61, 62, 63

[BZMn08] Bosch A., Zisserman A., Muñoz X.: Scene classification using a hybridgenerative/discriminative approach. IEEE Trans. Pattern Anal. Mach. Intell.30 (April 2008), 712–727. 70

[CB03] Chibelushi C., Bourel F.: Facial expression recognition: A brief tutorialoverview. In CVonline: On-Line Compendium of Computer Vision (2003). xiv,24, 25

[CCS03] Cowie R., Cowie R., Schroeder M.: The description of naturally occurringemotional speech. In Proceedings of 15th International Congress of PhoneticSciences (2003), pp. 2877–2880. 84

Bibliography 125

[CFH05] Crandall D., Felzenszwalb P., Huttenlocher D.: Spatial priors for part-based recognition using statistical models. In CVPR ’05: Proceedings of the2005 IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition (CVPR’05) - Volume 1 (Washington, DC, USA, 2005), IEEE ComputerSociety, pp. 10–17. 20

[CHC09] Cheng X., Hu Y., Chia L.-T.: Hierarchical word image representation forparts-based object recognition. In ICIP’09: Proceedings of the 16th IEEE in-ternational conference on Image processing (Piscataway, NJ, USA, 2009), IEEEPress, pp. 301–304. 19

[CL01] Chang C.-C., Lin C.-J.: LIBSVM: a library for support vector machines,2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. 60

[CL06] Carneiro G., Lowe D.: Sparse flexible models of local features. In ECCV(3) (2006), pp. 29–43. xiii, 19, 20

[CLL09] Chang K., Liu T., Lai S.: Learning partially-observed hidden conditionalrandom fields for facial expression recognition. In CVPR (Miami, FL, USA,June 2009), pp. 533–540. 28, 93, 95, 96, 121

[CSG∗03] Cohen I., Sebe N., Gozman F., Cirelo M., Huang T.: Learning bayesiannetwork classifiers for facial expression recognition both labeled and unla-beled data. In CVPR (Madison, Wisconsin, USA, June 2003), vol. 1, pp. 595–601. 92, 118, 119, 120

[CV95] Cortes C., Vapnik V.: Support-vector networks. In Machine Learning (1995),pp. 273–297. 60

[Dar02] Darwin C.: Expression of the Emotions in Man and Animals, The, 3 sub ed.Oxford University Press Inc, 2002. 83

[DPR06] Djemal K., Puech W., Rossetto B.: Automatic active contours propagationin a sequence of medical images. IJIG 6, 2 (April 2006), 267–292. 26

[DSRDS08] De Silva C. R., Ranganath S., De Silva L. C.: Cloud basis function neuralnetwork: A modified rbf network architecture for holistic facial expressionrecognition. Pattern Recogn. 41, 4 (2008), 1241–1253. 61, 92, 120

[Eck03] Eckman P.: Darwin, deception, and facial expression. Annals of the New YorkAcademy of Sciences 1000, EMOTIONS INSIDE OUT: 130 Years after Darwin’sThe Expression of the Emotions in Man and Animals (2003), 205–221. 102

[EF76] Ekman P., Friesen W. V.: Pictures of Facial Affect. Consulting PsychologistsPress, 1976. xiii, 3

http://www.csie.ntu.edu.tw/~cjlin/libsvm

http://www.csie.ntu.edu.tw/~cjlin/libsvm

126 Bibliography

[EF78] Ekman P., Friesen W.: Facial Action Coding System: A Technique for the Mea-surement of Facial Movement. Consulting Psychologists Press, Palo Alto, 1978.5, 23

[EVGW∗] Everingham M., Van Gool L., Williams C. K. I., Winn

J., Zisserman A.: The PASCAL Visual Object ClassesChallenge 2010 (VOC2010) Results. http://www.pascal-network.org/challenges/VOC/voc2010/workshop/index.html. 11

[FE73] Fischler M., Elschlager R.: The representation and matching of pictorialstructures. Computers, IEEE Transactions on C-22, 1 (jan. 1973), 67 – 92. 19

[Fer05] Fergus R.: Visual Object Category Recognition. PhD thesis, University ofOxford, 2005. 9, 10

[FFFP07] Fei-Fei L., Fergus R., Perona P.: Learning generative visual models fromfew training examples: An incremental bayesian approach tested on 101

object categories. Comput. Vis. Image Underst. 106, 1 (2007), 59–70. 10, 19, 20,77, 79, 83, 114, 115

[FFTR∗05] Fize D., Fabre-Thorpe M., Richard G., Doyon B., Thorpe S. J.: Rapidcategorization of foveal and extrafoveal natural images: Associated erpsand effects of lateralization. Brain and Cognition 59, 2 (2005), 145 – 158. 9

[FH05] Felzenszwalb P. F., Huttenlocher D. P.: Pictorial structures for objectrecognition. Int. J. Comput. Vision 61, 1 (2005), 55–79. 20

[FPH05] Feng X., Pietikäinen M., Hadid A.: Facial expession recognition withlocal binary patterns and linear programming. Pattern Recognition and ImageAnalysis 15, 2 (January 2005), 546–548. 24, 89, 118

[FPZ03] Fergus R., Perona P., Zisserman A.: Object class recognition by unsu-pervised scale-invariant learning. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (June 2003), vol. 2, pp. 264–271. 10

[FPZ05] Fergus R., Perona P., Zisserman A.: A sparse object category model forefficient learning and exhaustive recognition. In CVPR ’05: Proceedings of the2005 IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition (CVPR’05) - Volume 1 (2005), pp. 380–387. 19

[FS95] Freund Y., Schapire R. E.: A decision-theoretic generalization of on-linelearning and an application to boosting. In EuroCOLT ’95: Proceedings of theSecond European Conference on Computational Learning Theory (1995), pp. 23–37. 20, 58, 59

Bibliography 127

[GD05a] Grauman K., Darrell T.: The pyramid match kernel: Discriminative classi-fication with sets of image features. In ICCV ’05: Proceedings of the Tenth IEEEInternational Conference on Computer Vision (Washington, DC, USA, 2005),IEEE Computer Society, pp. 1458–1465. xiii, 16, 20, 21, 22

[GD05b] Guo G., Dyer C.: Learning from examples in the small sample case: Face ex-pression recognition. In Sys. Man and Cybernetics?PART B: Cybernetics (2005),pp. 477–488. 24, 27, 89, 118

[GHP07] Griffin G., Holub A., Perona P.: Caltech-256 Object Category Dataset. Tech.Rep. 7694, California Institute of Technology, 2007. 10

[GK01] Gnedin A., Kerov S.: A characterization of gem distributions. Comb. Probab.Comput. 10, 3 (2001), 213–217. 68

[Gro05] Gross R.: Face Databases. Springer, New York, February 2005. 85

[Har54] Harris Z.: Distributional structure. Word 10, 23 (1954), 146–162. 14

[HFH∗09] Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten

I. H.: The weka data mining software: an update. SIGKDD Explorations 11,1 (2009), 10–18. 60

[HL02] Hsu C.-W., Lin C.-J.: A comparison of methods for multiclass supportvector machines. IEEE Transactions on Neural Networks 13, 2 (2002), 415–425.62

[HL08] Huiskes M. J., Lew M. S.: The mir flickr retrieval evaluation. In MIR ’08:Proceedings of the 2008 ACM International Conference on Multimedia InformationRetrieval (New York, NY, USA, 2008), ACM. 11

[Hof99] Hofmann T.: Probabilistic latent semantic indexing. In SIGIR ’99: Pro-ceedings of the 22nd annual international ACM SIGIR conference on Researchand development in information retrieval (New York, NY, USA, 1999), ACM,pp. 50–57. 65

[HPA04] Hadid A., Pietikäinen M., Ahonen T.: A discriminative feature spacefor detecting and recognizing faces. In CVPR (Washington, DC, USA, June2004), pp. 797–804. xiv, 25, 33, 44

[HS88] Harris C., Stephens M.: A combined corner and edge detector. In Proceed-ings of the 4th Alvey Vision Conference (1988), pp. 147–151. 15

[HT98] Hastie T., Tibshirani R.: Classification by pairwise coupling. In NIPS ’97:Proceedings of the 1997 conference on Advances in neural information processingsystems 10 (1998), pp. 507–513. 62

128 Bibliography

[JB08] Jenkins R., Burton A. M.: 100% accuracy in automatic face recognition.Science 319, 5862 (January 2008), 435. 9

[Jeb04] Jebara T.: Machine Learning: Discriminative and Generative. Kluwer Aca-demic, 2004. 57, 60, 63

[JI10a] Ji Y., Idrissi K.: Learning from Essential Facial Parts and Local Features forAutomatic Facial Expression Recognition. In CBMI, 8th International Work-shop on Content-Based Multimedia Indexing (June 2010). 32, 76

[JI10b] Ji Y., Idrissi K.: Using Moments on Spatiotemporal Plane for Facial Ex-pression Recognition. In 20th International Conference on Pattern Recognition(ICPR), Istanbul, Turkey. (Aug. 2010). 32, 76

[JIB09] Ji Y., Idrissi K., Baskurt A.: Object categorization using boosting withinhierarchical bayesian model. In ICIP09 (dec 2009), pp. 317–320. 32, 56, 76

[JK09] Ji Y., Khalid I.: Facial expression recognition by automatic facial parts po-sition detection with boosted-lbp. In SITIS09 (Marrakech Maroc, November2009). 76, 105

[KBFS04] Kienzle W., Bakir G., Franz M., Schölkopf B.: Efficient approxima-tions for support vector machines in object detection. In Pattern Recognition,vol. 3175 of Lecture Notes in Computer Science. 2004, pp. 54–61. 21

[KBFS05] Kienzle W., Bakir G. H., Franz M. O., Schölkopf B.: Face detection— efficient and rank deficient. In Advances in Neural Information ProcessingSystems 17 (Cambridge, MA, 2005), Saul L. K., Weiss Y., Bottou L., (Eds.),MIT Press, pp. 673–680. 105

[KCT00] Kanade T., Cohn J., Tian Y.-L.: Comprehensive database for facial expres-sion analysis. In Proceedings of the 4th IEEE International Conference on Au-tomatic Face and Gesture Recognition (FG’00) (Grenoble, France, March 2000),pp. 46 – 53. 84, 87, 118

[KF08] Koutlas A., Fotiadis D.: A region based methodology for facial expressionrecognition. In BIOSIGNALS (2008), vol. 2, pp. 218–223. 25, 89, 118

[KJC08] Kim D. H., Jung S. U., Chung M. J.: Extension of cascaded simple featurebased face detection to facial expression recognition. Pattern Recogn. Lett.29, 11 (2008), 1621–1631. xiv, 27

[Kos94] Kosslyn S.: Image and Brain: The Resolution of the Imagery Debate. MIT Press,1994. 2

[KOY∗09] Kumano S., Otsuka K., Yamato J., Maeda E., Sato Y.: Pose-invariant facialexpression recognition using variable-intensity templates. Int. J. Comput.Vision 83, 2 (2009), 178–194. 26, 29, 96, 121

Bibliography 129

[KP08] Koelstra S., Pantic M.: Non-rigid registration using free-form deforma-tions for recognition of facial actions and their temporal dynamics. In Au-tomatic Face and Gesture Recognition (2008), pp. 1–8. 24, 26

[KZP08] Kotsia I., Zafeiriou S., Pitas I.: Texture and shape information fusionfor facial expression and facial action unit recognition. Pattern Recogn. 41, 3

(2008), 833–851. 26, 92, 120

[LAKG98] Lyons M., Akamatsu S., Kamachi M., Gyoba J.: Coding facial expressionswith gabor wavelets. In FG ’98: Proceedings of the 3rd. International Conferenceon Face & Gesture Recognition (1998), pp. 200–205. 84, 85, 117

[LBH∗08] Laganière R., Bacco R., Hocevar A., Lambert P., Païs G., Ionescu B. E.:Video summarization from spatio-temporal features. In TVS ’08: Proceedingsof the 2nd ACM TRECVid Video Summarization Workshop (New York, NY, USA,2008), ACM, pp. 144–148. 26

[LBL09] Littlewort G. C., Bartlett M. S., Lee K.: Automatic coding of facialexpressions displayed during posed and genuine pain. Image Vision Comput.27, 12 (2009), 1797–1803. 25, 29

[Lew98] Lewis D. D.: Naive (bayes) at forty: The independence assumption in infor-mation retrieval. In ECML ’98: Proceedings of the 10th European Conference onMachine Learning (London, UK, 1998), pp. 4–15. 64

[LFCY06] Liao S., Fan W., Chung A., Yeung D.: Facial expression recognition us-ing advanced local binary patterns, tsallis entropies and global appearancefeatures. In ICIP (2006), pp. 665–668. 89, 118

[LJ08] Larlus D., Jurie F.: Combining appearance models and markov randomfields for category level object segmentation. In CVPR (june 2008). xiii, 22,78, 80, 100

[LLS04] Leibe B., Leonardis A., Schiele B.: Combined object categorization andsegmentation with an implicit shape model. In Proceedings of the Workshop onStatistical Learning in Computer Vision (Prague, Czech Republic, May 2004).11

[Low99a] Lowe D. G.: Object recognition from local scale-invariant features. In IICCV(1999), p. 1150. xiv, 34, 35

[Low99b] Lowe D. G.: Object recognition from local scale-invariant features. In IICCV(1999), p. 1150. 15, 18, 41

[LP05] Li F.-F., Perona P.: A bayesian hierarchical model for learning naturalscene categories. In CVPR ’05: Proceedings of the 2005 IEEE Computer Society

130 Bibliography

Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 2(Washington, DC, USA, 2005), pp. 524–531. xiii, 15, 16, 17, 18, 64

[Mar82] Marr D.: Vision: A computational investigation into the human represen-tation and processing of visual information. In W.H. Freeman (1982). 12

[MB98] Martinez A., Benavente R.: The AR Face Database. Tech. Rep. CVC Tech-nical Report 24, Computer Vision Center in Barcelona, Spain, June 1998.84

[MB07] Moore S., Bowden R.: Automatic facial expression recognition usingboosted discriminatory classifiers. In IEEE International Workshop on Analysisand Modeling of Faces and Gestures (AMFG), ICCV (2007), pp. 71–83. 92, 118,119

[MB09] Moore S., Bowden R.: The effect of pose on facial expression recognition.In BMVC09 (2009), pp. xx–yy. 26, 30

[MCUP04] Matas J., Chum O., Urban M., Pajdla T.: Robust wide-baseline stereofrom maximally stable extremal regions. Image and Vision Computing 22, 10

(2004), 761 – 767. British Machine Vision Computing 2002. 15

[MR83] McLeod P. L., Rosenthal R.: Micromomentary movement and the de-conding of face and body cues. Journal of Nonverbal Behavior 8 (1983), 83–90.102

[MRD05] Mattern F., Rohlfing T., Denzler J.: Adaptive performance-based clas-sifier combination for generic object recognition. In Proceedings of 10th In-ternational Fall Workshop Vision, Modeling, and Visualization, November 16–18,2005. Erlangen, Germany (Erlangen, Germany, Nov. 2005), pp. 139–146. 21

[MS05] Mikolajczyk K., Schmid C.: A performance evaluation of local descriptors.IEEE Transactions on Pattern Analysis & Machine Intelligence 27, 10 (October2005), 1615–1630. 79

[MW09] Matsumoto D., Willingham B.: Spontaneous facial expressions of emotionof congenitally and noncongenitally blind individuals. Journal of Personalityand Social Psychology 96, 1 (2009), 1 – 10. 83

[MWR∗08] Mayer C., Wimmer M., Riaz Z., Roth A., Eggers M., Radig B.: Real timesystem for model-based interpretation of the dynamics of facial expression.In Automatic Face and Gesture Recognition (2008), pp. 1–2. 38

[Nal04] Nallapati R.: Discriminative models for information retrieval. In SIGIR ’04:Proceedings of the 27th annual international ACM SIGIR conference on Researchand development in information retrieval (2004), pp. 64–71. 58

Bibliography 131

[OFPA04] Opelt A., Fussenegger M., Pinz A., Auer P.: Generic object recognition withboosting. Tech. Rep. TR-EMT-2004-01, EMT, TU Graz, Austria, 2004. Submit-ted to the IEEE Transactions on Pattern Analysis and Machine Intelligence.11, 17

[PB04] Pozdnoukhov A., Bengio S.: Tangent vector kernels for invariant imageclassification with svms. In ICPR04 (2004), pp. III: 486–489. 21

[PB07] Pantic M., Bartlett M. S.: Machine Analysis of Facial Expressions. I-TechEducation and Publishing, Vienna, Austria, July 2007. xiii, xiv, 24, 33

[PBE∗06] Ponce J., Berg T. L., Everingham M., Forsyth D. A., Hebert M., Lazebnik

S., Marszalek M., Schmid C., Russell B. C., Torralba A., Williams C.K. I., Zhang J., Zisserman A.: Dataset issues in object recognition. In To-ward Category-Level Object Recognition, volume 4170 of LNCS (2006), Springer,pp. 29–48. 77

[Pin05] Pinz A.: Object categorization. Found. Trends. Comput. Graph. Vis. 1, 4 (2005),255–353. xiii, 9, 16, 20

[PK09] Park S., Kim D.: Subtle facial expression recognition using motion magni-fication. Pattern Recogn. Lett. 30, 7 (2009), 708–716. 28, 48

[PR00] Pantic M., Rothkrantz L.: Automatic analysis of facial expressions: Thestate of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence22 (2000), 1424–1445. 23

[PSK08] Park S., Shin J., Kim D.: Facial expression analysis with facial expressiondeformation. In Pattern Recognition, 2008. ICPR 2008. 19th International Con-ference on (Dec 2008), pp. 1 –4. 27

[PVRM05] Pantic M., Valstar M., Rademaker R., Maat L.: Web based databasefor facial expression analysis. In Proc. IEEE Intl Conf. Multimedia and Expo(2005), pp. 317–321. 84, 86, 118

[PY09] Peng Yang Qingshan Liu D. N. M.: Rankboost with l1 regularization forfacial expression recognition and intensity estimation. In iccv (Kyoto, Japan,September 2009), pp. 1018 – 1025. 92, 120

[RD09] Raducanu B., Dornaika F.: Natural facial expression recognition usingdynamic and static schemes. In ISVC ’09: Proceedings of the 5th InternationalSymposium on Advances in Visual Computing (2009), pp. 730–739. xiv, 28, 29,121

[RLAB10] Revaud J., Lavoué G., Ariki Y., Baskurt A.: Learning an efficient androbust graph matching procedure for specific object recognition. In Interna-tional Conference on Pattern Recognition (ICPR) (Aug. 2010). 20

132 Bibliography

[RLSP06] Rothganger F., Lazebnik S., Schmid C., Ponce J.: 3d object modeling andrecognition using local affine-invariant image descriptors and multi-viewspatial constraints. Int. J. Comput. Vision 66, 3 (2006), 231–259. 10, 13

[RMG∗76] Rosch E., Mervis C. B., Gray W. D., Johnson D. M., Boyes-Braem P.: Basicobjects in natural categories. Cognitive Psychology 8, 3 (1976), 382 – 439. 10

[Rob63] Roberts L. G.: Machine Perception of Three-Dimensional Solids. Outstand-ing Dissertations in the Computer Sciences. Garland Publishing, New York,1963. xiii, 12, 13

[RSNM03] Raina R., Shen Y., Ng A. Y., Mccallum A.: Classification with hybridgenerative/discriminative models. In In Advances in Neural Information Pro-cessing Systems 16 (2003), MIT Press. 70

[RTMF08] Russell B. C., Torralba A., Murphy K. P., Freeman W. T.: Labelme: Adatabase and web-based tool for image annotation. Int. J. Comput. Vision 77,1-3 (2008), 157–173. 12

[SC09] Shang L., Chan K.-P.: Nonparametric discriminant hmm and applicationto facial expression recognition. In CVPR (Miami, FL, USA, June 2009),pp. 2090–2096. 92, 120

[SGM09] Shan C., Gong S., McOwan P.: Facial expression recognition based on localbinary patterns: A comprehensive study. Image and Vision Computing 27, 6

(May 2009), 803–816. 24, 25, 27, 91, 92, 93, 118, 119, 120

[SMB00] Schmid C., Mohr R., Bauckhage C.: Evaluation of interest point detectors.Int. J. Comput. Vision 37, 2 (2000), 151–172. 15

[SRE∗05a] Sivic J., Russell B., Efros A. A., Zisserman A., Freeman B.: Discoveringobjects and their location in images. In International Conference on ComputerVision (ICCV 2005) (October 2005). xiii, 16, 17, 20

[SRE∗05b] Sivic J., Russell B. C., Efros A. A., Zisserman A., Freeman W. T.: Discov-ering object categories in image collections. In Proceedings of the InternationalConference on Computer Vision (2005) (2005). 64

[SSF78] Suwa M., Sugie N., Fujimora K.: A preliminary note on pattern recognitionof human emotional expression. In the Fourth International Joint Conferenceon Pattern Recognition (1978), pp. 408–410. 23

[STFW05] Sudderth E., Torralba A., Freeman W., Willsky A.: Learning hierarchicalmodels of scenes, objects, and parts. In ICCV05 (2005), pp. II: 1331–1338. 18,20

Bibliography 133

[SVD09] Siddiquie B., Vitaladevuni S., Davis L.: Combining multiple kernels forefficient image classification. In WACV09 (2009), pp. 1–8. 21

[TA07] Tripathi R., Aravind R.: Recognizing facial expression using particle filterbased feature points tracker. In PReMI’07: Proceedings of the 2nd internationalconference on Pattern recognition and machine intelligence (2007), pp. 584–591.91, 118

[TCJ10] Tong Y., Chen J., Ji Q.: A unified probabilistic framework for spontaneousfacial action modeling and understanding. IEEE Trans. Pattern Anal. Mach.Intell. 32, 2 (2010), 258–273. 29

[TJBB06] Teh Y. W., Jordan M. I., Beal M. J., Blei D. M.: Hierarchical Dirichletprocesses. Journal of the American Statistical Association 101, 476 (2006), 1566–1581. 17, 56, 66, 68, 70, 79, 80, 112, 113

[TKC05] Tian Y., Kanade T., Cohn J.: Handbook of Face Recognition. Springer, NewYork, 2005. 24

[TLJ07] Tong Y., Liao W., Ji Q.: Facial action unit recognition by exploiting theirdynamic and semantic relationships. IEEE Trans. Pattern Anal. Mach. Intell.29, 10 (2007), 1683–1699. 84

[UB05] Ulusoy I., Bishop C. M.: Generative versus discriminative methods forobject recognition. In CVPR ’05: Proceedings of the 2005 IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 2(2005), pp. 258–265. 58, 64

[VJ01] Viola P., Jones M.: Rapid object detection using a boosted cascade ofsimple features. In CVPR (2001), vol. 1, pp. 511–518. xiii, 4, 20, 21, 25, 38,59, 73, 81, 116

[VKM09] Venkatesh Y., Kassim A. A., Murthy O. R.: A novel approach to classifica-tion of facial expressions from 3d-mesh datasets using modified pca. PatternRecognition Letters 30, 12 (2009), 1128 – 1137. 13, 25

[VNP09] Vretos N., Nikolaidis N., Pitas I.: A model-based facial expressionrecognition algorithm using principal components analysis. In ICIP (2009),pp. 3301–3304. 28, 92, 120

[VNU03] Vidal-Naquet M., Ullman S.: Object recognition with informative featuresand linear classification. In ICCV (2003), pp. 281–288. 15

[VS04] Vogel J., Schiele B.: A semantic typicality measure for natural scene cate-gorization. In DAGM-Symposium (2004), pp. 195–203. 15

134 Bibliography

[WLF∗09] Whitehill J., Littlewort G., Fasel I., Bartlett M., Movellan J.: Towardpractical smile detection. IEEE Trans. Pattern Anal. Mach. Intell. 31, 11 (2009),2106–2111. 30

[WMG09] Wang X., Ma X., Grimson W. E. L.: Unsupervised activity perception incrowded and complicated scenes using hierarchical bayesian models. IEEETrans. Pattern Anal. Mach. Intell. 31, 3 (2009), 539–555. 22

[WS06] Winn J., Shotton J.: The layout consistent random field for recognizingand segmenting partially occluded objects. In IN PROCEEDINGS OF IEEECVPR (2006), pp. 37–44. 19

[WZFF06] Wang G., Zhang Y., Fei-Fei L.: Using dependent regions for object cate-gorization in a generative framework. In CVPR ’06: Proceedings of the 2006IEEE Computer Society Conference on Computer Vision and Pattern Recognition(2006), pp. 1597–1604. 18, 100, 112

[XL09] Xie X., Lam K.-M.: Facial expression recognition based on shape and tex-ture. Pattern Recogn. 42, 5 (2009), 1003–1011. 25

[XLC08] Xiang T., Leung M., Cho S.: Expression recognition using fuzzy spatio-temporal modeling. Pattern Recognition 41, 1 (2008), 204 – 216. 26, 48, 96,121

[YBS04] Yeasin M., Bullot B., Sharma R.: From facial expression to level of inter-est: A spatio-temporal approach. CVPR 2 (2004), 922–927. 26, 96, 121

[YLM09] Yang P., Liu Q., Metaxas D. N.: Boosting encoded dynamic features forfacial expression recognition. Pattern Recogn. Lett. 30, 2 (2009), 132–139. 25

[YWS∗06] Yin L., Wei X., Sun Y., Wang J., Rosato M. J.: A 3d facial expressiondatabase for facial behavior research. In FGR ’06: Proceedings of the 7th In-ternational Conference on Automatic Face and Gesture Recognition (Washington,DC, USA, 2006), IEEE Computer Society, pp. 211–216. 84

[ZP09] Zhao G., Pietikäinen M.: Boosted multi-resolution spatiotemporal de-scriptors for facial expression recognition. Pattern Recognition Letters 30, 12

(2009), 1117–1127. 26, 27, 48, 96, 107, 121

[ZPRH09] Zeng Z., Pantic M., Roisman G. I., Huang T. S.: A survey of affect recog-nition methods: Audio, visual, and spontaneous expressions. IEEE Trans.Pattern Anal. Mach. Intell. 31, 1 (2009), 39–58. 24, 29

[ZSK02] Zhu Y., Silva L. D., Ko C.: Using moment invariants and hmm in facialexpression recognition. Pattern Recognition Letters 23, 1-3 (January 2002), 83

– 91. 26

Bibliography 135

[ZTLH09] Zheng W., Tang H., Lin Z., Huang T. S.: A novel approach to expressionrecognition from non-frontal face images. In ICCV (2009), pp. 1901–1908. 26

[ZVM04] Zhu J., Vai M. I., Mak P. U.: A new enhanced nearest feature space (enfs)classifier for gabor wavelets features-based face recognition. In ICBA (2004),pp. 124–131. 46

[ZYZS05] Zhang W., Yu B., Zelinsky G. J., Samaras D.: Object class recognitionusing multiple layer boosting with heterogeneous features. In CVPR ’05:Proceedings of the 2005 IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPR’05) - Volume 2 (Washington, DC, USA, 2005),IEEE Computer Society, pp. 323–330. 20

136 Bibliography

Author’s Publications

International Conferences

1. Y. Ji, K. Idrissi, A. Baskurt, Object Categorization Using Boosting Within Hierar-

chical Bayesian Model. Dans ICIP, International Conference on Image Processing,

Le Caire, Egypte. 2009.

2. Y. Ji, K. Idrissi. Facial Expression Recognition by Self-Identification for Video Se-

quence. Dans International Conference on Signal-Image Technology and Internet-

Based Systems, Marrakech, Maroc. 2009.

3. Y. Ji, K. Idrissi. Learning from Essential Facial Parts and Local Features for Auto-

matic Facial Expression Recognition. Dans CBMI, 8th International Workshop on

Content-Based Multimedia Indexing , Grenoble, France. 2010.

4. Y. Ji, K. Idrissi Using Moments on Spatiotemporal Plane for Facial Expression

Recognition. . Dans ICPR, 20th International Conference on Pattern Recognition,

Istanbul, Turkey. , 2010.

Atelier

GDR ISIS - Thème B : Image et Vision Action "Visage, geste, action et comportement"

Présentation à la journée du 08/12/2009, Télécom ParisTech - Salle E800

137

138 Author’s publications