Numéro d’ordre : 2010-ISAL-0107 Année 2010
Institut National des Sciences Appliquées de Lyon
Laboratoire d’InfoRmatique en Image et Systèmes d’information
École Doctorale Informatique et Mathématiques de Lyon
Thèse de l’Université de Lyon
Présentée en vue d’obtenir le grade de Docteur,spécialité Informatique
par
Yi Ji
Object classification in images and videos
Application to facial expressions
Thèse soutenue le 3 Decembre 2010 devant le jury composé de :
M. Patrick Lambert Professeur, Polytech Annecy-Chambery RapporteurM. William Puech Professeur, LIRMM RapporteurM. Jean-Luc Dugelay Professeur, EURECOM ExaminateurM. Fabrice Mériaudeau Professeur , LABORATOIRE Le2i ExaminateurM. Atilla Baskurt Professeur, INSA Lyon DirecteurM. Khalid Idrrissi Maître de Conférences, INSA Lyon Co-encadrant
Laboratoire d’InfoRmatique en Image et Systèmes d’informationUMR 5205 CNRS - INSA de Lyon - Bât. Jules Verne
69621 Villeurbanne cedex - FranceTel: +33 (0)4 72 43 60 97 - Fax: +33 (0)4 72 43 71 17
To my family
iii
iv
Acknowledgments
First of all, I would like to thank my two advisers Pr.Attila Baskurt and Dr.Khalid Idrrisi
for their guidance, encouragement and patience. I’m very thankful to Pr.Baskurt for the
chance to do research in INSA de Lyon. To Dr.Idrrisi, I’m very grateful for his invaluable
advice and guidance all along my study.
I would like to thank Pr. Patrick Lambert and Pr. William Puech to spend their pre-
cious time to review the manuscript of this dissertation and gave me much appreciated
advice. I’m also very thankful to Pr. Jean-Luc Dugelay and Pr. Fabrice Mériaudeau to
agree to be be examiners in the jury.
I thank the PhD Grants from China Scholarship Council for co-operation program
between CSC and INSA.
I also would like to thank all the members of group IMAGINE at LIRIS. And especial
thanks to my friends shared the same office as Cagatay, Imane, Phuong, Kai and Yuyao.
I thank Pr. Mohan S. Kankanhalli of National University of Singapore who super-
vised my master thesis and inspired my interests in multimedia and computer vision.
My final thanks go to my parents and my friends in Lyon, Shanghai and Suzhou for
their supports.
v
vi
Abstract
In this dissertation, we address the problem of generative object categorization in com-puter vision. Then, we apply to the classification of facial expressions.
Humans can solve the problem of object recognition and categorization quickly, effi-ciently and almost effortlessly. But for a corresponding algorithm from computer vision,image analysis and pattern recognition, it is still a very difficult task. For the objects insame category, the variations in pose, lighting, scale and affine changes and the intra-class differences generate an extreme task and unsolved challenge to assign the object inan image to a certain category.
In our proposal, we are inspired by the method Hierarchical Dirichlet Processes togenerate intermediate mixture components to improve recognition and categorization,as it shares with documents modelling topic two similar aspects: its nonparametric andits hierarchical nature. After we obtain the set of components, instead of boosting thefeatures as Viola and Jones, we try to boost the components in the intermediate layerto find the most distinctive ones. We consider that these components are more impor-tant for object class recognition than others and use them to improve the classification.Our target is to understand the correct classification of objects, and also to discover theessential latent themes sharing across multiple categories of object and the particulardistribution of the latent themes for a special category.
There are many approaches for facial expression recognition system (FER) proposedto improve the Human-Computer Interaction (HCI). Ekman and Friesen defined six ba-sic emotions: Anger, Disgust, Fear, Happiness, Sadness and Surprise. The task to inter-pret these universal expressions, by applying recognition algorithm or, even manually,by human beings, is difficult because of individual differences and culture background.FER is still one of the most active fields in computer vision and has attracted manyproposals over the last several decades.
Regarding the relation between basic expressions and corresponding facial deforma-tion models, we propose two new textons, VTB and moments on spatiotemporal plane,to describe the transformation of human face during facial expressions. These descrip-tors aim to catch both general shape changes and motion texture details. The dynamicdeformation of facial components is so captured by modelling the temporal behaviour offacial expression. Finally, SVM based system is used to efficiently recognize the expres-sion for a single image in sequence, then, the weighted probabilities of all the framesare used to predict the class of the current sequence. My thesis includes finding theproper methods to describe the static and dynamic aspects during facial expression. I
vii
also aim to design new descriptors to denote characteristics of facial muscle movements,and furthermore, identify the category of emotion.
Keywords: Object Recognition, Facial Expression Recognition, SVM, Bayesian model,AdaBoost, Spatialtemporal Information, Moments, SIFT.
viii
Contents
Acknowledgments v
Abstract vii
Contents ix
List of Figures xiii
List of Tables xvii
List of Algorithms xix
1 Introduction 1
1.1 General Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Object Categorization and Methods . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Facial Expression Classification and Related Application . . . . . . . . . . 4
1.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Survey on Object Recognition and Facial Expression Recognition 7
2.1 Object Categorization: The State of the Art . . . . . . . . . . . . . . . . . . 9
ix
2.1.1 Problem Statement: Category Level Recognition . . . . . . . . . . . 9
2.1.2 Popular Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Classic Models for object categorization . . . . . . . . . . . . . . . . 12
2.1.4 Recent Work and Summary . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Literature Review: Facial Expression Recognition . . . . . . . . . . . . . . 23
2.2.1 Basic Facial Expressions and Facial Actions . . . . . . . . . . . . . 23
2.2.2 Three Stages in Automatic FER System . . . . . . . . . . . . . . . . 23
2.2.3 Recent Work and Trend . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Feature Representation for Objects and Faces 31
3.1 Overview of Local Feature Descriptors . . . . . . . . . . . . . . . . . . . . 33
3.2 The Detection of Regions of Interest . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 DOG for Key Point Detection . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Face Organ Location . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Features for Static Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3 Shape Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.4 LBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.5 Gabor features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Features for Dynamic Schemes . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.1 Introduction to Temporal Texture Analysis . . . . . . . . . . . . . . 47
3.4.2 Dynamic Deformation for Facial Expression . . . . . . . . . . . . . 47
3.4.3 VTB Descriptor for Dynamic Feature . . . . . . . . . . . . . . . . . 49
3.4.4 Moments on Spatiotemporal Plane . . . . . . . . . . . . . . . . . . 52
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
x
4 Recognition and Classification Methods 55
4.1 Overview to Machine Learning Methods . . . . . . . . . . . . . . . . . . . 57
4.2 Discriminative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.2 AdaBoost Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.2 BoW and Naïve Bayes Implemantation . . . . . . . . . . . . . . . . 64
4.3.3 Hierarchical Generative Model . . . . . . . . . . . . . . . . . . . . . 66
4.3.4 Construction of Hierarchical Dirichlet Processes . . . . . . . . . . . 66
4.3.5 Inference and sampling . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Hybrid System: Integrated Boosting and HDP . . . . . . . . . . . . . . . . 70
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5 Testing and Results 75
5.1 General Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.1 Datasets of Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.2 Classification Using HDP with Heterogeneous Features . . . . . . 77
5.1.3 Classification Using Boosting within Hierarchical Bayesian Model 81
5.2 Facial Expression Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.1 Face databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.2 Overview of Our System . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.3 Image Based Classification using Static features . . . . . . . . . . . 89
5.2.4 Image Based Classification Using Static and Dynamic Features . . 92
5.2.5 Classification for Sequences . . . . . . . . . . . . . . . . . . . . . . . 95
6 Conclusion 99
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.1.1 Object Categorization Using Boosting Within Hierarchical BayesianModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
xi
6.1.2 Automatic Facial Expression Recognition . . . . . . . . . . . . . . . 100
6.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.1 Object Similarities and Polymorphism . . . . . . . . . . . . . . . . . 101
6.2.2 Spontaneous Facial Expression Understanding . . . . . . . . . . . 102
7 Résumé en Français 103
7.1 Sommaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2.1 Détection des organes faciaux . . . . . . . . . . . . . . . . . . . . . 105
7.2.2 Descriptions des expressions . . . . . . . . . . . . . . . . . . . . . . 106
7.2.3 Classification par HDP (Hierarchical Dirichlet Process) . . . . . . . 111
7.2.4 Processus de Dirichlet Hiérarchique pour la classification . . . . . 112
7.3 Résulats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3.1 Validation de la classification d’objets par HDP . . . . . . . . . . . 114
7.3.2 Validation des descripteurs proposés pour la classification des ex-pressions par SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3.2.1 Classification des images d’expression par descripteursstatiques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.3.2.2 Classification des images d’expression par descripteursdynamiques . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.3.2.3 Classification de séquences d’expressions par descripteursdynamiques . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Bibliography 123
Author’s Publications 137
xii
List of Figures
1.1 The results from http://www.picsearch.com when searching for "cat". . . 2
1.2 Photographs of facial expressions from Paul Ekman[EF76]. . . . . . . . . . 3
2.1 Generic OR and Specific OR . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Samples from Caltech datasets . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 20 categories in PASCAL Visual Object Classes Challenge 2010 . . . . . . 12
2.4 Construction of a three-dimensional array, solid objects from a single two-dimensional photograph [Rob63] . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 BoW for Object Categorization (ICCV 2009 short course by L. Fei-Fei) . . 14
2.6 Examples of ’key features’ that are detected by: The scale invariant Harrisdetector , the affine invariant Harris detector, and the DoG/SIFT detec-tor/descriptor [Pin05]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.7 The most likely words (shown by 5 examples in a row) for four learnedtopics (1): (a) Faces, (b) Motorbikes, (c) Airplanes, (d) Cars. [SRE∗05a]. . . 17
2.8 Learning relevant intermediate representations of scenes automaticallyand without supervision. by [LP05] . . . . . . . . . . . . . . . . . . . . . . 18
2.9 Graphical geometric models by [CL06] . . . . . . . . . . . . . . . . . . . . 19
2.10 Schematic depiction of the Adaboost detection cascade by [VJ01] . . . . . 21
2.11 Pyramid Match by Grauman and Darrel [GD05a] . . . . . . . . . . . . . . 22
2.12 Latent components generated by [LJ08] . . . . . . . . . . . . . . . . . . . . 22
2.13 Six basic expressions, from left to right: Anger, Disgust, Fear, Happiness,Sadness and Surprise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.14 Examples of facial action units (AUs) and their combinations defined inFACS from [PB07] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
xiii
2.15 The main blocks in facial expression recognition [CB03] . . . . . . . . . . 25
2.16 Rectangle feature by [KJC08] . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.17 Three video examples associated 3D tracker with different degree of motion.[RD09] 29
3.1 Above: [PB07] Tracking the facial characteristic point. Below: [HPA04]LBP histograms from whole face and divided blocks. . . . . . . . . . . . 33
3.2 The construction of Difference of Gaussian(DOG). [Low99a] . . . . . . . . 35
3.3 Facial organ location: step by step. . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Sample images, from left to right: Anger, Disgust, Fear, Happiness, Sad-ness and Surprise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 The histogram computation for SIFT descriptor . . . . . . . . . . . . . . . 42
3.6 Left and center: samples of edge points for two similiar shape. Right: thelog-polar histrogram bins used to compute the shape context. . . . . . . . 43
3.7 Shape context as a discriminative descriptor . . . . . . . . . . . . . . . . . 43
3.8 An example of LBP computing [HPA04]. . . . . . . . . . . . . . . . . . . . 44
3.9 Examples of texture primitives which can be detected by LBP [HPA04]. . 44
3.10 (Left) A face image. (Center) Identified changed parts. (Right) Maskedand divided in 8 blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.11 The face image in Fig.3.10 is represented by a concatenation of 8 local LBPhistograms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.12 The cues of facial movements [Bas79]. . . . . . . . . . . . . . . . . . . . . . 47
3.13 Left: XY(front face); Center: YT slice; Right: XT slice. . . . . . . . . . . . . 48
3.14 The dynamic deformation for different expressions on vertical slices . . . 49
3.15 3 blocks on VT plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.16 VTB computing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1 Separating hyperplanes under linear case. [Bur98]. . . . . . . . . . . . . . 61
4.2 The examples of SVM kernels by pictures[Bur98]. . . . . . . . . . . . . . . 63
4.3 HDP model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 The mixture of components. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5 Hybrid approach for learning . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1 Samples from four categories we used. . . . . . . . . . . . . . . . . . . . . 78
xiv
5.2 Component No.17 in motorbike category. . . . . . . . . . . . . . . . . . . . 80
5.3 Convergence vs. Iteration times. . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 The distinctive components in large image. . . . . . . . . . . . . . . . . . . 82
5.5 Performance vs the size of component set. . . . . . . . . . . . . . . . . . . 83
5.6 The sample images from JAFFE database. From left to right: Angry, Dis-gust, Fear, Happiness, Sadness, Surprise and Neutral. . . . . . . . . . . . . 85
5.7 Sample images and location procedure, from left to right: Anger, Disgust,Fear, Happiness, Sadness and Surprise. . . . . . . . . . . . . . . . . . . . . 86
5.8 Sample frames from MMI database . . . . . . . . . . . . . . . . . . . . . . 87
5.9 Sample frames from Cohan-Kanade database . . . . . . . . . . . . . . . . . 88
5.10 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.11 One sequence of happiness and corresponding plot for six expressions. . 94
7.1 Détection des organes faciaux pour les 6 expressions . . . . . . . . . . . . 106
7.2 Les mouvements faciaux décrits par [Bas79] . . . . . . . . . . . . . . . . . 107
7.3 10 blocks correspondant à une image de sourire . . . . . . . . . . . . . . . 107
7.4 Gauche : XY(vue de face) ; Milieu : la tranche YT ; Droite : La tranche XT . 108
7.5 Les déformations de la tranche YT pour différentes expressions . . . . . . 108
7.6 Exemple de tranches dans le plan YT. . . . . . . . . . . . . . . . . . . . . . 109
7.7 Calcul de VTB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.8 Modèle graphique de LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.9 Modèle graphique de HDP . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.10 Catégories de la base Caltech utilisées pour nos tests. . . . . . . . . . . . . 114
7.11 Convergence et nombre d’itérations. . . . . . . . . . . . . . . . . . . . . . . 116
7.12 Les principales composantes caractéristiques du léopard. . . . . . . . . . . 116
7.13 Performance en fonction du nombre de composantes. . . . . . . . . . . . . 117
xv
xvi
List of Tables
5.1 Classification results, in parenthesis is the value K for number of compo-nents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 The confusion matrix for best-T. . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3 Performance comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 Recognition performances by SVM on different resolutions(%) . . . . . . 89
5.5 Average recognition performances on JAFFE database (%) . . . . . . . . . 89
5.6 Recognition performances by boosted-SVM for 64× 64 resolution(%) . . 90
5.7 Confusion matrix by boosted-SVM for 64× 64 resolution for 6-class recog-nition (%) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.8 Recognition performances for 64× 64 resolution on sub-sets (%) . . . . . 91
5.9 Confusion on MMI database for 6-class recognition (%) . . . . . . . . . . . 91
5.10 Recognition performances comparisons on Cohn-Kanade Database(%) . . 92
5.11 Recognition performances comparisons for image-based methods (%) . . 92
5.12 Confusion matrix of Moments on MMI database (%) . . . . . . . . . . . . 94
5.13 Confusion matrix of Ours:N (µ, σ22 )(%) . . . . . . . . . . . . . . . . . . . . . 96
5.14 Recognition performances comparisons for sequences-based methods(%) 96
7.1 résultats de la classification, (Nombre de composantes entre parenthèses). 115
7.2 Comparaison des performances . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3 Performance moyenne sur la base JAFFE (%) . . . . . . . . . . . . . . . . . 118
7.4 Performance moyenne sur la base MMI (%) . . . . . . . . . . . . . . . . . . 118
7.5 Comparaison des performances sur la base Cohn-Kanade (%) . . . . . . . 119
7.6 Comparaison avec descripteurs spaciaux temporels . . . . . . . . . . . . . 120
xvii
7.7 Matrice de confusion MMI database (%) . . . . . . . . . . . . . . . . . . . . 120
7.8 Comparatif pour la classification de séquences . . . . . . . . . . . . . . . . 121
xviii
List of Algorithms
3.1 Detect facial organs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 HDP to build components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 AdaBoost for Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1 Sequence level classification using weighted sum . . . . . . . . . . . . . . . 95
7.1 Classification d’une séquence par somme pondérée . . . . . . . . . . . . . . 121
xix
xx
Chapter 1Introduction
Contents
1.1 General Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Object Categorization and Methods . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Facial Expression Classification and Related Application . . . . . . . . 4
1.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1
2 Chapter 1. Introduction
1.1 General Context
Nowadays, variant electronic devices such as digital cameras, smart phones, or even
handheld game consoles life can take a digital photo or shot a short video. More and
more images and videos are presented in our daily life. For Human Visual System, the
perception of these visual information is with apparent ease. According to psychologists,
only few milliseconds is needed to understand the contents [Kos94].
In artificial intelligence, we face a colossal scale of visual information in our digital
storage, and few useful techniques to process, understand and classify them. New
image search engines (Google, Altavista, Picsearch) normally are based on the file name
of images or key words. Irrelevant results often appear in feedback (Fig. 1.1). Machine
perception from photographs is indeed an challenge and one of the most important
problem in computer vision. In this thesis work, our first research topic is to category
the objects in images to correct semantic labels automatically and build the content-
based recognition system. In order to accomplish this objective of dissertation work,
we propose to extract a set of intermediate visual components to present objects for
categorization. The method is efficient for general recognition but we also faced the
problem of processing complexity. The cost of training stage increases if we want to
handle tens of thousands classes. Though HDP is an efficient and potential method, due
to the limitation of converge speed, we confront several problems in extending to more
categories.
Figure 1.1: The results from http://www.picsearch.com when searching for "cat".
1.2. Object Categorization and Methods 3
Furthermore, we began to consider a more practical classification problem with pre-
fixed number of classes: Facial Expression Recognition. Perceptual psychologists have
spent decades to defined and represent human emotions and related facial expressions
(1.2).
Figure 1.2: Photographs of facial expressions from Paul Ekman[EF76].
1.2 Object Categorization and Methods
1.2.1 Objective
As cited in 1.1, humans can solve the problem of object recognition and categorization
quickly, efficiently and almost effortlessly. But for a corresponding algorithm from com-
puter vision, image analysis and pattern recognition, it is still a very difficult task. For
the objects in same category, the variations in pose, lighting, scale and affine changes and
the intra-class differences generate an extreme task and unsolved challenge to assign the
object in an image to a certain category.
4 Chapter 1. Introduction
1.2.2 Contribution
In our proposal, we combine thousands of descriptors (e.g. local gradient, shape, and
color) from small patches within one hierarchical generative model. These different data
sources have complementary characteristics, which should be independently combined
to improve the classification. We are also inspired by the method "Hierarchical Dirichlet
Processes" to generate intermediate mixture components to improve recognition and
categorization. Our work shares with topic modeling for documents two similar aspects:
its nonparametric and hierarchical nature. Unlike the documents that are usually only
written in single language, we have several set of descriptors for the same image, just as
using different scripts from different language sources to tell the same story. The method
reinforces the same contents from complementary origins for better understanding and
comprehension.
To find the right category for an object, these features represent different facets of
its characteristics. These facets are usually shared by multiple categories. The particular
combination and distribution of these facets signify the particular category and let us
do the correct classification.
Furthermore, after we obtain the set of components, instead of boosting the features
as Viola and Jones [VJ01], we boosted the components in the intermediate layer to find
the most distinctive ones. We consider that these components are more important for
object class recognition than others and use them to improve the classification. Our
target is to understand the correct classification of objects, also to discover the essential
latent themes sharing across multiple categories of object and the particular distribution
of the latent themes for a special category.
1.3 Facial Expression Classification and Related Application
1.3.1 Motivation
The second topic that we studies in this thesis concerns human facial expressions. Au-
tomatic Facial Expression Recognition (FER) is one of the most active fields in computer
vision and has attracted many proposals over the last several decades. Recognition and
interpretation of human facial expressions take an important role in interpersonal and
non-verbal communication. However, the task of interpreting, by applying algorithms in
1.3. Facial Expression Classification and Related Application 5
computer vision or, even manually, by human beings, is difficult because of individual
differences and cultural background. In machine analysis, automatic Facial Expression
Recognition (FER) is one of the most active fields and has attracted many proposals
over the last several decades. On a human face, expressions can be seen from the sub-
tle movements of facial muscles and influenced by internal emotion states. Based on
psychological researches, [EF78] proposed to use FACS (Facial Action Code System) as
a standard to systematically categorize facial expressions. They also defined six basic
emotions: Anger, Disgust, Fear, Happiness, Sadness and Surprise. Since then, two fami-
lies of methods are developed: one aims at understanding these prototypic expressions
while the other is concentrated on the detection of basic action units. Some approaches
combined these two methods. In this thesis, we propose to use the appearance-based
and spatial-temporal information so as to build automatic FER system to give an inter-
pretation of facial expression in video sequences.
1.3.2 Contribution
We developed an automatic FER system which establishes relations between facial ex-
pressions and the facial parts changes. An unchanged state during a long run usually
implies a neutral face. If changes take place between frames, we can detect the facial
movements and then a possible facial expression. These changes are related to differ-
ent emotions which are classified into six distinctive universal emotions by Erkman and
Friesen[EF78]. This builds the base line of our system: we detect the facial changes due
to an expression, we locate face parts and then, from these parts, we identify the facial
expressions.
In the proposed method, the detection of faces begins with a roughly face detector
then improved by detecting the important facial organs (eyes, nose and mouth). The
representation of faces is appearance-based, applied on a designed mask with 8 block
layout. For each block, LBP histogram is extracted then concatenated as feature vectors.
Gabor features are optional for combination at this step.
Our automatic system also contributes in the second stage by suggesting the new
descriptors. It separates the feature extraction into two parts: static images and dynamic
information estimation. To describe the spatiotemporal planes, we introduce VTB, a
modification of a well known LBP descriptor, and an unique usage of moments on
spatial-temporal domain for the challenge in representative features. VTB concentrates
on the texture characteristics while moments concentrate on the changing shapes. In
6 Chapter 1. Introduction
other words, the appearance features are applied to measure the geometric changes of
shape and locations of main facial components.
The usage of these descriptors can evaluate the expressions intensity and handle
temporal information for image sequences. Furthermore, the possibility to combine the
static appearance feature and the texture and shape features of motion is also explored.
1.4 Outline
The structure of this thesis is organized as follows:
1. In Chapter 2, we provide a survey of the existing techniques in the areas of object
categorization and facial expression classification. For object categorization, we
focus on different models and their hybrid for category level recognition. For the
more applicable area of facial expression classification, we will concentration on
the three main stages of Facial Expression Recognition(FER) systems to reflect the
historical development since its inception.
2. Chapter 3, we present various descriptors (mostly based on local appearance) used
in generic object recognition and facial expression recognition systems. Then, our
new descriptors are proposed and detailed.
3. Chapter 4 describes different classifiers used in computer vision systems. The
major two branches: discriminative and generative methods are presented. Our
proposal of hybrid system is then introduced.
4. In Chapter 5, both the systems for generic object recognition and facial expression
recognition are detailed in every step. Benchmark databases are used to evaluate
the performances of these systems.
5. Chapter 6 summarizes the contributions in both object categorization and facial
expression classification. Final conclusions which concerns this dissertation are
drawed. We also propose several future working directions.
Chapter 2Survey on Object Recognition and
Facial Expression Recognition
Contents
2.1 Object Categorization: The State of the Art . . . . . . . . . . . . . . . . 9
2.1.1 Problem Statement: Category Level Recognition . . . . . . . . . . 9
2.1.2 Popular Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Classic Models for object categorization . . . . . . . . . . . . . . . 12
2.1.4 Recent Work and Summary . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Literature Review: Facial Expression Recognition . . . . . . . . . . . . 23
2.2.1 Basic Facial Expressions and Facial Actions . . . . . . . . . . . . 23
2.2.2 Three Stages in Automatic FER System . . . . . . . . . . . . . . . 23
2.2.3 Recent Work and Trend . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7
8 Chapter 2. Survey on Object Recognition and Facial Expression Recognition
In this chapter, we review the existing techniques in the areas of object categorization
and facial expression classification. For object categorization, we focus on methods
for basic recognition of visual categories based on images feature extraction. For the
more applicable area of facial expression classification, we review some popular systems
to reflect the historical development since its inception.
2.1. Object Categorization: The State of the Art 9
2.1 Object Categorization: The State of the Art
2.1.1 Problem Statement: Category Level Recognition
According to neuroscientists ([FFTR∗05]), human can perform very fast categorization
(well below 300ms) of objects or identification of a special object , even under highly
degraded and novel viewing conditions. Between these two tasks: recognize the same
object and find the objects belong to the same categories in natural images, there is
an important distinction. For the first task of instance-level problem, specific OR will
deal with an individual object, for example, to recognize a special human being (Alan
Turing in first row of Fig. 2.1) no matter viewpoint change, differences in illumination,
occlusion, background clutter and imaging noise as mentioned by Fergus[Fer05]. This
technique is widely integrated in security applications (e.g. Airport Security Check).
Some researches have claimed as 100% accuracy in special instance recognition [JB08].
Figure 2.1: Generic OR and Specific OR
The second task of category level recognition is defined in Pinz’s book [Pin05] as
"generic object recognition" (generic OR). For generic recognition, (e.g. people, bikes or
bottles) can consist the list of categories or classes. In the case of robust category level
10 Chapter 2. Survey on Object Recognition and Facial Expression Recognition
categorization (generic OR), it is a harder task [Fer05] for the researchers in computer
vision. Here the goal is to define a process which can assign objects within images to
a certain category. To design an efficient algorithm, one of the major problems is the
distinction of the intra-class variabilities and inter-class differences. Furthermore, the
visual categories are normally hierarchical in semantic domain (e.g. Automobile-Car-
SUV or sport utility vehicle). An object also can belong to two or more classes (e.g. a
wooden box can be stationery case or flowerpot). Many methods have challenged this
task but a satisfying solution for an exhaustive set of objects (10,000 to 30,000 according
to psychologists [RMG∗76, Bie87]) in real world is not likely to appear now or in near
future. However, for a small set of objects, many contributions have been made since
the first attempt in 60’s, for example, the identification of smiling people in second row
of Fig. 2.1. In this thesis, we will handle this generic object recognition task and provide
literature review and original research in this field.
2.1.2 Popular Datasets
For a thorough evaluation of various categorization algorithms, benchmark or com-
monly used databases are required to train and test the proposed models. A useful
database should satisfy the criteria below:
1. Enough quantity in each category,
2. High intra-class variability between different categories and low inter-class vari-
ability inside the same category,
3. Scale, orientation and viewpoint variation for the same category,
4. Background clutter complexity.
Some database are designed for special object recognition. For example, Object
Recognition Database from Ponce research group includes modeling shots of eight ob-
jects for comparative evaluation [RLSP06]. Several databases are well known but con-
centrated on different aspects of the above requirements. Here we list some widely used
sets:
1. Caltech 4 [FPZ03], Caltech101 [FFFP07], Caltech256 [GHP07], a nearly standard for
testing natural object recognition algorithms. The images (Fig. 2.2) varies greatly
2.1. Object Categorization: The State of the Art 11
Figure 2.2: Samples from Caltech datasets
in view, position, size, lighting condition and so on. Some categories have under
40 images.
2. ETH-80 from TH CogVis project [LLS04], contains images of 8 categories and 80
objects, such as apples, pears, tomatoes, cows, dogs, horses, cups, and cars. And
each object is represented by 41 views.
3. UIUC Image Database for Car Detection [AR02, AAR04], which contains grey-scale
images of cars in PGM format for use in evaluating object detection algorithms
(only one category with background). No scale or viewpoint change are available.
4. The TU Graz-02 Database [OFPA04], which includes images of people, cars, bicy-
cles and counter exmaples.
5. PASCAL Object Recognition Database Collection [EVGW∗], this is a famous annual
challenge to recognize objects from a number of visual object classes in realistic
scenes. It has 20 object classes and is collected from the "flickr" website (Fig. 2.3.
Due to the huge size of the database, the training and testing process can be time-
consuming.
6. MIRFLICKR-25000 Image Collection [HL08] This database is also collected from
the "flickr" website and offers its users to search and share their pictures based on
image tags. For retrieval research it provided very interesting visual concepts.
12 Chapter 2. Survey on Object Recognition and Facial Expression Recognition
Figure 2.3: 20 categories in PASCAL Visual Object Classes Challenge 2010
Among these databases, the PASCAL VOC challenge is obviously the hardest be-
cause it includes the largest sets and some objects are very small in images. UIUC
includes only one category and not enough for biased free solution. ETH-80 provides
too few samples for each categories. In TU Graz-02, certain backgrounds and objects are
linked. Caltech series are most widely used and have low level clutter, though the num-
ber of images in some categories are too low to valid the efficiency of algorithms. In the
future, billions of image from popular search engine GOOGLE or open dataset LabelMe
[RTMF08] may become the more solid base for computer vision researchers. In our
research, we will use a subset of Caltech 101 to achieve the categorization performance.
2.1.3 Classic Models for object categorization
The research for automatic understanding of images and video began from 60’s. L.
Roberts built the first try to construct and display a three-dimensional array, solid ob-
jects from a single two-dimensional photograph as in Fig. 2.4 [Rob63]. Thus it developed
to reconstruction branch of object recognition. In Computational Neuroscience, David
Marr [Mar82] built the cornerstone for this group computational approaches to percep-
tion, and also for brain science. He described the vision model as proceeding from a
two-dimensional visual array (on the retina) to a three-dimensional description of the
world as output. His three stages of vision include:
2.1. Object Categorization: The State of the Art 13
Figure 2.4: Construction of a three-dimensional array, solid objects from a single two-dimensional photograph [Rob63]
1. Original images,
2. Primal sketch, with features like edges, regions, etc., as suitable representation of
the changes and structures in the images,
3. 2-1/2.5D sketch of the scene, the immediate representation of visible surface,
4. 3D object model, interpret the surface and lead to pure perception.
Marr’s idea influences the recent researches like [RLSP06, VKM09].
14 Chapter 2. Survey on Object Recognition and Facial Expression Recognition
Another branch, which majors in the two-dimensional domain, extracts the features
for pattern recognition from images or videos. This group of methods can handle the
visual data under natural environments and is widely used in robotics, security systems
and other autonomous systems . Here the general task of object categorization, is that
assigning the correct category label to unknown instances, given a small number of
training images of a category. We will present three major models in this section: Bag of
words, part-based and discriminative models.
Bag of Words
Bag of words (BoW) approach is originally a simplifying assumption in natural lan-
guage processing [Har54]. This group of methods are popular in document classifi-
cation. It represents the text as a bag, which contains the collection of words from
dictionary and ignored the word order and grammar.
Figure 2.5: BoW for Object Categorization (ICCV 2009 short course by L. Fei-Fei)
In computer vision, there is a similar treatment. An image is represented as a his-
togram vector of features; these features are extracted from regular grid or a set of
keypoints; each feature is a visual codeword which is an entry in visual dictionary; the
visual dictionary is normally generated by k-means. In this structure showed in Fig.
2.5, all features are independent. The model regards the image as a collection of these
features. These features are found by a set of detectors, which have been chosen to re-
2.1. Object Categorization: The State of the Art 15
spond to different types of structures within images (e.g. interesting features of pixels;
the outlines of objects and so on). Normally the system includes two main parts:
1. Feature detection and histogram representation of original image (object),
2. Learning and recognition based on histograms of features.
In the first part, methods for object categorization typically extract features by ap-
plying some salient point detectors on the images. The survey by Schmid et al. [SMB00]
evaluated the repeatability rate and information content of various interesting point de-
tectors. They compared contour based, intensity based and parametric model based
methods. They found that Harris point detector [HS88] and its multi-scale variation
perform better or at least equivalent to other detectors in two aspects: repeatability and
information content.
Matas et al. [MCUP04] proposed detection algorithm for an affinely-invariant sta-
ble subset of extremal regions, named the maximally stable extremal regions (MSER).
Integrated in the SIFT descriptor [Low99b], the difference of Gaussian (DoG) is also a
good keypoint detector and widely used. One comparison is showed in Fig. 2.6, which
shows that the salient parts in images are detected no matter it belongs to objects or
noisy background.
Some other methods for scene categorization [VS04, LP05] just used regular grid
on the images to extract features from rectangle patches. Random sampling is also
used [VNU03]. In these systems, salient regions are detected in the image but not
all are supposed to be keypoints of object that we are looking for. Some will lie on
the background or cluttered. The successful usage of these points after detection will
depend on descriptors and classification.
After salient regions are located, the next step is to represent them in term of descrip-
tive features, which is often represented by feature vectors. These descriptors should be
high discriminative and easy to generate. The combination of detector and descriptor
should be invariant to scaling, rotation, affine deformation, illumination changes and ge-
ometric or radiometric distortion. Some simple features are pixel intensities (graylevels)
or color, moments and it invariants, filters or transformations (DCT, Gabor). Some more
complex features (e.g. SURF [BETVG08]) are also proved to be high performance. These
features can be used to build codewords dictionary or directly put in part-based or
classifier.
16 Chapter 2. Survey on Object Recognition and Facial Expression Recognition
Figure 2.6: Examples of ’key features’ that are detected by: The scale invariant Harrisdetector , the affine invariant Harris detector, and the DoG/SIFT detector/descriptor[Pin05].
With codebook dictionary, we can map the original image to a histogram of code-
words. We still need efficient tools to complete the task of learning and recognition.
Two groups of approaches: discriminative and generative, or their hybrid are avail-
able. Generally speaking, the discriminative methods [GD05a] are driven by data with a
bottom-up manner, such as SVM [BGV92]. On the other hand, the generative methods
[LP05, SRE∗05a] are built by top-down diagram. Discriminative approaches can obtain
very high accuracy in some datasets but the over fitting problem is always a ghost above
it.
Using this BoW structure, various methods can achieve state-of-the-art performance
and are worth of note. Pinz et al. [Pin05] presented ’boundary-fragment-model’, which
is contour-based and used a codebook of ’boundary-fragments’ to vote for potential ob-
ject centroid location. Their method can cope with multi-models for a single category
which is not applicable for region-based methods. In the other aspect, it also needed
bounding box around the objects in the training images so more supervised. The pos-
sibility of combination of boundary and patched based methods is explored in Pinz’s
2.1. Object Categorization: The State of the Art 17
cooperative work [OFPA04], but the computational complexity problem is not solved.
Figure 2.7: The most likely words (shown by 5 examples in a row) for four learned topics(1): (a) Faces, (b) Motorbikes, (c) Airplanes, (d) Cars. [SRE∗05a].
Using probabilistic Latent Semantic Analysis (pLSA) model, Sivic et al. [SRE∗05a]
tried to discover the object categories depicted in a set of unlabelled images. The model
is applied to images by using a visual analogue of a word, formed by vector quantizing
SIFT-like region descriptors. They also extend the bag-of-words vocabulary to include
’doublets’ which encode spatially local co-occurring regions. Their unsupervised meth-
ods is successful though, in their strait forward system, one topic is equal to one object
(as in Fig. 2.7) and no potential latent component is considered.
In the area of natural language processing, for the goal to share the latent topics
among documents, an approach based on Latent Dirichlet Allocation (LDA, Blei et
al.[BNJ03]) provides a set of shared finite mixture components based on ’BoW’. This
method is efficient but need to fix some parameters. By using the nonparametric nature
of Dirichlet Process to solve the problem of determining the appropriate parameters,
LDA method is extended, by Teh et al. [TJBB06] as Hierarchical Dirichlet Processes, an
algorithm developed to capture uncertainty regarding the number of mixture compo-
nents in document modeling (details in Chapter 4.3).
Other than document analysis, The LDA model is also used in natural scene clas-
sification. L.Fei-Fei et al. [LP05] represented the image of a scene by a collection of
18 Chapter 2. Survey on Object Recognition and Facial Expression Recognition
Figure 2.8: Learning relevant intermediate representations of scenes automatically andwithout supervision. by [LP05]
local regions, denoted as codewords obtained by unsupervised learning. It learns the
theme distributions as well as the codewords distribution over the themes without su-
pervision as in Fig. 2.8. Furthermore, in L.Fei-Fei’s cooperative work [WZFF06], the
local patches are no long independent. They introduced a linkage structure over the
latent themes to encode the dependency of the local patches. Their methods showed
high competitive categorization results but the sharing of intermediate components is
not really functional because normally there are only one or two components for each
category. So the objects and the middle level components are almost equal and the core
of LDA algorithm is not notable.
As a similar object recognition system, Sudderth et al. [STFW05] used the SIFT
[Low99b] descriptors, the spatial information and HDP model for visual scene catego-
rization. By training transformed mixture components, the method applies complicated
transformed components and uses very limited number of themes.
Part-Based models
2.1. Object Categorization: The State of the Art 19
In Bag-of-Words model, all visual patches are considered as the same importance
and the spatial relations between these patches are simply ignored. Based on the similar
interesting point detection, part-based models are proposed to cope with these prob-
lems. The original proposal by Fischler and Elschlager [FE73] is to find a visual object in
an image given the relative position of a few template matches. Based on the geometry
and number of local features, different geometric models are applied (Fig. 2.9).
Figure 2.9: Graphical geometric models by [CL06]
By generating Constellation models, L. Fei-Fei et al. [FFFP07] presented a Bayesian
algorithm for learning generative models of object categories from a very limited train-
ing set. Their method used prior information then learned the unknown parameters of
a generative probabilistic model. No latent theme is considered so their system is only
one level. The complete graph is also used in [BKSS10] to locate objects.
Winn and Shotton [WS06] imposed Layout Consistent Random Field (LayoutCRF)
model of asymmetric local spatial constraints on these labels to ensure the consistent
layout of parts whilst allowing for object deformation. Although in their system, the
scale is fixed and only two categories: car and faces are tested.
Star models [FPZ05, CHC09] applied a star topology configuration of parts modeling
a variety of features (appearance, shape, spatial information or hybrid). part-based Star
Model (SM) is used in an exhaustive manner to learn the object model and recognize the
test images. The shortcoming is that the model parts are limited to be taken at similar
view points and not flexible.
20 Chapter 2. Survey on Object Recognition and Facial Expression Recognition
Felzenszwalb and Huttenlocher [FH05] built tree model to detect and locate humans
and faces, but can only capture a small collection of parts which are arranged in a
deformable configuration.
Crandall et al. [CFH05] extended the restricted form of tree model, and introduced
k-fan model, which is a complete subgraph over k nodes of a set <, and each node in
< is connected to every node which not belongs to <. They used k = 2 but the cost is
already relatively high(O(N3)).
Hierarchy models [BT05, STFW05] used two levels loose hierarchical structure; It
grouped the pixels to parts and parts to object. These methods apply to arbitrary spatial
transformations between parts and their subparts, the request for geometric constraints
is quite precise.
Carneiro and Lowe [CL06] stated that in their sparse flexible model the geometry of
each model part depends on the geometry of its k closest neighbors. The connectivity
parameter k can improve the performance but reduce the system’s efficiency.
Unlike [FFFP07, BKSS10], Revaud et al. [RLAB10] proposed an approach to select
subgraphs using the mutual information to learn automatically for a specific object.
Their method is based on local keypoints and their spatial proximity relationships. The
cost of recognition is relatively high and real time is not possible. They used uncom-
pleted graphs for specific object recognition though for category level recognition, ob-
jects within the same category also shared some degree of subgraphs.
Classifier-Based Models
Popular classifiers such as SVM are also used in the recognition stage of BoW mod-
els (e.g. [Pin05, SRE∗05a, GD05a]. Though, there are methods simply based on these
discriminative classifiers and the results are also promising.
Among all the classical methods, boosting is original from Freund and Schapire
[FS95]. Viola and Jones [VJ01] began to combine boosting weaker learners in face de-
tector (Fig. 2.10) under a very wide range of natural conditions, and trained a cascade
of classifiers based on simple Harr features. Based on this method of essential feature
selection, Zhang et al. [ZYZS05] broadened the feature pool by combining local tex-
ture features, global features and spatial features within a single multi-layer model to
improve the performance. Even though, the result from AdaBoost is not as good as
from another discriminative method SVM, their simple and efficient methods will help
2.1. Object Categorization: The State of the Art 21
Figure 2.10: Schematic depiction of the Adaboost detection cascade by [VJ01]
us to design our own object classification system. For the powerful kernel-based SVM
methods, different kernels or the kernels mixture are incorporated to improve the per-
formance. Pozdnoukhov and Bengio [PB04] defined new SVM kernels based on tangent
vectors that take into account prior information on known invariances. Kienzle etl al.
[KBFS04] replaced the set of support vectors by a smaller so-called reduced set of syn-
thetic points to speed the evaluation.
Some researchers proposed the usage of the combination of multi-classifiers. Mat-
tern et al. [MRD05] combined the classification results by the voting method. Each
classifier can vote for one class, and the class with the most votes wins. Siddiquie et
al. [SVD09] proposed Boosted Kernel SVM (BK-SVM) and learned a mixture of ker-
nels by greedily selecting exemplar data instances corresponding to each kernel using
AdaBoost. This method reduced the number of kernel computations but suffer from
reduction of accuracy.
2.1.4 Recent Work and Summary
To extend the BoW model, Grauman and Darrel [GD05a] designed a method to form a
partial matching between two sets of feature vectors (or histograms). This matching is
used as a robust measure of similarity to perform content-based image retrieval, as well
as a basis for learning object categories. They use a multi-resolution histogram pyramid
22 Chapter 2. Survey on Object Recognition and Facial Expression Recognition
(Fig. 2.11)in the feature space to implicitly form a feature matching.
Figure 2.11: Pyramid Match by Grauman and Darrel [GD05a]
Based on ’bag-of-words’ model, Larlus and Jurie [LJ08] used the HDP model to train
one-object-one-component blobs for segmentation and applied Markov Random Fields
to find boundaries of objects. This method combined heterogeneous features and MRK
components, however, it is applied on only 306̃0 patches for an image, and obtain only
one or two latent themes for an object(Fig. 2.12). In this case it fails to catch the essential
characteristics inside the ’categories’ and ignore the inner-class variance.
Figure 2.12: Latent components generated by [LJ08]
Similarly but working on different topics, Wang et al.[WMG09] combined low-level
2.2. Literature Review: Facial Expression Recognition 23
visual features, simple atomic activities and interactions in hierarchical Dirichlet model
for activity perception under complicated environments. It shows the power of HDP
model and also inspire our work.
These merits and weakness of current systems lead us to explore our original work,
which includes the combination of heterogeneous features, the search for latent compo-
nents and the cascaded generative and discriminative models.
2.2 Literature Review: Facial Expression Recognition
2.2.1 Basic Facial Expressions and Facial Actions
Facial expressions have an important role in signaling emotional states and have been
the focus of psychologists, (e.g. Ekman and Friesen [EF78]).
A great deal of effort has been put into the translation between temporal and pattern
features of human expressions and semantic labels. In computer vision, machine under-
standing of this human non-verbal communication is still a challenge in human-machine
interaction.
To point out the directions of facial measurement technology, the psychologists de-
fined two main streams on automatic analysis of facial expressions. In one direction,
facial affect detection is concerned, and six basic universal human emotions: Fear, Sur-
prise, Sad, Angry, Disgust and Happiness, are identified as in Fig.2.13. For the other
direction, the research concerns about facial muscle action detection, and the target is
to identify the basic actions or vocabulary of expressions: AUs (Action Units. A whole
set of object coding labels for facial actions is defined in Facial Action Coding System
as in Fig. 2.14). 44 different facial movements are defined as AUs. Between these two
main research directions, some translations or mapping tables are provided but no solid
theory bases.
2.2.2 Three Stages in Automatic FER System
The early work was summarized by Pantic and Rothkrantz [PR00] since one of the first
attempts for FER system done by Suwa et al. [SSF78]. They concluded that the ideal
system of facial expression analysis includes three important stages: the detection of
24 Chapter 2. Survey on Object Recognition and Facial Expression Recognition
Figure 2.13: Six basic expressions, from left to right: Anger, Disgust, Fear, Happiness,Sadness and Surprise.
Figure 2.14: Examples of facial action units (AUs) and their combinations defined inFACS from [PB07]
faces, their representation and the classification of these representations. In the overview
of automatic FER systems by Chibelushi and Bourel [CB03], the authors gave a similar
structure but they added two architectural components: pre-processing (normally as
a sub-stage in face acquisition) and post-processing (normally included in classification
step) as in Fig. 2.15. These main blocks are used in almost all FER systems. In later state-
of-the-art reviews by Tian et al. [TKC05], Pantic and Bartlett [PB07], the basic structure
of FER system is illustrated similarly and the main techniques for the three stages are
discussed and summarized. Zeng et al. [ZPRH09] surveyed the latest advances in facial
expressions, head and body movements and temporal audio-visual correlation.
In the face detection stage, some approaches used the distance between eyes [FPH05,
SGM09] to normalize the faces, or identify a set of fiducial points[GD05b, KP08]. Among
the existing automatic systems, Bartlette et al.[BGL∗06] employed the boosting tech-
niques of Gabor features to track each of 20 AUs (Action Units) to differentiate between
fake pain and real pain. Koelstra and Pantic[KP08] developed a system based on Viola
2.2. Literature Review: Facial Expression Recognition 25
Figure 2.15: The main blocks in facial expression recognition [CB03]
and Jones face detector[VJ01] and used Gabor features to located 20 facial characteris-
tic points. Koutlas and Fotiadis [KF08] automatically located 74 landmark points using
Active Shape Models(ASM). Buenaposada et al.[BMB08] constructed a three-level face
tracker using Viola and Jones face detector[VJ01], template-based rigid face tracker and
subspace-based tracker. These methods are concentrated only on face detection but they
did not consider the changes that are caused by facial expressions.
In the second stage, discriminative information is extracted to represent facial ex-
pression. One category of approaches is based on geometric features, whose models are
established by a set of important points on the face or face contour deformation. More
recently, some researchers [VKM09] suggested the use of deformed 3D model. In the
other category, expressions are described by appearance based features. Usually, the
methods in this group handle the image wise problem and identify 6 basic classes plus
the neutral class to label images. Among appearance based-methods, many of them
used the texture as discriminative feature. One of the most frequently used descriptors
is Gabor wavelet [LBL09], which is a powerful, while still remaining a time-consuming
tool. Local Binary Pattern (LBP) [HPA04] is another popular texture feature and usually
used on arbitrarily gridded sub-regions of images [SGM09].
There are others widely used texture descriptors and their variants. Yang et al.
[YLM09] built a class-specific codebook and applied boosts to select dynamic Harr-like
features in the same position along the temporal window. The weakness of the coded
features is the fixed size on time-span. Xie et al. [XL09] built the histogram of pixel
values at each pixel coordinate for all the training images. If a bin’s value is higher than
the value of its two neighbors, they defined it as a peak bin. The top peak bins with
high probabilities of occurrence are selected and the corresponding grey levels are used
26 Chapter 2. Survey on Object Recognition and Facial Expression Recognition
to code the pattern images for testing. The proposed feature is based on grey level peaks
and has been tested on only four expressions. As facial changing speed for different ex-
pressions is variable, these continuous pixel coding in longer span became inadaptable.
None of these sequence based methods considered the texture of appearance on static
faces. Contrary to these approaches, Buenaposada et al.[BMB08] introduced a proba-
bilistic procedure to combine this information, based on static information from each
frame in the sequence, so as to compute the posterior probability. This is a real-time
system but the recognition rate is not competitive. In order to represent both static and
dynamic information, Zhao et al.[ZP09] connected the LBP features on three orthogo-
nal planes to represent the video, making the same treatment for motion than the one
used for appearance, regardless of their obvious different texture pattern. Note that the
authors annotated manually the eye’s position in the sequences. Some other authors ex-
plored expression recognition on non-frontal face images by using SIFT features Zheng
et al. [ZTLH09], hybrid features of LBP and Gabor (e.g. [MB09]) or variable-intensity
template [KOY∗09].
Some authors proposed to use shape descriptors such as Kotsia et al. [KZP08] who
considered shape information from Candide facial grid based on a set of landmarks, or
Zhu et al. [ZSK02] who computed moment invariants on several manually annotated
faces areas. However, most of these approaches extract features on static images and do
not consider the transient movements of essential facial parts.
Some new approaches focused on taking into account dynamic information so as to
deal with the sequence wise problem. As an early example, Yeasin et al. [YBS04] used
discrete HMMs to uncover the hidden pattern associated with expressions, which is in-
variant to illumination changes. Djemal et Peuch [DPR06] tracked the contour changes
in a sequence of medical image for 3D rebuild. Xiang et al. [XLC08] analyzed fixed
size sequences with 11 frames using fuzzy C means computation to generate the ex-
pression model but this method is inflexible. For shorter sequences, the method cannot
yield meaningful interpretation, while for longer ones, some frames have to be elim-
inated. They also tested different parameters for the number of frames and reduced
performance when the number is cut down.
Spatial-temporal features also became popular in use, Laganière et lambert [LBH∗08]
observed prominent motion of visual interest points which are found by Hessian matrix.
Koelstra and Pantic [KP08] derived motion histogram descriptors in sliding windows of
frames along the time axis. The motion orientation histograms were extracted as feature
2.2. Literature Review: Facial Expression Recognition 27
descriptors to train a classification system for the automatic frame by frame recognition
of AUs and their temporal dynamics using a combination of ensemble learning and
HMMs. They tested for recognition of all 27 lower and upper face Action Units on MMI
databases. They also reported high false positive as only motion information is used and
some AU have very similar motion direction. In our opinion, the static or appearance
information can be used to reduce this error.
In the last stage, different machine learning techniques are applied. Guo et al.[GD05b]
compared the performance of variant classifiers: simplified Bayes classifier, support vec-
tor machine, and AdaBoost. The results reported that SVM is the most suitable. Buena-
posada et al.[BMB08] introduced a probabilistic procedure to combine the information
from input image sequence in order to compute the posterior probability. The proposed
system is robust to realistic environment but obtain only about 80% recognition rates
for expressions as fear, sadness and angry, not competitive to others [SGM09, ZP09]. In
Shan et al. [SGM09], better recognition performance is obtained by combining SVM and
Boosted-LBP features, but, as their boosting is based on sub-regions the method is under
the curse of dimensionality.
2.2.3 Recent Work and Trend
Figure 2.16: Rectangle feature by [KJC08]
For the recent work, the common new contributions lies on the design of new de-
scriptors or the recognition algorithms to improve the overall performance. Here we
introduce some new techniques and present their merits and weaknesses.
Kim et al. [PSK08, KJC08] extended Harr-like descriptors to variant rectangles (Fig.
2.16). In the feature selection, Adaboost algorithm from Viola and Jones is used to
28 Chapter 2. Survey on Object Recognition and Facial Expression Recognition
build the classifier. Their results are around 90% from still images of JAFFE database.
This made the performance of these 3× 3 features not remarkable among all available
features.
Vretos et al. [VNP09] utilized Kanade-Lucas-Tomasi algorithm to track the Candide
facial grid and applied principal components analysis (PCA) to find the two eigenvectors
of the model vertices. Then they used SVM upon selected vertices features to perform
classification. The achieved facial expression classification accuracy is approximately
90% on Kanade-Cohn database. Their system is efficient but nor best performance nor
automatic because the vertices of the grid on the frames are manually located.
Chang et al. [CLL09] linked the output class label to the underlying emotion of a
facial expression sequences and connect the hidden variables to the image frame-wise
action units based on Hidden Conditional random field. In their sequence wise classi-
fication, the labels of a sequence is decided by a majority vote from each image frame.
This solution is similar to Buenaposada et al. [BMB08], the labels and probabilities are
implemented to achieve better accuracy rates. In their system, only static features are
extracted and no dynamic actions are considered. This degraded the recognition rates
for the "Anger" and "Sadness".
From the temporally deformed facial features, Park et al. [PK09] changed the inten-
sity of facial actions so as to recognize subtle facial expressions using motion magnifi-
cation and SVM classifier. It is a novel angle to use temporal information to recognize
subtle and spontaneous facial expression. Their experiments are not complete and dif-
ficult to compare to other methods as only four classes (three expressions plus neutral
expression) and tested on their own datasets.
On the popular Kanade-Cohen database, Raducanu and Dornaika [RD09] reported
one of best results as 100% on 5 classes from 70 objects. Their method recovered 3D head
pose and facial actions from the video sequence using the appearance-based face and fa-
cial action tracker (Fig. 2.17). They concluded that the dynamic recognition scheme had
outperformed all static recognition schemes. We also consider the direction of tracking
dynamic transition as potential and more robust in our proposal. For weakness, their
accuracy rate is high but they ignored the most difficult expression "Fear" and possibly
over-fitting in SVM classifier.
For the future trends, research maturity in this field leads many researchers to be-
come interested in recognizing the spontaneous expressions in realistic environment.
2.2. Literature Review: Facial Expression Recognition 29
Figure 2.17: Three video examples associated 3D tracker with different degree ofmotion.[RD09]
Tong et al. [TCJ10] collected videos from MAD database, Belfast natural facial expres-
sion database and Youtube, and proposed a probabilistic based framework but nothing
new about descriptors (traditional Gabor wavelet features are applied.). Littlewort et
al. [LBL09] videotaped 26 participants to record the posed and genuine pain. In their
proposed system, the output from traditional Gabor filters is passed to classifiers, such
as SVM, Adaboost and linear discriminant analysis. It can obtain 88% correct for distin-
guishing fake from real pain while no dynamic information is used. However, although
some spontaneous affective behavior databases exist now, the benchmark database for
spontaneous facial expressions is still not available [ZPRH09].
Another trend is towards pose-invariant solution. Kumano et al. [KOY∗09] proposed
variable-intensity template for FER system. Their method described how the intensity
of multiple points defined in the vicinity of facial parts varies for different facial expres-
sions and is individual-dependent. By using this model in the framework of a particular
filter, their method is capable of estimating facial poses and expressions simultaneously.
The experiments demonstrate a recognition rate of over 93.1% on a range of ±40 degrees
30 Chapter 2. Survey on Object Recognition and Facial Expression Recognition
from the frontal view from their own data, but only 70% from 53 objects in Cohn-Kanade
database. Another problem is that the performance under their interest point location
system is inferior than the performance under random point extraction and far from
practical usage.
Another try for pose changes in facial expression recognition from Moore et el.
[MB09]. They used static features such as local binary patterns (LBPs) and a novel
descriptor as local gabor binary patterns (LGBPs). The evaluation is performed on 3D
BU-3DEF database and overall results are not very high even on frontal view case, which
showed this area still need more exploration.
Though limited to only one category: smile, Whitehill et al.[WLF∗09] collected pic-
tures, photographed by the subjects themselves, from thousands of different people in
many different real-world imaging conditions. Their experience developing a smile de-
tector suggests that robust automation of the Facial Action Coding system may require
on the order of 1,000 to 10,000 examples images per target Action Unit. Datasets of
this size are likely to be needed to capture the variability in illumination and personal
characteristics likely to be encountered in practical applications.
2.3 Conclusion
Object recognition and facial expression recognition are two long existing and still
promising research areas in computer vision. In this chapter, we present a comprehen-
sive survey on the various systems, with the summaries of the recent works and possible
trends in both topics. However, among all the methods, there are still many difficulties
waiting to be solved. For example, the inaccuracy of facial characteristic point location
influences the detection of movements in facial expression classification. Now, many ap-
proaches are interested in moving from static features to dynamic features. The accurate
detection of those points for dynamic description is perhaps not realistic. More general
methods to avoid the request of precise point location seem more practical. In our work,
we followed these promising works and proposed our own methods and systems in the
next several chapters.
Chapter 3Feature Representation for Objects
and Faces
Contents
3.1 Overview of Local Feature Descriptors . . . . . . . . . . . . . . . . . . . 33
3.2 The Detection of Regions of Interest . . . . . . . . . . . . . . . . . . . . 34
3.2.1 DOG for Key Point Detection . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Face Organ Location . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Features for Static Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3 Shape Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.4 LBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.5 Gabor features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Features for Dynamic Schemes . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.1 Introduction to Temporal Texture Analysis . . . . . . . . . . . . . 47
3.4.2 Dynamic Deformation for Facial Expression . . . . . . . . . . . . 47
3.4.3 VTB Descriptor for Dynamic Feature . . . . . . . . . . . . . . . . 49
3.4.4 Moments on Spatiotemporal Plane . . . . . . . . . . . . . . . . . 52
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
31
32 Chapter 3. Feature Representation for Objects and Faces
In this chapter, we present an overview of various local descriptors which are required
for pattern recognition. These descriptors provides the representation of descriptive
features on images or videos with success. Normally, they are denoted by feature vectors
on salient region or a patch around a key point. According to the different usages,
we put them into two sections: features for static images and features for dynamic
characteristics.
Among these features, some descriptors, such as SIFT or Shape Context, already ex-
ist. Because of their powerful ability of discrimination, we will incorporate them in our
systems. We also proposed a new texton descriptor, namely VTB (vertical time back-
ward), which contains the intrinsic changes of human faces on spatiotemporal domain.
For the general shape changes on spatioltemporal domains, we extend the usage of mo-
ments. To the best of our knowledge, our method is unique in literature that extract the
shape transformation of expressional faces on dynamic slices. Moreover, the combina-
tion of these heterogeneous features is also introduced and discuted in next chapter.
Part of the work described in this chapter was published in form of three interna-
tional conference papers [JIB09], [JI10b] and [JI10a].
3.1. Overview of Local Feature Descriptors 33
3.1 Overview of Local Feature Descriptors
The usage of features is an important component in computer vision system. The fea-
tures generally denotes the relevant information for learning and recognition in images
and videos (or image sequences). Many features are proposed by the researchers since
70s when the early development of applied artificial intelligence appeared. By using
these features, computational algorithm to mimic human visual perception can used in
behavioral research, affective computing, robotics and human-machine interfaces. For
these different applications, specific features should be carefully chosen to handle the
particular problems. Generally the representation of images and videos by using fea-
tures can be referred to two categories (Fig. 3.1):
Figure 3.1: Above: [PB07] Tracking the facial characteristic point. Below: [HPA04] LBPhistograms from whole face and divided blocks.
1. Geometric-based features: the structure information usually related to spatial or
temporal domain.
2. Appearance-based features: the information on a region or the neighborhood
around an interest point in the images;
Basically, we use the first category of features, which apply local neighborhood op-
erators on the patch around interest points or on the salient regions. The spatial or
temporal domain, which is traditionally explored by geometric feature, will be repre-
sented by the dynamic features proposed in section 3.3. The effectiveness of all these
34 Chapter 3. Feature Representation for Objects and Faces
features is dependent on the two steps: detector of "interesting" regions (points) and fea-
ture descriptor. The performance is firstly relied on detecting where is the right region
( or right point) a feature can be extracted. Furthermore, the pattern information on the
region ( or around the point) represented by the descriptors are commonly denoted by
a single vector called feature vector. The evaluation criteria of feature vectors should
includes three aspects:
1. Distinctive power
2. A simple and fast extraction process
3. Invariance to illumination, orientation, scale, and affine transformation.
For facial expression recognition, it also should be robust for human subjects from
different culture backgrounds and their non-rigid movements.
In this chapter, various feature detectors and feature descriptors will be introduced
and compared. They are incorporated in our system for object recognition and facial
expression classification. Especially for dynamic features, we introduce the appearance-
based features into spatiotemporal domains.
3.2 The Detection of Regions of Interest
3.2.1 DOG for Key Point Detection
If we want to find an object in an image, interesting points related to the object can be
located to provide a characteristic description of the object. For the task of object recog-
nition, the features extracted from the training imaged should be repeatable in testing
ones regardless of image noise, changes in illumination, uniform scaling, rotation, and
local geometric distortion. Evaluated on these merits, we choose and summarize Lowe’s
keypoint detection method based on DOG (difference-of-Gaussian)[Low99a] in this sec-
tion. The method extract a large number of feature vectors, which is invariant to image
translation, scaling, and rotation, partially invariant to illumination changes and robust
to affine local geometric distortion. The method of key point location includes three
cascade filtering stages:
3.2. The Detection of Regions of Interest 35
1. Find the maxima and minima of the result of difference of Gaussians function
applied on all scales and image locations. These extreme points are considered as
potential key points;
2. Low contrast points and edge response points along an edge are discarded. Only
the most stable among all candidate points are left.
3. Finally one or more orientation are assigned to each localized keypoints based on
local image gradient directions.
These steps ensure that a large number of keypoints over all scales and locations
can be generated for matching and recognition. By adjusting the threshold, for a high
texture image of 500x500, more than 2000 stable and repeatable keypoints can be located
after this filtering approach.
Figure 3.2: The construction of Difference of Gaussian(DOG). [Low99a]
The first stage of keypoint detection is to identify locations and scales that can be
repeatably assigned under differing views of the same object. For this, the image is
convolved with Gaussian filters at different scales, and then the difference of succes-
sive Gaussian-blurred images are computed as in Fig.3.2. Keypoints are then taken as
maxima/minima of the Difference of Gaussians (DoG) that occur at multiple scales.
Therefore, with an input image I(x, y), and a variable scale Gaussian as
G (x, y, σ) =1
2πσ2 e−x2+y2
2σ2 (3.1)
36 Chapter 3. Feature Representation for Objects and Faces
The scale space images L(x, y, ) on the left of Fig.3.2 is generated by convolving the
image as
L (x, y, σ) = G (x, y, σ) ∗ I (x, y) (3.2)
To use the scale-space extrema, among these scale spaces images, DOG image D (x, y, σ)
on the right of Fig.3.2 is given by the difference of two nearby scales separated by a con-
stant multiplicative factor k,
D (x, y, σ) = (G (x, y, kσ)− G (x, y, σ)) ∗ I (x, y)
= L (x, y, kσ)− L (x, y, σ)(3.3)
Hence the initial image is repeatedly convolved with Gaussians to produce the set of
scale space images. The convolved images are grouped by octave (an octave corresponds
to doubling the value of σ). Adjacent Gaussian images are simply subtracted to produce
the difference-of-Gaussian images. After each octave, the Gaussian image is down-
sampled by a factor of 2, and the process repeated. For scale-space extrema detection in
the SIFT algorithm, the image is first convolved with Gaussian-blurs at different scales.
Once DoG images have been obtained, keypoints are identified as local minima/maxima
of the DoG images across scales. The extremes of the difference-of-Gaussian images are
detected by comparing each pixel to its eight neighbors in the current image and nine
neighbors in the scale above and below, so totally 26 neighbors in 3x3 regions at the cur-
rent and adjacent scales. If the pixel value is the largest or smallest among all compared
pixels, it is selected as a potential keypoint.
The selection of the prior smoothing parameter σ = 0.009 concerns the frequency
of sampling in the spatial domain. Lowe’s experiments show that the repeatability of
keypoint detection increases with σ, but also increases the cost in efficiency. A proper
σ should be chosen to balance the relation between sampling frequency and rate of
detection in extrema selection. Here we use σ = initially. If the number of points is
below a threshold, the value of sigma will be doubled and more potential points will
be detected for future use. This process is repeated until we have enough points or the
extrema is too low contrast to be meaningful.
For accurate keypoint location, Lowe’s method rejected the points which have low
contrast or are poorly localized along an edge. This stage provides a substantial im-
3.2. The Detection of Regions of Interest 37
provement to matching and stability. Firstly after applying a threshold on minimum
contrast with its neighbouring points, only a subset of potential keypoints are left to
avoid too much clutter. Then, to eliminate the keypoints along the edge and therefore
unstable to small amounts of noise, the sum of the eigenvalues from the trace of Hessian
matrix and their product from the determinant is calculated. The curvatures around this
point are roughly checked by ratio and edge responses are discarded.
In the last stage of DOG detector, each keypoint is assigned one or more orientations
based on local image gradient directions. This is the most important step in achiev-
ing invariance to rotation as the keypoint descriptor can be represented relative to this
orientation and therefore achieve invariance to image rotation.
First, the Gaussian-smoothed image L (x, y, σ) with the closest scale σ, is selected so
that all computations are performed in a scale-invariant manner. For each image sample,
the gradient magnitude, m (x, y), and orientation, θ (x, y), are precomputed using pixel
differences in a neighboring region around the keypoint. An orientation histogram with
36 bins is formed, with each bin covering 10 degrees. Each sample in the neighboring
window added to a histogram bin is weighted by its gradient magnitude and by a
Gaussian-weighted circular window with a σ that is 1.5 times that of the scale of the
keypoint. The peaks in this histogram correspond to dominant orientations. Once the
histogram is filled, the orientations corresponding to the highest peak and local peaks
that are within 80% of the highest peaks are assigned to the keypoint. In the case of
multiple orientations being assigned, an additional keypoint is created having the same
location and scale but different orientations. Only a small set of points are assigned
multiple orientations, but these contribute significantly to the stability of matching in
experiments.
The previous operations in three stages have assigned an image location, scale, and
orientation to each keypoint. The detection till now is invariant to rotation, scaling and
small deformation, the corresponding discriminative descriptor is presented in section.
3.3.2.
3.2.2 Face Organ Location
Like the detection of interest points in object recognition, face and facial organs detection
is the first stage in an automatic facial expression recognition system (FER). The facial
detection in our FER system does not use the traditional ways of face detector. Viola and
38 Chapter 3. Feature Representation for Objects and Faces
Jones [VJ01] proposes a popular face detector using Harr-like features and AdaBoost al-
gorithm to locate faces, and have been used in many later system as [BMB08], [MWR∗08].
Some other methods manually label the location of eyes to normalize faces as the oval
CSU FERET faces [BBDT05]. After equalizing the histogram of the image and scaling
the pixel values to have a mean of zero and a standard deviation of one, we subtract
the images with expressions from images with neutral faces. We suppose the images
with expressions and images with neutral faces are from the same video sequence and
aligned. Possible head motion and pose changing are not considered here but available
technologies dealing with these subjects can be added to our system [BMB08].
Figure 3.3: Facial organ location: step by step.
Based on these subtracted images (the first image on the left in Fig. 3.3 and the third
row in Fig.3.4), as the expression changes are related to those of facial components, we
use them to identify the relevant facial organs for this expression and roughly locate
them as in Fig.3.4. For different expressions, the location results for the same face will
be different too. We firstly use Gaussian filter so as to blur the difference images to
eliminate isolate noisy points. Then we detect facial organ regions as in Algorithm 3.1.
As showed step by step in Fig.3.3, we first find the dense block in the image as facial
region, then locate the two dense blocks as left and right eye areas, and finally find
the low part which consists of nose and mouth, though sometimes the nose can not be
detected. In the algorithm, the validity of a vertical or horizontal line L are calculated as
line density DL:
DL =NValid
NTotal(3.4)
where NValid is the number of valid pixels, whose brightness is higher than a em-
pirical threshold Tbright, NTotal is the total number of points on the vertical or horizontal
line.
The results for different expressions are shown in Fig. 3.4. The first row shows
the original facial expression images. The second row shows the corresponding neutral
3.2. The Detection of Regions of Interest 39
Algorithm 3.1: Detect facial organs
1. Draw initial face rectangle,suppose the image only contains a front-view face,the face with full width of the image, eyes begin at 1
3 of the whole height andchin occupies 1
9 of whole height.
2. Shrink face rectangle, check line density DLof the left and right lines of face
3. Eye regions detection
(a) Locate the small area between eyes
(b) From the area, search the left part and right parts, find the true width ofthe face
(c) Initial the positions of eyes, locate the left and right blocks with enoughvalid points
(d) Adjust the sizes of left and right blocks to make sure they are the samesize
4. Locate low part consists of nose and mouth
(a) Initial the low part below eyes. the width is from center of the left eye tocenter of the right eye, height is from the low edge of left eye to the lowedge of face;
(b) Erase the blank lines around four edges of low part, and include the validparts below or above current low parts.
(c) Balance the low part according to central point between left and right eye.
5. Cut the minimal rectangle which contains left eye block, right eye block andlow part block.
40 Chapter 3. Feature Representation for Objects and Faces
Figure 3.4: Sample images, from left to right: Anger, Disgust, Fear, Happiness, Sadnessand Surprise.
expression images, the subtracted images are listed in row three. The detected facial
organs are drawn in forth row. Then we cut the original image using a square mask,
so only the face region from eyebrow to chin and left eye to right eye is kept. The
cropped region is processed through histogram equalization and pixel normalization in
the same way as in CSU Face Identification Evaluation System [BBDT05] to remove the
illumination changes. Finally, the facial region is scaled to a fixed resolution (e.g. 64x64)
for next stage. These cropped and normalized face images are shown in the fifth row of
Fig. 3.4.
Based on this stage of detection, the features for static and dynamic deformation will
be extracted in next two sections.
3.3. Features for Static Schemes 41
3.3 Features for Static Schemes
3.3.1 Color
Raw graylevel or color intensity of the pixels in the image are one of the most natural
way to describe one subregions of patch. Though it is far from being robust in realistic
uses. For example, the most popular RGB color space is easy to obtain, but not all not
visible colors can be represented by positive values of red, green and blue components.
CIE (International Commission on Illumination) defined CIE XYZ and CIE LUV color
spaces to overcome this difficulty. The merit is that it can obtain perceptual uniformity.
Among three components, the L value corresponds roughly to illuminance or brightness.
If not considering U and V , varying L can be used as the gray-scale. The U parameter
seems to mimic mostly shifts from green to red (with increasing U), and the V parameter
seems to mimic mostly blue and purple type colors. Both are chromo components.
In application, in order to balance and rich the information for the regions around
the DOG keypoints in section 3.2.1, we also used the LUV to average the pixel values in
8× 8 regions. The color information, denoted as C, is represented by a 3 dimensional
vector and clustered to one codebook of size 24.
3.3.2 SIFT
SIFT (Scale-invariant feature transform) descriptor is proposed by [Low99b]. it composes
a high distinctive descriptor for the local image region after the DOG key point detection
as in section 3.2.1. Furthermore, the descriptor should be as invariant as possible to
remaining variations, such as change in illumination or viewpoint changes.
As keypoint locations at particular scales are known and principle orientations are
assigned, the important invariance to image location, scale and rotation are ensured.
A descriptor vector at each keypoint is extracted on the image closest in scale to the
assigned scale of current point.
First, a 4x4 array of histograms with 8 bins each is created. These histograms are
computed from magnitude and orientation values of samples in a 16 x 16 region around
the keypoint such that each histogram contains samples from a 4 x 4 subregion of the
original neighborhood region. Fig.3.5 shows a 2x2 descriptor array computed from an
8x8 set of samples, whereas the full descriptor uses 4x4 descriptors computed from a
42 Chapter 3. Feature Representation for Objects and Faces
Figure 3.5: The histogram computation for SIFT descriptor
16x16 sample array. The magnitudes are further weighted by a Gaussian function with σ
equal to one half the width of the descriptor window, which is illustrated with a circular
window on the left side of Fig.3.5. The descriptor then becomes a vector of all the values
of these histograms. Since there are 4 x 4 = 16 histograms each with 8 bins the vector
has 128 elements. This vector is then normalized to unit length in order to enhance
invariance to affine changes in illumination. The influence of large gradient magnitudes
is reduced by thresholding the values in the unit feature vector to each be no larger than
0.2, and then renormalizing to unit length.
The dimension of the descriptor, i.e. 128, seems a bit high. The matching tasks
and the computational cost is still relatively low due by using Euclidean distance of
their feature vectors. SIFT descriptors are also proved to be invariant to minor affine
changes. Therefor, SIFT features are highly distinctive, robust and suitable for high
texture regions.
3.3.3 Shape Context
Shape context is proposed by Belongie and Malik [BMP00] as a contour-based feature.
It is another popular descriptor in object recognition.
As showed in Fig.3.6, firstly, an edge detector is performed on an image I(x, y), such
results a set of edge points {P = p1, p2, . . . , pn}. For each point pi on the shape, consider
the n − 1 vectors obtained by connecting pi to all other points. The set of all these
vectors is a rich description of the shape localized at that point but is far too detailed
since shapes and their sampled representation may vary from one instance to another in
the same category. The key idea in the proposal of Belongie and Malik [BMP00] is that
3.3. Features for Static Schemes 43
Figure 3.6: Left and center: samples of edge points for two similiar shape. Right: thelog-polar histrogram bins used to compute the shape context.
the distribution over relative positions is a robust, compact, and highly discriminative
descriptor. So, for the point pi, the coarse histogram of the relative coordinates of the
remaining n− 1 points is defined to be the shape context of pi. The bins are normally
taken to be uniform in log-polar space around the keypoint. The fact that the shape
context is a rich and discriminative descriptor can be seen in the figure below, in which
the shape contexts of two different versions of the letter "A" are shown.
Figure 3.7: Shape context as a discriminative descriptor
In Fig.3.6, there are the sampled edge points of the two shapes on left and center.
The right is the diagram of the log-polar, with five bins for logr and twelve bins for θ,
used to compute the shape context. In fig.3.7, there are the corresponing shape context
for the the three points. As can be seen, since left and center are the shape contexts for
two points in parallel positions , they are quite similar, while the shape context in right
is very different since the point is at low, angle-like position.
Because shape contexts are distributions represented as histograms, the similarity
between two vector are evaluated by X2 test statistic.
44 Chapter 3. Feature Representation for Objects and Faces
3.3.4 LBP
The local binary pattern (LBP) operator is defined as a gray-scale invariant texture mea-
sure, derived from a general definition of texture in a local neighborhood. It has be-
come a really powerful measure of image texture, showing excellent results in many
researches. The original LBP operator labels the pixels of an image by thresholding the
3x3 neighborhood of each pixel with the value of the center pixel and considering the
results as a binary number. Fig. 3.8 shows an example of LBP calculation [HPA04]. The
256-bin histogram of the labels computed over a region can be used as a texture descrip-
tor. Here, each bin (LBP code) can be regarded as a micro-texton. Local primitives which
are codified by these bins include different types of curved edges, spots, flat areas etc.
Fig.3.9 shows some examples. The operator has important advantages as its robustness
to any monotonic gray level change and its computational simplicity. The LBP operator
is also extended to consider neighborhood of different sizes and rotation. Here we select
the original LBP operator of 256 bins though we use it in a different manner.
Figure 3.8: An example of LBP computing [HPA04].
Figure 3.9: Examples of texture primitives which can be detected by LBP [HPA04].
Generally the face image are gridded and divided into small sub-regions of equal
size. The histograms are computed for the sub-regions and are concatenated into a
single and spatially enhanced feature histogram. In the dissimilarity measure between
histograms, weights are set for different sub-regions. In our approach, as the facial
3.3. Features for Static Schemes 45
organs have been detected as described in the section 3.2.2, we proposed to use a new
mask for the division. Our method of division is derived from the difference images
obtained in section 3.2.2, and it considers the natural distribution of facial organs. For
the normalized face image with resolution (l× l), the layout of the mask and the 8 blocks
is showed in Fig.3.10, and their positions are determined as follows:
1. Left block of left eye: (0,0) to ( 316 l, 3
8 l)
2. Right block of left eye: ( 316 l, 0) to ( 3
8 l, 38 l)
3. Left block of right eye: ( 58 l, 0) to ( 13
16 l, 38 l)
4. Right block of right eye: ( 1316 l, 0) to (l, 3
8 l)
5. Left block of nose: ( 14 l, 3
8 l) to ( 12 l, 3
4 l)
6. Right block of nose: ( 12 l, 3
8 l) to ( 34 l, 3
4 l)
7. Left block of mouse: ( 18 l, 3
4 l) to ( 12 l, l)
8. Right block of mouse: ( 12 l, 3
4 l) to ( 78 l, l)
Figure 3.10: (Left) A face image. (Center) Identified changed parts. (Right) Masked anddivided in 8 blocks.
The LBP histogram is calculated in each block. Then histograms of the 8 blocks are
concatenated into a single feature histogram containing K bins (in our case K = 2048
as shown in Fig.3.11). Comparing to other methods, Our way of division has important
properties:
1. Each block normally consists of half part of facial component
46 Chapter 3. Feature Representation for Objects and Faces
2. It includes the essential parts which concerns human face changes when an ex-
pression flirts across his or her face
3. It adapts to different expressions, because for different expressions the degree of
changes is variant.
Figure 3.11: The face image in Fig.3.10 is represented by a concatenation of 8 local LBPhistograms.
3.3.5 Gabor features
A Gabor filter is a linear filter used in computer vision for various tasks such as edge
detection or texture descriptor. Frequency and orientation are two parameters for ga-
bor filter and the representations are reported to be similar to those of human visual
system. It has been found to be particularly appropriate for texture representation and
discrimination. In the area of facial expression recognition, it is one of the most popular
descriptors.
We also use the Gabor descriptors, which are robust with respect to illumination
variations, scaling, translation, and distortion. Using the code provided by Zhu et
al.[ZVM04], a Gabor jet with 40 coefficients at 5 different scales and 8 orientations is
computed and stored on pixel locations of gray-level images from Section 2.1. These
jets are clustered by k-NN to a vocabulary with size 100. The histogram based on the
vocabulary is generated to represent each expressional face.
3.4. Features for Dynamic Schemes 47
3.4 Features for Dynamic Schemes
3.4.1 Introduction to Temporal Texture Analysis
The usage of specific dynamic textures (DTs) is one of most popular techniques in com-
putational intelligence. It is often used to describe the real world image sequences with
certain form of regularities.
Normally there are two main categories of techniques: correlation and differential.
Optical flow is the method to compute pixel motion. Normally only gray-level intensity
values are kept to avoid unnecessary complications.
3.4.2 Dynamic Deformation for Facial Expression
Figure 3.12: The cues of facial movements [Bas79].
In Fig. 3.12, deformations measurement, if performed directly on frontal faces, is dif-
ficult and sometimes very confusing. Each component (facial organs like eyes, mouth)
also varies in size, shape and color due to individual difference. As a solution, [Bas79]
suggested that face motion would allow expressions to be identified even with minimal
information about the spatial arrangement of features. His pioneer work reported that
48 Chapter 3. Feature Representation for Objects and Faces
facial expressions were more accurately recognized from dynamic images than from a
single static image. Since then, various approaches are applied to catch the temporal in-
formation from a neutral face to an expressive one. [PK09] used motion magnification to
exaggerate these movements regarding subtle expressions, [XLC08] observed the pixel-
based temporal pattern on sequences that are normalized to a fixed length, and [ZP09]
extended the feature extraction from traditional ones on static image to spatiotemporal
domain. The authors proposed to use LBP on three orthogonal planes as LBP-TOP. So
for three axes X, Y and T, LBP histogram is extracted from XY, XT and YT slices. How-
ever, the appearance and characteristics of XY, XT and YT planes are very different as
showed in Fig.3.13.
Figure 3.13: Left: XY(front face); Center: YT slice; Right: XT slice.
LBP can be successfully used for front face on XY plane but it is not proper to catch
the movements because time goes in only one direction neither four nor eight directions.
The texture on XT and YT slice, which comes from the dynamic deformation of facial
components, shows a different orientational tendency. For the same expression from
different subjects, the YT slices have a similar texture can be noticed. This similarity
comes from the uniform movement orientation of facial components in vertical direction
for different subjects during the same expression. After observing the characteristic of
XT and YT planes, no regular and systematic texture is found on XT planes at the same
Y position. This is why we select YT plane, but one can note that we might miss the
horizontal movements of facial muscles.
Inspired by their works, and after watching the repeatable texture for different ex-
3.4. Features for Dynamic Schemes 49
Figure 3.14: The dynamic deformation for different expressions on vertical slices
pressions as in Fig. 3.14, we proposed to extract the dynamic deformation on vertical
slices. Let S = {I1, I2, . . . , IT} be a temporal ordered face sequence, where T is the num-
ber of frames in S and where each Ii has a fixed resolution n×m. In order to extract the
dynamic information related to facial expressions, for each value of x, with 1 ≤ x ≤ n,
we decompose S to n spatiotemporal slices Px as in Fig. 3.15.
Moreover, as we want to track the texture (VTB) and shape (moments) changes of
different facial components, we separate each slice of height m to three blocks corre-
sponding to the three main components with different heights mk: eyes (m1 = 38 m), nose
(m2 = 14 m) and mouth (m3 = 3
8 m). In the time axis, we track the changes not from
the whole sequence but from overlapped subsequences. Each subsequence includes τs
frames. After this, the two textons are extracted from spatiotemporal planes to describe
dynamic deformation.
3.4.3 VTB Descriptor for Dynamic Feature
The VTB pattern descriptor is a gray-level invariant texture primitive statistic, but de-
signed to be used on spatiotemporal planes extracted in previous section. The operator
50 Chapter 3. Feature Representation for Objects and Faces
Figure 3.15: 3 blocks on VT plane.
is related to a sequence of three continuous images as τs = 3. In the third image, each
pixel is used to threshold its two backward neighboring pixels with the same coordinate.
Similar to the original LBP operator, we build the descriptor on 3∗3 neighborhood as in
Fig.3.16. The binary result is calculated to label the middle-right pixel (’6’ in Fig.3.16).
For each pixel p with coordinate (x, y) and gray value gx,yin an image It of time t,
3.4. Features for Dynamic Schemes 51
Figure 3.16: VTB computing.
the binary code is produced as Eq.3.5.
VTB = s(gx,y−1,t−2 − gx,y−1,t)25
+s(gx,y,t−2 − gx,y,t)24
+s(gx,y+1,t−2 − gx,y+1,t)23
+s(gx,y+1,t−1 − gx,y+1,t)22
+s(gx,y,t−1 − gx,y,t)21
+s(gx,y−1,t−1 − gx,y−1,t)20.
(3.5)
where,
s(x) =
1, i f x > 0
0, i f x ≤ 0
In facial expression, most facial part movements are vertically oriented and the main
directions can be up or down for different organs. For example, an expression of surprise
is related to eyes moving up and mouth moving down. Therefore we divide the VT
plane into three blocks: eye, nose and mouth regions as in Fig.3.15. Finally, a total
vector of 26 ∗ 3 = 192 bins is used to represent the movements. Now we have 10 blocks
from appearance and 3 blocks from motion. These histograms are concatenated into a
single one. Such a representation of an image is obtained from the image itself plus
two previous reference images. This extraction can be done per image in the sequence,
except for the first two images. The vectors obtained from LBP+VTB will be used to
identify sequences.
52 Chapter 3. Feature Representation for Objects and Faces
3.4.4 Moments on Spatiotemporal Plane
Coming from physics, an image moment is a particular weighted average of the image
pixels’ intensities and can be used as an effective descriptor of global shape. Given gray
level image with pixel density I(x, y), the image moments Mp,q are calculated by
Mp,q = ∑x
∑y
xpyq I(x, y) (3.6)
Especially some simple properties can be derived from moments such as:
1. M00: the mean of gray-level block
2. M10/M00, M01/M00: the centroid of gray-level block
Higher order moments can also be derived for shape features. In order to avoid the
curse of dimensionality, we only use three values (M00, M10/M00, M01/M00) for each of
the three blocks on one slice. The moments Mp,q(x, i, k) of each block at position x of
frame Ii with window size τs are calculated on slice Px as follows:
Mp,q(x, i, k) =mk
∑y=1
i
∑t=i−τs
yptqPx(y, t) (3.7)
where mk is the height of current block for 1 ≤ k ≤ 3 as in Fig. 3.15.
For each image Ii, moments are extracted for all the values of x, from current image Ii
and its (τs − 1) previous images in sequence. These values are combined into a feature
vector for current frame. This vector will be used to classify the current image into
one of six basic expressions plus neutral expression. The first (τs − 1) frames yield no
feature vector. Furthermore, the probabilities that the current image belongs to each of
the seven facial expressions are recorded to predict the possible facial expression in the
whole sequence.
3.5 Conclusion
In this chapter, various features used for the recognition for objects and facial expres-
sions are introduced and summarized. Two new spatiol-temporal descriptors, VTB and
3.5. Conclusion 53
moments on YT plans, are also proposed for dynamic schemes.
In the chapter 5 for testing and results, we will apply these features in benchmark
databases and show their efficiency. Indeed, in facial expression recognition, they make
possible to combine the existing static solutions and new temporal features to describe
characteristics of local facial regions.
Furthermore, in the next chapter, we will focus the classification methods to build
proper models using these feature vectors and their combinations.
54 Chapter 3. Feature Representation for Objects and Faces
Chapter 4Recognition and Classification
Methods
Contents
4.1 Overview to Machine Learning Methods . . . . . . . . . . . . . . . . . 57
4.2 Discriminative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.2 AdaBoost Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.2 BoW and Naïve Bayes Implemantation . . . . . . . . . . . . . . . 64
4.3.3 Hierarchical Generative Model . . . . . . . . . . . . . . . . . . . . 66
4.3.4 Construction of Hierarchical Dirichlet Processes . . . . . . . . . . 66
4.3.5 Inference and sampling . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Hybrid System: Integrated Boosting and HDP . . . . . . . . . . . . . . 70
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
55
56 Chapter 4. Recognition and Classification Methods
This chapter briefly presents the methods in Bayesian statistics for categorization in
machine vision system and details our proposal for object classification. These
classifiers are learned through Bayes rules and applied in both object categorization and
the specified area of facial expression classification. Among these statistical approaches
to learning and discovery, there are two groups: generative and discriminative models.
Generative models are a top-down manner while discriminative models are bottom-up
driving. Later in this manuscript, for the complex problem for object recognition, we
will combine the two models and build a set of essential middle level components. For
the more special topic of facial expression, the powerful discriminative models are used.
The chapter begins with an overview of learning algorithms based on Bayes rule.
Then, discriminative and generative classifiers are presented. Finally, we provide the
method to use hybrid system combined by discriminative and generative models.
Parts of this chapter were published in an international conference paper [JIB09].
Readers could also refer to the paper of Teh et al.[TJBB06] for the details about hierar-
chical Dirichlet process used for text modeling for documents.
4.1. Overview to Machine Learning Methods 57
4.1 Overview to Machine Learning Methods
The recurring problem in object recognition is the need to classify a new observation
into a limited number of categories. In other words, a new image or video sequence
is processed by model-based clustering and a decision is made about which category
it is most likely to belong to. These techniques are not limited to computer vision but
much more widely applicable to complex real-world phenomena to make accurate pre-
dictions about them, for examples such as speech recognition, text classification and
bioinformatics problems [Jeb04]. The learning and inference approaches falls into two
major categories: generative and discriminative models. Generative models specify a
full structured joint probability over observed data. The models in this category are
often related on graphical models based on Bayesian reasoning. On the other hand,
discriminative models rely on the conditional probability distribution over the exam-
ples. The largest possible margin separates classes are optimized by the adjustment of
parameters in classifiers. These method can assign available labels to new observations
but also might introduce the problem of over-fitting.
The formulations involves estimating f : X → Y, or P(Y|X). For discriminative clas-
sifiers, some functional form for P(Y|X) is assumed. Then the parameters of P(Y|X) are
estimated directly from training data to maximize the boundaries to separate different
classes. For generative classifiers, some functional form for P(X|Y), P(X) are assume
so the parameters of P(X|Y), P(X) are estimated directly from training data then Bayes
rule is used to calculate the prediction P(Y|X = xi).
These two powerful paradigms are the main techniques of pattern recognition, arti-
ficial intelligence and perception systems. The discriminative approaches, though em-
pirical, can usually provide superior performance. Yet various probability models from
generative methods can reflect the prior knowledge about the practical domains. We
will choose the proper method to solve the particular task or fuse the two framework to
combine the complementary powers of the two groups of approaches.
58 Chapter 4. Recognition and Classification Methods
4.2 Discriminative Model
4.2.1 Introduction
One of the traditional problems in machine learning is to measuring the similarities
between query samples and training samples. Discriminative models are empirically
successful to solve this problem [Nal04]. This category of models is relatively simple
because discriminative algorithms do not consider the joint distribution but directly
optimize the conditional probability distribution as remarked in 4.1.
Here the classification can be defined as the problem of choosing a class C for an
example with a feature vector x, which is obtained from Chapter 3. In the learning stage,
the classifier learns the parameters of discriminant function from the labeled training
data. In the testing stage, the ideal models will map or rank the feature vector x into its
correct class through the discriminative function. As concluded in [UB05], discriminative
models owe following advantage:
1. Discriminative models are flexible when the training data differ significantly;
2. In making predictions for testing samples, discriminative models are typically very
fast, while generative models often require iterative operations;
3. Normally, discriminative methods would provide better predictive performance
since they are trained to predict the class label rather than the joint distribution of
input vectors and targets.
Owing to their robustness and relatively simple nature, discriminative models (eg.
Maximum Entropy, Linear discriminant analysis, Support Vector Machines and Ad-
aBoost) have been preferred in many domains.
4.2.2 AdaBoost Method
AdaBoost, which stands for Adaptive Boosting, is an algorithm formulated by [FS95]. It
aims at constructing a strong classifier as linear combination:
f (x) =T
∑t=1
αtht(x) (4.1)
4.2. Discriminative Model 59
where ht(x) : X → {−1,+1} are weak classifiers, αt is the set of weights associated
with each weak classifier or feature. Initially, the distribution alphat will be set to be
uniform. For a sequence of N labeled training samples, at each iteration t, the weak
learner try to find a hypothesis ht(x) which is consistent with most of the samples (i.e.,
ht(xi) = yi for most 1 ≤ i ≤ N) with small error. Using the new hypothesis ht(x) ,
the algorithm generates the new set of weights alphat and this process repeats T times.
The final strong hypothesis h f is the result which combines the outputs of the T weak
hypotheses by a weighted majority vote. AdaBoost also can be proved to maximize
margin because it chooses ht(x) with minimal errors in each iteration to minimizes the
margin.
Through this boosting algorithm, AdaBoost provides some interesting merits for
machine learning as:
1. It produces a strong and complex classifier with relatively simple classifier Ad-
aboost is capable of reducing both bias (e.g. stumps) and variance (e.g. trees) of
the weak classifiers
2. It reduces the weakness of a single weak classifier, such as the bias of stumps.
3. It selects the most relevant feature by evaluating the empirical error.
4. It is able to obtain maximal margin in the groups of discriminative methods;
5. It generates a series of cascaded classifiers where the number of iteration T can be
decided by user.
In general, a hypothesis which is accurate on the training set might not be accurate
on examples outside the training set. This problem is usually referred to as over-fitting.
Often, however, overfitting can be avoided by restricting the hypothesis to be simple
[FS95].
In our study, Adaboost algorithm performs well in selection of essential features and
components as demonstrated in Chapter 4.4 and Chapter 5. It is also successfully used
in face detection by [VJ01]. Their method selects a small number of critical Harr features
from a larger set and yields cascaded classifiers. Their multi-layers detector is one of
most popular face detectors and implemented in OPENCV [Bra00].
60 Chapter 4. Recognition and Classification Methods
4.2.3 Support Vector Machine
In the area of machine vision, SVM (Support Vector Machine) often produces state of
the art classification performance [BGV92, Jeb04]. This group of algorithms are based on
the statistical learning theory and the Vapnik-Chervonenkis (VC) dimension introduced
by [CV95]. Intuitively, for the two-group classification problems, SVM conceptually
implements the following idea: input vectors are non-linearly mapped to a very high-
dimension feature space. Given the training data, this procedure generates a model to
predict the target values of the test data. Later, it evolves to handle the single multi-
class problem by reducing it into multiple binary classification problems and each of the
problems yields a binary classifier. In implementation, the efficiency of SVM is depend-
ing on the choice of kernels such as linear, polynomial and RBF (Radial basis function).
Based on these kernels , several applications (eg. WEKA by [HFH∗09] in JAVA, LIBSVM
by [CL01] in C++) are developed to obtain acceptable results rapidly.
For formalization under the case of two categories, a set of training data is pro-
vided as D ={(xi, yi)|xi ∈ <n, yi ∈ {1,−1}l
}, where a data point xi is viewed as a
n-dimensional vector (a list of n real numbers) and yi is either 1 or -1 to indicate the
class to which the point xi belongs. The target is to know whether the model can sep-
arate these points as clearly as possible with hyperplanes that their distance from the
nearest data point on each side is maximized. To identify the hyperplane, SVM intro-
duced a normal vector w which is perpendicular to the hyperplane and the parameterb‖w‖ which determine the offset of the hyperplane from the origin along the normal
vector w (showed in fig4.1). For the possible misclassified samples, a set of non-zero
variable ξi is introduced as
yi(w · xi + b) ≥ 1− ξi (4.2)
where 1 ≤ i ≤ l and ξ ≥ 0. The training vectors xi are mapped into a higher dimen-
sional space by a function φ. Then SVM tries to find a linear separating hyperplane with
the maximal margin in this higher dimensional space. With φ, 4.2 becames 4.3 as
yi(w · φ(xi) + b) ≥ 1− ξi (4.3)
The optimization problem becomes to find a large margin and a small error penalty
4.2. Discriminative Model 61
Figure 4.1: Separating hyperplanes under linear case. [Bur98].
as:
minw,b,ξ12
wTw + Cl
∑i=1
ξi (4.4)
where C is the penalty parameter of the error term. For constraints of the form
yi > 0, the constraint equations are multiplied by positive Lagrange multipliers and
subtracted from the objective function, to form the Lagrangian formulation. then can
introduces the so called kernel function.
K(xi, xj) ≡ φ(xi)Tφ(xj) (4.5)
Many kernel functions are proposed by researchers in machine learning (eg. cloud
basis functions by [DSRDS08]). The basic common ones are listed as:
1. Linear: K(xi, xj) = xTi xj
2. Non-linear:
62 Chapter 4. Recognition and Classification Methods
(a) Polynomial (homogeneous): K(xi, xj) = (xTi xj)
d
(b) Polynomial (inhomogeneous): K(xi, xj) = (γxTi xj + r)d, where γ > 0.
(c) Radial Basis Function (RBF): K(xi, xj) = exp(−γ∥∥xi − xj
∥∥2), where γ > 0.
(d) Sigmoid(Hyperbolic tangent) : K(xi, xj) = tanh(γxTi xj + r))
Here, γ, r and d are kernel parameters. To improve the efficiency of SVM, different set
of (C, γ, d, r) values should be tuned and the one with the best cross-validation accuracy
is used to train the whole training set D. To show the function of different kernels, in
Fig. 4.2 [Bur98], the above graphs show two examples of a two-class pattern recognition
problem, one separable and one not. The two classes are denoted by circles and disks
respectively. Support vectors are identified with an extra circle. For these machines, the
support vectors are the critical elements of the training set because they lie closest to the
decision boundary. The error in the non-separable case is identified with a cross. In the
low part of Fig. 4.2, the kernel was chosen to be a cubic polynomial function (degree 3).
For the linearly separable case (low left), the solution is still roughly linear, and that the
linearly non-separable case (low right) has become separable.
For the usage of trained model, when a new data point x is put in, an SVM is used by
computing dot products of a given test point x with w, or more specifically by computing
the sign of
f (x) =Ns
∑i=1
αiyiφ(si) · φ(x) + b =Ns
∑i=1
αiyiK(si, x) + b (4.6)
where NS is the number of support vectors. αi are Lagrange multipliers for every
training point. In this solution, those points for which αi > 0 are called support vectors,
and lie on one of the hyperplanes as in 4.2.
The SVM is originally binary classification method. To expend it to multiclass SVM,
the most common methods are built on one-versus-all method and the one-versus-one
method [HL02]. One-versus-all method use one binary classifier to separate current
class to all the rest of classes and the winner takes all. One-versus-one method which is
also called as pairwise method, built binary classifiers for each pair of classes then use
max-voting for final separation. Though, the efficiency of multi-SVM is still influenced
by the characteristic of datasets in practical use. The empirical study by [bDK05] showed
that pair-wise method proposed by [HT98] is highly recommended. We will use this as
the kernel discriminant method for solving multiclass problems in Chapter 5.
4.3. Generative Model 63
Figure 4.2: The examples of SVM kernels by pictures[Bur98].
4.3 Generative Model
4.3.1 Introduction
In the area of machine learning, the generative approaches use the joint distribution
of observable data. These group of methods are often casted by probabilistic graphic
model [Jeb04]. These methods provide a rich framework of imposing structure and prior
knowledge to estimate models from available observations or training data. Contrast
to the empirical nature of discriminative models, generative models can prove to be
informative in understanding the form of the probability distribution represented by
that model [Bis06].
Suppose a testing image (object or face) I∗ is described by a vector X, which consists
of some features extracted from it, the trained model will make a decision to assign I∗
64 Chapter 4. Recognition and Classification Methods
to one of C classes where c = 1, . . . , C or to a new class C + 1. Generative approaches
introduce the joint distribution p(c, X). For learning, Bayes’ theorem is applied using
prior probability p(c) and the class-conditional densities p(X|c) as
p(c|X) =p(X|c)p(c))
∑j=1C p(X|j)p(j)
(4.7)
This category of models, such as Naïve Bayes, Hidden Markov Models (HMM) and
Mixtures of Gaussians, has become prominent tools in computer vision domain, espe-
cially for object recognition and feature extraction because these applications benefit
greatly from the probabilistic methods that estimate the statistic between images and
features. For the generative modeling framework, as mentioned in [UB05], the relative
merits can be illustrated as:
1. It can handle noisy data such as missing or partially labeled ones.
2. A new class c + 1 can be added incrementally by learning its class-conditional
density p(X|c + 1) independently of all the previous classes.
3. It can readily handle compositeness and proportion.
Normally, based on prior knowledge, generative models learn the parameters from
training data to maximize the data likelihood. This group of models are robust and
with high accuracy, but the inference and classification speed are much slower than
discriminative models. In the following sections, we will apply it in general object
recognition based on Bag of Words (BoW) and Hierarchical Dirichlet Processes (HDP).
4.3.2 BoW and Naïve Bayes Implemantation
Bag of Words(BoW) or bag of feature model originally is a very popular method for
natural language processing (NLP). It represents a document as an unordered collection
of words, disregarding grammar and even word order [Lew98]. This dictionary-based
method has achieved great success in areas as spam filter, search engine design and
semantic analysis. Recently it is also extended to computer vision, especially object
categorization (eg. [LP05, SRE∗05b]).
To immigrate from text domain to image domain, some basic elements in BoW model
also changed. Words, which are much clearer defined in document treatment, now
4.3. Generative Model 65
are replaced by some feature vectors from local patches (regions). And the traditional
dictionary evolves to "codebooks". Normally these visual codebooks are generated by K-
means clustering over all the training vectors. Contrary to natural language dictionary,
the size of codebooks( number of available words) is flexible. In this visual dictionary,
each word is the center of a group of similar feature vectors. Then, a feature vector (eg.
SIFT in chapter 3.3.2) will be mapped to one corresponding visual word in codebook.
Then the image can be abstractively represented by histogram of visual words.
To formally note the method, given a training set with J images, a feature vector tj
is used to denote the label associated with each image J, where tjc ∈ 0, 1, c = 1, . . . , C,
and j = 1, . . . , J. Here, the class label is related to the existence of objects and not
directly to images. For notation, 0 means that this class is absent in image j and 1 mean
that this class exists. Each image j is represented by a feature vector Xj which consists
in Iw components, where the ith visual word xji instance in image j is a draw from a
distribution (or so-said histogram) F(θji) for association to a visual word vji belonging
to a vocabulary of size W. Here, xji is an extracted feature such as color, shape or texture
descriptor as in Chapter 3. For a codebook V = v1, v2, . . . , vw, the Naïve Bayes model is
used. Based on training data, we wish to maximize the likelihood by learning the latent
variables. After this inference stage, for the testing image I∗, the class c∗ is decided by
the probability as
P(I∗) = argmaxC p(c|V∗)= argmaxC p(c)p(V∗|c)= argmaxC p(c)∏W
w=1 p(V∗w|C).
(4.8)
Given all the features extracted from testing image I∗, the posterior probability of
class C(I∗) can be found by marginalizing out all the features. This basic model is
later extended to hierarchical Bayesian models as pLSA(probabilistic latent semantic
analysis, [Hof99]) and LDA (latent Dirichlet allocation, [BNJ03]). In order to use the
non-parametric properties of generative model, we apply HDP (Hierarchical Dirichlet
Processes) for our learning system in the following sections. Our target is to construct
the set of middle components for general object categories. The number of components
is unknown and will be inferred in iterations. The non-parametric solution is naturally
more proper in this case.
66 Chapter 4. Recognition and Classification Methods
4.3.3 Hierarchical Generative Model
To explain the method which we use for object recognition, we adopt an extended Chi-
nese Restaurant franchise or CRF in [TJBB06] as metaphor for the hierarchical Dirich-
let process used here. Suppose there are multiple restaurants, like Chinese, Japanese,
French and Italian, with a shared menu across the restaurants. A set of dishes are shared,
such as fried rice is shared among Asian ones, also noodle is possibly shared between
Chinese restaurants and Italian restaurants. At each table of each restaurant one dish
is ordered from the menu and shared among all customers who sit at that table. Then
one new customer will tend to select the table where sit the customers with similar
culture or career background. From all the dishes on tables and the customer’s sitting
pattern, we try to find the most prominent dishes, for example, Pekin Duck Recipe for
Chinese cuisine or fois gras for French cuisine. Such from the dishes, we can judge the
dominant favor of the restaurant, so as the main object in the image. Here the restau-
rants respond to images, the tables correspond to latent mixture components, and the
customers correspond to the visual words.
4.3.4 Construction of Hierarchical Dirichlet Processes
We consider a category which includes multiple images and each image can be modeled
as a mixture with different mixing proportions using shared components. Each compo-
nent is a mixture of visual words with different mixing proportions. The components
and the number of components will be inferred from training data.
Graphical representation of the model is showed in Fig. 4.3 [TJBB06], the global
probability measure G0 is distributed as a Dirichlet processes with hyperparameters γ,
H as G0|γ, H ∼ DP(γ, H) denoted by the stick-breaking construction as:
G0 =∞
∑k=1
βkδφk (4.9)
where βk is the global mixing proportions, δφk is an atom at φk and the φk denotes the
random variables distributed according to H, as the components shared among images.
The random measures Gj, are also Dirichet processes Gj|α0, G0 ∼ DP(α0, G0) and can be
4.3. Generative Model 67
Figure 4.3: HDP model.
written as
Gj =∞
∑k=1
πjkδφk (4.10)
where πjk is the mixing proportions for image j with hyperparameter α0.
The baseline H provides the prior distribution for the probability θji, which is a factor
corresponding to a single observation of visual word xji in the image j. The stick-break
construction is used to provide the prior distribution G0 as the global random measure:
β|λ ∼ GEM(γ)
π|α0, β ∼ DP(α0, β)
φk|H ∼ H
(4.11)
where
βk = β′k ∏k−1
l=1 1− βkl
β′k ∼ Beta(1, λ)
(4.12)
68 Chapter 4. Recognition and Classification Methods
and
πj =(πjk)∞
k=1
πjk = π′jk ∏k−1
l=1 1− πkjl
π′jk ∼ Beta(α0βk, α0(1−∑k
l=1 βl))
(4.13)
Here GEM, the so-called Griffiths-Engen-McCloskey distribution, is a probability law
for the sequence arising as a residual allocation model (RAM) [GK01]. This is a popular
partition model and has many remarkable peculiarities such as it is analytically best
tractable case.
4.3.5 Inference and sampling
Figure 4.4: The mixture of components.
Based on the prior, HDP uses the straightforward Gibbs sampling [TJBB06] in Markov
chain Monte Carlo (MCMC) algorithm for posterior sampling to learn the parameters of
the mixture model. As in Fig. 4.4, the ith visual word xji instance in image j is a draw
from a distribution F(θji) for association to a visual word vji belonging to a vocabulary
of size W. vji is associated to a component instance tji. tji is associated with a component
k ji in a component pool of size K and zji are used to denote the index of k ji which is mix-
4.3. Generative Model 69
ture of heterogeneous visualwords. For better understanding, we will use the restaurant
metaphor in 4.3.3 to explain the sampling method and separate the two main steps in
hierarchical models: the selection of table (one instance of component) and the selection
of component (the class of the component).
Sampling instance t. Here −ji,−jt is used in the superscripted index to denote
which corresponding variable is removed from the dataset, eg. x−ji = x
xji means that for x = (xji : all j, i) we consider all the data in x except xji itself, as xji is
the last customer to enter the restaurant to select table or the ith visual word to select
component instance. The conditional distribution of table tji is
n−jijt. f
−xjik jt
(xji) i f t is already used
α0 p(xji|t−ji, tjt = tnew, k) i f t = tnew(4.14)
where n−jijt is the number of visualwords in image j at component t after removing
xji , the f−xjik jt
(xji) is conditional density of xji for components k jt except xji, itself. In
iteration of updating tji, new component instances can appear, some instances become
empty and are discarded. As a result, some mixing components will be associated with
zero instance and deleted. Or a new instance introduces a new mixing component. Such
adapts to the proper number of all shared components K.
Sampling component k. Since changing component k jt also changes the membership
of all data items, here visualwords in instance t can select an old component or begin a
new component, the conditional probability of component k jt as
m−jt.k f
−xjik i f k is already used
γ f−xjiknew i f k = knew
(4.15)
where m−jt.k denotes the number of component instances in image j associating to a
special component k, f−xjik is the prior density of xji. The number of components K s
also possibly adjusted here. After the iteration of sampling, it gets the distribution of
global latent components for each class.
For the inference part as in Fig. 4.42, on the image training set in the right column
which includes M images, we consider each visual word from one image is the last
customer to enter the restaurant to select table. It selects one of component instances
with the probability that is proportional to the conditional density of xji. It also can
70 Chapter 4. Recognition and Classification Methods
add a new instance with the probability that is proportional to hyperparameter a0. For
each instance the mixing proportions associating visualwords changes in iterations. As
a result, some components in central column will be associated with zero instance and
deleted. Or a new instance introduces a new mixing component with the probability
that is proportional to hyperparameter γ. Such adapts to a set of components and the
proper number K of the set. More details about HDP are given in [TJBB06].
4.4 Hybrid System: Integrated Boosting and HDP
We have introduced two classes of statistical frames used in machine learning for mod-
eling the training data and predicting the label of new input data. Among these two
training frameworks: discriminative techniques are widely used since they give excel-
lent generalization performance; then generative models can handle the noisy training
data and deal with the potential partial overlapping between different classes. Hybrid
systems by the combination of generative and discriminative models are introduced by
the works by Raina et al. [RSNM03], which used naive Bayes and logistic regression
to imporve the classification of documents. Furthur more extented to computer vision,
Bosch et al. [BZMn08] investigated the combination of pLSA (probabilistic Latent Se-
mantic Analysis) and subsequently trained a multiway classifier on the topic distribution
vector for each image. In order to gain the benefit of both generative and discriminative
approaches, we propose another hybrid system which firstly uses generative model to
extract basic bricks on semantic middle levels, then applies discriminative approach to
select the most prominent ones.
Figure 4.5: Hybrid approach for learning
Basically, the training stage (Fig. 4.5) is based on BoostHDP (Adaboosting Hierarchi-
cal Bayesian model). For formal notation, a set of image C = I1, . . . , IJ , belongs to one
category C. Each image j has Nj descriptors, each descriptors xij, where 1 < i < Nj, can
belong to different dictionary (i.e. color, shape, texture, orientation), which are clustered
to a combined vocabulary V with W visual words. One image Ij can be represented by a
vector Vj = (vj1, vj2, . . . , vjw, . . . , vjW). So each instance xij from observation is associated
4.4. Hybrid System: Integrated Boosting and HDP 71
with a visual word vjw.
Firstly, a Dirichlet prior is built using stick-breaking construction as described in sec-
tion 4.3.4 with hyperparameter α. The posterior distribution of the mixture weights πc of
component set Z for category C is also Dirichlet, and determined with hyperparameters
γ by the number of observations Nj currently assigned to each components:
p(πc|z, α) = Dir(N1 +αK , . . . , NK + α
K ) (4.16)
where Nk = ∑Ni=1 δ(zi, k) and δ() is delta function.
Similarly, assuming λ is the precision of a symmetric Dirichlet prior, the posterior
distribution of the mixture weights ηk of descriptor to each component is also Dirichlet,
with hyperparameters determined by the number of observations Nw currently assigned
to each visual words:
p(ηk|w, λ) = Dir(C1 +λW , . . . , CW + λ
W ) (4.17)
where Cw = ∑Ni=1 δ(Nj, k) and δ() is delta function.
In the t iteration, given previous part assignments Z(t−1), for image set C = I1, I2, . . . , Ij, . . . , IJ
which depicts category c and in each image Ij there are Nj descriptor, we sequentially
sample new assignments Z(t) as follows:
After the iterations of sampling, we get a set of K latent components and the mixture
weight sets πc and ηk as distribution for each class. For one class with Hp positive
sample images (belong to this category), each image is represented by the mixture of
components. And each component includes a set of multiple visualwords (Fig. 4.4).
Here we integrated the discriminative method Adaboost weaker learner from Chapter
4.2.2 to find the most related components for classification to handle the variance in
intra-class and inter-class information.
So we select Hn negative sample images (not belong to this category) and iterate
multi-times as in inference (chapter 4.3.5). The negative samples are also the mixture
of components and visualwords. For each component we construct a weak classifier
hk, which consists of a component zk, a threshold θk and a parity pj. We compute
the distance dkj defined as Euclidean distance between two normalized instances of
72 Chapter 4. Recognition and Classification Methods
Algorithm 4.1: HDP to build components
1. Sample a random permutation τ(·) of the integers 1, . . . , Nj.
2. Set Z = Z(t−1), For each m belongs to τ(1), τ(2), . . . , τ(Nj), sequentially resamplecomponent zjm as follows:
(a) Remove the feature Djm from the cached statistics for its current componentk = zjm
Nck = Nck − 1Ckw = Ckw − 1Here w is the visual word which is associated with current descriptor Djm andCkw denotes the number of times appearance that a visual word w is assigned tocomponent k, and Nck is the number of features in category c assigned tocomponent k.
(b) There is the probability, proportional to parameter α0, to form a new component.The probability that Djm picks up an existing component is proportional to thenumber of features already selected the components. Here for each existing Kparts, determine the predictive likelihood
fk(Djm) = (Ckw+
λW
∑w Ckw+λ )∏Nckq=1 fk(Djm) (4.18)
Also determine the likelihood f k(Djm) of a potential new cluster k:
f k(Djm) = λ∑w Ckw+λ (4.19)
(c) At last, remove all empty component instances which are component instanceswithout any descriptor associated.
(d) Sample a new component assignment zji for current component instance fromthe following multinomial distribution:
zjm1
Zm(Nck +
αK ) fk(Djm)δ(zjm, k) (4.20)
Where Zm = ∑Kk=1(Nck +
αK )
(e) Add feature Djm to the cached statistics for its new part k = zjm, updateNck = Nck + 1Ckw = Ckw + 1
3. Set Z(t) = Z. Update part weights and parameters as
π(t)c Dir(Nc1 +
αK , . . . , NcK + α
K )
η(t)k Dir(Ck1 +
λW , . . . , CkW + λ
W )(4.21)
4. If any current component is empty (Nk = 0) and without any component instanceassociated, remove them and decrement K accordingly.
4.4. Hybrid System: Integrated Boosting and HDP 73
component zk in image j, and also the classifier rule as
hk(z, θ) =
1, dkj > θk
0, dkj ≤ θk(4.22)
Suppose that T is an integer between 1 and K, which is the total number of compo-
nents, the AdaBoost classifier learns for top T ’positive’ components as follows [VJ01]:
Algorithm 4.2: AdaBoost for Components
1. Given samples images (I1, y1), . . . , (In, yn), . . . , (IH , yH), where H = Hp + Hn , andyn = 0 or yn = 1 for negative or positive samples respectively, Hn, Hp are the numbersof negatives and positives samples respectively,
2. Initialize weight set for images j = 1, ...H as m1,j =1
2Hnor 1
2Hp,
3. For the tth iteration, with t = (1, ..., T) :
(a) Normalize the weights as ∑Hj=1 mt,j = 1
(b) For each component zk, we train a classifier hk(z, θ), and the error is evaluated as
εk =N
∑j=1
mt,j∣∣hk(z, θ)− yj
∣∣ (4.23)
(c) Find the classifier ht with the lowest error εt, add the corresponding componentzt into component set Zsub−T
(d) Update the weights as mt+1,j = mt,jβ1−ejt , where ej = 0 if sample is classified
correctly, ej = 1 otherwise and βt =εt
1−εt
From this AdaBoost learning method, we obtain a subset Zsub−T which consists of
the most distinguishing components.
This hybrid system takes a combination of generative and discriminative functions.
This modified model integrated smoothly the two classes of approaches and tried to
benefit from the merits of both. The first part provides the middle-level or preformed
units which are inferred entirely from the training data. The number of these units are
determined by the complexity of objects. In the second part, the most distinguishing
units are filtered out and the blend is built. Now computational cost is reduced after
this step. This extension of original HDP models improves both performance and speed
of recognition process.
74 Chapter 4. Recognition and Classification Methods
4.5 Conclusion
In this chapter, two main groups of solutions for supervised learning: generative and
discriminative models are presented. Especially, hierarchical generative model, a learn-
ing algorithms based on Bayesian statistics, is focused and detailed. Furthermore, we
also developed a nonparametric hybrid system to combine the merits of Dirichlet pro-
cess and traditional Adaboost approach. These proposals will be challenged to solve the
problem of robust learning in object classification in the first part of chapter 5. Other
discriminative learners, such as SVM, will also be used for our experiments in chapter
5.
Chapter 5Testing and Results
Contents
5.1 General Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.1 Datasets of Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.2 Classification Using HDP with Heterogeneous Features . . . . . 77
5.1.3 Classification Using Boosting within Hierarchical Bayesian Model 81
5.2 Facial Expression Recognition . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.1 Face databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.2 Overview of Our System . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.3 Image Based Classification using Static features . . . . . . . . . . 89
5.2.4 Image Based Classification Using Static and Dynamic Features . 92
5.2.5 Classification for Sequences . . . . . . . . . . . . . . . . . . . . . . 95
75
76 Chapter 5. Testing and Results
In this chapter, we present the experimental conduct to evaluate the performance of
our proposed systems, which are related to image/video classification and feature
extraction. For object classification and recognition, we used Caltech dataset and we
show that it performed well and had the potential to obtain visual concepts in semantic
domain. Another system is about the recognition and interpretation of human facial
expression.
To evaluate the proposed facial expression recognition system, we select three bench-
mark databases: Jaffe, Cohan-Kanade and MMI. By comparison to other methods, the
effectiveness of our method is clearly demonstrated.
These experimental results and comparisons are published in international confer-
ences [JIB09, JI10b, JK09, JI10a]. Some results are now submitted to one international
journal.
5.1. General Object Recognition 77
5.1 General Object Recognition
5.1.1 Datasets of Objects
Generic object recognition is one of the most extremely difficult computational prob-
lems due to the intra-class and inter-class variability problems. Various solutions are
proposed as reviewed in Chapter 2.
For the comparison and evaluation of these methods, appropriate datasets are re-
quired. The database should consist of enough images in each categories. The well-
known databases for natural object categorization includes LabelMe, Caltech series and
PASCAL series.
These databases have played a key role in category-level recognition research, driv-
ing the field by providing a common ground for algorithm development and evaluation
[PBE∗06]. Recently, Google or flicker, as the most popular image collection online, be-
came the appealing sources of natural object images. Though the overwhelming amount
of images and often incorrect annotation are obstructive to build ground truth.
A number of well known datasets are used to compare different algorithms in cat-
egorization. As listed in section 2.1.2, one of he most popular one is Caltech datasets
collected by FeiFei et al. [FFFP07], which consists 101 object categories and an additional
background category. Normally these objects are central and no clutter, which makes
the reported accuracy relatively high. For each category, the quantity is about 40 to 800
images per category and totally 9,144 images for the dataset. The size of each image is
roughly 300x200 pixels.
To balance the numbers across categories, we chose four categories that have more
pictures (Airplanes, Motorcycles, Faces and Leopards. Samples in Fig. 5.1). From each
category we use 50 images for training and 50 images for testing. To balance the richness
of real-world, we also include 50 images from the background category in the training
of visual words.
5.1.2 Classification Using HDP with Heterogeneous Features
In this experiments, we combine a large number of descriptors (e.g. local gradient,
shape, and color) from small patches within one hierarchical generative model. These
78 Chapter 5. Testing and Results
Figure 5.1: Samples from four categories we used.
different data sources have complementary characteristics, which should be indepen-
dently combined to improve the classification. We are also inspired by the method Hi-
erarchical Dirichlet Processes to generate intermediate mixture components to improve
recognition and categorization.
In previous works, authors [LJ08] propose to describe regions or patches around the
salient interesting points with feature vectors combining the different kinds of features
in one visual word. In our method, we chose three sets of different features, namely
SIFT, shape context and color, to generate three sets of independent heterogeneous fea-
ture vectors on small patches and individually clustering these feature vectors to three
different visual codebooks. Then we combine the codebooks into one extended vocabu-
lary to be used for the generative model introduced in Chapter 4.
In implementation, for each image in the training and test sets, DOG is firstly used to
detect the key points. Then using SIFT descriptor as in Section 3.3.2, we compute a 4×4× 8 = 128 dimensional feature vector at the 16x16 regions around keypoints according
to its proper scale, and then these vectors to be clustered to a visual vocabulary with
size 100. The similarity between two descriptors is quantified by the simple Euclidean
distance.
To represent the geometrical information related to local shapes, SC ( shape context
in Section 3.3.3), developed by [BMP00], is with 2 levels in radius and 8 bins in log-
polar space. For each bin, count 8 orientations are counted. Thus a 2x8x8 dimensional
descriptor is built around the non-zero points in edge map. fixed the size of visual
words as 50, in order to acquire the similarity between two shape features, χ2 distance
5.1. General Object Recognition 79
function is used.
To balance and rich the information for the regions around the DOG keypoints, we
also used the LUV to average the pixel values in 8x8 regions. The color information,
denoted as C, is represented by a 3 dimension vector and clustered to the codebook of
size 24.
First experiments are carried on using Caltech categories as described in Section
5.1.1. For prior hyperparameters, we fixed a0 = 0.1, γ = 1.0, and λ = 1.0. After 50
iterations for training model and 20 for testing images, the recognition performance
is showed in Table. 5.1. As adding shape and color information in descriptor pool,
the complex natural of objects is better grabbed and also it outperforms the average
performance 75% and best performance 95% in [FFFP07] which also used the Bayesian
approach.
Table 5.1: Classification results, in parenthesis is the value K for number of components.
SIFT+SC+C SIFT+SC SIFT
Airplane 96%(54) 92%(43) 68%(34)Leopard 100%(67) 94%(42) 92%(46)
Motorcycles 88%(40) 90%(43) 84%(35)Face 66%(50) 70%(41) 62%(43)
We also show one distinctive mixture component (No.17) detected. In Fig. 5.2, some
links are added to show similar descriptors detected in different images. Component
No.17 of motorbike category is made up of several repeated visualwords distributing
around some distinctive parts of motorbikes, and they are clustered to the same compo-
nent after process. The number and the positions are relatively stable, though there are
still few irrelevant descriptors as noise.
According to the algorithm, in the testing part results were obtained by running
the Gibbs Sampler of MCMC described in section 2. It is not obvious that after how
many times of iteration the algorithm will in general converge to a useful result. In
previous works, the burn-in times are arbitrary selected as 100 [TJBB06] for documents,
60 [MS05] for natural scenes. We also do a test on 30 images from different categories to
know when the efficient convergence occurred for efficient mixing.
In Fig. 5.3, we show the average results of 6 random images from category Airplane,
the first several iterations as initially burn-in correspond to random movements close
80 Chapter 5. Testing and Results
Figure 5.2: Component No.17 in motorbike category.
to the randomly initialized starting point. In the next stage all the combinations of
different features rapidly move towards the posterior mode. And finally at equilibrium,
all the samples run towards the stable values. For different combination of features,
Fig. 5.3 shows that the average convergence speed is variable. For SIFT it will begin
to converge around iteration 10-15, for SIFT+SC, around 15-20 iteration, though for the
combination of three set features, we need 25-30 iterations. To balance the problem of
time-consuming and result-accuracy, we advice to adaptively select the iteration times
in testing stage and set the stopping criteria as variance/average is less than 0.2% to
0.5%.
Next experiment is to locate a single object in complex background by finding dis-
tinctive components. We try to accumulate the relevant components to decide where
the prominent object is. We firstly tested the image in trained HDP model (as in section
4.3.3, after getting all the instances of components, then selecting top 5 components.
In Fig. 5.4, the squares are detected on the body of leopard and displayed the rough
position of the leopard. The result is rather satisfying with few noises.
This model firstly improved the generation and sharing of 10-20 latent themes among
classes, while previous works usually has one or two themes [LJ08, TJBB06] for one
category. Secondly, it combined information provided individually from three diverse
feature sources: local texture feature, shape feature and color feature. The combina-
tion makes our method outperformed the single-feature or large-patch-one-descriptor
methods in object class detection.
5.1. General Object Recognition 81
Figure 5.3: Convergence vs. Iteration times.
5.1.3 Classification Using Boosting within Hierarchical Bayesian Model
In this section, after we obtain the set of components as in 5.1.2, instead of boosting the
features as Viola and Jones [VJ01], we try to boost the components in the intermedi-
ate layer to find the most distinctive ones. We consider that these components are more
important for object class recognition than others and use them to improve the classifica-
tion. Our target is to understand the correct classification of objects, also to discover the
essential latent themes sharing across multiple categories of objects and the particular
distribution of the latent themes for a special category.
After observing Table. 5.1 for appropriate features, we chose two sets of different
features, SIFT and shape context. Because the combination of these two features per-
forms well as in Table. 5.1, we mixed the codebooks into one extended vocabulary to be
used for the generative model introduced in section 4.3.3.
The experiments are also carried out on four categories: Airplanes, Leopards, Faces
and Motobikes. For prior hyperparameters, we use a0 = 0.1, γ = 1.0 and λ = 1.0 to train
HDP model for a set of components. For each category, we use the 50 training images
as positive samples and 50 random images from the other three categories as negative
samples for boosting. For classification, the number T is adjusted to verify the relation
between the overall performance and the size of the set of ’good’ components. For dif-
ferent value of T and different categories the results are showed in Fig. 5.5. We can see
82 Chapter 5. Testing and Results
Figure 5.4: The distinctive components in large image.
that for the different categories, the first T components sets yield an acceptable but not
very good detection rate. Then we adjust the number T and add more positive compo-
nents into the subset. As result, the performances increase and reach the comparatively
stable values.
Table 5.2: The confusion matrix for best-T.
Airplane Face Leopard Motobike
Airplane 94% 0% 4% 2%Face 4% 74% 16% 6%
Leopard 8% 0% 92% 0%Motobikes 10% 0% 2% 88%
Table 5.3: Performance comparison.HDP(Num: K) HDPBoost (Best of T)
Airplane 92%(43) 94%(8)Face 70%(41) 74%(10)
Leopard 94%(42) 92%(11)Motobikes 90%(43) 88%(6)
The confusion matrix is given in Tab. 5.2. It shows the HDPBoost classification
performances in each category with best value of T. The comparison between HDP and
HDPBoost is showed in Tab. 5.3. Here we list the performances, and, in parenthesis, the
number of components value (K) and the best value of T. We find that a small subset
of all the components found in HDP is enough to catch the complex nature of objects,
5.2. Facial Expression Recognition 83
Figure 5.5: Performance vs the size of component set.
and performs as well as using the whole set. Even the performance are not improved so
much, the boosting not only helps to find the essential characteristics of objects, but also
can accelerate the classification speed. For some categories (Leopard and Motobikes),
the results decreases 2% or 4%. This outcome is perhaps contributed by the occlusion
of backgroud. The overall performance here also outperforms the average performance
75% in [FFFP07] which used the Bayesian approach. As conclusion, we can predict that
these top ’positive’ components are strongly linked to the objects in semantic domain.
5.2 Facial Expression Recognition
5.2.1 Face databases
In human society, face occupies an important role in interpersonal communication. From
Darwin [Dar02] to Matsumoto [MW09], social psychologists try to interpreter the facial
signals and the expression of emotion of human beings. From the beginning of computer
era, the advances in image analysis make the automatic analysis of facial expression
in computer vision become possible and now accurate emotion categorization forms a
challenging task.
84 Chapter 5. Testing and Results
As this topic is very useful for man-machine interaction, researchers need a bench-
mark to be able to directly compare the results. Different databases provide the standard
test data for the detection of identity, face pose, illumination, facial expression and age.
Here we list some publicly available databases as:
1. Cohn-Kanade AU-Coded Facial Expression Database [KCT00], which provides re-
searchers in automatic facial image analysis and synthesis and for perceptual stud-
ies. The peak expression for each sequence is fully FACS but emotion labels are
not available.
2. MMI Database [PVRM05], this dataset is manually FACS coded frame-by-frame
annotations of the temporal segments. Some sessions are labeled as one of the six
basic emotions.
3. Japanese Female Facial Expression (JAFFE) Database [LAKG98],includes images
of 7 facial expressions (6 basic facial expressions + 1 neutral) posed by 10 Japanese
female models.
4. Belfast Naturalistic Database [CCS03], consists of audiovisual clips from 125 speak-
ers (31 male, 94 female). Emotional clips provide within themselves at least most
of the context necessary to understand a local peak in the display of emotion and
to show how it develops over time.
5. RPI ISL Facial Expression Databases [TLJ07], which is primarily used for frontal-
view facial action unit recognition.
6. DaFEx - Database of Human Facial Expressions [BC08], is a Database of posed
human facial expressions for the evaluation of synthetic faces and embodied con-
versational agents but not widely used.
7. The AR Face Database [MB98], contains images of 116 individuals (63 men and 53
women). The imaging and recording conditions (camera parameters, illumination
setting, camera distance) were carefully controlled and constantly recalibrated to
ensure that settings are identical across subjects.
8. BU-3DFE (Binghamton University 3D Facial Expression) Database and BU-4DFE
(3D + time): A 3D Dynamic Facial Expression Database [YWS∗06] This database
aims in identifying facial expressions and understanding of facial behavior and 3D
structure of facial expressions on a detailed level. It contains 100 subjects (56%
5.2. Facial Expression Recognition 85
female, 44% male) and each subject performed seven expressions in front of the
3D face scanner.
Among these comparatively large number of face databases, subjects are usually
asked to perform the desired actions. Most of the databases collected primarily for face
recognition also recorded subjects under changing facial expressions. However, appear-
ance and timing of these directed facial actions may differ from spontaneously occurring
behavior [Gro05]. Recently Youtube and Google Video also became the resources for po-
tential high quality databases.
The development of algorithms should be robust to these databases with suffi-
cient size that include carefully controlled variations of these factors. Furthermore,
benchmark databases are necessary to comparatively evaluate algorithms. We select
three widely used one as datasets in our experiments: Cohn-Kanade, JAFFE and MMI
databases.
Figure 5.6: The sample images from JAFFE database. From left to right: Angry, Disgust,Fear, Happiness, Sadness, Surprise and Neutral.
JAFFE database[LAKG98] is widely used in this area and Asian women usually yield
the lowest rates in facial expression recognition. JAFFE database consists of 213 images
with 6 basic facial expressions and one set of neutral expression. These images in Fig. 5.6
86 Chapter 5. Testing and Results
are posed by 10 Japanese females. As processing in Fig. 5.7 (the same one as Fig. 3.4 in
Chapter 3, for each individual, we use images from the 6 categories of facial expressions
(the first row), the neutral face images (the second row) to obtain the subtracted differ-
ence images (the third row). As we have 3 neutral images per person, and 2 to 4 facial
expression images per expression category per person, the number of total difference
images is multiplied and we obtain 531 normalized face images for classification.
Figure 5.7: Sample images and location procedure, from left to right: Anger, Disgust,Fear, Happiness, Sadness and Surprise.
We also evaluated our method on MMI database [PVRM05], another publicly avail-
able database which aims to deliver large volumes of facial expressions visual data to
the facial expression analysis community. It includes more than 20 different faces of
students and research staff members of both sexes (44% female), ranging in age from 19
to 62, having either an European, Asian or South American ethnic background. These
5.2. Facial Expression Recognition 87
Figure 5.8: Sample frames from MMI database
faces appeared in frontal and profile views displaying six facial expressions of emotions.
The sequences are of variable length from 40 and 520 frames, picturing one or more
neutral-expressive-neutral facial behavior patterns. In sequences, each frame measures
(720 ∗ 576), (576 ∗ 720) or (640 ∗ 480) pixels in true color as in Fig.5.8.
In our experiments, we select 199 image sequences. As a selection criteria, the se-
quence has to include a frontal or near frontal view face and is already labeled in the
MMI database as ground truth. For each sequence, we manually labeled the position of
peak frames and selected four peak frames from it. We also included all neutral ones to
build the evaluation dataset and converted images to gray level.
The third used one is Cohan-Kanade database [KCT00], which includes 486 se-
quences from 97 students (samples in Fig. 5.9). Subjects range in age from 18 to 30
years inclusive. 65% were female; 15% were African-American and 3% Asian or Latino.
They were asked to perform different expressions using a camera directly in front of
one subject. Each sequence runs from a face in neutral state to a target expression in
peak state. A comprehensive comparison is difficult because even though most of the
proposed systems worked on this database, they did not use the same selection of se-
quence sets and their own labelings of expressions. In our experiments, we select 348
sequences (40 Anger, 41 Disgust, 45 Fear, 97 Happiness, 48 Sadness and 77 Surprise).
We also manually labeled the starting frame of expression in every sequence.
Because Cohn-Kanade and MMI databases consist of video sequences, both image-
88 Chapter 5. Testing and Results
Figure 5.9: Sample frames from Cohan-Kanade database
based and video based methods are tested. For JAFFE database, we will only test image-
based method.
5.2.2 Overview of Our System
Figure 5.10: System overview
The same as most of FER system, our system (Fig.5.10) consists in three stages: face
and facial parts detection, face representation and facial expression recognition. The
first one is the process to automatically locate the face and facial parts as in Chapter
3.2.2. The next stage concerns the extraction of the appropriate descriptors in Chapter
3 to represent the normalized face sequences from the first stage. The representation
5.2. Facial Expression Recognition 89
is extracted from two sources: the appearance-based information using traditional LBP
(Local Binary Pattern) or Gabor features on static images, and two new proposed textons
(VTB and moments) for the dynamic spatial-temporal information. These descriptors
(both static and dynamic feature) are used by the classification methods in Chapter 4 for
facial expression recognition.
Furthermore, the last stage of our system is composed of two steps. First, classifica-
tion is performed on every image, except for the first ones in the sequence, by predicting
the probability that each image belongs to each expression from binary descriptor and
moments. Then the weighted probabilities obtained are combined so as to predict the
expression associated to the whole sequence.
5.2.3 Image Based Classification using Static features
The first experiment on JAFFE database is performed on different resolutions as (32×32),(64× 64) and (128× 128). The face images (the forth row in Fig.3.4) are cropped
from original images (the first row in Fig.3.4) and normalized separately to different
sizes. On these face images, we apply the face mask in Fig.3.10 and obtain 8 sub-
regions. LBP histograms are computed for each block and are concatenated to a single
feature vector of 2048 bins. The recognition rates are obtained by using SVM classifier
with polynomial kernel and 10-fold cross-validation. The performances for each class
are showed in Table 5.4, where An stands for Angry, Di for Disgust, Fe for Fear, Ha for
Happiness, Sa for Sadness, Su for Surprise, Av for Average. The same annotations are
used in the rest of this paper. The average performance is 98.3%, and has outperformed
other manually annotated and automatic methods as listed in Table 5.5.
Table 5.4: Recognition performances by SVM on different resolutions(%)l An Di Fe Ha Sa Su Av
32 87.8 81.6 77.1 89.4 76.1 83.5 82.564 100 97.7 97.9 97.6 97.7 98.8 98.3
128 93.3 89.7 87.5 92.9 86.4 91.8 90.2
Table 5.5: Average recognition performances on JAFFE database (%)Ours Feng[FPH05] Guo[GD05b] Koutlas[KF08] Liao[LFCY06]
98.3 93.8 92.3 90.8 94.59
The above performances show that our automatic system is effective for facial ex-
90 Chapter 5. Testing and Results
Table 5.6: Recognition performances by boosted-SVM for 64× 64 resolution(%)
An Di Fe Ha Sa Su Av
100 96.6 99 100 100 97.6 98.8
pression recognition and out-performed current existing methods, even some manually
pre-processed systems. We still face the problem of high dimension for the feature
vector. In order to find the most discriminative features and explore the possibility of
real-time recognition system, we boost the features as presented in Section 2.2. For each
expression, we randomly select 50 positive face images and 50 negative face images from
all other categories of expressions. After one-against-all boosting, the top 20 features are
selected for each expression. As some features appear multiple times in different ex-
pressions, we eliminate the duplicate and reduce the dimension from 20 ∗ 6 = 120 to
73 bins. The generalization performance of boost-SVM classifier is showed in Table 5.6,
according confusion matrix is showed in Table 5.7.
Table 5.7: Confusion matrix by boosted-SVM for 64× 64 resolution for 6-class recogni-tion (%)
An Di Fe Ha Sa Su
An 100 0 0 0 0 0
Di 0 96.6 1.1 0 0 2.3Fe 0 1 99 0 0 0
Ha 0 0 0 100 0 0
Sa 0 0 0 0 100 0
Su 0 2.4 0 0 0 97.6
As in the normalization of face images we multiply the number of dataset by three,
which is the number of neutral images per person, some images are possibly identical
as being cut from the same original images (not in all the cases, because for different
neutral images, detection , normalization and histograms usually are different). So as to
confirm our results are valid and to compare to other methods, we divide the dataset to
three sub-sets, to make sure that each sample face image is cut from different original
image. In the later experiments we will test only on 64× 64 resolution boosted results.
After 10-cross validation in each sub-set, it observes that the performance decreases but
is still robust and stable as listed in Table 5.8.
We also evaluated this method on MMI database. For totally 199 images sequences,
5.2. Facial Expression Recognition 91
Table 5.8: Recognition performances for 64× 64 resolution on sub-sets (%)Set An Di Fe Ha Sa Su Av
No.1 100 93.1 93.8 100 100 100 97.8No.2 100 94.8 100 100 98.3 98.2 98.6No.3 100 96.6 100 96.4 100 93.1 97.7
we manually labeled the position of peak frames and selected three peak frames from
it to build the evaluation dataset. Hence, 597 images are extracted from videos and
converted to gray level. The face regions are automatically identified as described in
Section 3.2.2 and normalized to (64x64) pixels. After the extraction of LBP features
using masks, we also performed 10-cross validation using SVM for 6-class expression
recognition. The average performances are 91.2%, which is the best in state of arts,
comparing to 86.9% from [SGM09] and 82.2% from [TA07]. The confusion matrix on
MMI database is showed in Table. 5.9.
Table 5.9: Confusion on MMI database for 6-class recognition (%)
An Di Ha Fe Sa Su
An 90.6 2.1 0 0 5.2 2.1Di 2.3 94.3 3.4 0 0 0
Ha 0 0 100 0 0 0
Fe 4.8 4.8 4.8 72.6 2.4 10.7Sa 2.1 2.1 5.2 2.1 88.5 0
Su 0 0 2.6 0.9 0.9 95.7
The overall performance on MMI database is inferior to the results on JAFFE database.
This may be due to the more wide selection of ethnic background and out-of-plane
movements in MMI database. A large training set with variations in culture origins will
be helpful to build more comprehensive models for facial expression recognition.
For Cohn-Kanade database, we do a similar processing. Considering the rich vari-
ation in physical appearance for this database, we select the top 50 features from each
expression after AdaBoost and combine them to 220 bins. As an option, we also use the
histogram of Gabor features as complementary information to improvement the perfor-
mance. The comparison to other alternative methods are listed in Table. 5.10, where SN
stands for the number of subjects, SqN stands for the number of sequences. As we can
see, the results are not as good as [SGM09], however, the eyes location in their system is
manually labeled. Furthermore, our system is more robust as it performs well on all the
92 Chapter 5. Testing and Results
Table 5.10: Recognition performances comparisons on Cohn-Kanade Database(%)Representation SN SqN C A M AR(%)
[CSG∗03] Motion Units 53 - 6 N - 91.8[BGL∗06] Gabor+AdaBoost 90 313 7 Y 10-Fold 93.3
[MB07] Gabor+AdaBoost - - 6 Y - 84.6[MB07] Edge/chamfer+AdaBoost - - 6 Y - 85
[SGM09] BoostedLBP 96 320 6 N 10-Fold 95.1Ours: LBP LBP 94 346 6 Y 10-Fold 91.9
Ours: BoostedLBP BoostedLBP 94 346 6 Y 10-Fold 91.4Ours: Fusion BoostedLBP+ Gabor 94 346 6 Y 10-Fold 94.3
Table 5.11: Recognition performances comparisons for image-based methods (%)Features SN SqN C D A M AR(%)
[CSG∗03] Gabor 53 318 6 N N - 91.8[BGL∗06] Gabor 90 313 7 N Y 10-Fold 93
[KZP08] Shape - - 6 N Y 5-Fold 92.3[DSRDS08] Holistic 98 411 6 N N 5-Fold 96.1
[PY09] Harr+Boost 96 - 6 N Y 3-Fold 88
[VNP09] Candide - 440 7 N N 5-Fold 90
[SGM09] BoostedLBP 96 320 6 N N 10-Fold 95.1[SC09] AAM - 72 6 N N 3-Fold 97.22
Ours LBP+VTB 95 348 7 Y Y 2-Fold 94
Ours LBP+VTB 95 348 7 Y Y 10-Fold 97.2Ours Moments 95 348 7 Y Y 2-Fold 95.5Ours Moments 95 348 7 Y Y 10-Fold 97.3
three databases.
5.2.4 Image Based Classification Using Static and Dynamic Features
Among the image-based approaches on Cohen-Kanade database, [CSG∗03] used a sub-
set of 53 subjects, for which at least four of the sequences were available. For each
person, they selected an average of 8 frames for each expression. [BGL∗06] selected the
first and last frame from each sequence as training images and for testing. [SGM09] used
the neutral faces and three peak frames for prototypic expression recognition. We did
a similar selection to build static image sets. Our selection criteria is that LBP, VTB and
moments can be computed on these images (e.g for moments, the first as τs − 1frames
are ignored). All the neutral images and four peak images per sequence are used as
training and testing images.
LBP histograms are extracted from 10 blocks on these image. For VTB features, we
will connect the motion information with current frame and its previous two frames.
5.2. Facial Expression Recognition 93
These three images are decomposed into vertical spatiotemporal planes. The total his-
togram length for LBP-VTB descriptors is 256 ∗ 10 + 64 ∗ 3 = 2752. Moments values,
which are extracted from one image and its previous (τs− 1) images in sequence, are ob-
tained from three blocks (related to eyes, nose and mouth) for each of n = 64 spatiotem-
poral planes as showed in 3.15. For each block, three values (M00, M10/M00, M01/M00)
are computed. The total vector dimension is 64 ∗ 3 ∗ 3 = 576.
The value of the temporal window τs is fixed to 8, as a good compromise between
having a large enough region needed for moment computation and having numerous
probability vectors for sequence based approach. Because of its powerful discriminative
ability, the SVM with polynomial kernel is used.
The 10-fold validation is applied in all experiments. We also tested the 2-fold val-
idation for comparison. Our results are compared to other image-based methods in
Table.5.11, where SN stands for the Number of Subjects, C for number of Classes, D
for Dynamic, A for Automatic, AR for Accuracy Rate, SqN stands for the number of
SequeNces and LOSO stands for Leave-One-Subject-Out.
By comparing the results in Table.5.11, we can fairly say that our proposed method
is more performant than the other ones in classifying the static images. As showed in
Table.5.11, the results are relatively stable around 94% (2-fold) and 97% (10-fold). The
new descriptors, VTB and moments, improve the classification of neutral+peak images.
[CLL09] provided similar recognition rate, however, they use less sequences, and only 6
classes for classification.
From MMI database, we select 199 image sequences. As a selection criteria, the
sequence has to include a frontal or near frontal view face and is already labeled in the
MMI database as ground truth. For each sequence, we manually labeled the position of
peak frames and selected four peak frames from it. We also included all neutral ones
to build the evaluation dataset and converted images to gray level. The face regions are
automatically identified as described in section 3.2.2 and normalized to 64 ∗ 64 pixels.
After the extraction of LBP+VTB and moments features, we also performed 10-cross
validation using SVM for 7-class expressions recognition. The average performances are
92%(LBP+VTB) and 89%(moments), which is considered as the best performance in the
state of the art, comparing to 86.9% from [SGM09]. For each category, we also provide
the confusion matrix in Fig. 5.12
The results are inclined to neutral expressions. The reason possibly lies on the small
94 Chapter 5. Testing and Results
Table 5.12: Confusion matrix of Moments on MMI database (%)
An Di Ha Fe Sa Su Ne
An 76.6 2.34 0.78 0 2.34 0 17.97
Di 4.27 81 0 5.13 0.85 2.56 5.98
Ha 1.28 0.64 94.9 0 0 0.64 2.56
Fe 2.68 2.68 1.79 58.9 1.79 16.96 15.18
Sa 3.9 0 0 0.78 66.4 1.56 27.34
Su 1.28 0.64 0.64 7.69 0 71.2 18.60
ne 0.71 0.1 0.1 0.3 0.3 0.3 98.2
Figure 5.11: One sequence of happiness and corresponding plot for six expressions.
changes on the beginning of sequences. This error can be coped with the references to
later peak frames. We also tested the 2-fold and 5-fold classification. The results are
81.9% and 86.8%. It shows the methods is relatively robust.
After this step, we suppose a temporal ordered image sequence S = {I1, I2, . . . , IT},where T is the number of frames in one sequence. After classifying all the frames except
the first two ones, an image Ii, with 3 ≤ i ≤ T, is associated with a vector combined
by probabilities: {pi,c0 , pi,c1 , . . . , pi,c6}, where pi,cj , (0 ≤ j ≤ 6) is the probability for the 7
classes including pi,c0 for neutral expression.
5.2. Facial Expression Recognition 95
5.2.5 Classification for Sequences
Recently, more and more researches moved from static-based methods to dynamic analy-
sis of video sequences exploration. Similar to Bartlette et el.[BGL∗06], Chang et el.[CLL09]
and Buenaposada et el.[BMB08], we use the classification results from each frame to se-
lect the final label for the sequences. The detailed algorithm is presented in Algorithm
5.1. One sample sequence and the plot of six expressions are showed in Fig.5.11.
Algorithm 5.1: Sequence level classification using weighted sum
1. Initiate a vector PS = {P1, P2, . . . , P6} as Pk = 0, where 1 ≤ k ≤ 6
2. For (i = 3 to T) :
(a) If Gi =′ Neutral′, ignore the frame and go to next iteration;
(b) If Gi 6=′ Neutral′, Pk = Pk + wi ∗ pi,ck , where 1 ≤ k ≤ 6
3. The final label for sequence
G = argmaxk {Pk} , (5.1)
where 1 ≤ k ≤ 6.
In experiments, we built a set of weights W = {w3, w4, . . . , wT}, and we associate one
weight to one image. Three sets of weights W are tested. In the first set, we consider
the same weights wi = 1 for all the images 3 ≤ i ≤ T. For the second and third set,
the weights will have higher values for the last few frames in sequences as they provide
more valuable information. As we can observe [BMB08, CLL09], face changes are not
linear and, for different expressions, the movement patterns are also variant. As the
real pattern is unknown, we assume that the relation between distance to peak frame
and similarity follows normal distribution N (µ, σ2). The maximum is reached for the
peak frame, corresponding to the last frame (µ = T). In the second and third sets, σ21
and σ22 are respectively T and 1
2 T. A comparison between our proposed approach and
other methods is given in Table.5.14. As we can see, better performances are obtained
with the three sets of W. The confusion matrix is showed in Table.5.13, where An
stands for Angry, Di for Disgust, Fe for Fear, Ha for Happiness, Sa for Sadness, Su for
Surprise, Av for Average. Here, the results are also better than [BMB08]. In the future
use of the complete sequences as neutral-expressive-neutral, this setting can be easily
extended from applying left bell to full bell. The use of normal distribution is only a
rough estimation to approximate the movement patterns of expressions. Future tuning
and special pattern build can be achieved.
96 Chapter 5. Testing and Results
Table 5.13: Confusion matrix of Ours:N (µ, σ22 )(%)
An Di Fe Ha Sa Su
An 97.5 2.43 0 0 0 0
Di 2.5 92.7 0 0 0 0
Fe 0 0 90.91 2.01 2.08 0
Ha 0 2.43 0 96.91 0 0
Sa 0 0 0 0 97.92 0
Su 0 2.43 9.09 1.06 0 100
Table 5.14: Recognition performances comparisons for sequences-based methods(%)SN SqN C D A M AR(%)
[YBS04] - - 6 Y Y 5-Fold 90.9[XLC08] 95 365 6 Y N LOSO 88.8[BMB08] 94 333 6 N Y LOSO 89.13
[ZP09] a 97 374 6 Y N 10-Fold 95.1[ZP09] b 97 374 6 Y Y 2-Fold 93.85
[KOY∗09] 53 129 6 Y - - 70
[CLL09] - 392 6 Y N 5-Fold 92.86
Ours:LBP+VTBwi = 1 95 348 6 Y Y 10-Fold 93.68
Ours:LBP+VTBN (µ, σ21 ) 95 348 6 Y Y 10-Fold 95.1
Ours:LBP+VTBN (µ, σ22 ) 95 348 6 Y Y 10-Fold 95.7
Ours:momentswi = 1 95 348 6 Y Y 10-Fold 98.5
In our approach, we begin from labeling all images in a sequence from neutral ex-
pression to apex status by using image based classifying method. The first several im-
ages are usually classified as ’Neutral’ and will be ignored. From the starting frame of
expression, the trace of facial organ movements is captured by the texture and shape
changes on spatiotemporal slices. The first several images with subtle expression after
the starting frames of expression are possibly wrongly labeled as "Neutral". However
the following images normally yield a high accuracy. One of the advantages, is that, if
a few peak frames are not correctly identified, the probabilities from other frames will
help to label the sequence. After taking into account all images except the first (τs − 1)
ones in sequences, our method produced a recognition rate of 95.7%(LBP+VTB) and
98.5%(moments). The result listed in Table.5.14 outperformed others who had tested
their systems on the same database. On MMI database, we only test the situation of
wi = 1, the results are 95% (LBP+VTB) and 97% (moments) for 6 classes.
The dynamic deformation on spatiotemporal planes is explored to be used to recog-
nize the human facial expressions in image sequences. Here, the dynamic information
is derived from vertical-time plane and especially used to model the evolution of facial
5.2. Facial Expression Recognition 97
parts from neutral state to apex status.
Compared to other methods based on appearance or motion, we do not only learn
both static and spatiotemporal features, but also treat these features according to their
specific domains. The strategy improved the effectiveness of image-based recognition.
The system further used the predicted probabilities from a single image to classify the
whole sequence. After training and testing it on the two commonly used databases:
Cohn-Kanade and MMI, the proposed approach yielded better results than other meth-
ods.
98 Chapter 5. Testing and Results
Chapter 6Conclusion
Contents
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.1.1 Object Categorization Using Boosting Within Hierarchical Bayesian
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.1.2 Automatic Facial Expression Recognition . . . . . . . . . . . . . . 100
6.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.1 Object Similarities and Polymorphism . . . . . . . . . . . . . . . . 101
6.2.2 Spontaneous Facial Expression Understanding . . . . . . . . . . 102
99
100 Chapter 6. Conclusion
6.1 Contributions
6.1.1 Object Categorization Using Boosting Within Hierarchical Bayesian Model
In this thesis, we present a novel framework for object categorization, which is based on
mixing Hierarchical Dirichlet Processes and Adaboost Learner to recognize the object
from multiple categories. This probabilistic model for an object category can be learned
from a set of labeled training image instances. This method combined the discriminative
and generative models, not only accomplishes the task of recognition, but also provides
the potential links between the low-level features and high-level semantic labels.
Here a category includes multiple images and each image can be modeled as a
mixture with different mixing proportions using shared components. Each compo-
nent is a mixture of visual words with different mixing proportions. The components
and the number of components will be inferred from training data. In other methods
[WZFF06, LJ08] using HDP models, typically, one object is only corresponding to one
or two components. This setting can reduce the complexity of processing but now, the
multi-layers hierarchical models becomes meaningless as two levels are almost identical
because their system is not with a pyramid-like structure and one object is assigned by
only one or two middle components. In our system, for each category, the number of
middle level components will exceed twenty or more. These components can be shared
by similar classes, inherited in hierarchical class structure (e.g. building->theater) and
represented the unique characteristic in current categories. As our system generated
more complex middle level model, the speed of processing becomes a problem to multi-
class recognition. These prefabricated components are new and generated from training
samples. They are more adaptive and their total number is flexible comparing to other
methods.
6.1.2 Automatic Facial Expression Recognition
In this thesis, we present the systems for automatic facial expression recognition. The
main contribution exists in several aspects: the adaptive face location based on facial
expression changes and new textons to describe the dynamic deformation of these facial
expression changes.
The first system is learned from essential facial parts and local features for facial
expression identification. In the system, we use the automatic method for essential facial
6.2. Perspectives 101
parts detection, the specially designed mask for LBP histogram, the boosted features
from bins and the fusion of heterogeneous features. All these aspects show their impact
on the recognition rates. Comparing to other systems, no translation of face or alignment
of mouth position is needed in pre-processing. The system is testified on JAFFE, Cohn-
Kanade and MMI database and proved to be robust to slightly translation or small
alignments but cannot support large rotations or translations.
Based on this system, we furthermore propose a novel FER system to recognize the
human facial expressions in image sequences by using statiotemporal information. The
system analyzed a single image with static information from appearance-based descrip-
tors and dynamic information provided by statiotemporal plane. Here, the dynamic
information is derived from vertical-time plane and specially used to model the evolu-
tion of facial parts from neutral state to expressional state. The system further used the
predicted probabilities from a single image to classify the whole sequence. Compared
to other methods based on appearance or motion, we do not only learn both static and
spatiotemporal features, but also treat these features according to their specific domains.
The strategy improved the effectiveness of image-based recognition. After training and
testing it on the two commonly used databases: Cohn-Kanade and MMI, the proposed
approaches yield competitive results on the state of the arts.
6.2 Perspectives
6.2.1 Object Similarities and Polymorphism
There are many open questions in object categorization. Some are to extend current
system and some are new directions for future researches.
Firstly, we can extend to consider the co-occurrence of heterogeneous features or
components and adopt the graph method to model the spatial relation among compo-
nents. The component instances detected also can be used for location or build hierar-
chical tree to show the structure of object class. The usage of different combinations of
more kinds of descriptors is also very interesting to explore and tune.
We also should train the models on larger number of images, and prepare to con-
struct the ’component dictionary’ of objects and taxonomy tree in semantic level. This
dictionary can be updated and extended when a new category is added. The basic
blocks for categorization are generated in dynamic way just like the real vocabulary.
102 Chapter 6. Conclusion
Another thought is that objects is possibly polymorphism. A thesaurus is necessary
for visual systems. An object can be manifested by several models and a model can be
related to multiple labels. This idea is rarely explored and promising. The potential
problem maybe lies on the complex degree.
6.2.2 Spontaneous Facial Expression Understanding
In our future work, we plan to build the complete real-time system to recognize the
front faces and identify the facial expressions from a video camera under variant en-
vironments. Though, the rigid motion generated by pose changing or head motion is
not considered. These motions, which often appeared unintentionally in spontaneous
expressions, will influence detection results. In future work, we aim to first differentiate
and compensate rigid motion and expressional muscular movements on spatiotemporal
domain.
Secondly, the second half of the sequence after the peak frame, i.e. the apex-offset-
neutral segments, is not explored yet. The offset parts, which show opposite direction of
movements and different speed, should produce their own models. So, we plan to use
the complete sequences. For different expressions, the different model should be built
to catch the different rhythm along the time axis.
In current work, the subtle facial expression is mentioned though the level of inten-
sity is not identify or evaluated. The level of different expression is also meaningful in
human emotion recognition and human-computer interaction. For example, smile and
laugh show different signals of unspeakable message. These facial events are important
though not fully explored.
Becoming famous after the broadcasting of American television series "Lie to Me",
microexpressions are another hot topic in psychology [MR83, Eck03]. Unlike basic facial
expressions, it is said to be difficult to fake. At the same time, also difficult to catch.
Computer vision is said to be able to see what human eyes cannot sense. We hope
to study frame by frame expression labels to further identify this kind of deception in
videos.
Chapter 7Résumé en Français
Contents
7.1 Sommaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2.1 Détection des organes faciaux . . . . . . . . . . . . . . . . . . . . 105
7.2.2 Descriptions des expressions . . . . . . . . . . . . . . . . . . . . . 106
7.2.3 Classification par HDP (Hierarchical Dirichlet Process) . . . . . . 111
7.2.4 Processus de Dirichlet Hiérarchique pour la classification . . . . 112
7.3 Résulats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3.1 Validation de la classification d’objets par HDP . . . . . . . . . . 114
7.3.2 Validation des descripteurs proposés pour la classification des
expressions par SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3.2.1 Classification des images d’expression par descripteurs
statiques . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.3.2.2 Classification des images d’expression par descripteurs
dynamiques . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.3.2.3 Classification de séquences d’expressions par descrip-
teurs dynamiques . . . . . . . . . . . . . . . . . . . . . . 120
103
7.1 Sommaire
Dans cette thèse, nous avons abordé la problématique de la classification d’objets puis
nous l’avons appliqué à la classification et la reconnaissance des expressions faciales.
1. Classification par modèles bayésiens hiérarchiques
Les humains peuvent résoudre le problème de la classification et de la reconnais-
sance d’objets sans grand effort, rapidement et de manière relativement efficace. Mais
dans le cadre de l’apprentissage automatique (ou Machine Learning), ceci reste encore
une tâche très difficile. Dans ce domaine, et pour une certaine catégorie d’approches,
la connaissance est acquise par apprentissage à partir d’un ensemble d’images, puis
un modèle de l’objet est construit automatiquement pour permettre la classification, la
reconnaissance ou l’identification.
Dans cette thèse, nous sommes inspirés des processus de Dirichlet, comme des dis-
tributions dans l’espace des distributions, qui génèrent des composantes intermédiaires
permettant d’améliorer la catégorisation d’objets. Ce modèle, utilisé notamment dans
la classification sémantique de documents, se caractérise par le fait d’être non paramé-
trique, et d’être hiérarchique. Dans une première phase, l’ensemble des composantes
intermédiaires de base sont extraites en utilisant l’apprentissage bayésien par MCMC
puis une sélection itérative des classifiers faibles les plus distinctifs parmi toutes les
composantes est opéré par Adaboost. Notre objectif est de cerner les distributions des
composantes latentes aussi bien celles partagées par les différentes classes que celles
associées à une catégorie particulière.
2. Classification d’expressions faciales Nous avons cherché dans cette seconde par-
tie à appliquer notre approche de classification aux expressions faciales. De nombreuses
approches pour des systèmes de reconnaissance d’expression faciale sont proposées. Ek-
man et Friesen ont défini six émotions de base : la colère, le dégoût, le peur, la joie, la
tristesse, et la surprise. En raison des différences individuelles et du contexte culturel,
il est souvent difficile d’interpréter ces expressions universelles en appliquant des algo-
rithmes de reconnaissance, et parfois même en procédant manuellement par des êtres
humains. La reconnaissance automatique d’expression faciale est encore un domaine
des plus actifs en vision par ordinateur, et elle a attiré de nombreuses propositions au
cours des dernières décennies. Nous nous sommes alors intéressés à la description des
expressions faciales, et avons proposé deux nouveaux descripteurs pour la reconnais-
sance automatique (VTB et des moments sur le plan spatio-temporel). Ces descripteurs
permettent de décrire la transformation du visage pendant les expressions faciales. Ils
visent à capturer le changement de la forme générale et les détails des textures induits
par les mouvements des organes faciaux. Enfin, les classifiers HDP et SVM ont été uti-
lisés afin de reconnaître efficacement l’expression qu’il s’agisse d’images fixes ou de
séquences d’images.
Ce travail a consisté à trouver les méthodes adéquates pour décrire les aspects sta-
tiques et dynamiques au cours de l’expression faciale, et donc à concevoir de nouveaux
descripteurs capables de représenter les caractéristiques des mouvements des muscles
faciaux, et par là même, d’identifier la catégorie de l’expression.
Mots clés : Classification d’objets, Classification d’expressions du visage, Modèle Hy-
bride, Modèles hiérarchiques de Dirichlet, Spatio-temporelle, descripteur dynamique.
.
7.2 Contributions
7.2.1 Détection des organes faciaux
Dans ce travail, nous nous sommes placés dans le cas de séquences d’images, partant de
l’idée qu’à terme, l’objectif sera d’analyser des vidéos en temps réel, et d’en extraire les
visages puis les expressions qui leur sont associées. Pour détecter de manière grossière
la position des visages, nous nous sommes basés dans un premier temps sur l’algo-
rithme de Kienzel [KBFS05]. Celui-ci fait appel à un apprentissage par SVM à marge
floue et une recherche pyramidale multi-échelles à 12 niveaux afin de détecter les patchs
candidats. Ceux-ci sont ensuite introduits dans un système de classifiers en cascade qui
présente un taux de détection de 95%. Partant de ces résultats, nous avons affiné la détec-
tion des organes faciaux comme présenté dans [JK09]. Après normalisation des images
de la séquence, nous procédons à la soustraction des images Neutres− Expressives, ce
qui nous permet de localiser les organes faciaux grossièrement, du fait que ce sont les
parties les plus concernées par l’apparition d’une expression. Ensuite, une détection plus
fine est opérée, se basant sur l’accumulation des intensités des pixels le long de lignes
horizontales et verticales, et exploitant les propriétés géométriques du visage, comme
illustré sur la figure 7.1.
Nous commençons par appliquer sur l’image différence un filtre Gaussien qui per-
Figure 7.1 – Détection des organes faciaux pour les 6 expressions
met d’éliminer les points relatifs au bruit, ensuite nous appliquons un algorithme qui
scanne l’image pour détecter les blocs denses correspondant aux yeux, au nez et à la
bouche. Enfin nous délimitons ces organes par des lignes verticales et horizontales véri-
fiant des propriétés de densité linéique en calculant le rapport :
DL =PIXvalid
PIXTot, (7.1)
où PIXvalid est le nombre de pixel dont le niveau de gris est supérieur à un seuil fixé, et
PIXTot est le nombre total de pixel le long de la ligne.
Les tests effectués ont montré qu’en limitant le calcul des descripteurs aux organes
détectés, nous obtenons de meilleurs résultats que si l’on travaillait sur l’ensemble de
l’image.
7.2.2 Descriptions des expressions
D’après [Bas79], le cerveau humain procède davantage par une analyse locale du visage
que par une approche holistique pour déterminer l’expression faciale présente. La fi-
gure 7.2 illustre les mouvements associés à chaque expression. Nous sommes partis de
ce constat pour tenter de reconnaitre les expressions à partir d’une combinaison d’infor-
mations statiques, calculées dans le plan image, et dynamiques, calculées dans le plan
temporel.
Figure 7.2 – Les mouvements faciaux décrits par [Bas79]
Pour la description statique, nous avons utilisé LBP que nous avons appliqué à un
masque de 10 blocks (Figure 7.3), relatifs aux zones qui sont concernées par les expres-
sions.
Figure 7.3 – 10 blocks correspondant à une image de sourire
Pour extraire l’information dynamique, nous avons repris les travaux de [ZP09] et
avons analysé le plan image XY, ainsi que les plans temporels en horizontal XT, et en
vertical YT, ce qui a fait apparaître de très grandes différences (Fig.7.4).
L’analyse des plans XT et YT a montré, que pour différentes expressions, les plus
grandes différences apparaissaient dans le plan YT, et que ces variations sont quasiment
les mêmes pour différents sujets ((Fig.7.5)
Nous avons proposé de quantifier les déformations selon le plan YT
Soit S = {I1, I2, . . . , IT} une séquence temporelle, avec T le nombre d’images dans
la séquences S, et où chaque image Ii a une taille de n× m. Pour chaque valeur de x,
Figure 7.4 – Gauche : XY(vue de face) ; Milieu : la tranche YT ; Droite : La tranche XT .
Figure 7.5 – Les déformations de la tranche YT pour différentes expressions
avec 1 ≤ x ≤ n, on décompose S en n spatio-temporelles tranches Px tel que décrit sur
la figure 7.6.
Par la suite, chaque tranche de hauteur m est décomposée en trois parties de diffé-
rentes hauteurs : (m1 = 38 m) pour les yeux, (m2 = 1
4 m) pour le nez et (m3 = 38 m) pour la
bouche. Enfin, le calcul des descripteurs se fera sur des portions de tranche de taille τs,
qui se chevauchent, le long de chaque tranche.
Le premier descripteur, VTB, est une extension de LBP, où nous avons fixé τs =
3. Dans la 3ime image, chaque pixel est utilisé comme valeur de seuil appliqué à ses
précédents voisins, situés aux mêmes coordonnées (cf figure 7.7)
Pour chaque pixel p de coordonnées (x, y) et de niveau de gris gx,y dans l’image It à
Figure 7.6 – Exemple de tranches dans le plan YT.
l’instant t, la valeur associée à ce pixel est donnée par l’équation 7.2.
VTB = s(gx,y−1,t−2 − gx,y−1,t)25
+s(gx,y,t−2 − gx,y,t)24
+s(gx,y+1,t−2 − gx,y+1,t)23
+s(gx,y+1,t−1 − gx,y+1,t)22
+s(gx,y,t−1 − gx,y,t)21
+s(gx,y−1,t−1 − gx,y−1,t)20.
(7.2)
Figure 7.7 – Calcul de VTB
où
s(x) =
1, i f x > 0
0, i f x ≤ 0
Le descripteur final, de taille 192, est donné alors par l’histogramme des valeurs
calculées, et mesure la variation de texture qui apparait lors d’une expression faciale.
Le second descripteur que l’on propose de calculer concerne le moment spatio-
temporel, qui a pour but de capturer les variations de forme observées. Nous avons
opté pour les moments (M00, M10/M00, M01/M00) à calculer sur chacune des 3 parties
de la tranche définies précédemment. Le moment Mp,q(x, i, k) de chaque partie pour la
position x de l’image Ii avec une fenêtre temporelle de τs est donné, pour la tranche Px
par :
Mp,q(x, i, k) =mk
∑y=1
i
∑t=i−τs
yptqPx(y, t) (7.3)
où mk est la hauteur de la partie en question avec 1 ≤ k ≤ 3 comme indiqué sur
la figure 3.15. Pour une image Ii donnée, les moments sont calculés avec τs = 8 pour
toutes les positions de x. Ensuite ces moments sont regroupés dans un vecteur final
de taille 3 × 3 × 64, soit 576 bins. Le vecteur ainsi obtenu servira alors à déterminer
les probabilités pour que l’image courante Ii soit associée à chacune des 7 expressions
connues.
7.2.3 Classification par HDP (Hierarchical Dirichlet Process)
Notre première idée a été d’utiliser les processus hiérarchiques de Dirichlet pour mo-
déliser les expressions faciales. Nous avons considéré ce problème similaire à celui de
la modélisation des thèmes abordés par les documents d’un corpus, à partir des oc-
currences de mots dans le texte. Dans cette application, un document peut traiter de
nombreux thèmes (inconnus et dont le nombre est également inconnu) et partager cer-
tains thèmes avec d’autres documents, chaque thème étant caractérisé par un ensemble
de mots.
Les textes ou documents sont donc des distributions de thèmes, et chaque thème est
une distribution de mots pris dans un vocabulaire. L’analogie peut être faite avec notre
cadre applicatif sur les expressions faciales en procédant à la correspondance (texte,
mot, thème) - (image, mot visuel, expression) où les images sont des distributions sur
les mots visuels, les mots visuels sont des distributions sur des "composantes internes"
à déterminer, et les expressions (à déterminer également) sont aussi des distributions
sur ces "composantes internes". Dans ce cas, l’idée est de déterminer non seulement
les composantes internes (les thèmes), ce qui peut être réalisé par l’intermédiaire d’une
LDA (Latent Dirichlet Allocation), mais également les expressions, qui sont elles-mêmes
des distributions sur les composantes internes. Nous avons ainsi une distribution sur
un espace de distributions, pour laquelle la modélisation par processus de Dirichlet se
trouve bien adaptée.
D’autres motivations pour ce choix sont d’une part l’aspect non paramétrique de ce
modèle car, bien qu’un nombre connus d’expressions identifiées par Ekman soit établi,
nous pensons que la classification peut concerner en réalité un nombre inconnu d’ex-
pressions relatives aux différentes intensités que l’on peut rencontrer ; d’autres part, les
clusters à trouver ne sont pas disjoints mais partagent certaines caractéristiques que l’on
cherchera à mettre en évidence. La classification par HDP a été appliquée à la catégori-
sation d’objets et les premiers tests effectués ont montré que cette approche donne des
résultats comparables à l’état de l’art.
La base de la classification Bayésienne est la formule de Bayes qui permet d’estimer
la probabilité a postériori d’appartenance à une classe, à partir de la connaissance de l’a
priori et de la vraissemblance du modèle dont la forme la plus simple est donnée par :
p(x/y) = p(x)× p(y/x)÷ p(y) (7.4)
Plus globalement, le but est d’établir une inférence à partir d’observations soit pour
décrire soit pour prédire un phénomène. Généralement le premier terme de l’équation
7.4 est issu de la connaissance a priori que l’on a du problème ou d’étude statistique,
tandis que le second est donné par le modèle de distribution choisi, dont les paramètres
sont déterminés par apprentissage dans le cas d’approches supervisées. De nombreux
travaux de recherche ont porté sur la détermination de ces deux termes. L’approche
bayesienne a conquis de très nombreux domaines (finance, météo, sciences humaines,
informatique, etc.). En traitement d’image, elle a également été très utilisée pour la clas-
sification, la segmentation, la restauration, etc. Pour chaque application, il s’agit d’établir
le modèle probabiliste adéquat, de déterminer les inférences puis de trouver les para-
mètres des modèles choisis. Une autre problématique vient se rajouter dans un certain
nombre de cas, à travers la complexité du calcul qui oblige de faire appel d’autres tech-
niques (programmation dynamique, graphecut, etc.).
7.2.4 Processus de Dirichlet Hiérarchique pour la classification
Les papiers de [TJBB06] et [WZFF06] fournissent une présentation bien détaillée du mé-
canisme utilisé. Sans toutefois expliciter le formalisme mathématique dont la théorique
dépasse largement le cadre de nos travaux, l’idée maîtresse est de modéliser l’image
par un ensemble de patch, chacun décrit par un mot code issu d’un dictionnaire. Un
"‘thème latent"’ est associé à chaque patch. On admet que les patchs qui sont liés à une
structure donnée de l’image doivent probablement partager des thèmes similaires. Il est
alors possible, pour chaque classe d’objets, de déterminer la distribution des thèmes as-
sociés à celle-ci. Nous pouvons voir le processus de Dirichlet Hiérarchique comme une
extension de LDA. La figure 7.8, donne le modèle graphique de LDA. Initialement, ces
méthodes ont été appliquées à la classification des documents d’un corpus. LDA est
basée sur l’idée qu’un document est un mélange probabiliste de thématiques latentes
(z) et que chaque thématique est caractérisée par une distribution de probabilité sur les
mots (x) qui lui sont associés. Le modèle LDA permet d’empêcher la dépendance entre
la distribution de probabilités qui va servir au choix des thèmes et les documents déjà
connus. Dans le graphe présenté nous avons :
Figure 7.8 – Modèle graphique de LDA
γ représente le paramètre de la distribution uniforme de Dirichlet entre documents
et thèmes et qui est l’a priori,
β est le paramètre de la distribution uniforme de Dirichlet entre mots et thèmes et
que l’on va estimer,
πj est la distribution des thèmes sur le jime document,
zj,i est le thème associé au iime mot dans le jime document,
et xj,i est le iime mot dans le jime document.
Figure 7.9 – Modèle graphique de HDP
La figure 7.9 donne le modèle de HDP proposé par [TJBB06], dans lequel π n’est
plus une distribution paramétrée par α mais une distribution de distribution. Ce qui,
comme l’indique son nom, inclue un niveau supplémentaire de hiérarchie.
Nous avons alors :G0|λ, H ∼ DP(γ, H)
Gj|α0, G0 ∼ DP(α0, G0)
φji|Gj ∼ Gj
xji|φji ∼ F(φji)
(7.5)
Appliqué à notre cas de figure, nous avons donc des observations xj,i qui sont les
mots visuels calculés selon la technique du Bag Of Words. φji représentent des compo-
santes latentes à déterminer, de distribution Gj de paramètre α0 suivant la distribution de
Dirichlet G0, elle-même de paramètres λ et H. A une classe donnée (objets, expressions,
etc.) correspondra une certaine distribution de composantes latentes φji. La connaissance
de celle-ci permettra d’associer chaque image à la classe qui lui est la plus probable.
L’inférence est conduite par Monte carlo Chaine de Markov, à l’aide d’un algorithme
de Gibbs. Nous avons fait appel à la représentation par "‘stick-breaking"’.
7.3 Résulats
Les tests ont concerné 2 aspects. D’une part la classification d’objets par Processus Hié-
rarchique de Dirichlet afin de valider le modèle, et d’autre part la détection et la des-
cription des expressions faciales.
7.3.1 Validation de la classification d’objets par HDP
Nous avons utilisé quelques catégories de la base Calthech [FFFP07] qui contient 101 ca-
tégories d’objets, constituée chacune de 40 à 800 images de taille 300x200 pixels (exemples
donnés en figure 7.10).
Figure 7.10 – Catégories de la base Caltech utilisées pour nos tests.
Nous avons fait appel à trois descripteurs complémentaires, SIFT, Shape Context
[BMP00] et couleur, calculés localement au niveau de points d’intérêt, et donnant cha-
cun un codebook visuel différent. Les trois codebooks sont ensuite combinés puis utilisés
dans le modèle génératif proposé. Pratiquement, pour SIFT, un vecteur de 128 compo-
santes est calculé sur une région 16x16 pixels autour des points d’intérêt détectés, puis
une clusterisation est appliquée à tous les descripteurs d’une classe pour ne garder
qu’un vocabulaire de 100 mots visuels. Pour le descripteur de forme Shape Context,
celui-ci est calculé autour de points issus d’une carte des contours, donnant un vecteur
de taille 128. On garde ici un vocabulaire de 50 mots visuels. Enfin, on calcule la couleur
moyenne autour des points d’intérêt dans l’espace LUV, et nous ne gardons que 24 mots
visuels. Le tableau 7.1 donne les résultats obtenus par HDP au bout de 50 itérations
pour la phase d’apprentissage et 20 pour la phase de test.
Table 7.1 – résultats de la classification, (Nombre de composantes entre parenthèses).
SIFT+SC+C SIFT+SC SIFT
Avion 96%(54) 92%(43) 68%(34)Leopard 100%(67) 94%(42) 92%(46)
Motos 88%(40) 90%(43) 84%(35)Visages 66%(50) 70%(41) 62%(43)
Nous comparons la combinaison de plusieurs descripteurs (SIFT, SIFT+SC et SIFT+SC+Couleur).
Nous observons que ces résultats rivalisent avec ceux de l’état de l’art [FFFP07] (75% en
moyenne et 96% en valeur max). L’une des difficultés rencontrées est de déterminer le
nombre d’itérations nécessaires pour être certain que nous avons bien convergé vers
le bon modèle. La figure 7.11 montre l’évolution de la vraisemblance en fonction du
nombre d’itération. Pour SIFT seul, 10 à 15 itérations suffisent. Plus le nombre de des-
cripteurs augmente, plus il y a besoin d’itérations pour obtenir le même résultat (15 à
20 pour SIFT+SC et 20 à 25 pour SIFT+SC+Couleur).
Nous cherchons également à localiser un objet à partir des composantes internes
détectées par l’apprentissage de sa catégorie. Sur la figure 7.12 apparaissent sous forme
de carré les positions des composantes les plus caractéristiques du léopard sur une
image test, il apparaît clairement que celles-ci permettent de localiser le léopard bien
que le fond présente des similitudes avec l’objet.
Le dernier point auquel nous nous sommes intéressés à ce niveau concerne le temps
de calcul. Celui-ci est excessif, notamment dans la phase d’apprentissage. Nous avons
Figure 7.11 – Convergence et nombre d’itérations.
Figure 7.12 – Les principales composantes caractéristiques du léopard.
cherché à réduire le nombre de composantes intermédiaires à sélectionner, à la différence
de certains travaux comme ceux de [VJ01] qui réduisent plutôt le nombre de descrip-
teurs, par utilisation d’Adaboost. Travaillant toujours sur les mêmes quatre catégories,
nous avons utilisé 50 images d’une catégorie pour les exemples positifs et 50 images
de chacune des trois autres catégories pour les exemples négatifs. Nous faisons varier
le nombre T de composantes intermédiaires à garder. La figure 7.13 montre qu’à partir
d’une dizaine de composantes, les résultats n’évoluent que très peu.
Nous pouvons voir sur le tableau 7.2 une comparaison entre HDP simple et HDP
Boosté, où il ressort qu’en moyenne, il n’y a pas de perte de la qualité de la classification
lors de l’utilisation d’un jeu réduit de composantes, déterminé par Adaboost.
Cependant, lors des tests, la phase d’apprentissage s’est avérée très lourde en temps
Figure 7.13 – Performance en fonction du nombre de composantes.
Table 7.2 – Comparaison des performances .HDP(K composantes) HDPBoost (Les meilleures T)
Avion 92%(43) 94%(8)Visage 70%(41) 74%(10)
Leopard 94%(42) 92%(11)Moto 90%(43) 88%(6)
de calcul, étant donnée la taille des descripteurs utilisés, et ce malgré un nombre de
classes et d’images réduit. Par conséquent, nous avons finalement opté concernant la
classification des expressions faciales, pour un classifier SVM classique, que nous avons
boosté pour sélectionner les éléments les plus significatifs. Ceci nous a notamment per-
mis de valider notre méthode pour la détection et la description des expressions
7.3.2 Validation des descripteurs proposés pour la classification des expres-
sions par SVM
Nous avons distingué ici la classification d’image de la classification de séquences, sa-
chant que le but était de valider la méthode proposée pour la localisation des expressions
faciales ainsi que de vérifier la pertinence des descripteurs proposés. Les tests ont été
menés sur trois bases. La première est la base JAFFE [LAKG98] qui inclue 7 expressions
faciales (six + neutre) pour chacun des dix personnages d’origine Japonaise. Cette base
est largement utilisée pour les tests non spontanés, et est réputée pour présenter de
faible taux de reconnaissance. La deuxième base est MMI [PVRM05], qui est constituée
de séquences annotées manuellement. Enfin la base Cohn-Kanade [KCT00] constituée
également de séquences, et qui utilise le système FACS (Facial Action Coded System)
pour l’identification des pics des expressions. En revanche, les labels relatifs aux émo-
tions ne sont disponibles.
7.3.2.1 Classification des images d’expression par descripteurs statiques
On utilise le descripteur LBP calculé autour des organes détectés tel que décrit dans la
section . La classification est réalisée par SVM Boosté où seuls 20 paramètres sont sélec-
tionnés par expression. Etant donné les recouvrements, nous arrivons à un vecteur de di-
mension 73 pour l’ensemble des 6 expressions. Les tableaux 7.3, 7.4 et 7.5 comparent nos
résultats avec certains de ceux obtenus respectivement par [FPH05],[GD05b],[KF08],[LFCY06]
pour la base JAFFE, par [SGM09],[TA07] pour la base MMI et par [CSG∗03], [BGL∗06],
[MB07] et [SGM09] pour la base Cohn-Kanade.
Table 7.3 – Performance moyenne sur la base JAFFE (%)notre approche Feng Guo Koutlas Liao
98.3 93.8 92.3 90.8 94.59
Table 7.4 – Performance moyenne sur la base MMI (%)Notre approche Shan Tripathie
91.2% 86.9% 82.2%
Pour la base Cohn-Kanade, étant donnée la grande variabilité des apparences, nous
avons gardé 50 paramètres, ce qui donne finallement un vecteur de dimension 220.
Enfin, pour pouvoir se comparer à d’autres travaux, nous avons utilisé également des
histogrammes de Gabor features pour la description. Dans le tableau 7.5, SN correspond
au nombre de sujets, SqN au nombre de séquences, C au nombre de classe et A au fait
que la localisation des organes est automatique ou non. Nous obtenons des résultats
légèrement meilleurs que ceux de l’état de l’art.
Table 7.5 – Comparaison des performances sur la base Cohn-Kanade (%)Representation SN SqN C A Taux(%)
[CSG∗03] Motion Units 53 - 6 N 91.8[BGL∗06] Gabor+AdaBoost 90 313 7 Y 93.3
[MB07] Gabor+AdaBoost - - 6 Y 84.6[MB07] Edge/chamfer+AdaBoost - - 6 Y 85
[SGM09] BoostedLBP 96 320 6 N 95.1Notre approche LBP 94 346 6 Y 91.9Notre approche BoostedLBP 94 346 6 Y 91.4Notre approche BoostedLBP+ Gabor 94 346 6 Y 94.3
7.3.2.2 Classification des images d’expression par descripteurs dynamiques
Le but ici est de reconnaitre l’expression relative à une séquence, en calculant en une
seule fois les descripteurs correspondant à une portion de celle-ci. Nous travaillons dans
ce cas sur des portions de séquences dont la longueur est suffisante pour permettre de
calculer les descripteurs VTB et les moments. Dans [CSG∗03], l’auteur utilise un jeu
de 53 sujets, présentant chacun au moins 4 séquences, et pour lesquels, pour chaque
expression, huit frames sont considérées pour l’extraction des descripteurs. [BGL∗06]
considère la première et la dernière frame de chaque séquence aussi bien pour l’ap-
prentissage que pour le test. [SGM09] utilise une frame de l’expression neutre et les
trois dernières frames relatives au pic de l’expression à reconnaitre. Nous avons suivi
un schéma similaire pour le jeu d’images à utiliser pour la description en prenant en
compte à chaque fois les images neutres et au moins 4 images expressives.
Les descripteurs LBP, VTB et Moments sont calculés comme présenté dans la section
7.2.2, avec une largeur de fenêtre temporelle τs fixée à 8. Cette valeur est un bon com-
promis entre le fait de disposer de suffisamment de régions pour calculer les moments
temporels et suffisamment de probabilités, chacune associée à une mini-séquence de lon-
gueur τs, pour calculer la probabilité associée à chaque image. Le tableau 7.6 montre les
résultats de comparaison avec divers travaux menés sur la même base (Cohn-Kanade),
où D précise si les descripteurs sont dynamiques ou non. Nous pouvons en conclure
que ces descripteurs améliorent sensiblement la classification.
La base MMI a également servi pour les tests. La matrice de confusion donnée par le
tableau 7.7 pour le descripteur Moment montre les classes difficiles à caractériser pour ce
descripteur (ici C pour colère, D pour dégout, J pour joie, P pour peur, T pour tristesse,
S pour surprise et N pour neutre).
Table 7.6 – Comparaison avec descripteurs spaciaux temporelsFeatures SN SqN C D A AR(%)
[CSG∗03] Gabor 53 318 6 N N 91.8[BGL∗06] Gabor 90 313 7 N Y 93
[KZP08] Shape - - 6 N Y 92.3[DSRDS08] Holistic 98 411 6 N N 96.1
[PY09] Harr+Boost 96 - 6 N Y 88
[VNP09] Candide - 440 7 N N 90
[SGM09] BoostedLBP 96 320 6 N N 95.1[SC09] AAM - 72 6 N N 97.22
Notre approche LBP+VTB 95 348 7 Y Y 94
Notre approche LBP+VTB 95 348 7 Y Y 97.2Notre approche Moments 95 348 7 Y Y 95.5Notre approche Moments 95 348 7 Y Y 97.3
Table 7.7 – Matrice de confusion MMI database (%)
C D J P T S N
C 76.6 2.34 0.78 0 2.34 0 17.97
D 4.27 81 0 5.13 0.85 2.56 5.98
J 1.28 0.64 94.9 0 0 0.64 2.56
P 2.68 2.68 1.79 58.9 1.79 16.96 15.18
T 3.9 0 0 0.78 66.4 1.56 27.34
S 1.28 0.64 0.64 7.69 0 71.2 18.60
N 0.71 0.1 0.1 0.3 0.3 0.3 98.2
Pour la suite, au lieu de classer des frames d’une séquence donnée, nous allons
classer la séquence entière, en tenant compte de l’évolution des changements du visage
au cours de l’expression.
7.3.2.3 Classification de séquences d’expressions par descripteurs dynamiques
A la différence de la section précédente, nous cherchons ici à déterminer l’expression
associée à une séquence en analysant l’évolution des descripteurs spatio-temporels au
cours de la séquence. La recherche sur l’expression faciale s’oriente davantage vers la
vidéo et les expressions spontanées. Dans ce sens, nous avons utilisé les résultats pré-
cédents pour déterminer l’expression portée par la séquence, en supposant que celle-ci
n’en contient qu’une. Différentes études ont montré que les expressions sont très souvent
brèves, et par conséquent, ne tiennent que sur quelques frames. Dans notre approche,
nous avons considéré toutes les frames constituant la séquence, et avons déterminé pour
chacune d’elle la probabilité d’appartenir à l’une des classes. Pour cela, on suppose avoir
une séquence ordonnée S = {I1, I2, . . . , IT}, où T est le nombre de frames dans une sé-
quence. Nous classons chaque frame Ii, avec 3 ≤ i ≤ T, selon la méthode précédente
en lui associant un vecteur de probabilité {pi,c0 , pi,c1 , . . . , pi,c6}, où pi,cj est la probabilité
pour que la frame i appartienne à la classe j, avec pi,c0 pour l’expression neutre. Par la
suite, l’algorithme 7.1 permet d’associer à la séquence la classe la plus probable.
Algorithm 7.1: Classification d’une séquence par somme pondérée
1. Initialiser le vecteur PS = {P1, P2, . . . , P6} avec Pk = 0, et 1 ≤ k ≤ 6
2. pour (i = 3 à T) :
(a) Si Gi =′ Neutre′, ignorer frame et aller itération suivante ;
(b) If Gi 6=′ Neutre′, Pk = Pk + wi ∗ pi,ck , pour 1 ≤ k ≤ 6
3. Label de la séquenceG = argmaxk {Pk} , (7.6)
pour 1 ≤ k ≤ 6.
Concernant les poids wi, nous avons émis des hypothèses sur la relation entre la
similarité et la distance par rapport à l’image la plus expressive (Loi uniforme ou loi
Gaussienne pour 2 valeurs de σ, σ1 = T et σ2 = T/2). Sur le tableau 7.8 nous avons les
comparaisons avec d’autres travaux testés sur la même base, et nous pouvons constater
que pour différents paramètres de la distribution, nous obtenons de meilleurs résultats.
Table 7.8 – Comparatif pour la classification de séquencesSN SqN C D A AR(%)
[YBS04] - - 6 Y Y 90.9[XLC08] 95 365 6 Y N 88.8[BMB08] 94 333 6 N Y 89.13
[ZP09] a 97 374 6 Y N 95.1[ZP09] b 97 374 6 Y Y 93.85
[RD09] 70 350 5 Y - 93.85
[KOY∗09] 53 129 6 Y - 70
[CLL09] - 392 6 Y N 92.86
Notre LBP+VTBwi = 1 95 348 6 Y Y 93.68
Notre LBP+VTBN (µ, σ21 ) 95 348 6 Y Y 95.1
Notre LBP+VTBN (µ, σ22 ) 95 348 6 Y Y 95.7
Notre Momentswi = 1 95 348 6 Y Y 98.5
Bibliography
[AAR04] Agarwal S., Awan A., Roth D.: Learning to detect objects in images via asparse, part-based representation. IEEE Trans. Pattern Anal. Mach. Intell. 26,11 (2004), 1475–1490. 11
[AR02] Agarwal S., Roth D.: Learning a sparse representation for object detection.In ECCV ’02: Proceedings of the 7th European Conference on Computer Vision-Part IV (2002), pp. 113–130. 11
[Bas79] Bassili J. N.: Emotion recognition: The role of facial movement and the rel-ative importance of upper and lower areas of the face. Journal of Personalityand Social Psychology 37, 11 (1979), 2049 – 2058. xiv, xv, 47, 106, 107
[BBDT05] Beveridge J., Bolme D., Draper B., Teixeira M.: The csu face identificationevaluation system: Its purpose, features, and structure. Machine Vision andApplications 16, 2 (February 2005), 128–138. 38, 40
[BC08] Beskow J., Cerrato L.: Evaluation of the expressivity of a swedish talking headin the context of human-machine interaction. mar 2008. 84
[bDK05] bo Duan K., Keerthi S. S.: Which is the best multiclass svm method? anempirical study. In Proceedings of the Sixth International Workshop on MultipleClassifier Systems (2005), pp. 278–285. 62
[BETVG08] Bay H., Ess A., Tuytelaars T., Van Gool L.: Speeded-up robust features(surf). Comput. Vis. Image Underst. 110, 3 (2008), 346–359. 15
[BGL∗06] Bartlett M., GLittlewort C., Lainscsek C., Fasel I., Frank M., Movel-lan J.: Fully automatic facial action recognition in spontaneous behavior.In Automatic Face and Gesture Recognition (2006), pp. 223–228. 24, 92, 95, 118,119, 120
123
124 Bibliography
[BGV92] Boser B. E., Guyon I. M., Vapnik V. N.: A training algorithm for optimalmargin classifiers. In COLT ’92: Proceedings of the fifth annual workshop onComputational learning theory (1992), pp. 144–152. 16, 60
[Bie87] Biederman I.: Recognition by components: A theory of human imageunderstanding. PsychR 94, 2 (1987), 115–147. 10
[Bis06] Bishop C. M.: Pattern Recognition and Machine Learning (Information Scienceand Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. 63
[BKSS10] Bergtholdt M., Kappes J., Schmidt S., Schnörr C.: A study of parts-based object class detection using complete graphs. Int. J. Comput. Vision 87,1-2 (2010), 93–117. 19, 20
[BMB08] Buenaposada J., Munoz E., Baumela L.: Recognising facial expressionsin video sequences. Pattern Analysis and Applications 1, 2 (January 2008),101–116. 25, 26, 27, 28, 38, 95, 96, 121
[BMP00] Belongie S., Malik J., Puzicha J.: Shape context: A new descriptor forshape matching and object recognition. In Proc. NIPS (2000), pp. 831–837.42, 78, 115
[BNJ03] Blei D., Ng A., Jordan M.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(January 2003), 993–1022. 17, 65
[Bra00] Bradski G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000).59
[BT05] Bouchard G., Triggs B.: Hierarchical part-based visual object categoriza-tion. In CVPR ’05: Proceedings of the 2005 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR’05) - Volume 1 (2005), pp. 710–715. 20
[Bur98] Burges C. J. C.: A tutorial on support vector machines for pattern recogni-tion. Data Min. Knowl. Discov. 2, 2 (1998), 121–167. xiv, 61, 62, 63
[BZMn08] Bosch A., Zisserman A., Muñoz X.: Scene classification using a hybridgenerative/discriminative approach. IEEE Trans. Pattern Anal. Mach. Intell.30 (April 2008), 712–727. 70
[CB03] Chibelushi C., Bourel F.: Facial expression recognition: A brief tutorialoverview. In CVonline: On-Line Compendium of Computer Vision (2003). xiv,24, 25
[CCS03] Cowie R., Cowie R., Schroeder M.: The description of naturally occurringemotional speech. In Proceedings of 15th International Congress of PhoneticSciences (2003), pp. 2877–2880. 84
Bibliography 125
[CFH05] Crandall D., Felzenszwalb P., Huttenlocher D.: Spatial priors for part-based recognition using statistical models. In CVPR ’05: Proceedings of the2005 IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition (CVPR’05) - Volume 1 (Washington, DC, USA, 2005), IEEE ComputerSociety, pp. 10–17. 20
[CHC09] Cheng X., Hu Y., Chia L.-T.: Hierarchical word image representation forparts-based object recognition. In ICIP’09: Proceedings of the 16th IEEE in-ternational conference on Image processing (Piscataway, NJ, USA, 2009), IEEEPress, pp. 301–304. 19
[CL01] Chang C.-C., Lin C.-J.: LIBSVM: a library for support vector machines,2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. 60
[CL06] Carneiro G., Lowe D.: Sparse flexible models of local features. In ECCV(3) (2006), pp. 29–43. xiii, 19, 20
[CLL09] Chang K., Liu T., Lai S.: Learning partially-observed hidden conditionalrandom fields for facial expression recognition. In CVPR (Miami, FL, USA,June 2009), pp. 533–540. 28, 93, 95, 96, 121
[CSG∗03] Cohen I., Sebe N., Gozman F., Cirelo M., Huang T.: Learning bayesiannetwork classifiers for facial expression recognition both labeled and unla-beled data. In CVPR (Madison, Wisconsin, USA, June 2003), vol. 1, pp. 595–601. 92, 118, 119, 120
[CV95] Cortes C., Vapnik V.: Support-vector networks. In Machine Learning (1995),pp. 273–297. 60
[Dar02] Darwin C.: Expression of the Emotions in Man and Animals, The, 3 sub ed.Oxford University Press Inc, 2002. 83
[DPR06] Djemal K., Puech W., Rossetto B.: Automatic active contours propagationin a sequence of medical images. IJIG 6, 2 (April 2006), 267–292. 26
[DSRDS08] De Silva C. R., Ranganath S., De Silva L. C.: Cloud basis function neuralnetwork: A modified rbf network architecture for holistic facial expressionrecognition. Pattern Recogn. 41, 4 (2008), 1241–1253. 61, 92, 120
[Eck03] Eckman P.: Darwin, deception, and facial expression. Annals of the New YorkAcademy of Sciences 1000, EMOTIONS INSIDE OUT: 130 Years after Darwin’sThe Expression of the Emotions in Man and Animals (2003), 205–221. 102
[EF76] Ekman P., Friesen W. V.: Pictures of Facial Affect. Consulting PsychologistsPress, 1976. xiii, 3
126 Bibliography
[EF78] Ekman P., Friesen W.: Facial Action Coding System: A Technique for the Mea-surement of Facial Movement. Consulting Psychologists Press, Palo Alto, 1978.5, 23
[EVGW∗] Everingham M., Van Gool L., Williams C. K. I., Winn
J., Zisserman A.: The PASCAL Visual Object ClassesChallenge 2010 (VOC2010) Results. http://www.pascal-network.org/challenges/VOC/voc2010/workshop/index.html. 11
[FE73] Fischler M., Elschlager R.: The representation and matching of pictorialstructures. Computers, IEEE Transactions on C-22, 1 (jan. 1973), 67 – 92. 19
[Fer05] Fergus R.: Visual Object Category Recognition. PhD thesis, University ofOxford, 2005. 9, 10
[FFFP07] Fei-Fei L., Fergus R., Perona P.: Learning generative visual models fromfew training examples: An incremental bayesian approach tested on 101
object categories. Comput. Vis. Image Underst. 106, 1 (2007), 59–70. 10, 19, 20,77, 79, 83, 114, 115
[FFTR∗05] Fize D., Fabre-Thorpe M., Richard G., Doyon B., Thorpe S. J.: Rapidcategorization of foveal and extrafoveal natural images: Associated erpsand effects of lateralization. Brain and Cognition 59, 2 (2005), 145 – 158. 9
[FH05] Felzenszwalb P. F., Huttenlocher D. P.: Pictorial structures for objectrecognition. Int. J. Comput. Vision 61, 1 (2005), 55–79. 20
[FPH05] Feng X., Pietikäinen M., Hadid A.: Facial expession recognition withlocal binary patterns and linear programming. Pattern Recognition and ImageAnalysis 15, 2 (January 2005), 546–548. 24, 89, 118
[FPZ03] Fergus R., Perona P., Zisserman A.: Object class recognition by unsu-pervised scale-invariant learning. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (June 2003), vol. 2, pp. 264–271. 10
[FPZ05] Fergus R., Perona P., Zisserman A.: A sparse object category model forefficient learning and exhaustive recognition. In CVPR ’05: Proceedings of the2005 IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition (CVPR’05) - Volume 1 (2005), pp. 380–387. 19
[FS95] Freund Y., Schapire R. E.: A decision-theoretic generalization of on-linelearning and an application to boosting. In EuroCOLT ’95: Proceedings of theSecond European Conference on Computational Learning Theory (1995), pp. 23–37. 20, 58, 59
Bibliography 127
[GD05a] Grauman K., Darrell T.: The pyramid match kernel: Discriminative classi-fication with sets of image features. In ICCV ’05: Proceedings of the Tenth IEEEInternational Conference on Computer Vision (Washington, DC, USA, 2005),IEEE Computer Society, pp. 1458–1465. xiii, 16, 20, 21, 22
[GD05b] Guo G., Dyer C.: Learning from examples in the small sample case: Face ex-pression recognition. In Sys. Man and Cybernetics?PART B: Cybernetics (2005),pp. 477–488. 24, 27, 89, 118
[GHP07] Griffin G., Holub A., Perona P.: Caltech-256 Object Category Dataset. Tech.Rep. 7694, California Institute of Technology, 2007. 10
[GK01] Gnedin A., Kerov S.: A characterization of gem distributions. Comb. Probab.Comput. 10, 3 (2001), 213–217. 68
[Gro05] Gross R.: Face Databases. Springer, New York, February 2005. 85
[Har54] Harris Z.: Distributional structure. Word 10, 23 (1954), 146–162. 14
[HFH∗09] Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten
I. H.: The weka data mining software: an update. SIGKDD Explorations 11,1 (2009), 10–18. 60
[HL02] Hsu C.-W., Lin C.-J.: A comparison of methods for multiclass supportvector machines. IEEE Transactions on Neural Networks 13, 2 (2002), 415–425.62
[HL08] Huiskes M. J., Lew M. S.: The mir flickr retrieval evaluation. In MIR ’08:Proceedings of the 2008 ACM International Conference on Multimedia InformationRetrieval (New York, NY, USA, 2008), ACM. 11
[Hof99] Hofmann T.: Probabilistic latent semantic indexing. In SIGIR ’99: Pro-ceedings of the 22nd annual international ACM SIGIR conference on Researchand development in information retrieval (New York, NY, USA, 1999), ACM,pp. 50–57. 65
[HPA04] Hadid A., Pietikäinen M., Ahonen T.: A discriminative feature spacefor detecting and recognizing faces. In CVPR (Washington, DC, USA, June2004), pp. 797–804. xiv, 25, 33, 44
[HS88] Harris C., Stephens M.: A combined corner and edge detector. In Proceed-ings of the 4th Alvey Vision Conference (1988), pp. 147–151. 15
[HT98] Hastie T., Tibshirani R.: Classification by pairwise coupling. In NIPS ’97:Proceedings of the 1997 conference on Advances in neural information processingsystems 10 (1998), pp. 507–513. 62
128 Bibliography
[JB08] Jenkins R., Burton A. M.: 100% accuracy in automatic face recognition.Science 319, 5862 (January 2008), 435. 9
[Jeb04] Jebara T.: Machine Learning: Discriminative and Generative. Kluwer Aca-demic, 2004. 57, 60, 63
[JI10a] Ji Y., Idrissi K.: Learning from Essential Facial Parts and Local Features forAutomatic Facial Expression Recognition. In CBMI, 8th International Work-shop on Content-Based Multimedia Indexing (June 2010). 32, 76
[JI10b] Ji Y., Idrissi K.: Using Moments on Spatiotemporal Plane for Facial Ex-pression Recognition. In 20th International Conference on Pattern Recognition(ICPR), Istanbul, Turkey. (Aug. 2010). 32, 76
[JIB09] Ji Y., Idrissi K., Baskurt A.: Object categorization using boosting withinhierarchical bayesian model. In ICIP09 (dec 2009), pp. 317–320. 32, 56, 76
[JK09] Ji Y., Khalid I.: Facial expression recognition by automatic facial parts po-sition detection with boosted-lbp. In SITIS09 (Marrakech Maroc, November2009). 76, 105
[KBFS04] Kienzle W., Bakir G., Franz M., Schölkopf B.: Efficient approxima-tions for support vector machines in object detection. In Pattern Recognition,vol. 3175 of Lecture Notes in Computer Science. 2004, pp. 54–61. 21
[KBFS05] Kienzle W., Bakir G. H., Franz M. O., Schölkopf B.: Face detection— efficient and rank deficient. In Advances in Neural Information ProcessingSystems 17 (Cambridge, MA, 2005), Saul L. K., Weiss Y., Bottou L., (Eds.),MIT Press, pp. 673–680. 105
[KCT00] Kanade T., Cohn J., Tian Y.-L.: Comprehensive database for facial expres-sion analysis. In Proceedings of the 4th IEEE International Conference on Au-tomatic Face and Gesture Recognition (FG’00) (Grenoble, France, March 2000),pp. 46 – 53. 84, 87, 118
[KF08] Koutlas A., Fotiadis D.: A region based methodology for facial expressionrecognition. In BIOSIGNALS (2008), vol. 2, pp. 218–223. 25, 89, 118
[KJC08] Kim D. H., Jung S. U., Chung M. J.: Extension of cascaded simple featurebased face detection to facial expression recognition. Pattern Recogn. Lett.29, 11 (2008), 1621–1631. xiv, 27
[Kos94] Kosslyn S.: Image and Brain: The Resolution of the Imagery Debate. MIT Press,1994. 2
[KOY∗09] Kumano S., Otsuka K., Yamato J., Maeda E., Sato Y.: Pose-invariant facialexpression recognition using variable-intensity templates. Int. J. Comput.Vision 83, 2 (2009), 178–194. 26, 29, 96, 121
Bibliography 129
[KP08] Koelstra S., Pantic M.: Non-rigid registration using free-form deforma-tions for recognition of facial actions and their temporal dynamics. In Au-tomatic Face and Gesture Recognition (2008), pp. 1–8. 24, 26
[KZP08] Kotsia I., Zafeiriou S., Pitas I.: Texture and shape information fusionfor facial expression and facial action unit recognition. Pattern Recogn. 41, 3
(2008), 833–851. 26, 92, 120
[LAKG98] Lyons M., Akamatsu S., Kamachi M., Gyoba J.: Coding facial expressionswith gabor wavelets. In FG ’98: Proceedings of the 3rd. International Conferenceon Face & Gesture Recognition (1998), pp. 200–205. 84, 85, 117
[LBH∗08] Laganière R., Bacco R., Hocevar A., Lambert P., Païs G., Ionescu B. E.:Video summarization from spatio-temporal features. In TVS ’08: Proceedingsof the 2nd ACM TRECVid Video Summarization Workshop (New York, NY, USA,2008), ACM, pp. 144–148. 26
[LBL09] Littlewort G. C., Bartlett M. S., Lee K.: Automatic coding of facialexpressions displayed during posed and genuine pain. Image Vision Comput.27, 12 (2009), 1797–1803. 25, 29
[Lew98] Lewis D. D.: Naive (bayes) at forty: The independence assumption in infor-mation retrieval. In ECML ’98: Proceedings of the 10th European Conference onMachine Learning (London, UK, 1998), pp. 4–15. 64
[LFCY06] Liao S., Fan W., Chung A., Yeung D.: Facial expression recognition us-ing advanced local binary patterns, tsallis entropies and global appearancefeatures. In ICIP (2006), pp. 665–668. 89, 118
[LJ08] Larlus D., Jurie F.: Combining appearance models and markov randomfields for category level object segmentation. In CVPR (june 2008). xiii, 22,78, 80, 100
[LLS04] Leibe B., Leonardis A., Schiele B.: Combined object categorization andsegmentation with an implicit shape model. In Proceedings of the Workshop onStatistical Learning in Computer Vision (Prague, Czech Republic, May 2004).11
[Low99a] Lowe D. G.: Object recognition from local scale-invariant features. In IICCV(1999), p. 1150. xiv, 34, 35
[Low99b] Lowe D. G.: Object recognition from local scale-invariant features. In IICCV(1999), p. 1150. 15, 18, 41
[LP05] Li F.-F., Perona P.: A bayesian hierarchical model for learning naturalscene categories. In CVPR ’05: Proceedings of the 2005 IEEE Computer Society
130 Bibliography
Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 2(Washington, DC, USA, 2005), pp. 524–531. xiii, 15, 16, 17, 18, 64
[Mar82] Marr D.: Vision: A computational investigation into the human represen-tation and processing of visual information. In W.H. Freeman (1982). 12
[MB98] Martinez A., Benavente R.: The AR Face Database. Tech. Rep. CVC Tech-nical Report 24, Computer Vision Center in Barcelona, Spain, June 1998.84
[MB07] Moore S., Bowden R.: Automatic facial expression recognition usingboosted discriminatory classifiers. In IEEE International Workshop on Analysisand Modeling of Faces and Gestures (AMFG), ICCV (2007), pp. 71–83. 92, 118,119
[MB09] Moore S., Bowden R.: The effect of pose on facial expression recognition.In BMVC09 (2009), pp. xx–yy. 26, 30
[MCUP04] Matas J., Chum O., Urban M., Pajdla T.: Robust wide-baseline stereofrom maximally stable extremal regions. Image and Vision Computing 22, 10
(2004), 761 – 767. British Machine Vision Computing 2002. 15
[MR83] McLeod P. L., Rosenthal R.: Micromomentary movement and the de-conding of face and body cues. Journal of Nonverbal Behavior 8 (1983), 83–90.102
[MRD05] Mattern F., Rohlfing T., Denzler J.: Adaptive performance-based clas-sifier combination for generic object recognition. In Proceedings of 10th In-ternational Fall Workshop Vision, Modeling, and Visualization, November 16–18,2005. Erlangen, Germany (Erlangen, Germany, Nov. 2005), pp. 139–146. 21
[MS05] Mikolajczyk K., Schmid C.: A performance evaluation of local descriptors.IEEE Transactions on Pattern Analysis & Machine Intelligence 27, 10 (October2005), 1615–1630. 79
[MW09] Matsumoto D., Willingham B.: Spontaneous facial expressions of emotionof congenitally and noncongenitally blind individuals. Journal of Personalityand Social Psychology 96, 1 (2009), 1 – 10. 83
[MWR∗08] Mayer C., Wimmer M., Riaz Z., Roth A., Eggers M., Radig B.: Real timesystem for model-based interpretation of the dynamics of facial expression.In Automatic Face and Gesture Recognition (2008), pp. 1–2. 38
[Nal04] Nallapati R.: Discriminative models for information retrieval. In SIGIR ’04:Proceedings of the 27th annual international ACM SIGIR conference on Researchand development in information retrieval (2004), pp. 64–71. 58
Bibliography 131
[OFPA04] Opelt A., Fussenegger M., Pinz A., Auer P.: Generic object recognition withboosting. Tech. Rep. TR-EMT-2004-01, EMT, TU Graz, Austria, 2004. Submit-ted to the IEEE Transactions on Pattern Analysis and Machine Intelligence.11, 17
[PB04] Pozdnoukhov A., Bengio S.: Tangent vector kernels for invariant imageclassification with svms. In ICPR04 (2004), pp. III: 486–489. 21
[PB07] Pantic M., Bartlett M. S.: Machine Analysis of Facial Expressions. I-TechEducation and Publishing, Vienna, Austria, July 2007. xiii, xiv, 24, 33
[PBE∗06] Ponce J., Berg T. L., Everingham M., Forsyth D. A., Hebert M., Lazebnik
S., Marszalek M., Schmid C., Russell B. C., Torralba A., Williams C.K. I., Zhang J., Zisserman A.: Dataset issues in object recognition. In To-ward Category-Level Object Recognition, volume 4170 of LNCS (2006), Springer,pp. 29–48. 77
[Pin05] Pinz A.: Object categorization. Found. Trends. Comput. Graph. Vis. 1, 4 (2005),255–353. xiii, 9, 16, 20
[PK09] Park S., Kim D.: Subtle facial expression recognition using motion magni-fication. Pattern Recogn. Lett. 30, 7 (2009), 708–716. 28, 48
[PR00] Pantic M., Rothkrantz L.: Automatic analysis of facial expressions: Thestate of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence22 (2000), 1424–1445. 23
[PSK08] Park S., Shin J., Kim D.: Facial expression analysis with facial expressiondeformation. In Pattern Recognition, 2008. ICPR 2008. 19th International Con-ference on (Dec 2008), pp. 1 –4. 27
[PVRM05] Pantic M., Valstar M., Rademaker R., Maat L.: Web based databasefor facial expression analysis. In Proc. IEEE Intl Conf. Multimedia and Expo(2005), pp. 317–321. 84, 86, 118
[PY09] Peng Yang Qingshan Liu D. N. M.: Rankboost with l1 regularization forfacial expression recognition and intensity estimation. In iccv (Kyoto, Japan,September 2009), pp. 1018 – 1025. 92, 120
[RD09] Raducanu B., Dornaika F.: Natural facial expression recognition usingdynamic and static schemes. In ISVC ’09: Proceedings of the 5th InternationalSymposium on Advances in Visual Computing (2009), pp. 730–739. xiv, 28, 29,121
[RLAB10] Revaud J., Lavoué G., Ariki Y., Baskurt A.: Learning an efficient androbust graph matching procedure for specific object recognition. In Interna-tional Conference on Pattern Recognition (ICPR) (Aug. 2010). 20
132 Bibliography
[RLSP06] Rothganger F., Lazebnik S., Schmid C., Ponce J.: 3d object modeling andrecognition using local affine-invariant image descriptors and multi-viewspatial constraints. Int. J. Comput. Vision 66, 3 (2006), 231–259. 10, 13
[RMG∗76] Rosch E., Mervis C. B., Gray W. D., Johnson D. M., Boyes-Braem P.: Basicobjects in natural categories. Cognitive Psychology 8, 3 (1976), 382 – 439. 10
[Rob63] Roberts L. G.: Machine Perception of Three-Dimensional Solids. Outstand-ing Dissertations in the Computer Sciences. Garland Publishing, New York,1963. xiii, 12, 13
[RSNM03] Raina R., Shen Y., Ng A. Y., Mccallum A.: Classification with hybridgenerative/discriminative models. In In Advances in Neural Information Pro-cessing Systems 16 (2003), MIT Press. 70
[RTMF08] Russell B. C., Torralba A., Murphy K. P., Freeman W. T.: Labelme: Adatabase and web-based tool for image annotation. Int. J. Comput. Vision 77,1-3 (2008), 157–173. 12
[SC09] Shang L., Chan K.-P.: Nonparametric discriminant hmm and applicationto facial expression recognition. In CVPR (Miami, FL, USA, June 2009),pp. 2090–2096. 92, 120
[SGM09] Shan C., Gong S., McOwan P.: Facial expression recognition based on localbinary patterns: A comprehensive study. Image and Vision Computing 27, 6
(May 2009), 803–816. 24, 25, 27, 91, 92, 93, 118, 119, 120
[SMB00] Schmid C., Mohr R., Bauckhage C.: Evaluation of interest point detectors.Int. J. Comput. Vision 37, 2 (2000), 151–172. 15
[SRE∗05a] Sivic J., Russell B., Efros A. A., Zisserman A., Freeman B.: Discoveringobjects and their location in images. In International Conference on ComputerVision (ICCV 2005) (October 2005). xiii, 16, 17, 20
[SRE∗05b] Sivic J., Russell B. C., Efros A. A., Zisserman A., Freeman W. T.: Discov-ering object categories in image collections. In Proceedings of the InternationalConference on Computer Vision (2005) (2005). 64
[SSF78] Suwa M., Sugie N., Fujimora K.: A preliminary note on pattern recognitionof human emotional expression. In the Fourth International Joint Conferenceon Pattern Recognition (1978), pp. 408–410. 23
[STFW05] Sudderth E., Torralba A., Freeman W., Willsky A.: Learning hierarchicalmodels of scenes, objects, and parts. In ICCV05 (2005), pp. II: 1331–1338. 18,20
Bibliography 133
[SVD09] Siddiquie B., Vitaladevuni S., Davis L.: Combining multiple kernels forefficient image classification. In WACV09 (2009), pp. 1–8. 21
[TA07] Tripathi R., Aravind R.: Recognizing facial expression using particle filterbased feature points tracker. In PReMI’07: Proceedings of the 2nd internationalconference on Pattern recognition and machine intelligence (2007), pp. 584–591.91, 118
[TCJ10] Tong Y., Chen J., Ji Q.: A unified probabilistic framework for spontaneousfacial action modeling and understanding. IEEE Trans. Pattern Anal. Mach.Intell. 32, 2 (2010), 258–273. 29
[TJBB06] Teh Y. W., Jordan M. I., Beal M. J., Blei D. M.: Hierarchical Dirichletprocesses. Journal of the American Statistical Association 101, 476 (2006), 1566–1581. 17, 56, 66, 68, 70, 79, 80, 112, 113
[TKC05] Tian Y., Kanade T., Cohn J.: Handbook of Face Recognition. Springer, NewYork, 2005. 24
[TLJ07] Tong Y., Liao W., Ji Q.: Facial action unit recognition by exploiting theirdynamic and semantic relationships. IEEE Trans. Pattern Anal. Mach. Intell.29, 10 (2007), 1683–1699. 84
[UB05] Ulusoy I., Bishop C. M.: Generative versus discriminative methods forobject recognition. In CVPR ’05: Proceedings of the 2005 IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 2(2005), pp. 258–265. 58, 64
[VJ01] Viola P., Jones M.: Rapid object detection using a boosted cascade ofsimple features. In CVPR (2001), vol. 1, pp. 511–518. xiii, 4, 20, 21, 25, 38,59, 73, 81, 116
[VKM09] Venkatesh Y., Kassim A. A., Murthy O. R.: A novel approach to classifica-tion of facial expressions from 3d-mesh datasets using modified pca. PatternRecognition Letters 30, 12 (2009), 1128 – 1137. 13, 25
[VNP09] Vretos N., Nikolaidis N., Pitas I.: A model-based facial expressionrecognition algorithm using principal components analysis. In ICIP (2009),pp. 3301–3304. 28, 92, 120
[VNU03] Vidal-Naquet M., Ullman S.: Object recognition with informative featuresand linear classification. In ICCV (2003), pp. 281–288. 15
[VS04] Vogel J., Schiele B.: A semantic typicality measure for natural scene cate-gorization. In DAGM-Symposium (2004), pp. 195–203. 15
134 Bibliography
[WLF∗09] Whitehill J., Littlewort G., Fasel I., Bartlett M., Movellan J.: Towardpractical smile detection. IEEE Trans. Pattern Anal. Mach. Intell. 31, 11 (2009),2106–2111. 30
[WMG09] Wang X., Ma X., Grimson W. E. L.: Unsupervised activity perception incrowded and complicated scenes using hierarchical bayesian models. IEEETrans. Pattern Anal. Mach. Intell. 31, 3 (2009), 539–555. 22
[WS06] Winn J., Shotton J.: The layout consistent random field for recognizingand segmenting partially occluded objects. In IN PROCEEDINGS OF IEEECVPR (2006), pp. 37–44. 19
[WZFF06] Wang G., Zhang Y., Fei-Fei L.: Using dependent regions for object cate-gorization in a generative framework. In CVPR ’06: Proceedings of the 2006IEEE Computer Society Conference on Computer Vision and Pattern Recognition(2006), pp. 1597–1604. 18, 100, 112
[XL09] Xie X., Lam K.-M.: Facial expression recognition based on shape and tex-ture. Pattern Recogn. 42, 5 (2009), 1003–1011. 25
[XLC08] Xiang T., Leung M., Cho S.: Expression recognition using fuzzy spatio-temporal modeling. Pattern Recognition 41, 1 (2008), 204 – 216. 26, 48, 96,121
[YBS04] Yeasin M., Bullot B., Sharma R.: From facial expression to level of inter-est: A spatio-temporal approach. CVPR 2 (2004), 922–927. 26, 96, 121
[YLM09] Yang P., Liu Q., Metaxas D. N.: Boosting encoded dynamic features forfacial expression recognition. Pattern Recogn. Lett. 30, 2 (2009), 132–139. 25
[YWS∗06] Yin L., Wei X., Sun Y., Wang J., Rosato M. J.: A 3d facial expressiondatabase for facial behavior research. In FGR ’06: Proceedings of the 7th In-ternational Conference on Automatic Face and Gesture Recognition (Washington,DC, USA, 2006), IEEE Computer Society, pp. 211–216. 84
[ZP09] Zhao G., Pietikäinen M.: Boosted multi-resolution spatiotemporal de-scriptors for facial expression recognition. Pattern Recognition Letters 30, 12
(2009), 1117–1127. 26, 27, 48, 96, 107, 121
[ZPRH09] Zeng Z., Pantic M., Roisman G. I., Huang T. S.: A survey of affect recog-nition methods: Audio, visual, and spontaneous expressions. IEEE Trans.Pattern Anal. Mach. Intell. 31, 1 (2009), 39–58. 24, 29
[ZSK02] Zhu Y., Silva L. D., Ko C.: Using moment invariants and hmm in facialexpression recognition. Pattern Recognition Letters 23, 1-3 (January 2002), 83
– 91. 26
Bibliography 135
[ZTLH09] Zheng W., Tang H., Lin Z., Huang T. S.: A novel approach to expressionrecognition from non-frontal face images. In ICCV (2009), pp. 1901–1908. 26
[ZVM04] Zhu J., Vai M. I., Mak P. U.: A new enhanced nearest feature space (enfs)classifier for gabor wavelets features-based face recognition. In ICBA (2004),pp. 124–131. 46
[ZYZS05] Zhang W., Yu B., Zelinsky G. J., Samaras D.: Object class recognitionusing multiple layer boosting with heterogeneous features. In CVPR ’05:Proceedings of the 2005 IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPR’05) - Volume 2 (Washington, DC, USA, 2005),IEEE Computer Society, pp. 323–330. 20
136 Bibliography
Author’s Publications
International Conferences
1. Y. Ji, K. Idrissi, A. Baskurt, Object Categorization Using Boosting Within Hierar-
chical Bayesian Model. Dans ICIP, International Conference on Image Processing,
Le Caire, Egypte. 2009.
2. Y. Ji, K. Idrissi. Facial Expression Recognition by Self-Identification for Video Se-
quence. Dans International Conference on Signal-Image Technology and Internet-
Based Systems, Marrakech, Maroc. 2009.
3. Y. Ji, K. Idrissi. Learning from Essential Facial Parts and Local Features for Auto-
matic Facial Expression Recognition. Dans CBMI, 8th International Workshop on
Content-Based Multimedia Indexing , Grenoble, France. 2010.
4. Y. Ji, K. Idrissi Using Moments on Spatiotemporal Plane for Facial Expression
Recognition. . Dans ICPR, 20th International Conference on Pattern Recognition,
Istanbul, Turkey. , 2010.
Atelier
GDR ISIS - Thème B : Image et Vision Action "Visage, geste, action et comportement"
Présentation à la journée du 08/12/2009, Télécom ParisTech - Salle E800
137
138 Author’s publications