5
HAVE 2008 - IEEE International Workshop on Haptic Audio Visual Environments and their Applications Ottawa - Canada, 18-19 October 2008 Combining Integral Projection and Gabor Transformation for Automatic Facial Feature Detection and Extraction Yisu Zhao, Xiaojun Shen, Nicolas D. Georganas Distributed and Collaborative Virtual Environments Research Laboratory (DISCOVER) School ofInformation Technology and Engineering University of Ottawa, K1N 6N5 Canada E-mail: {yzhao/shen/georganas}@discover.uottawa.ca Abstract - In order to achieve subject-independent facial feature detection and extraction and obtain robustness against illumination variety, a novel method of combining integral projection and Gabor transformation is presented in this paper. First, to avoid manually picked expression features, we employ binary image and gray-level integralprojection to detect and locate the exact position of human facial features automatically. Second, we segment the extracted areas into small cells for 7 x 7 pixels each and apply Gabor transformation on each ceiL This greatly reduces the execution time of the Gabor transformation while retaining important information. Finally, a Support Vector Machine is used for classifying facial emotions and when tested on the JAFFE database, the method has achieved a high recognition rateof94%. Keywords - facial feature detection and extraction, integral projection, Gabor transformation. INTRODUCTION Human computer Intelligent Interaction (HCII) is an emerging field of computer science which aims at training the computer to interact with a human more naturally instead of using the keyboard and the mouse. This means that the computer should have the ability to understand the emotional states of human users. The most expressive way humans display their emotional state is through facial expressions. Research in social psychology [1-5] suggests that facial expressions form the major modality in human communication and is a visible manifestation of the affective state of a person. In particular, facial expression recognition, which conveys non-verbal communication cue, is gaining momentum in the area of Affective Computing [6-10] during the past two decades. Many applications, such as virtual reality, video-conferencing, synthetic face animation require efficient facial expression recognition in order to achieve the desired results [11]. The facial expressions under examination were defined by psychologists as a set of six universal facial emotions including happiness, sadness, anger, disgust, fear, and surprise [12]. A generic facial expression recognition system usually has a sequential configuration of processing steps: face detection, pre-processing, feature detection, feature extraction and classification. This work focuses on automatic facial feature detection, location and extraction and proposes a new method of combining both integral projection and Gabor transformation. The detected facial regions are segmented 978-1-4244-2669-0/08/$25.00 <92008 IEEE into small cells and convolved with the Gabor wavelets. The extracted features are then regarded as the input of the Support Vector Machine for expression recognition. METHODOLOGY 1. Face Detection In order to build a system capable of automatically capturing facial feature positions in a face scene, the first step is to detect and extract the human face from the background image. We make use of a robust and automated real-time face detection scheme proposed by Viola and Jones [15] [16], which consists of a cascade of classifiers trained by AdaBoost. In their algorithm, the concept of "integral image" is also introduced to compute a rich set of Haar-like features (see Fig 1). Each classifier employs the integral image filters, which allows the features be computed very fast at any location and scale. For each stage in the cascade, a subset of features is chosen using a feature selection procedure based on AdaBoost. The Viola-Jones algorithm is approximately 15 times faster than any other previous approaches while achieving equivalent accuracy as the best published results [15]. Fig 2 shows the detection of the human face using Viola-Jones algorithm. 2. Pre-processing Since input images are affected by the type of camera, illumination conditions, background information and so on and so forth, we need to normalize the face images before feature detection and extraction. The aim of pre-processing is to eliminate the differences between input images as far as possible, so that we could detect and extract them under the same conditions. Expression representation can be sensitive to translation, scaling, and rotation of the head in an image, to combat the effect of these unwanted transformations, the steps of pre-processing steps are: 1) Transform the face video into face images. 2) Convert the input color images into gray-scale images. 3) Normalize the face images into the same size of 128 x 128 pixels. Scale normalization is used to align all facial features. We use the Lanczos Resampling [17] method to resize images. 4) Smooth face images to remove noise by using Mean-based Fast Median Filter [18].

[IEEE 2008 IEEE International Workshop on Haptic Audio visual Environments and Games (HAVE 2008) - Ottawa, ON, Canada (2008.10.18-2008.10.19)] 2008 IEEE International Workshop on Haptic

Embed Size (px)

Citation preview

Page 1: [IEEE 2008 IEEE International Workshop on Haptic Audio visual Environments and Games (HAVE 2008) - Ottawa, ON, Canada (2008.10.18-2008.10.19)] 2008 IEEE International Workshop on Haptic

HAVE 2008 - IEEE International Workshop onHaptic Audio Visual Environments and their ApplicationsOttawa - Canada, 18-19 October 2008

Combining Integral Projection and Gabor Transformation for Automatic FacialFeature Detection and Extraction

Yisu Zhao, Xiaojun Shen, Nicolas D. GeorganasDistributed and Collaborative Virtual Environments Research Laboratory (DISCOVER)

School ofInformation Technology and EngineeringUniversity ofOttawa, K1N 6N5

CanadaE-mail: {yzhao/shen/georganas}@discover.uottawa.ca

Abstract - In order to achieve subject-independent facialfeature detection and extraction and obtain robustness againstillumination variety, a novel method of combining integralprojection and Gabor transformation is presented in this paper.First, to avoid manually picked expression features, we employbinary image and gray-level integralprojection to detect and locatethe exact position ofhuman facial features automatically. Second,we segment the extracted areas into small cells for 7 x 7 pixelseach and apply Gabor transformation on each ceiL This greatlyreduces the execution time of the Gabor transformation whileretaining important information. Finally, a Support VectorMachine is usedfor classifyingfacial emotions and when tested onthe JAFFE database, the method has achieved a high recognitionrateof94%.

Keywords - facial feature detection and extraction, integralprojection, Gabor transformation.

INTRODUCTION

Human computer Intelligent Interaction (HCII) is anemerging field of computer science which aims at training thecomputer to interact with a human more naturally instead ofusing the keyboard and the mouse. This means that thecomputer should have the ability to understand the emotionalstates of human users. The most expressive way humansdisplay their emotional state is through facial expressions.Research in social psychology [1-5] suggests that facialexpressions form the major modality in humancommunication and is a visible manifestation of the affectivestate of a person. In particular, facial expression recognition,which conveys non-verbal communication cue, is gainingmomentum in the area of Affective Computing [6-10] duringthe past two decades. Many applications, such as virtualreality, video-conferencing, synthetic face animation requireefficient facial expression recognition in order to achieve thedesired results [11]. The facial expressions under examinationwere defined by psychologists as a set of six universal facialemotions including happiness, sadness, anger, disgust, fear,and surprise [12].A generic facial expression recognition system usually has asequential configuration of processing steps: face detection,pre-processing, feature detection, feature extraction andclassification. This work focuses on automatic facial featuredetection, location and extraction and proposes a new methodof combining both integral projection and Gabortransformation. The detected facial regions are segmented

978-1-4244-2669-0/08/$25.00 <92008 IEEE

into small cells and convolved with the Gabor wavelets. Theextracted features are then regarded as the input of theSupport Vector Machine for expression recognition.

METHODOLOGY

1. Face DetectionIn order to build a system capable of automatically capturingfacial feature positions in a face scene, the first step is todetect and extract the human face from the background image.We make use of a robust and automated real-time facedetection scheme proposed by Viola and Jones [15] [16],which consists of a cascade of classifiers trained byAdaBoost. In their algorithm, the concept of "integral image"is also introduced to compute a rich set of Haar-like features(see Fig 1). Each classifier employs the integral image filters,which allows the features be computed very fast at anylocation and scale. For each stage in the cascade, a subset offeatures is chosen using a feature selection procedure basedon AdaBoost. The Viola-Jones algorithm is approximately 15times faster than any other previous approaches whileachieving equivalent accuracy as the best published results[15]. Fig 2 shows the detection of the human face usingViola-Jones algorithm.

2. Pre-processingSince input images are affected by the type of camera,illumination conditions, background information and so onand so forth, we need to normalize the face images beforefeature detection and extraction. The aim of pre-processing isto eliminate the differences between input images as far aspossible, so that we could detect and extract them under thesame conditions. Expression representation can be sensitiveto translation, scaling, and rotation of the head in an image, tocombat the effect of these unwanted transformations, thesteps ofpre-processing steps are:

1) Transform the face video into face images.2) Convert the input color images into gray-scale

images.3) Normalize the face images into the same size of

128 x 128 pixels. Scale normalization is used to alignall facial features. We use the Lanczos Resampling[17] method to resize images.

4) Smooth face images to remove noise by usingMean-based Fast Median Filter [18].

Page 2: [IEEE 2008 IEEE International Workshop on Haptic Audio visual Environments and Games (HAVE 2008) - Ottawa, ON, Canada (2008.10.18-2008.10.19)] 2008 IEEE International Workshop on Haptic

5) Perform grayscale equalization [19] to reduce theinfluence due to illumination variety and ethnicity.Although Gabor transformation is insensitive toillumination variety, by using histogramequalization, the results will be improved.

(2) (3) (4)

Fig 1. A set of Harr-like features

(1)

(2)

m

P(So) =L P(i)i=O

k

P(SI) = L P(i)i=m+l

The probability of SI is

The mean level of So is

where n is the total number of pixels in the image; ni is the

pixel at level i.At level m, we dichotomize the pixels into two classes:

So ={O ~ m}, SI ={m +1~ k}, the probabilities of class

occurrence and mean levels are:

The probability of So is

[fJ(1)

where

(3)

(4)

(5)

k

ns1 = L nii=m+l

k n.~ = L ie-'

i=m+l ns1

(]" =L (Ai - ES) e P(Si )i=O

~ ),~..,~

Fig 3. Convert into Binary Image

where

The mean level of SI is

where

The variance is

The maximum value m of the variance (]" between O--K is thethreshold we are looking for. Using the threshold, we convertthe original picture into a binary image, that is, an image withpixel values 0 and 255, representing black and whiterespectively. The black is called the foreground whichcontains facial feature information that we are interested inwhile the white background will be ignored since it is uselessinformation. Fig 3. shows the result after image binarization.

a) Input Video b) Detection c) Detected faceFig 2. Face Detection using Harr-like features

3. Facial Feature Detection and LocationThe most important areas in human faces for classifying

expression are eyes, eyebrows and mouth. Other areas in thehuman face contribute little or even encumber facialexpression recognition; hence we should refine usefulinformation and abandon useless ones. In this paper, weemploy image binarization and gray-level integral projectionas the methods for automatically detection of human facialexpression.

3.1 Image BinarizationThe first step of facial expression detection is to convert

the input image into a binary image. Image binarization isone of the main techniques for image segmentation. Itsegments an image into foreground and background. Theimage will only appear in two gray-levels: the brightest level255 and the darkest level O. The foreground containsinformation of interest. The most important part in imagebinarization is the threshold selection. Image thresholding is auseful method in many image processing. We use anonparametric and unsupervised method of automaticthreshold selection, called the Otsu method [20]. This methodhas been widely used as the classical technique inthresholding tasks since it is not sensitive to non-uniformilluminations. The main idea of Otsu is to dichotomize thegray-level histogram into two classes by a threshold level. Letan image be represented in K gray levels. Then theprobability of the occurrence of level i is

P(i) = ni

n

Page 3: [IEEE 2008 IEEE International Workshop on Haptic Audio visual Environments and Games (HAVE 2008) - Ottawa, ON, Canada (2008.10.18-2008.10.19)] 2008 IEEE International Workshop on Haptic

o 25 50 75 100 128

25

so

25

75

75

SO

100

100

128

128

o 25 50 75 100 128

75

25

50

25

OL...--r----r---..,r---r----r~

50

75

100

100

128

128

o 0 I..I....-...........---r- ---r----- ~

o 25 50 75 100 128 0 25 50 75 100 128

a) Original projection curves b) Curves after smootheningFig 4. Integral Projection Curves

Fig 5. Facial Feature Detection

k 2 (k 2z2) [ 2 ]liE (z)=....!:!.:!:.-exp -~ • exp(ik .z)-exp(-~)'f' J.J,V a 2 2a2 J.J,V 2

(10)where z=(x, y) is the point with the horizontal coordinate xand the vertical coordinate y, U is the standard deviation ofthe Gaussian window in the kernel, which determines theGaussian window width. There are three parameters of aGabor kernel: location, frequency and orientation. Vectork stands for the frequency vector and f.1 ,v define the

jJ.,V

orientations and the scale of the Gabor kernel,

k = k e ifPu (11 )jJ.,V v

where kv = kID1JX / IV ,I = J2, fjJ#. = f.l1r / n if n differentorientations are chosen, while kmax is the maximumfrequency and IV is the special frequency between kernels in

paper is combining integral projection and Gabortransformations in order to achieve subject-independent facialfeature detection and extraction. Comparing to traditionalFourier transformation, Gabor wavelets can easily adjust thespatial and frequency properties to extract facial features inorder to analyze the results in different granularity. Gaborwavelet 'If ( Z ) is defined as'r jJ.,v

3.2 Integral Projection CurveAn integral projection method [21, 22] is usually used in thedetection of face and eye features. We propose that it is alsogood at facial feature position detection, including eyebrow,eyes, and mouth detection. By applying the integralprojection method on binary images, we could detect theexact position of the facial feature areas automatically.Suppose I(x,y) be a gray value of an image. TheHorizontal integral projection in intervals [YI' Y2] and thevertical projection in intervals [Xl' X2 ] can be definedrespectively as H (y) and V (x), thus we have:

1 X2

H(y) = LI(x,y) (6)X2 -Xl X=Xl

1 Y2

Vex) = L I(x,y) (7)Y2 - YI Y=Yl

The horizontal projection indicates the x-coordinate of theeyebrow, eyes, and mouth. Let x-coordinate of the eyes as thecentral point and double length from eyebrow to eye as theregion, the result of the vertical projection indicates the y­coordinate of the left eye and right eye. Since the originalintegral projection curves are irregular, we use Bezier Curves[23, 24], which are used in computer graphics to modelsmooth curves at all scales. For any four points:

A(XA'YA) ,B(XB'YB) ' C(xc,Yc) ' D(xD'YD) ' it starts

at A(XA'YA) and ends at D(xD'YD) , the so-called end

points. C(xc, Yc) , D(XD' YD) are called the control

points. Therefore, any coordinate (xt,Yt) in a curve is:

xt =x A ·t3+xB ·(1-t)3 +3·xc.(1-t)·t2+3·xD .t·(1-t)2(8)

Yt =YA ·t3+ YB ·(1-t)3 +3· yc·(1-t)·t2+3· YD ·t·(1-t)2(9)

Fig 4 a) shows the original horizontal integral projection andvertical integral projection curves; b) shows the projectioncurves after smoothen.We detect the exact position of facial features by observingthe projection curves after smoothening (see Fig 5). For thehorizontal projection curves, from the top to the bottom, thefirst minimum value represents the x-coordinate of theeyebrow; the second minimum value represents the x­coordinate of the eye. From the bottom to the top, the firstmaximum value represents the x-coordinate of the mouth. Forthe vertical projection curves, from the middle to both sides,the minimum values represent the y-coordinate of left eyeand right eye respectively.

4. Facial Feature ExtractionGabor wavelets are now being used widely in variouscomputer vision applications due to its robustness againstlocal distortions caused by illumination variance [25]. Theyare used to extract the appearance changes as a set of multi­scale and multi-orientation coefficients. A novelty of this

Page 4: [IEEE 2008 IEEE International Workshop on Haptic Audio visual Environments and Games (HAVE 2008) - Ottawa, ON, Canada (2008.10.18-2008.10.19)] 2008 IEEE International Workshop on Haptic

Fig. 8. Flowchart of feature extraction stage of facial image

Fig 9. Image examples of JAFFE database. From left to rightangry, disgust, fear, happy, sad, surprise, and neutral.

=====IQ••­,..------¥..---. iEEEE~•••_a.... .... '_.....-...-~·a•••••••••

tested on the JAFFE database. The JAFFE database [26] usedin our experiments contains a total of 219 images of tenJapanese women. Each person has four images for each of theseven expressions: happy, sad, surprise, anger, disgust, fearand neutral. Fig 9 shows an example of the facial expressionsin the database.

Fig 6. Facial image response to Gabor filtersCONCLUSION AND FUTURE WORK

Fig 7. Dividing expression areas into small cells

Since the full convolution of face images with differentGabor kernels is very costly in the demand of processing time,we segment the detected expression region into small cells(see Fig 7), 7 x 7 pixels for each cell. Gabor transformation isapplied on each of these cells instead of in the wholeextracted areas. The expression region is 40 x 36 pixels foreach eye, 60 x 28 pixels for mouth and 7x7 pixels for each cell.Therefore, for one image, we have 3 x 6 = 18 Gabortransformations. For each transformation, we have40x36 60x28--x 2 +-- ~ 96 features. Therefore, for each image,

7x7 7x7we have 18 x 96 = 1728 features.Fig 8 displays the whole process of facial feature detectionand extraction. After feature detection and extraction, facialvectors are then processed by a Support Vector Machine(SVM) [27], and obtain a recognition rate of 94% when

In this paper, we present a novel feature detection andextraction approach based on integral projection and Gabortransformation. The features are first detected and located byintegral projection curves and then extracted by a GaborWavelet transformation. A family of Gabor wavelets with 3scales and 6 orientations is generated with the standard GaborKernel. We segment the detected expression areas into smallcells and convolved them with the Gabor wavelets, theoriginal images are transformed into vectors of Gaborwavelet features. These vectors are then regarded as theinput of the Support Vector Machine for expressionclassification. When tested on the JAFFE database, we obtaina high recognition rate of94%.In future work, we will discuss in details about the parameterselection which affects the recognition rate. These parametersinclude the dimension of the facial expression areas, the cellgranularity of expression areas, different scale and orientationparameter selection of Gabor kernel. Moreover, the reasonsthat cause some incorrect expression recognition will also beanalyzed.

REFERENCES

[1] K. Matsumura, Y. Nakamura, and K. Matsui, "Mathematicalrepresentation and image generation of human faces bymetamorphosis," Electron. Commun. Jpn., vol. 80, pp. 36-46,1997.

Page 5: [IEEE 2008 IEEE International Workshop on Haptic Audio visual Environments and Games (HAVE 2008) - Ottawa, ON, Canada (2008.10.18-2008.10.19)] 2008 IEEE International Workshop on Haptic

[2] P. Ekman, "Facial expression and emotion," Am. Psycho/., vol. 48, pp.384-392, 1993.

[3] P. Ekman., "Strong evidence for universals in facial expressions: a replyto Russell's mistaken critique," Psycho/. Bull., vol. 115, pp. 268-287,1994.

[4] P. Ekman, "Emotions inside out. 130 years after Darwin's "TheExpression of the Emotions in Man and Animal"," Ann. NY Acad. Sci.,vol. 1000,pp. 1-6,2003.

[5] M. Pantic and L. J. M. Rothkrantz, "Automatic analysis of facialexpressions:the state of the art," IEEE Trans. Pattern Anal. Mach. Intell.,vol. 22,pp. 1424-1445,2000.

[6] A. Kapoor, S. Mota, and R. Picard, "Toward a learning companion thatrecognizes affect," presented at the Amer. Assoc. Artificial Intelligence,2001.

[7] R. Picard, "Toward agents that recognize emotion," in Proc. IMAGINA,1998.

[8] R. Picard, Affective Computing. Cambridge, MA: MIT Press, 1997.[9] R. W. Picard and T. Kabir, "Finding similar patterns in large image

databases," in Proc. IEEE ICASSP, Minneapolis, MN, 1993, pp.161-64.

[10] R. W. Picard, Affective Computing. Cambridge, MA: MIT Press,1997.

[11] B. Fasel, J. Luettin. Automatic facial expression analysis: a survey.Pattern Recognition. 2003, 36(1): 259-275.

[12] P. Ekman and W. V. Friesen, Emotion in the Human Face. EnglewoodCliffs, NJ: Prentice-Hall, 1975.

[13] M. Pantic, L. J.M. Rothkrantz, "Automatic analysis of facial expressions:The state of the art,"IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no.12, pp. 1424-1445,Dec. 2000.

[14] B. Fasel and J. Luettin, "Automatic facial expression analysis: Asurvey," Pattern Recognit., vol. 36, no. 1, pp. 259-275, 2003.1998, pp. 454-459.

[15] P.Viola, M. Jones, "Robust real-time object detection," CambridgeResearch Laboratory Technical Report Series CRL200l/01, pp. 1-24,2001

[16] P.Viola, M. Jones, "Robust real-time face detection,", InternationalJournal ofComputer Vision 57(2), 137-154,2004

[17] K. Turkowski, Apple Computer, t~ilters for Common ResamplingTasks" Filters for Common Resampling , (Tasks 10 April 1990)

[18] L. Zhang. I. Chen, Z, Gao, W., "Mean-based fast median filter",Journal ofTsinghua University, 2004, Vol 44; Part 9, page 1157-1159

[19] J. Sumbera, "Histogram Equalization", CS - 4802, Digital ImageProcessing, Lab #2

[20] N. Otsu. "A threshold selection method from gray-level histograms" [1].IEEE Trans on SMC, 1979,9(1): 62-66

[21] G. G. Mateos, A. Ruiz, P. E. Lopez-de-Teruel. "Face Detection UsingIntegral Projection Models," Proceedings of the Joint IAPRInternational Workshop on Structural, Syntactic, and StatisticalPattern Recognition. 2002, pp. 644-653

[22] Z. H. Zhou, X. Geng "Projection functions for eye detection," PatternRecognition 37 (2004) 1049-1056

[23] T. Sederberg, BYU "Bezier curves",http://www.tsplines.com/resources/class_notes/Bezier_curves.pdf

[24] J.D. Foley et al.: Computer Graphics: Principles and Practice in C (2nded., Addison Wesley, 1992)

[25] L. Shen, L. Bai "A review on Gabor wavelets for face recognition",Pattern Anal Applic, 9:273-292, 2006

[26] M.J. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, "Coding facialexpressions with gabor wavelets," in Proc,3rd IEEE Int.Conf. AutomaticFace and Gesture Recognition, 1998, pp. 200-205

[27] C.C. Chang, C.J. Lin, LIBSVM, A library for Support Vector Machineshttp://www.csie.ntu.edu.tw/-cjlin/libsvrn/index.html