Pose-corrected Face Processing on Video Sequences for ... · June 28, 2007 Abstract ... education under project TEC2005-07212/TCM and the European sixth framework pro- ... integrate

Pose-corrected Face Processing on Video

Sequences for Webcam-based Remote

Biometric Authentication

Jose Luis Alba-Castro, Daniel Gonzalez-Jimenez, Enrique Argones-Rua,

Elisardo Gonzalez-Agulla, Enrique Otero-Muras and Carmen Garcıa-Mateo ∗†

June 28, 2007

Abstract

In this paper we describe a framework aimed to perform face-basedbiometric user authentication for web-resources through client-serversecured sessions. A novel front-end for face video sequences processingis developed in which face detection and shot selection is performedat client-side while statistical multi-shot pose-corrected face verifica-tion is performed at server-side. We explain all the image processingsteps, from the acquisition to the decision, paying special attention toa PDM-based pose correction subsystem and a GMM-based sequencedecision test. The pose correction relies on projecting a face-shapemesh onto the set of PDM eigenvectors and back-projecting it afterchanging the coefficients associated to pose variation. The alignedand discriminatively selected texture features form the observationvectors ready to be pluged into a GMM based likelihood ratio for sta-tistical decision. Tests over known databases show the reliability of

∗The authors are with the Departamento de Teorıa de la Senal y Comunicaciones, ETSITelecomunicacion, Universidad de Vigo, 36310 Vigo, Spain. Phone: +34 986 812664. Fax:+34 986 812116. E-mails: {jalba,danisub,eargones,eli,eotero,carmen}@gts.tsc.uvigo.es.Corresponding author. E-mail: [email protected]

†This work was supported with funds provided partially by the Spanish ministry ofeducation under project TEC2005-07212/TCM and the European sixth framework pro-gramme under the Network of Excellence BIOSECURE (IST-2002-507604)

1

the proposed methods and preliminar tests on our client-server bio-metric authentication framework also yield encouraging performanceof the complete system.

1 Introduction

Natural interaction with advanced user interfaces is a topic of interest for awide variety of new applications in desktop-based or mobile-based comput-ing scenarios. One of the newest situations where users need to cooperatewith the system is the unsupervised biometric authentication for accessingrestricted services. Nowadays fingerprint technology is being deployed mainlyin laptops, cellular phones and PDAs due to the security threat of sensitivedata being stolen. Soon, this technology will also play an increasingly rolefor data protection and accessing secured zones in internet from any kindof device connected to the network. Even when fingerprint is the biometricidentification technology most widely deployed, it is also true that there existsome good reasons to foster research in other more natural biometric traitslike face and voice. First of all, it is well known that multimodal biometricsreduces the False Acceptance and False Rejection Rates and also alleviatesthe Failure to Enroll Rate (quite high in fingerprints). Second, recognizingfacial and vocal features is much more natural and flexibe than using fin-gerprints, and this offers advantages when dealing with spoofing attacks (forexample, asking the user to perform a specific audiovisual sequence). Third,video capturing devices have been increasing their quality while decreasingtheir price. Last, but no least, face and voice research for biometric au-thentication also pushes other applications based on advanced multimodalinterfaces, like web-resources personalization, handicapped computer inter-action, uncooperative monitoring, multimedia database indexing, advancedvideo-conferencing, etc.

We have developed a multimodal biometric identification framework forimplementing flexible solutions for accessing web-resources through the in-ternet [9]. One of the advantages of this framework is that allows to easilyintegrate any kind of BioAPI-compliant biometric device or to develop a pro-prietary biometric feature extraction software to be used in monomodal ormultimodal biometric identification or authentication. The client-side is incharge of the acquisition and preprocessing of biometric samples, while theserver-side, where more computational power and security can be put, is in

2

charge of face alignment, feature extraction, template creation, recognitionand decision (see Figure 1). The framework has been specifically designed toperform biometric authentication using webcams and microphones, ensuring,this way, the universal usability of the proposed solution. By the other hand,taking into account that sample acquisition is performed in an uncontrolledscenario, it is easy to understand that extraction of robust face and vocalfeatures for identification is a challenging task. The main sources of error aredue to extreme illumination conditions, extreme head rotation or nodding,bad webcam focusing or framing and background noise. There exist, then, atrade-off between comfortable usability and system performance.

CLIENT−SIDE SERVER−SIDE

CAPTURE FACE/FEATURES

DETECTION

POSE CORRECTION FACE ALIGNMENT

SECURED

CONNECTION

FEATURE EXTRACTION

AND SELECTION

STATISTICAL−BASED

MATCHING

SHOT

SELECTION

GABOR GMM

?

SEND?

Figure 1: Flow diagram of our system

In this paper we focus on the face processing client-server module ofour framework. The first approach we developed for this application heavilyrelied on user cooperation to fit a frontal pose in a rigid frame overlayed to themirrowed webcam stream [9]. While this solution ensured a scale-normalizedfrontal pose quite useful for a bunch of face recognition algorithms, useracceptability was not very high. The approach we propose now relies solelyon the user being focused in the field of view of the webcam and on a poseranging roughly from -30 to 30 degrees in azimuth and -30 to 30 degrees inelevation. In-plane rotation is not critical at all. The main objective of thismodule consist of increasing user acceptability and system usability. Thedrawback is an increase in the image processing complexity. The face modeof the biometric acquisition front-end is in charge of obtaining sampled faceframes from the video stream, where face pose parameters falls in a specified

3

range. In the server side, a face alignment mechanism, based on a Point-Distribution Model (PDM) that works from a set of face-feature points [7],allows the correction of pose to a frontal one and the extraction of alignedlocal texture information. The template generated from every frame canbe used alone or in combination with other frames to perform still-to-still,video-to-still or video-to-video recognition. The video-to-video or multi-shotrecognition strategy relies on a GMM based likelihood ratio for statisticaldecision between client and imposter.

There has been some other approaches for frame selection in face videoanalysis. In [15] a face pose estimator (PE) based on a boosting regressionalgorithm and the well-known Haar-like features from [3] has been devel-oped. The PE outputs Up-Down (30,-30) and Left-Right (45, -45) anglesof every frame at 14 fps on a 1.4 GHz Athlon PC with a frontal-face error(UD:(10, -10), LR:(10, -10)) of 20-30%. They implemented the PE rightafter face detection and before face alignment. The main purpose of thissolution consisted of giving a rough estimate of the pose for driving view-based recognition or view-based face-alignment algorithms. Our approachis conceptually similar but doesn’t need an accurate pose estimation, butjust keeping poses within a specified range useful for the subsequent facealignment.

Regarding video-based face recognition, several are the differences on per-forming image based and video-based face recognition. Usually video-basedface recognition deals with lower resolution images (more critical for broad-casted video than webcam-based video streams), self-occluded feature points(due to pose changes) and several consecutive face frames. Given the stan-dard resolution of the cheapest webcams and the applications we are dealingwith, the main difference to image-based recognition is, actually, the mainadvantage: the sequence of face images allows developing techniques for com-bining visual evidence over time. Some authors have exploited this advan-tage: In [16] the authors use a weighting function to diminish differencesand enhance similarities in pose and expression between training-shots andtest-video face sequences (still-to-video). Even when they only weigh LRpose differences (not UD), using view-based eigenspaces [17], the increase ofrecognition performance compared to the average of frame to frame recogni-tion is quite large.

The rest of the paper is organized as follows. Section 2 gives a roughsystem overview to understand the constraints applied to the face processingsubsystems. Section 3 details the client-module from the face processing

4

point of view and presents the shot selection based on a weak pose estimator.Section 4 spans the main contributions of our work in 3 subsections. First,face alignment and pose correction; then feature extraction and selection;and finally, model matching and statistical decision. Section 5 shows sometests to validate the proposed approach. Conclusions are presented in Section6.

2 System Overview

The image processing modules we will describe in this paper are embeddedin an open client-server architecture for server-side biometric authenticationoriented to the Web [9]. 1

This architecture is intended to allow the integration of multiple third-party biometric devices and algorithms. For this purpose, our system iscompatible with the standard BioAPI for biometric interoperability. So anyBioAPI-compliant biometric software or device is supported. Moreover, oursystem is capable of controlling any multimedia device that is compatiblewith the Java Media Framework, as common webcams and microphones (seeFigure 2).

Regarding to the execution of the biometric tasks, the system performsboth biometric template extraction and verification on the server side. Thissolution has the advantage of minimizing the computational load on the clientside, while these biometric operations can be executed on powerful servers.Also, from the versatility and security point of view, server-side verificationis a better configuration. Thus, the client side, which is the weakest pointin the security chain, has the sole responsibility of acquiring the biometricsamples.

The overall behaviour of our client-server biometric system summarizesas follows (see Figure 2): The client-side biometric application is in charge ofmulti-biometric sample acquisition, encryption and transaction over a secureTCP/IP connection (steps 1 to 4 in Figure 2). On the server side, there isa centralized authentication server with a biometric authentication modulethat is in charge of extracting the biometric template, matching, and check-ing access privileges (steps 5 to 7 in Figure 2). We drive user interaction

1We have recently released the source code of this architecture as a free-software project(http://sourceforge.net/projects/biowebauth/)

5

with the system through an easily configurable dialogue. Enrollment or veri-fication tasks are modeled as human-machine dialogues specified by an XMLdocument which describes the sample acquisition process and the biometricverification mode [18].

The face acquisition process on the client side is in charge of selectingadequate face images from the webcam live video stream. For this purposeface shots are extracted using the well known Viola and Jones algorithm[3] for detecting face, eyes, nose and mouth. Then, selected shots are sentdirectly from memory to the server machine where they are processed tocreate user templates for saving (enrollment) or matching (verification). Wewill extend on this part in section 4.

A web tool based on this framework has been recently used for the acqui-sition of an audiovisual biometric database of remote users in the EuropeanNetwork of Excellence (NoE) BIOSECURE. 2

Figure 2: Client-Server Architecture

2The acquired database of over 1000 users from 11 different european sites will bepublicly available under signed institutional agreements

6

3 Face processing at client side: feature de-

tection and shot selection

Since the publication of the general object detection framework by Violaand Jones [3], face detection and tracking can be quite efficiently replacedby face detection in every frame. The detection of face features can be alsoperformed by the same algorithm after proper training with enough featureinstances. We have used Lienhart’s implementation [19] included with theOpenCV library 3 both for face (frontal and profile) and features detection.Scale constraints were applied to the scanning of face and features to acceler-ate the detection. In this sense, we avoid scanning the webcam image lookingfor small faces (that should not correspond to a subject working in front of adesktop): for the typical webcam lenses and size of OpenCV’s training faces,this constraint saves on average a 90% of processing time. A similar proce-dure is applied to feature detectors where also face morphological structureis considered to reduce scanning. Pre-rotation of the frame given the eyes-position of the previous one allows scanning always with unrotated features.The coordinates of detected features form the basis for shot selection.

As earlier stated, it is not desirable to load client’s machine (usually run-ning lots of background processes!), so shot selection is quite basic and givenby the next premises: i) there must be at least 3 out of 4 features detected and

ii) the ratiodinter−eye

dnose−eyesor

dinter−eye

dmouth−eyesshould be kept in morphological limits un-

der the assumption of manageable rotation in depth (azimuth and elevation).In collaborative scenarios, a 4 seconds video acquisition (time asigned to PINutterance in our local audiovisual biometric application) yields between 20 to120 frames, depending on the bandwidth and webcam frame-rate. Extremeillumination or in-depth pose conditions reduces drastically the number ofvalid shots for our application. Detection of profile faces inhibits searchingof facial features. So, if there are no valid shots, the local application fires adialogue event for acquiring a new video sample asking the user to roughlyface the camera or to change lighting conditions, depending on the maincause of missing faces.

The larger the number of valid shots the higher the accuracy of the sta-tistical decision test, but also the larger bandwidth needed for transmissionof biometric samples. At this point we distinguish two different application

3http://sourceforge.net/projects/opencvlibrary

7

scenarios in our platform: i) single verification, where a video of preselectedlength is recorded and their selected shots sent, and, ii) continuous verifica-tion, where a background process is launched for selection of valid shots andinterval-based transmission.

For the rest of the processing chain, both application scenarios only differon the number of processed frames at the server and the temporal differencebetween selected shots. Nevertheless the basic image processing proceduresare exactly the same.

4 Face processing at server side

This section describes a novel face processing system for multi-shot verifica-tion. The processes that are involved in the system are:

1. Fine face alignment and pose correction

2. Feature extraction and selection

3. GMM modeling and statistical hypothesis test

4.1 Fine Face Alignment and Pose Correction

It is clear that every selected shot reaching the server machine can be coarselyaligned according to the facial features that were detected in the previous step(eyes/nose/mouth). Face recognition algorithms may work accurately withsuch a coarse registration when frontal faces are present, but decrease theirperformance in the case of pose variations. In order to deal with the possiblechanges in viewpoint, we make use of one of the pose correction schemesdescribed in [8], which requires a deformable face mesh (62 landmarks) to befitted onto the original image. The fitting procedure is achieved by meansof a variant of the Lucas-Kanade image registration algorithm: the InverseCompositional Image Alignment (ICIA) algorithm described in [6] (see Figure3 for several examples of the fitting result on three video-sequences from theBIOSECURE DS1 database). Once that the fitting step has finished, thesystem outputs the set of N (62) facial landmarks representing the user’sface shape instance. The projection of this vector onto the eigenvectors of apreviously trained Point Distribution Model provides the shape parametersneeded for pose correction. Let us explain this process in more detail:

8

Fitting: frame 121 Fitting: frame 91 Fitting: frame 11

Figure 3: Example of face mesh fitting using ICIA [6] on three different videosequences.

4.1.1 Face Pose parameters

An annotated training set of facial landmarks are used for calculating aPoint Distribution Model. For each training image Ii, N landmarks arelocated, and their normalized coordinates (by removing translation, rotationand scale) are stored, forming a vector

X i = (x1i, x2i, . . . , xNi, y1i, y2i, . . . , yNi)T =

(

xi yi

)T(1)

The pair (xji, yji) represents the normalized coordinates of the j-th land-mark in the i-th training face. Through Principal Components Analysis(PCA), the most important modes of face shape variation are found. As aconsequence, any training face shape X i can be approximately reconstructed:

X i = X + Pb, (2)

where X stands for the mean shape, P = [φ1|φ2| . . . |φt] is a matrix whosecolumns are unit eigenvectors of the first t modes of variation found in thetraining set, and b is the vector of parameters that defines the actual shapeof X i. So, the k-th component from b (bk, k = 1, 2, . . . , t) weighs the k-theigenvector φk. Also, since the columns of P are orthogonal, we have thatP TP = I, and thus:

b = P T(

X i − X)

, (3)

i.e. given any face shape, it is possible to obtain its vector of parameters b.Among the main modes of shape variation, we isolated those eigenvectors

responsible for rigid mesh changes, turning out that φ1 controlled up-downrotations (see Figure 5) while φ2 was the responsible for left-right rotations(see Figure 6). In [8], it was demonstrated that if the training set is ade-quately chosen, the obtained pose eigenvectors are only responsible for rigidmesh changes and do not contain identity or expression information.

9

1

2

3

4

5

6

78

9

10

11

12

13

14

15

161718

1920 2122

23 24252627

2829

3031 3233

34 3536

37

3839 40 41

42

43

44

4546

47484950

515253545556575859

60 61 62

Figure 4: Position of the 62 landmarks used in this paper on an image fromthe XM2VTS database.

Figure 5: Effect of changing the value b1 on the reconstructed shapes. φ1

controls the up-down rotation of the face.

Figure 6: Effect of changing the value b2 on the reconstructed shapes. φ2

controls the left-right rotation of the face.

10

So, given a face image with corresponding fitted mesh X, we can projectX using Equation (3) to obtain its vector of shape parameters b. Based onthe particular values of b1 and b2, it is possible to get knowledge about thecurrent rotation of the face in the image.

4.1.2 Pose Correction Strategy

One of the methods proposed in [8] aimed to synthesize frontal faces from nonfrontal views (namely Normalize to Frontal Pose and Warp, i.e. NFPW ).Once the mesh X has been fitted to the face image I, the vector of shapeparameters b is computed, and the subset of parameters that account forpose variations are fixed to typical values of frontal faces (since the averageshape corresponds to a frontal face, pose parameters are set to zero, i.e.b1 = b2 = 0), obtaining the modified vector of shape parameters b. Newmesh coordinates are computed using equation (2), and a virtual image I issynthesized by warping4 the original face onto the new shape (see Figure 7for a block diagram). For large horizontal rotations, we can take advantageof facial symmetry in order to overcome problems due to self-occlusions,achieving state-of-the-art results on the CMU PIE database [5] in a set ofangles ranging from −45◦ to +45◦.

In Figure 8 we show the cropped face images corresponding to the fittedmeshes of Figure 3 as well as the virtual frontal faces generated usingNFPW .

Normalization

Pose

Flexible shapemodel fitting

Test image

Warping

TPS

I

X X

I

Figure 7: Block diagram for pose normalization. TPS stands for Thin PlateSplines.

4Using Thin Plate Splines [4]

11

Figure 8: Upper row: Original cropped face images from the correspondingframes shown in Figure 3. Bottom row: Corresponding frontal virtualimages synthesized.

4.1.3 Experiment on a video sequence

In order to demonstrate the suitability of the NFPW method for pose cor-rection on videos, we performed an experiment using a manually annotated5000-frame sequence5. For this test, the vectors of shape parameters b werecomputed in every frame. One of the frames of the video was used as a tem-plate, while the remaining ones were used for testing. Let b2(i) be the value,in frame i, of the pose parameter that accounts for horizontal rotations, andlet bT

2 be the value of this parameter for the template frame (up-down ro-tations are quite inexistent in this video). The difference between bT

2 andb2(i), namely ∆bα, is a measure of the difference α between the rotationangles of the template and the probe image. Figure 9 presents the similarityscores obtained with and without the pose correction stage against ∆bα. Thetested video shows a man during conversation and, apart from pose changes,there are other factors such as expression variations that affect the value ofthe similarity between the template and the probe image. However, it is clearthat when the absolute value of ∆bα grows, the use of pose corrected imagesoutperforms the original system where no pose correction was applied (sincethe similarity score degradation is much less in the case of pose correctedimages).

5http://www-prima.inrialpes.fr/FGnet/data/01-TalkingFace/talking face.html

12

−0.1 −0.05 0 0.05 0.1 0.15 0.2 0.250.65

0.7

0.75

0.8

0.85

0.9

0.95

1

∆ bαS

imila

rity

Sco

re

No Pose CorrectionPose Correction

Figure 9: Similarity scores with and without pose correction in a video se-quence

4.1.4 Callibrating the system

We should take into account that texture mapping increases the computa-tional burden on the server. In order to minimize this load without severelydegrading the accuracy of the system, we decided to correct for pose vari-ations only if the rotation angle θ of the face exceeds θmin, otherwise theoriginal aligned face image is considered frontal and it is fed to the featureextraction module. Deciding whether a frame contains a frontal face or not isachieved by checking the current values of the pose parameters. In order toestablish a correspondence between rotation angles and the b2 values, we usedthe CMU PIE (Pose, Illumination and Expresion) database [5], which con-sists of face images from 68 subjects recorded under different combinations ofposes and illuminations. Figure 10 shows the images taken for subject 04006from all cameras with neutral illumination. As we can see, this databaseis specially suitable for testing the robustness of systems against completeleft-right face rotations. In order to calibrate our system, we used a subsetof the database, namely the images taken from cameras 11, 29, 27, 05 and 37(with corresponding nominal azimuth angles of −45◦, −22.5◦, 0◦, 22.5◦ and45◦ approximately) under neutral illumination. All of them (a total of 68×5images) were manually annotated with the same set of 62 landmarks shownin Figure 4.

For each annotated mesh, its vector of shape parameters b was calculated.So, for every subject, we have 5 b vectors, each one corresponding to a certainpose (11, 29, 27, 05, and 37) and, for a given pose, we have 68 vectors, each

13

Figure 10: Images taken from all cameras of the CMU PIE database forsubject 04006. The 9 cameras in the horizontal sweep are each separated byabout 22.5◦ [5]

Table 1: Relationship between b2 and the nominal angle of rotation θ.

pose c11 c29 c27 c05 c37θ −45◦ −22.5◦ 0◦ 22.5◦ 45◦

b2 -0.197 -0.114 0.010 0.127 0.196

one corresponding to a subject from the database. Table 1 shows, for eachpose, the average value of the parameters b2 (b2), along with the nominalrotation angle θ. Clearly, there exists an approximate linear relationshipbetween b2 and θ, in agreement with the result obtained in [2]. Hence, itis possible to determine whether a face is frontal (-θmin ≤ θ ≤ θmin) ornot by looking if the corresponding b2 value is within the range [−bmin

2 , bmin2 ].

What is left to do is to determine the particular value of θmin (or analogouslybmin2 ): obviously there exists a trade-off between accuracy and computational

load, and this threshold can be modified if necessary. According to thesimilarity score degradation achieved in the previous section, we decided tofix bmin

2 = 0.05 (θmin ≈ 10◦). From Figure 10, it is clear that elevationchanges are not a major characteristic of the CMU PIE database. Hence, inorder to callibrate b1, we used another database with enough up-down tiltingexamples, and performed a similar analysis to the one done with b2. Theminimum angle ϕmin needed to perform pose correction is set to 5◦, and thecorresponding range [−bmin

1 , bmin1 ] is calculated.

14

4.2 Feature extraction

Once a real or virtual frontal face is obtained, the system performs a localfeature extraction procedure based on Gabor filtering. Gabor filters are bio-logically motivated convolution kernels in the shape of plane waves restrictedby a Gaussian envelope, as it is shown next:

ψm (−→x ) =

∥

∥

∥

−→k m

∥

∥

∥

2

σ2exp

−∥

∥

∥

−→k m

∥

∥

∥

2

‖−→x ‖2

2σ2

[

exp(

i−→k m · −→x

)

− exp

(

−σ2

2

)]

(4)

where−→k m contains information about frequency and orientation of the

filters, −→x = (x, y)T and σ = 2π. Our system uses a set of 40 Gabor filterswith the same configuration as in [1]. The region surrounding a pixel in theimage is encoded by the convolution of the image patch with these filters, andthe set of responses is called a jet, J . So, a jet is a vector with 40 coefficients,and it provides information about an specific region of the image.

At each of the nodes of the pose-normalized mesh, a Gabor jet is extractedand stored for comparison. Given two images to be compared, say I1 and I2with node coordinates P = {~p1, ~p2, . . . , ~pN} and Q = {~q1, ~q2, . . . , ~qN}, theirrespective sets of jets are computed: {J~pi

}i=1,...N

and {J~qi}

i=1,...N. Finally,

the still-to-still score between the two images is given by:

S = fN {< J~pi,J~qi

>}i=1,...,N

(5)

where < J~pi,J~qi

> represents the normalized dot product between correspon-dent jets, but taking into account that only the moduli of jet coefficients areused. In Equation (5), fN stands for a generic combination rule of the N dotproducts.

Previous results from several researchers [22][21][23] have shown that per-formance of face verification using local models can be improved by discrim-inatively selecting or fusing local responses. In this application, where astatistical model is adapted for every user (see next section), we are doublyinterested on selecting a subset of discriminative face locations: discrimina-tive reduction of dimensionality and decreasing of computational cost.

15

4.3 Feature selection using Sequential Floating For-ward Search

Sequential Floating Forward Search (SFFS) method [12] is a non-exhaustivedeterministic sequential search method. This algorithm has proved its effi-ciency in Gabor kernel location selection [14] and image classification [13].In [14] several feature selection algorithms are evaluated on two Gabor mesh-based verification schemes, and SFFS is the best within all the tested sub-optimal deterministic search methods.

Let X be the selected set of similarities when comparing two images. Theimages similarity is given by S = median {X}. The SFFS criterion functionJ provides a measure of the classification accuracy for the selected similaritiesbetween the template and target image. Eventhough an inmediate measurecould be directly derived from a traditional performance measure such asthe TER, i.e. J(X) = 1 − TER, this is not a good criterion function, sinceperfect classification is possible in the evaluation dataset for feature sets farfrom the optimal general solution. More robust statistics can be used ascriterion functions for the SFFS. The adopted solution is:

J(X) = p12% − p0

98% , (6)

where P(S < p12%|C = 1) = 0.02 and P(S < p0

98%|C = 0) = 0.98 in the eval-uation set. This measure evaluates the separation between the distributionof the similarities for true and false identity claims.

The cardinal of the final set has been adjusted as a trade-off betweendiscriminability in the evaluation set and dimensionality reduction for thestatistical model.

4.4 GMM modeling and statistical hypothesis test

Still face verification systems usually compare the features extracted fromone or several user’s frontal face images (templates) with one or several probefrontal face images. However, when dealing with video streams the numberof frontal face images that the system can use for enrollment and verificationis much larger than in the still case. More reliable information about theintra-user and inter-user distribution of the parameters extracted from thefrontal faces is available now, and an appropriate statistical modeling cantake advantage to perform the identity verification.

16

After face alignment and possible pose correction, the real or virtualfrontal faces available at the server for user identification are characterizedby the set of absolute values of the responses of the Gabor filters applied ataligned locations of the face. Without loss of generality, any subset of theseresponses can be used instead of the full set. Each frame selected from thewebcam stream where a frontal face has been picked or synthesized providesa realization of, let’s say, M local responses. This allows to build statisticalmodels of the face represented in the video sequence.

In a statistical framework, the identity verification task can be formulatedas an hypothesis test:

• H0: frontal faces extracted from the video sequence belong to theclaimed identity I. This hypothesis is modeled by the user model λI .

• H1: frontal faces extracted from the video sequence do not belong tothe claimed identity I. This hypothesis is modeled by the universalbackground model (UBM) λUBM .

And thereby the identity verification can be performed in the classical Bayesianframework:

H0 is accepted ⇐⇒N∏

n=1

P (fn|λI)

P (fn|λUBM)> θ , (7)

where N is the total number of frames where a frontal face is detected andassuming statistical independence between frames.

Gaussian mixture models (GMM) have already been successfully appliedin other video face verification works [10] and also in speaker verification [11].A GMM λ with M mixtures is fully characterized by its probability densityfunction, which for a given feature vector x can be written as:

P (x|λ) =M

∑

i=1

wi

(

(2π)d∥

∥Γi

∥

∥

)−1

2 e−1

2(x−µi)′Γ

−1

i (x−µi) (8)

Assuming statistical independence between the M local responses we canestimate M Gaussian Mixture Models for every user. Now the dimension ofthe problem is 40, instead of 40×M.

Although a video sequence can provide a large number of feature vectors,they could not be enough for training a generalizable model. The number offeature vectors required for training depends on the feature dimensionality,

17

and therefore feature dimension reduction must be used to minimize thisproblem. The well known PCA technique has been used for this purpose,where the covariance matrix and the mean vector are estimated using a setof features not used in the model training stages to avoid generalizationproblems in the statistical modeling.

4.4.1 Universal model training

The EM algorithm is used to train the UBM λUBM . Standard LBG is usedfor the GMM initialization to avoid the EM accuracy dependence on theinitialization. Face video sequences of 10 seconds from 63 users are used inorder to obtain a model as general as possible, capable to model the distri-bution of the feature vectors extracted from any face. This model encodesthe parameter distribution for any face, independently of the identity of theface, and hence it is suited for the alternative hypothesis H1.

4.4.2 User model generation

Eventhough feature dimension is reduced, the number of feature vectors avail-able for each user may not be enough. This can be overcomed by generatingthe user model from the UBM using the Maximum A Posteriori (MAP)adaptation technique [11].

Only mixture means are adapted, since variance adaptation and weightadaptation has not provide improvements in other frameworks such as speakerverification. The new i-th mixture mean vector is computed as:

µi = αi Ei{y} + (1 − αi)µi , (9)

18

where

αi =ni

ni + ρ(10)

ni =N

∑

l=1

Pi(yl) (11)

Pi(yn) =

wi

(

(2π)d∥

∥Γi

∥

∥

)−1

2 e−1

2(y−µi)

′Γ−1

i (y−µi)

∑M

m=1wm

(

(2π)d∥

∥Γm

∥

∥

)−1

2 e−1

2(y−µm)′Γ−1

m (y−µm)(12)

Ei{y} =

1

ni

N∑

l=1

Pi(yn) (13)

(14)

where ρ is the relevance factor and N is the number of feature vectors usedfor adaptation.

When using adapted models in the hypothesis test formulated in equa-tion 7, only learnt differences between the general face and the user face areenhanced, and this provides good discrimination even when a few featuresare used for model adaptation.

5 Results

6 Conclusions

We have presented a complete video-based face verification solution for coop-erative unsupervised webcam scenarios. This solution has been integrated onour client-server multiplatform nultimodal biometric authentication frame-work. Regarding face processing, the main contributions of this paper arerelated to the multi-shot face-pose alignment and correction for discrimina-tive selection of feature points, and the statistical modeling of user sequencesusing speaker recognition inherited procedures, i.e. user model and universalbackground model for performing hypothesis test decisors. Results of theadvantage of pose correction over a video sequence are reported. The lack ofdatabases for video-based face authentication benchmarking does not allowus to compare our overall system against other researchers’ solutions, butpreliminar results are quite encouraging. An empyrical analysis of the mainerror sources of the whole processing chain is still missing, though.

19

References

[1] Wiskott, L., Fellous, J.M., Kruger, N., von der Malsburg, C., “Facerecognition by Elastic Bunch Graph Matching,” in IEEE Transactions

on PAMI, Vol. 19, No.7, 1997 pp. 775–779.

[2] Lanitis, A., Taylor, C.J. and Cootes, T.F.,“Automatic Interpretationand Coding of Face Images Using Flexible Models,”in IEEE Transac-

tions of PAMI, Special Issue in Face and Gesture Recognition Vol. 19,No. 7, pp. 743–756, 1997.

[3] Viola, P., Jones, Michael J.,“Robust Real-Time Face DetectioninInternational Journal of Computer Vision, Vol. 57, No. 2, May 2004,pp.151–173

[4] Bookstein, Fred L.: “Principal Warps: Thin-Plate Splines and the De-composition of Deformations,” in IEEE Transactions on PAMI, Vol. 11,No. 6, April 1989, pp. 567–585.

[5] Sim, T., Baker, S., and Bsat, M.,“The CMU Pose, Illumination, andExpression Database,” in IEEE Transactions on PAMI, Vol. 25, No. 12,December, 2003, pp. 1615 – 1618.

[6] Baker, S., and Matthews, I., “Equivalence and Efficiency of Image Align-ment Algorithms”, in Proceedings IEEE Conference on CVPR 2001,1090–1097.

[7] Gonzalez-Jimenez, D., Sukno, F., Alba-Castro, J.L. and Frangi, A.,“Automatic Pose Correction for Local Feature-Based Face Authenti-cation”, in Proceedings IAPR Conference on Articulated Motion and

Deformable Objects 2006, 356–365.

[8] Gonzalez-Jimenez, D., Alba-Castro, J.L., “Towards Pose Invariant 2DFace Recognition Through Point Distribution Models and Facial Sym-metry, ” accepted for publication in IEEE Transactions on Information

Forensics and Security

[9] Otero-Muras, E., Gonzalez-Agulla, E., Alba-Castro, Jose L., Garcıa-Mateo, C., Marquez-Florez, Oscar W., “An Open Framework for Dis-tributed Biometric Authentication in a Web Environment,”Annals of

20

Telecommunication. Special issue on Multimodal Biometrics, Vol.62,No.1-2, 2007, pp. 1702–1717

[10] Shaohua Zhou and Volker Krueger and Rama Chellappa. “ProbabilisticRecognition of Human Faces from Video”. Computer Vision and Image

Understanding vol. 91 (2003) 214 – 215.

[11] Douglas A. Reynolds and Thomas F. Quatieri and Robert B. Dunn.“Speaker Verification using Adapted Gaussian Mixture Models”. Digital

signal processing vol. 10 (2000) 19 – 41.

[12] P. Pudil, F. J. Ferri, J. Novovicova, and J. Kittler. “Floating SearchMethods for Feature Selection with Nonmonotonic Criterion Functions,”in Proceedings ICPR 1994. Vol. 2 - Conference B: Computer Vision and

Image Processing, Vol.2, pp. 279-283.

[13] Anil Jain and Douglas Zongker. “Feature Selection: Evaluation, Ap-plication and Small Sample Performance, ” in IEEE Transactions on

PAMI 19(2): 153-158 (1997).

[14] B. Gokberk, M. O. Irfanoglu, L. Akarun, and E. Alpaydin. “OptimalGabor Kernel Location Selection for Face Recognition, ” in Proceedings

ICIP 2003 pp. 677–680.

[15] Yang, Z., Ai, H., Wu, B., Lao, S. Cai, L., “Face Pose Estimation andits Application in Video Shot Selection, Proceedings of the 17th Inter-

national Conference on Pattern Recognition, Vol. 1, 2004, pp. 322–325.

[16] Zhang, Y., Martinez, Aleix M., “A Weighted Probabilistic Approachto Face Recognition from Multiple Images and Video Sequences, Image

andVision Computing, Vol. 24, 2006, pp. 626–638

[17] Pentland A., Moghaddam B., Starner T., “View-Based and ModularEigenspaces for Face Recognition ,in Proceedings of the IEEE Conf. on

Computer Vision & Pattern Recognition, Seattle, WA, July 1994, pp.84–91

[18] Gonzalez-Agulla, E., Argones-Rua, E., Garcıa-Mateo, C., Marquez-Florez, O., “Development and Implementation of a Biometric Verifi-cation System for Internet Applications, inProc. COST 275 Workshop

on Biometrics on the Internet, Vigo, Spain, pp. 35–41, Mar 2004.

21

[19] Lienhart, R. , Kuranov, A., Pisarevsky, V., “Empirical Analysis of De-tection Cascades of Boosted Classifiers for Rapid Object Detection, inDAGM’03, 25th Pattern Recognition Symposium, Madgeburg, Germany,pp. 297-304, Sep. 2003.

[20] Rodriguez-Linares, L., Garcia-Mateo, C., Alba-Castro, Jose L., “OnCombining Classifiers for Speaker Authentication,Pattern Recognition,Vol:36 , Pp.347-359. February 2003.

[21] C. Kotropoulos, C., Tefas, A. ,Pitas, I.,“Frontal face authentication us-ing morphological elastic graph matching,inIEEE Transactions on Im-

age Processing, Vol. 9, No. 4, pp. 555–560, 2000

[22] Argones-Rua, E., Kittler, J. Alba-Castro, Jose L., Gonzalez-Jimenez,D.,“ Information Fusion for Local Gabor Features Based Frontal FaceVerification, inInternational Conference on biometrics, pp. 173–181,2006

[23] Duc, B.; Fischer, S.; Bigun, J.; “Face authentication with Gabor infor-mation on deformable graphs, in IEEE Transactions on Image Process-

ing, Vol. 8, No. 4, 1999, Pp.:504 –516

22

Documents

Pose-corrected Face Processing on Video Sequences for ... · June 28, 2007 Abstract ... education under project TEC2005-07212/TCM and the European sixth framework pro- ... integrate