A Bayesian Model for Predicting Face Recognition Performance Using … · 2014. 9. 1. · face recognition performance using only image quality in-formation. Such a model has many

A Bayesian Model for PredictingFace Recognition Performance Using Image Quality

Abhishek Dutta Raymond Veldhuis Luuk SpreeuwersUniversity of Twente, Netherlands

{a.dutta,r.n.j.veldhuis,l.j.spreeuwers}@utwente.nl

Abstract

Quality of a pair of facial images is a strong indica-tor of the uncertainty in decision about identity based onthat image pair. In this paper, we describe a Bayesian ap-proach to model the relation between image quality (likepose, illumination, noise, sharpness, etc) and correspond-ing face recognition performance. Experiment results basedon the MultiPIE data set show that our model can accu-rately aggregate verification samples into groups for whichthe verification performance varies fairly consistently. Ourmodel does not require similarity scores and can predictface recognition performance using only image quality in-formation. Such a model has many applications. As anillustrative application, we show improved verification per-formance when the decision threshold automatically adaptsaccording to the quality of facial images.

1. Introduction

A face recognition system can make a verification deci-sion to indicate if the subjects contained in a pair of facialimages have same (genuine or match) or different (impos-tor or non-match) identity. For practical applications, weare not only interested in the verification decision but alsoin the uncertainty associated with the decision about iden-tity. In this paper, we present a Bayesian model to quantifythe uncertainty in verification decision.

In addition to the inherent limitations of a face recogni-tion system, there are two major factors that contribute touncertainty in decision about identity: a) inherent propertyof some identities which makes verification difficult (as de-scribed in [6]); b) the quality (like pose, illumination, noise,etc) of facial image pair. Our model only considers the roleof image quality because it has a very strong contributiontowards uncertainty in the decision about the identity. Forexample, a verification decision made using a non-frontalimage with uneven lighting entails more uncertainty than averification decision carried out on frontal mugshots cap-

tured under studio conditions. Therefore, our model relieson information about facial image quality to predict perfor-mance and to quantify the uncertainty in verification deci-sion.

We propose to use a data driven model to capture therelationship between image quality and corresponding ver-ification performance. We automatically assess facial im-age quality (like pose and illumination) of facial imagesand train our model on real verification performance data tofind regions in the quality space where the recognition per-formance varies fairly consistently. Many such models ex-plored in the past require similarity scores to predict recog-nition performance. Our model can make performance pre-dictions even before the actual recognition has taken placebecause our model is based solely on the quality of probe(or query) and gallery (or enrollment) image pair.

There are many applications of such models that predictrecognition performance: a) verification decision thresholdthat adapts according to sample quality; b) fusion of resultsfrom multiple algorithms; c) facilitate capture of “best” en-rollment images by giving feedback to operator about thequality of acquired samples; d) in forensic cases involvinga large amount of CCTV footage, such models can help in-vestigators focus their attention on only the “best” qualityvideo frames that entail higher evidential value. As an il-lustrative example, we apply our model to adaptively varydecision threshold and show that it helps improve verifica-tion performance.

This paper is organized as follows: In Section 2, we re-view some previous work in this area. We describe ourBayesian model in Section 3 and discuss its performanceevaluation methodology in Section 4. In Section 5, we de-scribe the experiments designed to train our model and eval-uate its performance. We discuss our experiment results inSection 6 and then finally present our conclusions in Sec-tion 7.

2. Related WorkThe systems that predict performance of a biometric sys-

tem can be generally classified into two groups. The firstgroup of methods utilizes the similarity score and priorknowledge about the genuine and impostor scores distri-butions to predict the performance. The second group ofperformance prediction systems assesses biometric samplequality and uses this information to predict performance –poorer sample quality entails more uncertainty in decisionsabout identity.

Performance prediction systems based solely on the sim-ilarity score first create some feature from the similarityscores and then apply machine learning to model the re-lationship between these features and the correspondingrecognition performance. For instance, [16] computes threefeatures from a set of sorted similarity scores while [11]uses features based on similarity scores that quantify theintrinsic factors (properties of algorithm, gallery set, etc)and extrinsic factors (quality of probe images). Both thenuse SVM to learn the relationship between these similar-ity score based features and their corresponding recogni-tion performance. In [10], they compute a feature fromimpostor score distribution to quantify the “facial unique-ness” and then used Kernel Density Estimation to model auniqueness based match (genuine) and non-match (impos-tor) score distributions. The uncertainty in decision aboutidentity is higher in regions of overlapping tails of genuineand impostor score distribution. Therefore, a better modelof tails of score distribution is essential for accurate pre-diction of recognition performance. Following this line ofthought, [13] and [14] directly model the tails of similarityscore distributions. In [13], the tail of impostor score distri-bution is modeled as a Weibull distribution. To predict theoutcome of a verification decision, they check if the newverification score is an outlier with respect to the model ofthe tail of impostor score distribution. In [14], the tails ofboth genuine and impostor score distribution are modeled asa General Pareto Distribution. The normalized distance of asimilarity score from the impostor score distribution is usedas a performance predicting feature in [12]. Using a Prob-abilistic Graphical model, they model the joint density ofsimilarity score and these performance predicting features.This allows them to predict the recognition performance.

It is also possible to predict recognition performancebased on information about the biometric sample quality.One of the earliest works in predicting performance of abiometric system was presented by [15]. They first showthat the normalized match score – which denotes the dis-tance of match score from non-match score distribution –is an indicator of recognition performance. Using an Arti-ficial Neural Network (ANN), they learn the non-linear re-lationship between fingerprint quality (like clarity of ridgesand valleys, number and quality of minutiae, size of im-

age, etc) and corresponding normalized match score. Thismodel of quality and recognition performance (i.e. normal-ized score) is used to predict the performance of previouslyunseen fingerprint samples. Using externally assessed fin-gerprint quality, [17] model the genuine and impostor scoredistributions using gamma and log normal distributions re-spectively. This model of score distributions is then usedto adaptively select the decision threshold based on qualityinformation. The authors of [1] apply Multi-DimensionalScaling (MDS) to learn the relationship between imagequality features and similarity scores. Using regression, theauthors of [2] model the relationship between quality par-tition (good, bad and ugly) and image-specific (sharpness,hue, etc.) and face-specific (facial expression) properties ofa facial image.

Our work most closely relates to the work of [3] whichuses a Generalized Linear Mixed Model (GLMM) to modelthe relationship between image quality (like focus, head tilt,etc.) and the outcome of verification decision. Their anal-ysis shows that some quality metrics are strong indicatorof recognition performance. In this paper, we propose aBayesian framework for modeling the relation between facerecognition performance and image quality. We use a prob-ability density function to model this relationship.

3. Model of Image Quality and RecognitionPerformance

Let q = [qp1 , qg1 , · · · , qpm, qgm] ∈ R2m denote image qual-

ity parameters (like pose, illumination direction, noise, etc.)of a probe and gallery image pair. Throughout this paper,the term image quality refers to any measurable propertyof facial images that affects the performance of face recog-nition systems. For a particular face recognition system j,let r(j) = [r1, · · · , rn] ∈ Rn denote the face recognitionperformance corresponding to a sufficiently large set of dif-ferent image pairs each having same quality q. Here, weassume that the recognition performance of system j is notaffected by variations in identity [6] and that vector q is suf-ficient to capture all the relevant quality variations possiblein a facial image pair. Different face recognition systemshave varying level of tolerance to image quality degrada-tions and therefore we denote vector r(j) as a function of aparticular face recognition system. To simplify the notation,we simply use r.

We want to model the interaction between image qualityq and recognition performance r using a probability den-sity function (PDF) P (q, r). In this paper, we propose adata driven model of P (q, r) which is trained by gatheringrecognition performance data r for the most common typesof quality variations q in probe and gallery image pairs.Once we have trained this model, we can predict recognitionperformance for a new probe and gallery pair with quality q

as follows:r∗ = arg max

rP (r|q), (1)

where r∗ denotes the most probable estimate of face recog-nition performance.

The recognition performance prediction r∗ based on ourmodel can be done even before the actual recognition taskbecause our model relies only on the quality of facial im-ages. Many such models, explored in the past, also use thesimilarity score as a feature for performance prediction. Theimpostor (or, non-match) score is influenced by both iden-tity and quality of facial images [7]. Hence, it is not possibleto differentiate if an extremely low similarity score is due tomismatched identity or comparison of extremely poor facialimage pair. Therefore, we avoid using similarity score as afeature in our model. This design decision not only avoidsthe issues associated with using similarity score as a fea-ture but also allows our model to predict performance evenbefore the actual facial comparison has taken place.

In this paper, we express P (q, r) using a mixture of Kmultivariate Gaussian (MOG):

P (q, r) =

K∑k=1

πkN ([q, r];µk,Σk), (2)

where, πk are the mixture coefficients such that 0 ≤ πk ≤1,

∑k πk = 1, and µk ∈ R2m+n, Σk are the mean and

covariance matrix of the kth mixture components. We applythe Expectation Maximization (EM) algorithm to learn theparameters of the MOG in (2).

Given the quality q of previously unseen verification in-stance, we can apply Bayes’ theorem to (2) and obtain theposterior distribution of recognition performance r as

P (r|q) =P (q, r)

P (q). (3)

Since the denominator of (3) does not depend on r, the cor-responding most probable estimate of r for a given qualityq is given by

r∗ = arg maxr

P (q, r), (4)

Substituting r∗ in (3), we can obtain P (r∗|q) which definesthe probability of most probable recognition performancer∗.

4. Performance Prediction ErrorUsing the PDF of (3), we can obtain the posterior dis-

tribution of recognition performance r for any given pointq in the quality space. Our test data set does not containsufficient number of verification instances at each point inthe quality space. Therefore, even though our model canpredict performance at each point in the quality space, wedo not have enough test data to evaluate the error in those

model predictions. Hence, we evaluate the performance ofour model by adopting an alternative view of MOG decom-position which presents the mixture components as clusters.

Recall that, the MOG decomposition of (2) can alterna-tively be also viewed as partitioning the [q, r] space into Kclusters. We partition all the verification instances in thetest data set into a set of K clusters (or, mixture compo-nent). For a previously unseen verification instance in thetest data set with quality q, we first compute the most prob-able estimate of performance r∗ using (4) and then assign itto the cluster k∗ such that

k∗ = arg maxk

πk N ([q, r∗];µk,Σk), (5)

where, k∗ ∈ [1, · · · ,K]. Based on these cluster specificverification instances, we compute true verification perfor-mance and its credible interval using a Bayesian approachas discussed in Section 4.1. Given a new instance q, the costof performance prediction is O(an) where a is the numberof levels in each dimension of r and n is the dimensionalityof r.

We compare these cluster specific true verification per-formance with our model’s prediction of verification per-formance at each cluster center. The most probable esti-mate of verification performance r∗k evaluated at the center(i.e. mean) of cluster k has a credible interval (c, d) of size(1− α) such that∫ d

c

P (r|q = µqk)dr = 1− α, (6)

where, µqk denotes the quality component of kth mixture

component mean µk.Note that we employ this strategy of evaluating model

performance because of the limited nature of our testingdata set. Given sufficient test data, our model’s perfor-mance can be evaluated at each point in the quality space.As a very rough estimate, we need 100 genuine samples ateach point in the quality space for reliable measurements ofFRR= 0.01.

4.1. Credible Interval for Computed FRR

We describe a Bayesian approach for computing thecredible interval for the cluster specific FRR computed fromtest data set. Let Gk and Ik denote the set of genuine andimpostor scores corresponding to cluster k. Given the de-sired operating point FARdesired, we can obtain a decisionthreshold tk by solving the following equation:

FARdesired =n({Itkk : Ik > tk})

n(Ik), (7)

where n(Ik) denotes the cardinality of set Ik. Now, for allinstances in Gk, we can make a verification decision w ∈

{0, 1} based on this decision threshold tk as follows:

w(i) =

{1 if G(i)

k < tk,

0 otherwise.(8)

Therefore, each verification decision can be thought of asthe outcome of a Bernoulli trial. Let m be a random vari-able indicating the number of w(i) = 1 observations out oftotal N = |Gk| verification decisions. The probability ofgetting m success in N trials follows a Binomial distribu-tion Bin(m|N,µ), where P (w = 1|µ) = µ. We are inter-ested in the posterior distribution of µ which in turn definesthe distribution of FRR given by:

FRR =m

N. (9)

Assuming a Beta distribution as the prior distribution of µ,the posterior distribution of µ is the product of binomiallikelihood function Bin(m|N,µ) and beta prior Beta(a, b).Based on the property of conjugate prior [4, p.70], we knowthat the posterior distribution of µ is a Beta distributionBeta(m + a, l + b), where l = N − n. The FRR givenby (9) has a Bayesian credible interval (c, d) of size 1 − αsuch that ∫ d

c

Beta(µ;m+ a, l + b)dµ = 1− α. (10)

Since we do not have any prior knowledge about µ, we as-sume a uniform prior i.e. Beta(a = 1, b = 1)

5. Experiments

chair with

head rest

01 0507

09 13

13-0 14-0 05-1 05-0 04-1flash

frontal view

camera

Figure 1. MultiPIE camera and flash positions used in this paper.

We present experiment results to show that the proposedmodel of (2) can indeed capture the relation between qual-ity q and performance r. In this study, we use the Face-VACS [5] recognition system and the neutral expression,four session (first recording only) subset of the MultiPIEdata set [9]. Of the total 337 subjects, our training set con-sists of the 129 subjects that are present in all four sessions.The remaining 208 subjects are used for testing.

For both training and testing data set, we use the highquality frontal mugshots for the gallery (or, enrollment)set. Image quality (i.e. pose and illumination) variations

are only present in the probe (or, query) set. The probe setcontains images from the 5 camera and 5 flash positions asdepicted in Figure 1. Since the quality of our gallery set re-mains fixed, in all our experiments the quality vector quan-tifies the pose and illumination of only the probe image: i.e.q = [qp1 , q

p2 ]. In Section 5.1, we describe the quality vector

q in more detail. Furthermore, the recognition performancevector is a single dimensional quantity r = [r1] where, r1denotes the False Reject Rate (FRR) (in base 10 log scale).For all results presented in this paper, the False Accept Rateis fixed to 0.001. This experimental design simulates realworld verification scenario where the gallery is fixed to a setof high quality frontal mugshots and the variable of interestis the expected face recognition performance (i.e. FRR) atsome predefined operating point (i.e. FAR).

We have designed our experiment such that there is mini-mal impact of session variation and image alignment on theface recognition performance. We select the high qualitygallery image from the same session as the session of theprobe image. Furthermore, we disable the automaticallydetected eye coordinates based image alignment of Face-VACS by supplying manually annotated eye coordinates.This ensures that there is consistency in facial image align-ment even for non-frontal view images.

5.1. Image Quality Assessment

Many types of quality variations can degrade the qual-ity of a facial images. There exists a multitude ofalgorithms in Computer Vision to assess common fa-cial image properties like pose, illumination direction,noise, blur, etc. In this paper, we use the Image Qual-ity Assessment (IQA) tool dbassess included with theFaceVACS [5] SDK. This IQA tool measures a largenumber of image quality parameters. However, weonly use the DeviationFromFrontalPose (q1) andDeviationFromUniformLighting (q2) parametersbecause our training and testing data set mainly containsvariations in pose and illumination.

In Figure 2, we show the distribution of these two qual-ity parameters (q1 and q2) for probe images in the train-ing data set. The distribution of q1 for frontal view im-ages is centered around −1.0 while for non-frontal views,it shifts toward +2.0. Similarly, while keeping the posefixed to frontal view, we vary the illumination and observethat for frontal illumination the distribution of q2 is cen-tered around−2.0 while for other illumination conditions itshifts towards values ≥ 0. This shows that the two qualityparameters have the desired response towards the pose andillumination variations present in our data set.

5.2. Training

In order to train our data driven model of (2), we wouldideally want a very large number of probe images with same

13_0 (left view) 14_0 05_1 (frontal) 05_0 04_1 (right view)

0

200

400

−2.5 0.0 2.5 −2.5 0.0 2.5 −2.5 0.0 2.5 −2.5 0.0 2.5 −2.5 0.0 2.5

DeviationFromFrontalPose ( q1

p )

count

01 (left flash) 05 07 (frontal flash) 09 13 (right flash)

0

50

100

150

0 5 0 5 0 5 0 5 0 5

DeviationFromUniformLighting ( q2

p )

count

Figure 2. Distribution of image quality values for probe images our training set. For illumination distributions, pose is frontal (i.e. 05 1)

−4 −2 0 2 4

−4

−2

02

4

DeviationFromFrontalPose

Devia

tionF

rom

Unifo

rmLig

hting

1

2

3

4

5

Figure 3. Six Gaussian mixture components projected onto thequality space. Image insets show sample from each quality region.

quality to evenly occupy each position in the quality space.However, it is difficult to obtain such a data set. In Fig-ure 4 (left), each point corresponds to a unique probe imagein our training data set. We observe that some of the regionsin the quality space are sparsely populated by the trainingdata. Therefore, we apply a quality space sampling strategythat adapts according to the nature of available training data.We define sampling points along q1 and q2 based on the Nq

quantiles (= 28) of evenly spaced probabilities in the qual-

ity space. At each sampling point q = [qp1 , qp2 ], we select

the closest Ns samples (= 250) around q. We aggregate allsimilarity scores for which the quality of probe correspondsto the closest Ns samples. These aggregated scores definethe r vector (i.e. FRR at FAR= 0.001) for that particular qvector. To avoid collecting scores from very large distances,we discard the sample point q that do not acquire sufficientscores within certain predefined range. For the training dataset, Figure 4 (right) shows the true FRR at each samplingpoint in the quality space. For some sample points, theFRR= 0 and therefore to avoid −∞ in the log scale, weassign all such instances as r1 = −3.0 (i.e. FRR= 0.001).

From the training data set, we have a set of 441 trainingvectors [q, r], which is used to learn the model parametersfor (2). We use the EM algorithm implementation availablein the R library mclust [8]. We select the number of clus-ters K = 5 because on the training set, this results in mostdistinct clusters in the quality space. Furthermore, given thelimited nature of our training data set, we cannot reliably es-timate a model with full covariance matrix. Therefore, weselect the VVI (see [8] for details) model parametrizationwhich defines covariance matrix as: Σk = λkAk, whereAk

is a diagonal matrix whose elements are proportional to theeigenvalues and λk is an associated constant of proportion-ality. Here, λk and Ak govern the volume and shape of k-thmixture component.

The projection of resulting six mixture components re-gions in the quality space is shown in Figure 3. In Figure 5(left), we show the plot of (r∗, q1, q2) where r∗ denotes themost probable estimate of recognition performance com-puted using (4). In Figure 5 (right), we show the corre-sponding value of probability density function P (r∗|q1, q2).Note that the FRR values are shown in log scale in order to

provide better visualization of different regions in the qual-ity space formed by the six mixture components.

5.3. Performance Prediction

0.00

0.05

0.10

1 2 3 4 5

Cluster Id

FR

R @

FA

R=

0.0

01

Dateset Train Test Model (at cluster center)

Figure 6. Cluster specific verification performance where the errorbars indicate 95% credible interval.

As described in Section 4, Figure 6 shows the cluster spe-cific verification performance as predicted by our model (ateach cluster center in the quality space) and for the trainingand testing data set. The error bars indicate 95% credibleinterval (i.e. α = 0.05).

0.0001

0.001

0.01

0.1

0.00001 0.0001 0.001 0.01 0.1 1

False Accept Rate (FAR)

Fals

e R

eje

ct R

ate

(F

RR

)

Decision threshold cluster−based naive

Figure 7. Performance for verification decisions based on ourcluster-specific threshold and a naive threshold scheme.

As an illustrative application of our model, we show thatadapting the verification decision threshold based on imagequality information can improve verification performance.From the training data set, we compute a cluster specific de-cision threshold (for FAR= 0.001) from samples assignedto that cluster. During testing, we compute the most prob-able cluster assignment and then apply a cluster specificthreshold to make the verification decision. From the fulltraining set, we compute a decision threshold correspond-ing to FAR= 0.001 and apply this threshold to make veri-fication decision on all the instances present in the test set.This simulates the operation of a naive system that uses afixed decision threshold for all verification instance without

considering the image quality. The selected FAR= 0.001denotes a single operating point and therefore gives us asingle point in the ROC curve shown in Figure 7. There-fore, we repeat this procedure for other values of FAR andobtain the full ROC curve shown in Figure 7.

The adaptive decision threshold scheme based on ourmodel achieves a FRR= 4.72% (at FAR= 0.0102%)while a naive scheme achieves a FRR= 5.53% (at FAR=0.0107%) – an improvement of 0.81% in FRR. This im-provement in performance, though small, shows the meritof our model in exploiting image quality information forperformance prediction. Further performance gain can beachieved by including additional image quality parametersand by using less contrained model parametrization.

6. DiscussionIn this paper, we build a data driven model to learn the

relation between the quality q of probe and gallery imagepair and the corresponding recognition performance r. Re-call that, although our model is capable of handling qual-ity of both probe and gallery images, we only consider thequality of probe images in our experiments. This is done tosimulate real world verification scenario in which the qual-ity of gallery image is fixed to a set of high quality frontalmugshots and only the quality of probe images vary. Ourmodel parametrizes the [q, r] space into a linear combina-tion of multivariate Gaussian. In Figure 3, we show the pro-jection of these Gaussians in the quality space. FaceVACSis fine tuned for optimal verification performance on frontalview images for which its performance remains largely in-variant to illumination variation. Mixture Component (MC)1 captures this property of the system and therefore occu-pies the region of quality space corresponding to frontalview (−3 ≤ qp1 ≤ 0) and all possible illumination condition(−4 ≤ qp2 ≤ 6). MC 3 corresponds to slight non-frontalpose and depicts the property of FaceVACS to be nearlytolerant to a pose variation of ±15◦. All the remaining fourMC i.e. k ∈ {2, 4, 5} are located in a region of quality spacecorresponding to the non-frontal pose (i.e. 1 ≤ qp1 ≤ 4). Inthis region, there are multiple clusters along the illuminationvariation axis (i.e. qp2) which indicates that for non-frontalpose, illumination variation has strong impact on verifica-tion performance. Furthermore, MC 2 corresponds to non-frontal pose and illumination images – the worst possibleimage quality in our data set.

Based on (4), we compute the most probable estimate ofverification performance vector r∗ at each point in the qual-ity space and show it in Figure 5 (left). This map clearlymarks the boundary of MC {1, 2, 5}. However, the bound-ary between MC 3 and 4 is not clearly visible because thesetwo mixture components have very small difference in veri-fication performance. Yet, our model use two mixture com-ponents to represent this region because multiple types of

−5

−3

−1

1

3

5

7

−4 −3 −2 −1 0 1 2 3 4


p )

Devia

tio

nF

rom

Un

iform

Lig

htin

g (

q2p )

−1.0 0.0 0.5 1.0 1.5 2.0

−2

−1

01

23

4

True FRR @ FAR=0.001 (FRR in log scale)


p )

Devia

tio

nF

rom

Un

iform

Lig

htin

g (

q2p )

−3.0

−2.5

−2.0

−1.5

−1.0

Figure 4. Position of training probe samples in the quality space (left) and the map of corresponding face recognition performance.

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−2

−1

01

23

4

Variation of recognition performance r* in the quality

space as predicted by our model (FRR in log scale)


p )

Devia

tio

nF

rom

Un

iform

Lig

htin

g (

q2p )

−3.0

−2.5

−2.0

−1.5

−1.0

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−2

−1

01

23

4

P(r*|q)


p )

Devia

tio

nF

rom

Un

iform

Lig

htin

g (

q2p )

0.002

0.003

0.004

0.005

0.006

1 3

2

5

4

Figure 5. Map of verification performance in quality space as predicted by our model.

quality variations (corresponding to different regions in thequality space) can have similar impact on verification per-formance. For example, for a uniformly illuminated non-frontal view facial image (i.e. camera 04 1 and flash 09), theverification performance is degraded to a certain level by thenon-frontal pose of the facial images. Whereas, for a poorlyilluminated near-frontal view facial image (i.e. camera 05 0and flash 01), the verification performance is degraded toa similar level but due to poor illumination. Therefore, weexpect multiple regions in the quality space to have similarverification performance. This phenomenon is nicely illus-trated by the L shaped region formed by MC 3 and 4. Thecorresponding map of P (r∗|q) is shown in Figure 5 (right)which depicts low confidence in model prediction in theboundary regions of mixture components.

The cluster specific verification performance of Figure 6

shows that for clusters k = {1, 2, 4}, our model accuratelyclassifies both testing and training verification samples intoclusters for which the verification remains fairly consistent.For clusters k = {2, 5}, we observe a large difference be-tween the true verification performance of the training andtesting data set. This indicates that the two quality param-eters (pose and illumination) used in our model may notbe sufficient to capture all the variations that exist in ourdata set. For example, some subjects in our data set wearglasses and some others have a large part of their face oc-cluded by facial hair. Another reason for this large variationmight have to do with the diagonal model parametrizationwe use in this paper. While this parametrization reducesthe model complexity, it enforces independence constraintbetween the quality and recognition performance variables.

Figure 6 also shows our model predictions which is very

close to the true verification performance observed on thetraining and testing data set. For cluster 2, we observe avery large confidence interval for our model’s prediction.Recall that, we evaluate the model’s performance at eachcluster center. Since our model training is based on only250 scores samples at each sample points in the qualityspace, we observe large variance in the model predictions.For our training and testing data set, we observe small vari-ance in verification performance because the estimates arebased on a very large number of samples (> 1000) classi-fied to each cluster. These limitations are common for mostdata driven model indicating the need for more densely dis-tributed training data in the quality space.

7. ConclusionIn this paper, we propose a data driven model to learn the

relation between facial image quality and the correspondingrecognition performance. Adopting a Bayesian approach,we model this relationship as a probability density function.For previously unseen verification instance, we predict theverification performance by evaluating the posterior distri-bution for the given image quality. This posterior distribu-tion also quantifies the uncertainty in decision about iden-tity. A remarkable property of our model is that it reliessolely on image quality information and does not requiresimilarity scores to make predictions about recognition per-formance. For a data set containing pose and illuminationvariations, we have shown that the proposed model is able toidentify regions (i.e. cluster) in the quality space over whichthe face recognition performance varies fairly consistently.Furthermore, we have also shown an illustrative applicationof our model where we observe improvement in verificationperformance by using image quality information to adaptthe decision threshold.

A limitation of the proposed data driven model is thatit requires sufficiently large number of training samplesspread densely in the quality space. Given that we succeedin acquiring sufficient training and testing data, we envisageto extend our model to include additional quality parame-ters (noise, sharpness, expression, etc.) and more recogni-tion performance parameters (like Area Under ROC, cali-brated log-likelihood ratio, more points on ROC, etc. andtheir combinations).

AcknowledgementsThis work was supported by the BBfor2 project which is

funded by the EC as a Marie-Curie ITN-project (FP7-PEOPLE-ITN-2008) under Grant Agreement number 238803. We wouldalso like to thank Cognitec Systems GmbH. for supporting ourresearch by providing the FaceVACS software. Results obtainedfor FaceVACS were produced in experiments conducted by theUniversity of Twente, and should therefore not be construed as avendors maximum effort full capability result.

References[1] G. Aggarwal, S. Biswas, P. J. Flynn, and K. W. Bowyer. Pre-

dicting performance of face recognition systems: An imagecharacterization approach. In Computer Vision and PatternRecognition Workshops (CVPRW), pages 52–59, 2011.

[2] G. Aggarwal, S. Biswas, P. J. Flynn, and K. W. Bowyer. Pre-dicting good, bad and ugly match Pairs. In IEEE Workshopon Applications of Computer Vision, pages 153–160, 2012.

[3] J. R. Beveridge, G. H. Givens, P. J. Phillips, B. A. Draper,and Y. M. Lui. Focus on quality, predicting FRVT 2006 per-formance. In FG’08, pages 1–8, 2008.

[4] C. M. Bishop. Pattern recognition and machine learning,volume 1. Springer New York, 2006.

[5] Cognitec Systems. FaceVACS C++ SDK Version 8.7.0,2012.

[6] G. Doddington, W. Liggett, A. Martin, M. Przybocki, andD. Reynolds. Sheep, goats, lambs and wolves: A statisti-cal analysis of speaker performance in the nist 1998 speakerrecognition evaluation. In Proceedings of International Con-ference on Spoken Language Processing, 1998.

[7] A. Dutta, R. N. J. Veldhuis, and L. J. Spreeuwers. Can facialuniqueness be inferred from impostor scores? In BiometricTechnologies in Forensic Science, BTFS, 2013.

[8] C. Fraley, A. E. Raftery, T. B. Murphy, and L. Scrucca.mclust version 4 for r: Normal mixture modeling formodel-based clustering, classification, and density estima-tion. Technical Report 597, Department of Statistics, Uni-versity of Washington, 2012.

[9] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker.Multi-PIE. In FG’08, pages 1–8, 2008.

[10] B. F. Klare and A. K. Jain. Face recognition: Impostor-basedmeasures of uniqueness and quality. In Biometrics: Theory,Applications and Systems (BTAS), pages 237–244, 2012.

[11] W. Li, X. Gao, and T. E. Boult. Predicting biometric systemfailure. In Computational Intelligence for Homeland Secu-rity and Personal Safet, pages 57–64, 2005.

[12] N. Ozay, Y. Tong, F. W. Wheeler, and X. Liu. Improving facerecognition with a quality-based probabilistic framework. InCVPR Workshops, pages 134–141, 2009.

[13] W. J. Scheirer, A. Rocha, R. J. Micheals, and T. E. Boult.Meta-Recognition: The Theory and Practice of RecognitionScore Analysis. IEEE PAMI, 33(8):1689–1695, 2011.

[14] Z. Shi, F. Kiefer, J. Schneider, and V. Govindaraju. Mod-eling biometric systems using the general pareto distribution(gpd). In Proc. SPIE, volume 6944, pages 69440O–11, 2008.

[15] E. Tabassi, C. Wilson, and C. I. Watson. Fingerprint Im-age Qualitiy. Interagency/Internal Report (NISTIR) - 7151,NIST, April 2004.

[16] P. Wang, Q. Ji, and J. L. Wayman. Modeling and PredictingFace Recognition System Performance Based on Analysis ofSimilarity Scores. IEEE PAMI, 29(4):665–670, 2007.

[17] L. M. Wein and M. Baveja. Using fingerprint image qualityto improve the identification performance of the U.S. Visitorand Immigrant Status Indicator Technology Program. Pro-ceedings of the National Academy of Sciences of the UnitedStates of America, 102(21):7772–7775, 2005.

Documents

A Bayesian Model for Predicting Face Recognition Performance Using … · 2014. 9. 1. · face recognition performance using only image quality in-formation. Such a model has many