How to Make Face Recognition Work: The Power of · PDF fileHow to Make Face Recognition Work: The Power of Modeling Context ... that captures dependencies induced via set of contex-

How to Make Face Recognition Work: The Power of Modeling Context

Ashish Kapoor1, Dahua Lin2, Simon Baker1, Gang Hua3 and Amir Akbarzadeh1

1Microsoft Research2MIT CSAIL

3Stevens Institute of Technology

Abstract

Face recognition in the wild has been one of the longeststanding computer vision challenges. While there hasbeen constant improvement over the years, the varia-tions in appearance, illumination, pose etc. still makes itone the hardest task to do well. In this paper we summa-rize two techniques that leverage context and show sig-nificant improvement over vision only methods. At theheart of the approach is a probabilistic model of contextthat captures dependencies induced via set of contex-tual relations. The model allows application of standardvariational inference procedures to infer labels that areconsistent with contextual constraints.

With the ever increasing popularity of digital photos, vision-assisted tagging of personal photo albums has become an ac-tive research topic. Existing efforts in this area have mostlybeen devoted to using face recognition to help tag people.However, current face recognition algorithms are still notvery robust to the variation of face appearance in real photos.

This paper explores how contextual cues can aid and im-prove face recognition in the wild. In particular instead ofjust focusing on answering who, we incorporate methodsto consider what, when, and where, and in doing so showhow such contextual can immensely help with the task. Weconsider the domains of people, events, and locations, as awhole and the key insight that the domains are not indepen-dent and knowledge in one domain can help the others. Forexample, if we know the event that a photo was captured in,we can probably infer who was in the photo, or at least re-duce the set of possibilities. On the other hand, the identitiesof the people in a photo may help us infer when and wherethe photo was taken.

Our approach is based on the principles of probabilisticgraphical models where we build a network of classifiersover different domains. Ideally, if a strong classifier is avail-able to recognize the instances in one domain accurately, thelabels in this domain can be used to help recognition in theother domains. However, a challenge that arises in real sys-tems is that we often do not have a strong initial classifier inany of the domains. One of the primary contributions of thework presented here is to develop a unified framework that

Copyright c© 2012, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

couples the recognition across the domains. A joint learn-ing and inference algorithm allows accurate recognition inall domains by exploiting the statistical dependency betweenthem and reinforces individual classifiers, which alone maybe relatively weak.

Related WorkOver the last decade, there has been a great deal of interest inthe use of context to help improve face recognition accuracyin personal photos. A recent survey of context-aided facerecognition can be found in (Gallagher and Tsuhan 2008).Zhang et al. (Zhang et al. 2003) utilized body and clothing inaddition to face for people recognition. Davis et al. (Davis etal. 2005; 2006) developed a context-aware face recognitionsystem that exploits GPS-tags, time-stamps, and other meta-data. Song and Leung (Song and Leung 2006) proposed anadaptive scheme to combine face and clothing features basedon the time-stamps. These methods treat various forms ofcontextual cues as linearly additive features, and thus over-simplify the interaction between different domains.

Various methods based on co-occurrence have also beenproposed. Naaman et al. (Naaman et al. 2005) leveragedtime-stamps and GPS-tags to reduce the candidate list basedon people co-occurrence and temporal/spatial re-occurrence.Gallagher and Chen (Gallagher and Tsuhan 2007) proposedan MRF to encode both face similarity and exclusivity. Inlater work by the same authors (Gallagher and Chen 2007),a group prior is added to capture the tendency that certaingroups of people are more likely to appear in the same photo.In addition, Anguelov et al. (Anguelov et al. 2007) devel-oped an MRF model to integrate face similarity, clothingsimilarity and exclusivity.

There is also a lot of research on use of context in ob-ject recognition and scene classification. For example, Tor-ralba et al. (Torralba et al. 2003; Torralba 2003) used scenecontext as a prior for object detection and recognition. Ra-binovich et al. (Rabinovich et al. 2007) proposed a CRFmodel that utilizes object co-occurrence to help object cate-gorization. Galleguillos et al. (Galleguillos, Rabinovich, andBelongie 2008) extended this framework to use both objectco-occurrence and spatial configurations for image segmen-tation and annotation. Li-Jia and Fei-Fei (Li and Fei-Fei2007) proposed a generative model that can be used to la-bel scenes and objects by exploiting their statistical depen-

dency. In later work (Li, Socher, and Fei-Fei 2009), the sameauthors extended this model to incorporate object segmenta-tion. Cao et al. (Cao et al. 2008) employed a CRF model tolabel events and scenes coherently.

Modeling ContextIn this paper we summarize two approaches proposed ear-lier to model context in face recognition task. First is amodel that extends vanilla face recognition setting to incor-porate simple match and non-match constraints (Kapoor etal. 2009) that are derived from physical context of individualphotographs (e.g. two faces in the same image cannot havethe same identity). The second model (Lin et al. 2010) is amore sophisticated version of the first, where the location,time and co-occurrence patterns of individuals in a photo-graph is considered as a context to influence the classifica-tion. The core idea behind both of the models is a powerprobabilistic framework that enables us to incorporate con-text via a network of classifiers that depend upon each other.Also note that both the models follow discriminative mod-eling paradigm as neither of them attempt to model P (X),the high dimensional underlying density of observations. Fi-nally, the inference methodology for both the models fol-lows the principles of variational approximation where theclassification and constraint resolution is performed itera-tively.

Model 1: Match and Non-match ConstraintsThis is a fairly simple model (Kapoor et al. 2009) that ex-tends classic supervised classification setting to allow in-corporation of additional sources of prior information. Inthis model, we consider two simple contextual constraints:match and non-match. First, if two faces appeared in thesame unedited photo, the two faces cannot have the sameidentity. We call such constraints nonmatch constraints. An-other example is in video. If faces are tracked and two im-ages are from the same track, they must have the same iden-tity. We call such constraints match constraints.

Model: Assume we are given a set of face images X ={xi}. We partition this set into a set of labeled ones XL withlabels tL = {ti|i ∈ L} and a set of unlabeled ones XU . Themodel consists of a network of predictions that interact withone another such that the decision of each predictor is in-fluenced by the decision of its neighbors. Specifically, givenmatch and non-match constraints we induce a graph whereevery vertex corresponds to a label ti, i ∈ L ∪ U , and isconnected to its neighbors according to the given constraint.We will denote the set of edges corresponding to match andnon-match edges as E+ and E− respectively.

Figure 1 illustrates the factor graph corresponding to theproposed model. The class labels {ti : i ∈ L ∪ U} are de-noted by squares and influence each other based on differentmatch (green lines) and non-match (dashed red lines) con-straints. In addition to these constraints, our model also im-poses smoothness constraints using a GP prior (Rasmusenand Williams 2006). We introduce latent variables Y ={yi}ni=1 that use a GP prior to enforce the assumption thatsimilar points should have similar prediction. In particular,

1y 7y 8y 9y ny

1t

7t

8t

9t

nt

)|( XYP

Figure 1: Factor graph depicting the first proposed discrimi-native model. The shaded nodes correspond to the observedlabels (training data) and the thick green and dashed red linecorrespond to match and non-match constraints respectively.

the latent variables are assumed to be jointly Gaussian andthe covariance between two outputs yi and yj is typicallyspecified using a kernel function applied to xi and xj . For-mally, p(Y|X) ∼ N (0,K) where K is a kernel matrix1

with Kij = k(xi,xj) and encodes similarity between twodifferent face regions.

Given a pool of images, the model induces a condi-tional probability distribution p(t,Y|X) using the GP priorp(Y|X) and potential functions φ, ψ+ and ψ−. Here φ en-codes the compatibility of a label t and the correspondinglatent variable y. Further, ψ+ and ψ− encode the pairwiselabel compatibility according to the match and non-matchconstraints respectively. Thus, the conditional distributioninduced by the model can be written as:

p(t,Y|X) =1

Zp(Y|X)

n∏i=1

φ(yi, ti)×∏(i,j)∈E+

ψ+(ti, tj)∏

(i,j)∈E−ψ−(ti, tj)

where Z is the partition function (normalization term) andthe potentials φ, ψ+ and ψ− take the following form:

φ(yi, ti) ∝ e−||yi−t̄i||

2

2σ2

ψ+(ti, tj) = δ(ti, tj)

ψ−(ti, tj) = 1− δ(ti, tj).

Here, δ(·, ·) is the Dirac delta function and evaluates to 1whenever the arguments are equal, and zero otherwise. Also,

1This kernel matrix is a positive semidefinite matrix and is akinto the kernel matrix used in classifiers such as SVMs.

t̄i is the indicator vector corresponding to ti and σ2 is thenoise parameter and determines how tight the relation be-tween the smoothness constraint and the final label is. Bychanging the value of σ we can emphasize or de-emphasizethe effect of the GP prior. Note that in absence of any matchand non-match constraints the model reduces to a multi-class classification scenario with GP models (Kapoor et al.2007; Rasmusen and Williams 2006).

Inference: Given some labeled data the key task is to in-fer p(tU |X, tL) the posterior distribution over unobservedlabels tU = {ti|i ∈ U}. The inference is performed by sim-ple message passing to resolve the smoothness, match andnon-match constraints and infer the unobserved variables inan efficient manner. Intuitively, the inference is done by al-ternating between a classification step and a constraint res-olution scheme. A simple application of the face recogni-tion classifier on the unlabeled set might lead to labels thatare inconsistent with contextual constraints. However, theseinconsistencies are resolved by performing belief propaga-tion. This resolution of inconsistencies induces novel be-liefs about the unlabeled points and are propagated to theclassifier in the next iteration. By iterating between thesetwo updates the model consolidates information from bothcomponents, and thus provides a good approximation ofthe true posterior. This scheme is an approximate inferenceand derived by maximizing the variational lower bound (seeKapoor et al. (Kapoor et al. 2009) for the details).

Model 2: Location, Co-occurrence and TimeThe second model (Lin et al. 2010) is a richer modeland specifically consider four kinds of contextual relations:(a) the people-event relation models who attended whichevents, (b) the people-people relation models which pairsof people tend to appear in the same photo, (c) the event-location relation models which event happened where,and (d) the people-location relation models who appearedwhere. These relations embody a wide range of contextualinformation, which is modeled uniformly under the samemathematical framework. It is important to note that eachpair of related domains are symmetric with respect to thecorresponding relation. This means, for example, that uti-lizing the people-event relation, event recognition can helppeople recognition, and people recognition can also helpevent recognition.

Model: The framework, outlined in figure 2, consists ofthree domains: people, events, and locations. Each domaincontains a set of instances. In order to account for the uncer-tainty due to missing data or ambiguous features, we con-sider the labels in all three domains as random variables tobe inferred. Pairs of domains are connected to each otherthrough a set of cross-domain relations that model the statis-tical dependency between them.

Suppose there areM domains: Y1, . . . ,YM . Each domainis modeled as a set of instances, where the i-th instance inYu is associated with a label of interest, denoted as the ran-dom variable yiu. While the user can provide a small numberof labels in advance, most labels are unknown and need to

People

Domain

P1

P2

P4

P3

E1 E2

L1

L2

Faces

Clothes Scenes

2006-11-01

2006-11-28

Time-stamps

Location

Domain

0.0 0.0 1.0 1.0

1.0 1.0 1.0 0.0

people-event relation

Event

Domain

Figure 2: Our framework comprises three types of entity:(1) The people, event, and location domains, together withtheir instances. (2) The observed features of each instancein each domain. (3) A set of contextual relations betweenthe domains. Each relation is a 2D table of coefficients thatindicate how likely a pair of labels is to co-occur. Althoughonly the people-event relation is shown in this figure, weconsider four different relations in this paper. See body oftext for more details.

be inferred. Specifically, we consider three domains for peo-ple, events, and locations. Each detected face corresponds toa person instance in the people domain, and each photo cor-responds to both an event instance and a location instance.Each domain is associated with a set of features to describeits instances. In particular, person instances are character-ized by their facial appearance and clothing; while eventsand locations are characterized by time-stamps and the back-ground color distribution respectively.

To exploit the statistical dependency between the labelsin different domains, we introduce a relational model Ruv

between each pair of related domains Yu and Yv . Each Ruv

is parameterized by a 2D table of coefficients that indicatehow likely a pair of labels is to co-occur. Taking advantageof these relations, we can use the information in one domainto help infer the labels in others.

Here, we use Y and X to represent the labels andfeatures of all domains. The formulation has two parts:(1) p(Y |R,X): the joint likelihood of the labels given therelational models and features and (2) p(R): the prior put onthe relations to regularize their estimation.

The joint likelihood of the data labels is directly modeledas a conditional distribution based on the observed features:

p(Y |X;R) =1

Z×

exp

M∑u=1

αuΦu(Yu;Xu) +∑

(u,v)∈R

αuvΦuv(Yu, Yv;Ruv)

.

This likelihood contains: (1) an affinity potentialΦu(Yu,Xu) for each domain Yu to model feature similar-ity, and (2) a relation potential Φuv(Yu, Yv;Ruv) for eachpair of related domains (u, v) ∈ R. The terms are combinedwith weights αu and αuv .

The affinity potential term Φu captures the intuition thattwo instances in Yu with similar features are likely to be in

the same class:

Φu(Yu;Xu) =

Nu∑i=1

Nu∑j=1

wu(i, j)I(yiu = yju).

Here, wu(i, j) is the similarity between the features of theinstances corresponding to yiu and yju. I(·) denotes the in-dicator that equals 1 when the condition inside the paren-thesis holds. The similarity function wu depends on the fea-tures used for that domain. Intuitively, Φu considers all in-stances of Yu over the entire collection, and attains largevalue when instances with similar features are assigned thesame labels. Maximizing Φu should therefore result in clus-ters of instances that are consistent with the feature affinity.

The relational potential term Φuv(Yu, Yv;Ruv) modelsthe cross-domain interaction between the domains Yu andYv . The relational modelRuv is parameterized as a 2D tableof co-occurring coefficients between pairs of labels. For ex-ample, for the people domain Yu and the event domain Yv ,Ruv(k, l) indicates how likely it is that person k attendedevent l. We define Φuv to be:

Φuv(Yu, Yv;Ruv) =∑i∼j

∑k,l

Ruv(k, l)I(yiu = k)I(yjv = l).

Here, i ∼ j means that yiu and yjv co-occur in the samephoto. Intuitively, a large value of Ruv(k, l) indicates thatthe pair of labels k and l co-occur often, and will encourageyiu to be assigned k and yjv be assigned l. Hence, maximiz-ing Φu should lead to the labels that are consistent with therelation.

Formally, our goal is to jointly estimate the posterior prob-ability of the labels Y and relations R conditioned on thefeature measurements X:

p(Y,R|X) ∝ p(Y |R,X)p(R).

Finally, the prior follows standard principles of regular-ization and sparsity in order to avoid over-fitting:

p(R) ∝ exp

−β1 ∑(u,v)∈R

||Ruv||1 − β2∑

(u,v)∈R

||Ruv||22

.

Here, ||Ruv||1 and ||Ruv||2 are the L1 and L2 norms of therelational matrix. The first term encourages sparsity of therelational coefficients, and therefore can effectively suppressthe coefficients due to occasional co-occurrences, retainingonly those capturing truly stable relations. The second terminhibits really large coefficients that might otherwise occurwhen the class sizes are imbalanced.

Inference: The inference is performed via a variationalEM algorithm where the goal is to jointly infer the labels ofthe instances and estimate the relational models. With a fewlabels in different domains provided in advance by a user,the algorithm iterates between two steps: (1) Infer the dis-tribution of the unknown labels based on both the extractedfeatures and the current relational models R. (2) Estimateand update the relational modelsR using the labels providedby the user and the hidden labels inferred in the previous it-eration. As before the the iterations can be thought of as a

message passing scheme between individual classifiers andresolution of labels in order to resolve the constraints im-posed by the contextual cues. We refer readers to Lin et al.(Lin et al. 2010) for technical details.

Experiments

We present experimental results for the Model 2 as its richerand more powerful. The experiments are carried out on twopublicly available data sets that are commonly used to eval-uate research in personal photo tagging, which we call E-Album (Cui et al. 2007) and G-Album (Gallagher 2008). Werefer readers to Lin et al. (Lin et al. 2010) for details on fea-ture extraction and implementation. We compare the perfor-mance of four different variants of our algorithm: (1) usingonly people affinity (no contextual information), (2) with thepeople-people relation, (3) with the people-event relation,and (4) with both relations.

The results are shown in Figure 3. We note three obser-vations: First, on both albums the people-people relationalone provides only a limited improvement (rank-1 errorsreduced from 27.8% to 27.0% for the E-Album). Second,the people-event relation gives a much bigger improvement(rank-1 errors reduced from 27.8% to 11.9% for the E-Album). Third, the combination of the people-event relationand the people-people relation yields another significant im-provement (rank-1 errors down to 3.2% on the E-Album). Toillustrate our results visually, we include a collage of all ofthe errors for the E-Album in Figure 4. With both the people-event and people-people relations used, there are only threeerrors (3.2%) on the E-Album (see Figure 4).

These results show: (1) that the people-event and people-people relations provide complementary sources of informa-tion, and (2) the people-event relation makes the people-people relation more effective than without it. The mostlikely explanation is that the group-prior and exclusivity aremore powerful when used on the small candidate list pro-vided by the people-event relation.

Overall, we found the G-Album to be more challenging.Partly, this is due to the fact that the G-Album contains avery large number of events (117), each with very few pho-tos (3.8 on average.) The people-event relation would bemore powerful with more photos per event. Note, however,that our framework still yields a substantial improvement,reducing the rank-1 error rate from 26.3% to 14.6%. Notealso, that the rank-3 error rate is reduced to zero on both al-bums, a desirable property in vision-assisted tagging systemwhere a short-list of candidates is often provided for the userto choose from.

To validate the statistical significance of our results, werandomly generated multiple pre-labeled sets, with the per-centage of pre-labeled instances varying from 15% to 55%.Figure 5 contains the median rank-1 results (signified by thecentral mark) along with the 25th and 75th percentiles (sig-nified by lower and upper bars) obtained on E-Album. Wefound that the improvement is significant across the entirerange of pre-labeling percentage in both data sets.

error rates P_Only PP PE PP+PERank 1 27.8% 27.0% 11.9% 3.2%Rank 2 19.0% 16.7% 0.8% 0.8%Rank 3 13.5% 9.7% 0.0% 0.0%

P_Only PP PE PP+PE

Rank 1 27.8% 27.0% 11.9% 3.2%Rank 2 19.0% 16.7% 0.8% 0.8%Rank 3 13.5% 9.7% 0.0% 0.0%

0%

5%

10%

15%

20%

25%

30%

(a) Error rates on E-Album

error rates P_Only PP PE PP+PERank 1 26.3% 25.3% 20.1% 14.6%Rank 2 8.2% 6.4% 2.3% 4.6%Rank 3 4.9% 3.6% 0.0% 0.0%

P_Only PP PE PP+PE

Rank 1 26.3% 25.3% 20.1% 14.6%Rank 2 8.2% 6.4% 2.3% 4.6%Rank 3 4.9% 3.6% 0.0% 0.0%

0%

5%

10%

15%

20%

25%

30%

(b) Error rates on G-Album

Figure 3: People labeling results for several different configurations of our algorithm.

Above: Errors when only face recognition is used Left: Errors when both P-E and P-P relations are used

Figure 4: All rank-1 errors for the E-Album. Above the de-limiter: Errors with no contextual relations (27.8%). Belowthe delimiter: Errors with both the people-event and people-people relations (3.2%).

0.15 0.25 0.35 0.550

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

ratio of pre−labeling

erro

r ra

tes

Figure 5: Statistical significance results on the E-Album withdifferent percentages of pre-labeled instances. Curves fromtop to bottom: using only face, people-people only, people-event only, and using both.

ConclusionThe paper explores how context can be leveraged to boostface recognition in personal photo collections. We have pro-posed probabilistic model that incorporate cross-domain re-lations as a mechanism to model context in multi-domain la-beling (people, events, locations) and match/non-match con-straints. Relation estimation and label inference are unifiedin a single optimization algorithm. Our experimental resultsshow that probabilistic graphical models provide a elegant,powerful, and general method of modeling context in vision-assisted tagging applications.

ReferencesAnguelov, D.; Lee, K.-c.; Gokturk, S. B.; and Sumengen,B. 2007. Contextual identity recognition in personal photoalbums. In CVPR’07.Cao, L.; Luo, J.; Kautz, H.; and Huang, T. S. 2008. Annotat-ing collections of photos using hierarchical event and scenemodels. In CVPR’08.Cui, J.; Wen, F.; Xiao, R.; Tian, Y.; and Tang, X. 2007.Easyalbum: an interactive photo annotation system based onface clustering and re-ranking. In SIGCHI, 367–376.Davis, M.; Smith, M.; Canny, J.; Good, N.; King, S.; andJanakiraman, R. 2005. Towards context-aware face recog-nition. In 13th ACM Conf. on Multimedia.

Davis, M.; Smith, M.; Stentiford, F.; Bamidele, A.; Canny,J.; Good, N.; King, S.; and Janakiraman, R. 2006. Usingcontext and similarity for face and location identification. InSPIE’06.

Gallagher, A. C., and Chen, T. 2007. Using group prior toidentify people in consumer images. In CVPR Workshop onSLAM’07.

Gallagher, A. C., and Tsuhan, C. 2007. Using a markovnetwork to recognize people in consumer images. In ICIP.

Gallagher, A. C., and Tsuhan, C. 2008. Using context to rec-ognize people in consumer images. IPSJ Journal 49:1234–1245.

Gallagher, A. C. 2008. Clothing cosegmentation for recog-nizing people. In CVPR’08.

Galleguillos, C.; Rabinovich, A.; and Belongie, S. 2008.Object categorization using co-occurrence, location and ap-pearance. In CVPR’08.

Kapoor, A.; Grauman, K.; Urtasun, R.; and Darrell, T. 2007.Active learning with Gaussian Processes for object catego-rization. In ICCV.

Kapoor, A.; Hua, G.; Akbarzadeh, A.; and Baker, S. 2009.Which faces to tag: Adding prior constraints into activelearning. In ICCV’09.

Li, L.-J., and Fei-Fei, L. 2007. What, where and who? clas-sifying events by scene and object recognition. In CVPR’07.Li, L.-J.; Socher, R.; and Fei-Fei, L. 2009. Towards to-tal scene understanding: Classification, annotation, and seg-mentation in an automatic framework. In CVPR’09.Lin, D.; Kapoor, A.; Hua, G.; and Baker, S. 2010. Joint peo-ple, event, and location recognition in personal photo collec-tions using cross-domain context. In ECCV.Naaman, M.; Garcia Molina, H.; Paepcke, A.; and Yeh, R. B.2005. Leveraging context to resolve identity in photo al-bums. In ACM/IEEE-CS Joint Conf. on Digi. Lib.Rabinovich, A.; Vedaldi, A.; Galleguillos, C.; Wiewiora, E.;and Belongie, S. 2007. Objects in context. In ICCV’07.Rasmusen, C. E., and Williams, C. 2006. Gaussian Pro-cesses for Machine Learning. MIT Press.Song, Y., and Leung, T. 2006. Context-aided human recog-nition - clustering. In ECCV’06.Torralba, A.; Murphy, K. P.; Freeman, W. T.; and Rubin,M. A. 2003. Context-based vision system for place andobject recognition. In ICCV’03.Torralba, A. 2003. Contextual priming for object detection.Int’l. J. on Computer Vision 53(2):169–191.Zhang, L.; Chen, L.; Li, M.; and Zhang, H. 2003. Automatedannotation of human faces in family albums. In 11th ACMConf. on Multimedia.

Documents

How to Make Face Recognition Work: The Power of · PDF fileHow to Make Face Recognition Work: The Power of Modeling Context ... that captures dependencies induced via set of contex-