CS 395T: Celebrity Look-Alikes · Tripplehorn and Amber Tamblyn, Leonardo Di Caprio and Mila Jovovich, Kristen Johnston and Mo’Nique. human perception, psychology, and culture,

CS 395T: Celebrity Look-Alikes ∗

Adrian [email protected]

Abstract

This project explores the problem of finding the celebritywho most resembles the individual in a given photo. Thisproblem is more general than face recognition (and there-fore more challenging). I first compare the performance ofan Eigenfaces-like model with Active Appearance Modelsto determine which is more suitable for this task. Then Icompare Fischer’s Linear Discriminant Analysis (LDA), atraditional approach to face recognition, with Holub et al.’sperceptual map (PMap), which is specifically designed tomeasure face similarity across individuals. The results in-dicate that this problem is far from solved.

1. IntroductionWhen does one person resemble another? Sometimes

similarity depends on obvious factors, such as hair color andface shape, but other times it can be more subtle, dependingon small cues from specific features which override grossdifferences like skin color and sex. Figure 1 shows someexamples of similar faces.

Facial similarity clearly has something in common withthe well-studied problem of face identification: every per-son necessarily resembles themselves. However, there aresome important differences. In identification, we expect aquery to represent some known person exactly, and if it doesnot we want to report no match. In similarity searchingwe expect the image does not represent anyone in the setexactly, but we still want to find the most similar-lookingface. It is not enough to perform identification with a hightolerance for error, because the variation possible betweentwo similar faces belonging to different people is differentthan that between two views of the same person: it may be,for example that one feature (like the shape of the jaw) is avery good indicator of identity, but not especially importantfor similarity. Finally, identity is based on physical factsand identification is therefore primarily concerned with re-moving the effects of physical transformations like pose,lighting, and expression. In contrast, similarity is based on∗Original document, full source code and data available at

http://www.cs.utexas.edu/∼quark/vision project/

Figure 1. Examples of similar celebrity faces, found by NatashaSazanova[23]. From top to bottom and left to right: JeanneTripplehorn and Amber Tamblyn, Leonardo Di Caprio and MilaJovovich, Kristen Johnston and Mo’Nique.

human perception, psychology, and culture, and therefore itwould be impractical to model directly.

Aside from being a new problem, face similarity is alsouseful. It is necessary to understand resemblance in orderto answer questions like:

• Are two people related?

• Who can be cast as a stand-in for a movie star?

• How do humans describe faces?

1

Figure 2. The author (left) and top three celebrity matches fromboth Hoowat [11] (top) and MyHeritage.com[18].

• How do humans remember faces?

This project addresses the problem of measuring facesimilarity in a limited context: given a photo of an ordi-nary person, what celebrity does it most resemble? Thisquestion is a good topic of study because it is easy to for-mulate and understand, but still requires completing threetasks which are essential to answering more practical ques-tions like those above:

• Collect and organize many images of many individu-als.

• Create face models robust to pose, lighting, and ex-pression variation.

• Measure the similarity between face models.

My general approach to these tasks was as follows. Iobtained a large and diverse data set of celebrity faces bydownloading labeled images publicly available on the In-ternet, extracting frontal faces, and applying heuristics todiscard incorrectly-labeled images. I then modeled thesefaces using Active Appearance Models to obtain a descrip-tor which is relatively invariant to pose, lighting, and ex-pression variation. Finally, I collected human similarityrankings over a large set of faces and used these rankings totrain a perceptual mapping of face descriptors into a spacewhere similar faces are close together. I evaluated Eigen-faces as an alternative to AAMs, and Fischer’s Linear Dis-criminant as an alternative to the perceptual mapping.

2. Related Work

Currently on the Internet there are several web siteswhich perform exactly this task[25, 18, 11]. Based oninformal testing they perform at acceptable levels, some-times making insightful choices but often questionableones. Since they do not publish or publicly evaluate theirmethods, it is difficult to make direct comparisons, but someexamples of results are shown in Figure 2.

This project extends work by Holub et al. in constructingfacial similarity maps[10]. They obtained human ratings ofthe similarity between faces and used these to generate anembedding of facial feature vectors in a space which putssimilar-looking faces (as judged by humans) close together.I follow their general approach while exploring some varia-tions in the details:

• They use Eigenface-like pixel-based descriptors whileI use Active Appearance Models.

• Their data had manually-annotated facial features,while I use automatic feature detectors.

• I explore Fisher’s LDA as an alternative to their per-ceptual map.

The general task of learning a similarity metric for facesis also discussed in [6], but that paper focuses on the pairmatching task (a form of identification) rather than compar-ing images of similar-looking-but-different people.

Related work for specific steps of this project will be dis-cussed in the appropriate sections.

3. Technical PlanThis project can be broken down into several sub-tasks

which I will address in the following sections:

• Build a data set of celebrity photos

• Collect human similarity rankings

• Choose a representation for faces

• Build a similarity space

• Remove mislabeled images from the data set

• Search for faces similar to an example

See Figure 3 for an overview of the entire process.

3.1. Celebrity Data set

The first step to identifying which celebrity a person re-sembles is to know what the celebrities look like, and thisrequires a large labeled set of celebrity images. Having justone image per celebrity is probably not sufficient, becausewe are interested in comparing them to people in a varietyof poses and expressions. Having multiple images of eachcelebrity will make this comparison easier.

Unfortunately, most freely-available face databases witha suitable number and variety of faces do not contain photosof celebrities. The explanation is that databases are most ef-fective for training when their photographs are taken undercontrolled conditions with controlled variation, and mostcelebrities would not consent to having photographs taken

detect faces fit AAM

build AAM

build face map

IMDB photos

IMM data

annotate

Human similarity ratings

Off

line {

detect faces

fit AAM apply map

Figure 3. Overview of technical approach.

specifically for face recognition purposes. One exception is“Labeled Faces in the Wild”[12], comprising photographscollected from news articles on the Internet and thereforequite a few photographs of celebrities. However, the inclu-sion of celebrities in this data set is incidental, and the distri-bution of data is uneven: the database includes 13,233 im-ages of 5,749 people, but only 158 of these have 10 or morephotographs. This is not suitable for an application whichspecifically requires many photographs of a large numberof celebrities.

Therefore it is necessary to create a new data set. Thereare several WWW sites which have large collections ofsemi-labeled celebrity images. After considering the alter-natives, I chose to use the Internet Movie Data Base[13],which incorporates photographs of over 33,000 movie- andtelevision-related celebrities, with 2,780 of these havingmore than 20 photographs each. These photographs in-clude a mix of photographs from events, movies, public-ity, and other sources, although events dominate. Despitethe large volume, this excludes some obvious celebritiesbecause they have no affiliation with television or film. Idecided that this was an acceptable trade-off for several rea-sons. First, because of the role of television and film in ourculture, there are few celebrities who have not taken part inthem in some way; witness, for example, the large numberof musicians-turned-actors. Second, for this project I wouldprefer to use celebrities who are visually recognizable —nobody cares if you look like a celebrity who nobody rec-ognizes — and by the nature of the medium, movie celebri-ties tend to be visually recognizable. Third and finally, Icould not find any single alternative source with a balancedmix of celebrities and a comparable number of photographs,

Figure 4. Examples of faces detected on IMDB. From top to bot-tom: typical true positives, unusual true positives, false positives.Right: average image.

and I did not want to introduce the additional complicationof combining data collected from multiple sources.

IMDB provides access to their photographs via their website indexed first by the first letter of the celebrity’s lastname, then by full name (each celebrity is given a uniquename), and then by page, with 50 photos per page. I wrotePHP scripts to crawl the website via those index pages,building a database first of celebrity names with over 20photos and then downloading between 20 and 150 photosper celebrity, in the order listed by IMDB. The number 150was chosen as an artificial cut-off to avoid using too muchdisk space, as many of the more popular celebrities haveover 500 photographs. In order to avoid placing any mea-surable load on IMDB’s servers I downloaded the images ina single thread over the course of 4 nights.

Because the photographs are taken in uncontrolled con-ditions which may be too challenging for current face

Step # celebrities # facesOriginal IMDB+IMM 25,609 301,306

Filter by available images 2,837 192,970Extract faces 2,780 153,475

Eigenface warping 2,780 133,553Eigenface label cleaning 2,525 25,250

AAM fitting 2,780 70,696AAM label cleaning 2,117 21,170

Table 1. The total number of celebrities and face images availableafter each step of the process.

Figure 5. Interface for collecting similarity ratings.

recognition technology, I used the Viola-Jones frontal facedetector[28] included with OpenCV[20] to extract onlyfrontal faces from each photograph. This saves disk spacesince the non-face parts of the image needn’t be stored,and ensures that later parts of the process won’t have todeal with faces in extreme poses or lighting conditions.Some examples of both typical detections and bad detec-tions (challenging true positives and false positives) areshown in Figure 4, together with the average of all detectedimages. As the average image shows, the detector is local-izing faces and facial features fairly consistently.

3.2. Human Similarity Data

Human estimates of similarity are necessary for two rea-sons: to empirically validate the results of the experiment,and to train the face representation to follow human judg-ment. I use essentially the same method as [10] to collectthis data.

Human subjects are presented with a target face and a setof 24 sample faces (each from a different individual), andasked to choose the sample face most similar to the targetface (see Figure 5). The sample set, target face, choice, andreaction time in seconds are all recorded. Intuitively, onemight expect that it would give better results to request anabsolute rating of the similarity of two faces on something

like a Likert scale, rather than merely asking the subject tochoose the most similar face with no concept of absolutedistance. However Holub et al. found that the relative rank-ing provides more information about the underlying percep-tual space than an absolute rating, given the same numberof samples[10]. It is an open question whether the relativeand absolute approaches could be combined somehow, forexample to rate both the most similar face and its perceivedsimilarity to the target face at the same time.

Sample and target faces are chosen at random from apool of 550 faces, including one image each from the 520IMDB celebrities with the most photographs (each imagewas chosen manually to be representative of the correspond-ing celebrity) and 30 frontal, neutral-expression, neutral-lighting images from the IMM face database[19]. InitiallyI had planned on ensuring that every trial (set of target andsample faces) was unique, but ultimately I decided that re-lying on random chance to avoid redundant samples wassufficient.

The subjects were self-selected volunteers including my-self, friends, family, and classmates. In total, 843 ratingswere submitted by 14 subjects, giving a mean of 60 ratingsper subject. The maximum number of ratings submitted byone individual (myself) was 186; the minimum, 14. Sub-jects were given the following guidelines for making selec-tions:

• First impressions are best.

• Try to base your choice on innate features, not superfi-cial traits like makeup, pose, or lighting.

• Ask yourself the question Who would play this personin their made-for-TV autobiography?.

• Reaction time is a factor, so choose as quickly as pos-sible after fully considering the options.

Note that no explicit instructions were given about how tocompare faces. Although this makes the results less con-sistent, it avoids introducing any bias towards a particularconception of similarity. So, for example, some subjects re-ported that sometimes they occasionally chose faces of op-posite sexes as most similar when other features warrantedit, while others stated that they would never do this. Severalsubjects indicated that they found the task surprisingly dif-ficult, and would often waver between favoring one facialfeature or another before making their choice. One factorspecifically not accounted for in this study is the effect ofrace in the ability to judge facial similarity[15].

The interface for collecting similarity ratings was imple-mented in PHP[21] using SQLite[24] as a database back-end.

Figure 6. Feature points labeled for AAM training. Numbers andarrows indicate the canonical ordering of feature points.

3.3. Active Appearance Model Construction

Arguably the most significant factor in any face recog-nition system is the representation of faces. Face recogni-tion methods fall into two major groups: appearance based,where features are directly extracted from new images, andmodel-based, where domain knowledge is used to designand train a model whose parameters are adjusted to fit newimages. Appearance-based representations are typicallyquicker to compute and more robust to low resolution, whilemodel-based representations can use human knowledge todirectly represent and factor out unimportant variability likepose, illumination, and expression[16].

One of my goals for this project is to try to use a morerobust and accurate model for appearance than that usedby Holub et al. [10], so I chose a model-based represen-tation. The two main contenders in this area are 3D Mor-phable Models (3DMMs) [2] and Active Appearance Mod-els (AAMs) [7]. 3DMMs have the advantage that they aremore robust to pose and illumination changes, and sincethese are modeled independently it is easier to ignore themwhen judging facial similarity. AAMs have the advantagethat they are somewhat simpler to implement, may be fit tonew faces efficiently and fully automatically, can be createdfrom a purely 2D training set, and source code implement-ing them is freely available. Therefore I chose AAMs.

At a high level, the AAM algorithm works by fitting amesh to each face so that dense pixel correspondences be-tween faces can be used to compare faces. As a bonus, therelative positions and sizes of features in the mesh can alsobe used for identification.

In more detail, AAM operates by first constructing amorphable 2D model of the shape and appearance of faces.Then this model can be fit to new faces and the model pa-rameters used to describe them. The model is constructedas follows[7]:

1. Start with a training set of faces with manually-labeled

Figure 7. Varying the first 3 parameters of the shape model

feature points (typically including the contours of thejaw, mouth, nose, eyes, and eyebrows). Figure 6 illus-trates the feature points I used.

2. Normalize and align feature points using Procrustesanalysis. This results in an affine-invariant shape de-scriptor for each face. Figure 7 illustrates some shapedescriptors.

3. Warp all images to a canonical shape based on the la-beled feature points. For reasons of efficiency my im-plementation uses piecewise affine warping.

4. For each warped image, form a vector of normalizedpixel gray-level intensities from within the face shape.

5. Apply principal component analysis (PCA) to shapeand pixel descriptors independently to obtain linearmodels of shape and texture.

6. Combine shape and texture model and apply PCAagain to obtain a combined shape-appearance model.This reduces the number of parameters in the modelwithout hurting its generality. Figure 8 illustrates thefinal parameters.

As a training set for the model I began with the IMMface database[19] which includes 240 images of 40 differentindividuals with some pose and expression variation. Eachimage is annotated with 58 feature points. This data setcontains only Caucasians and just 7 women, and I found thatit seemed to generalize poorly, so I supplemented it with 75images of 15 individuals chosen from the IMDB photo setspecifically to introduce more race and gender variation intothe training set.

I used Inkscape[14] to annotate the feature points of thesupplemental IMDB images, saving the results to SVG andthen converting this to an XML format for further process-ing. One challenge with creating the manual annotations is

Figure 8. Varying the first 3 parameters of the appearance model

Figure 9. Left: original image with feature points identified byAAM. Right: appearance reconstructed by AAM.

ensuring that each point is marked in the same order on allfaces. For some time I wondered why my AAM was per-forming so poorly, until I discovered that a single trainingface had the jaw points annotated in the wrong order.

An alternative to training a single AAM for all faceswould be to train a different AAM for each face, and for newfaces use whichever AAM gives the best fit. This approachis called “Person-Specific AAMs”, and there is evidencethat it performs better than generic AAMs[17]. However,a person-specific AAM is only better than a generic AAMwhen applied to the person it was trained with, and it willdo a poor job of both matching and modeling other people.Therefore this approach is not appropriate for this project,where it is necessary to be able to model never-before-seenfaces.

3.4. Active Appearance Model Fitting

How can we fit the model to new images? Gradient de-scent is a straightforward and principled approach whichis unfortunately too inefficient because of the large num-ber of model parameters (80+). The key insight behindAAMs is that the full generality of gradient descent is notneeded. Due to the restricted domain, there is some pre-

dictable correlation between the error with the image andhow the model needs to be adjusted. Under the assump-tion that this correlation is constant and linear, it can beestimated using linear regression[7], or using a first orderTaylor approximation with the Jacobian assumed constantand estimated with numeric differentiation of the trainingdata[8]. I have chosen the latter approach because it is bothfaster and more reliable[26]. Figure 9 shows an example ofan AAM fit to an image of the author using this method.

Because of the crude approximation of the error gradi-ent the AAM will only converge if started close to its finalposition. I use three techniques to mitigate this problem.

First, the pose (position, rotation, and scale) parametersof the appearance model are initialized based on estimatedlocations for the eyes. The eyes are found using Viola-JonesOpenCV Haar-like detectors[4]. In order to eliminate falsepositives, the eyes are constrained to lie within the top 55%of the image, with each eye in its respective half of the im-age. In case of multiple matches, those closest to the meanposition of the eyes relative to the image size (learned fromall images) are preferred. If the eyes cannot be detected,the mean positions are used directly to initialize the shapemodel.

Second, the fitting algorithm is run multiple times withthe appearance model initialized to various shapes, andwhichever gives the result with the lowest error is kept. Cur-rently the appearance is initialized to shapes correspondingto looking left, right, up, and down, as well as long andsquat faces. These are in fact the three primary modes ofvariation of the shape model, illustrated in Figure 7.

Third, the AAM is constructed at multiple resolutionsand each resolution is fit to the image and used to initializethe next. Currently I am using 20x20, 40x40, and 80x80appearance models for this purpose. Ideally there would beone more higher-resolution 120x120 model, but I have en-countered memory issues which prevented me from build-ing it.

The accuracy of an AAM fit is characterized by the resid-ual error, which is simply the pixel-wise difference betweenthe original image and the appearance synthesized from themodel parameters. One way to visualize how well an AAMis able to model a set of images is to plot the residual error,after fitting, as a histogram. Ideally, the histogram would beheavily weighted towards the low end, indicating that mostimages were fit with low residual error. I found empiricallywith my models that a residual error of< 0.1 guarantees ac-curate feature locations, while > 0.2 almost always meansthe facial features were not localized correctly. For the ma-jority of my experiments using AAMs, I discarded all im-ages with residual error ≥ 0.1.

To compare the effectiveness of an AAM trained with theoriginal IMM dataset with one trained using my expandedIMM+IMDB dataset, I fit both models to all the celebrity

0

10

20

30

40

50

60

0 0.1 0.2 0.3 0.4 0.5

Num

ber

of Im

ages

Residual Error

IMDB+IMMIMM only

Figure 10. Histogram of AAM fit residual error, over the image setwhich was used in the human similarity experiment

Figure 11. Top: average visual features of [10]; Bottom: averagevisual features implemented in this project.

images used in the human similarity poll, and plotted theresidual error histogram, shown in Figure 10. Clearly theadded training data has significant improved the ability ofthe model to generalize to typical faces from the celebritydataset.

I used RAVL’s implementation of AAMs to do the ma-jority of work training and fitting AAMs[22]. RAVL is aC++ library including algorithms for solving many com-puter vision, pattern recognition, audio, and general devel-opment tasks. Its main strength is that it provides a richand easy-to-use programming environment. Its main weak-nesses include lack of documentation, a small user base, andinactive maintainers, leading to numerous small but annoy-ing bugs.

3.5. Eigenfaces

In order to provide a basis for comparison with Holubet al.’s results in [10], I have also implemented somethingsimilar to their Eigenfaces[27]-like approach to face repre-sentation.

Starting with the original image, the eyes and

mouth are detected using Viola-Jones OpenCV Haar-likedetectors[4][3]. The specific detectors were chosen becausethey performed best of all the freely-available OpenCVHaar-like detectors[5]. The image is then affine-warped sothat these features conform to fixed locations.

I discarded any images which did not have sufficient de-tected features. I also discarded images where the magni-tude of the calculated affine skew was greater than 1.0, as Ifound empirically that this usually means the features wereincorrectly localized.

Finally, patches around the eyes and mouth are extracted,re-sized to 17x24 and 25x14 respectively, and normalized tohave zero mean and unit variance. See Figure 11 for a com-parison of these patches to those of Holub et al. The mostobvious difference is the relative blurriness of my averagefeatures, which can be explained by errors in the automaticdetection and warping of the features.

A random sampling of 1,000 images from the IMDB isgiven this treatment and then subjected to PCA to reducethe dimensionality to 200. Holub et al. found in that thiswas sufficient to preserve over 99% of the variation in theirsamples[10], but with my data the figure was closer to 95%.Again this can probably be attributed to errors in face align-ment and the concomitant irregularity in patch appearance.

3.6. Similarity Space

Given a representation of faces we need to design a sim-ilarity space which ensures that similar-looking people aregrouped together.

An approach commonly used in face recognition is Fis-cher’s linear discriminant analysis (LDA), which attemptsto map faces to an identity space (where images of the sameperson are neighbors) rather than a perceptual similarityspace. However it is clear that an identity space obeys someconstraints of a perceptual similarity space, since a personalmost always “looks like” themselves, so LDA may workfor judging facial similarity. Because it only uses class la-bels, not relative similarity ratings, it serves as a baselinewhich a true perceptual distance metric should be able tobeat.

More accurate results should be possible using the rank-ings collected from humans as described in § 3.2. Eachranking includes a target image A, a set of images I anda chosen image B ∈ I . By choosing B, the human assertsthat for all other images C in I , D(A,B) < D(A, C) whereD is a perceptual distance measure. For convenience eachsuch assertion may be encoded in a triplet 〈At,Bt, Ct〉, andthe images represented by feature vectors 〈at,bt, ct〉.

Holub et al. propose an algorithm for creating a percep-tual map (PMap) which maps faces into a Euclidean spacewhere those judged similar by humans are neighbors[10].A description of their approach follows.

Assuming the perceptual mapping is linear, it may be

characterized by a matrix M such that an image feature vec-tor a maps to Ma in perceptual space. Our goal is to en-sure that Euclidean distance in this perceptual space obeysthe constraints implied by the triplets provided by humans.Formally, for all triplets 〈at,bt, ct〉, it should be the casethat ‖Mat−Mbt‖ < ‖Mat−Mct‖. This suggests usinga cost function which penalizes breaking this inequality fora given triplet t according to:

S(M, t) = ‖Mat −Mbt‖2 − ‖Mat −Mct‖2

= (at − bt)>M(at − bt)−(at − ct)>M(at − ct)

The squared L2 metric is used in place of Euclidean dis-tance because this simplifies the math. Finally, the penaltyover all triplets is computed with an exponential cost func-tion:

C(M) =∑

t

exp(S(M, t)

β)

where β is a tuning parameter1. Holub et al. used β = 1 butI found it necessary to set β proportional to the number oftriplets in order to avoid numeric overflow.

M is found using the conjugate gradient algorithm, withthe following Jacobian for the cost function:

∂S(M, t)∂M

= (at − bt)(at − bt)> −

(at − ct)(at − ct)>

∂C(M)∂M

=1β

∑t

exp(S(M, t)

β)∂S(M, t)∂M

I used the GNU Scientific Library’s [9] implementationof the Fletcher-Reeves algorithm, with conservative searchparameters. Unlike Holub et al., I always used the identitymatrix to initialize the gradient search, on the theory thatthe initial feature space already does a better-than-randomjob of estimating similarity. Small perturbations to the ini-tialization matrix did not seem to affect the outcome, i.e.local minima were not observed. Termination seemed to beprimarily determined by the combination of β and gradienttolerance: a high β tends to flatten the search space so that ahigh gradient tolerance is easily met. I conservatively chosea low β and gradient tolerance so that the algorithm alwaysused the maximum number of iterations without meetingthe gradient tolerance, but the difference in performance ofthe final map did not appear to be significant.

1My treatment differs from Holub et al.’s due to some apparent typo-graphical errors in their equations:• a, b, and c are used sometimes as row vectors and sometimes as

column vectors. Here, following convention, I treat them as columnvectors.

• at is used to denote transposition instead of a>.

• −S(t) is used as a per-triplet penalty function rather than S(t).

Name Init FPR EF FPR AAM FPRRachel Roberts (III) 6/34 0/10 1/10

Mickey Rourke 24/48 3/10 5/10Sela Ward 14/42 0/10 2/10

Olivia Thirby 12/37 1/10 0/10Kenneth Cole (I) 18/47 5/10 9/10

John Goodman (I) 16/38 1/10 2/10Shaune Bagwell 3/32 0/10 0/10

Tim Blake Nelson 36/58 4/10 4/10Scott Foley (I) 22/51 0/10 1/10

Melina Kanakaredes 5/33 0/10 0/10Izabella Miko 3/38 0/10 0/10

Tia Mowry 9/39 1/10 1/10Eva Amurri 17/44 4/10 3/10

Celia Weston 14/32 0/10 0/10Rai Aishwarya 9/28 0/10 0/10

Totals 34.6% 12.7% 18.7%

Table 2. False positive rates for face labels prior to cleaning, afterEigenface-based cleaning, and after AAM-based cleaning.

3.7. Removing Mislabeled Images

While the IMDB data is consistently labeled, each pic-ture may include several faces and it is not clear to whichface the label refers. I use a very simple algorithm to re-move outliers which are likely to be mislabeled. For thefaces associated with a given label:

1. Find the mean

2. Discard the image furthest from the mean

3. Repeat until only 10 faces are left

In addition, any label with less than 15 faces is discardedentirely. Before applying the algorithm I map all faces to anLDA space trained with the labeled AAM training images(55 people with at least 5 images each).

This algorithm produces slightly different results de-pending on whether AAMs or Eigenfaces are used to repre-sent faces. The results for 15 representative faces are listedin Table 2. The quality of this method is not very good,especially considering the number of images it discards. Inorder to get good results for celebrity searches, it would def-initely be necessary to find a better method of cleaning thedata to ensure accurate labels while preserving the variationin appearance which is necessary for accurate results.

3.8. Finding Similar Faces

With the (somewhat) accurately labeled celebrity dataand a perceptual map (created using either LDA or PMap),finding similar faces should require only a simple nearest-neighbor search.

In order to take some advantage of the fact that 10 viewsare available for each celebrity, and achieve some robust-ness against mislabeling, I use a k-NN neighbor search withk = 10. Each label is given a score which is the sum of thedistances to each matching image with that label, so that alabel with one very good match may still out-score a labelwith multiple poor matches. Finally the results are returnedwith the highest score first.

4. ResultsAssessing the performance of the similarity measure is

challenging. The obvious approach would be to give thecomputer the same similarity poll given to the humans (de-scribed in § 3.2) and directly compare its responses with hu-man responses to see how often it agrees with humans. Thisis a bad idea because even a human taking this test woulddo very poorly: people rarely agree on exactly which faceis most similar. On the other hand, if you asked people tochoose the top 10 most similar faces, there would probablybe a great deal of overlap.

In this spirit, I convert similarity into a binary classifica-tion problem. For each image A, the goal is to classify theremaining images into “similar” and “not similar” sets. Theground truth for the “similar” set is provided by the triplets:the triplet 〈A,B, C〉 asserts that B is in A’s “similar” set.This ignores the relationship between B and C, as well asthe relative distances of different Bs, but it is easy to evalu-ate the performance of this simplified classification problemusing a traditional ROC curve.

Holub et al. took a related approach: for each image,consider the 10% nearest neighbors and count the percent-age of “similar” faces falling within this radius[10]. Effec-tively this measures the true positive rate versus the radius.However, by failing to account for the false positive rate thismakes the measure of performance dependent upon the ratioof true positives to samples (in other words, the density ofsimilarity judgements used for testing), so that the same al-gorithm can be made to appear worse just by increasing thedensity of the test samples. In practice, because the simi-larity judgements are sampled very sparsely, each class hasonly two or three members, so the radius corresponds ap-proximately to the false positive rate.

For all of the similarity experiments I created a trainingset by randomly choosing a subset of images from the simi-larity poll together with the triplets which refer only to thoseimages. The remaining images and triplets were used fortesting. Triplets which refer to images in both the trainingand testing sets were unused.

4.1. Qualitative Results

Examples of celebrity matches for the author are shownin Figure 12. The results are unfortunately not very impres-

Figure 12. Top 3 similarity matches for the author for LDA andPMap methods. A typical photo of each person is shown togetherwith the corresponding appearance model. Left: LDA matches,from top to bottom: Brenda Soong, Adrien Brody (mislabeledJennifer Jason Leigh), Phylicia Rashad (mislabeled Bill Cosby).Right: PMap matches, from top to bottom: Eugene Levy, Dy-lan McDermott (mislabeled Snoop Dogg), Unknown (mislabeledCorbin Bleu).

sive. There appear to be two significant sources of error:mislabeling, and weak appearance modeling.

Mislabeling results from the poor performance of the al-gorithm used to prune mislabeled faces from the dataset.However in some of these cases, better performance maybe quite hard to achieve: Corbin Bleu and Jennifer JasonLeigh, for example, both have more mislabeled faces in theoriginal dataset than good examples. These cases are es-pecially problematic because these bad labels will tend tohave a broad footprint (due to high within-class variance)and therefore match more often than good labels which arehighly localized in perceptual space.

Weak appearance modeling results from poor generaliza-tion of the appearance model. It is clear that although theappearance model has achieved a good fit to the facial fea-tures, the actual appearance only weakly resembles the orig-inal face. A straightforward solution to this is to model theface using the shape found by the appearance model but theappearance sampled directly from the image and reducedusing PCA and/or LDA. In other words, decouple shape and

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Tru

e po

sitiv

e ra

te

False positive rate

AAM LDAAAM PMap

AAMEigenfaces PMap

EigenfacesHolub et al (approximate)

Baseline

Figure 13. ROC comparison of various methods on the similaritytask.

appearance. A similar effect could be obtained simply byusing a richer training set for the appearance model. Onecheap way to get such a training set is to use the weakerappearance model to automatically annotate new faces andthen incorporate these into the training set.

4.2. Overall Results

I compared the performance of Eigenfaces and AAMswith and without PMap. For this experiment all algorithmswere tested against the same set of 295 images (and the as-sociated triplets). The PMaps were trained using a disjointset of 250 images (and the associated triplets). The resultsare shown in Figure 13. I have plotted the approximate lo-cation of Holub et al.’s best results under the assumptionthat their training data was sparse (see §4 for a detailed dis-cussion of their performance measure and how it relates tomine).

Some observations:

• AAM clearly outperforms Eigenfaces, as expected

• I have apparently surpassed the performance of Holubet al. using their own algorithm, which might be ex-plained by my richer training data

• The perceptual map algorithm (PMap) generalizesvery poorly

The following sections examine these results in more de-tail.

4.3. Eigenfaces versus AAM

Holub et al. used Eigenfaces while I focused on AAMs.How well do these approaches work at measuring similar-ity? I tested AAMs and Eigenfaces both with and withoutthe learned perceptual mapping (PMap). I used a training

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Tru

e po

sitiv

e ra

te

False positive rate

AAM PMapAAM

Eigenfaces PMapEigenfaces

Baseline

Figure 14. ROC comparison of Eigenfaces and AAMs for the sim-ilarity task.

0

0.05

0.1

0.15

0.2

0.25

0.3

0 0.02 0.04 0.06 0.08 0.1

Tru

e po

sitiv

e ra

te

False positive rate

AAM PMapAAM

Eigenfaces PMapEigenfaces

Baseline

Figure 15. Detail from Figure 14: ROC comparison of Eigenfacesand AAMs for the similarity task.

set of 250 images for the PMaps and a separate set of 295images for testing all of the algorithms.

As expected, AAMs clearly out-perform Eigenfaces(Figure 14). Surprisingly, especially when compared withHolub et al.’s results, PMap seems to provide very little im-provement. This may be explained by the fact that Holub etal. only sampled one point near the start of the ROC curve,where noise can exaggerate the differences between differ-ent methods, as shown in Figure 15. I examine PMap’s poorperformance in more detail in the next section.

4.4. PMap Generalization

If PMap generalizes well, its performance on imagesoutside the training set should be comparable to its per-formance inside the training set. Therefore I compare theperformance of:

• No mapping (raw AAM feature space)

• PMap trained on various numbers of training images

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Tru

e po

sitiv

e ra

te

False positive rate

PMap OptimalPMap 400PMap 250PMap 100No PMapBaseline

Figure 16. ROC comparison of various training set sizes for PMapbased on AAMs. “PMap Optimal” shows the performance whentested on the full training data, while all other tests use disjointtraining and testing data.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Tru

e po

sitiv

e ra

te

False positive rate

PMap+LDALDA

PMapNo PMapBaseline

Figure 17. ROC comparison of LDA and PMap based on AAMs.

and tested against unseen images

• PMap trained on the full testing set

As shown in Figure 16, PMap does quite well whentested on its training data, but when tested on new datathe performance drops to almost no improvement. Mostsurprisingly, for smaller training sets it actually decreasesperformance relative to AAM alone, which suggests thatit may be over-fitting. Perhaps these results could be im-proved by using a more diverse training set which includesseveral views of each individual.

4.5. PMap versus LDA

How does PMap compare to LDA? To answer this ques-tion I compare the following mapping functions, tested on asubset of the similarity data:

• No mapping (raw AAM feature space)

0

5

10

15

20

25

30

35

40

0 50 100 150 200

Num

ber

of R

atin

gs

Reaction Time in Seconds

exp(-t/30)

Figure 18. Histogram of reaction times for similarity ratings.

• LDA, trained on the full AAM training set

• PMap, trained on similarity data disjoint from the test-ing set

• LDA+PMap, where the PMap takes LDA-space fea-tures as input rather than raw AAM features

The results, shown in Figure 17, are somewhat surpris-ing: LDA clearly performs better than PMap, despite beingtrained solely with identity labels. I attribute this to PMap’spoor generalization with my current training set. Combin-ing PMap with LDA provides the best overall performance,as expected, but the improvement is very slight.

4.6. Weighted Similarity

One possible problem with the PMap algorithm is that ittreats all triplets as equally important, which is not the case.For example, if the chosen face is clearly a better choicethan any other, that triplet should be considered important,but if there are several equally good choices (or no goodchoices), then the human’s choice is not very informativeand it may be better to ignore the triplet.

Without having to ask the human for more information,we can use the reaction time as a rough measure of cer-tainty. The more obvious a choice is, the shorter the reac-tion time. To incorporate this information into the PMapcost function, I multiplied the cost for each triplet by theweight exp(−t/30), where t is the reaction time. I chosethis function based on the intuition that differences in reac-tion time close to zero are more significant than those farfrom zero. Figure 18 shows the histogram of reaction timeswith the weight function superimposed2.

2Theory suggests that since choices are made independently, the wait-ing time between any two choices (the reaction time) may be modeled bya Poisson process and therefore is expected to have an exponential prob-ability distribution. In fact the distribution looks more like an Erlang dis-tribution (with k = 2), suggesting that there are two independent events

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Tru

e po

sitiv

e ra

te

False positive rate

Weighted PMapPMap

No PMapBaseline

Figure 19. ROC comparison of weighted versus non-weightedAAM-based PMaps for the similarity task.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Tru

e po

sitiv

e ra

te

False positive rate

AAM+LDAAAM+PMap

AAMBaseline

Figure 20. ROC comparison of various methods for the recognitiontask.

I compared the results of AMM-based PMap using thisweight, its inverse, and no weight. If the weight is meaning-ful, one would expect that it would yield an improvement inperformance, while its inverse should signficantly degradeperformance. As the results in Figure 19 show, the weightmakes little difference in the final results. This result is in-conclusive: it may mean that reaction time doesn’t correlatewell with the certainty of a choice, it may be that the PMapalgorithm is unable to effectively take advantage of the dif-ference in weights, or it may be that the training data is toosparse for the effects to be clear. Given the previous resultswhich demonstrate how little effect PMap alone has on theresults, I suspect one of the latter two explanations.

4.7. Using PMap for Recognition

Since everybody looks more like themselves than any-body else, we expect that an algorithm which is good at

involved in choosing an image, or maybe that I just don’t understand prob-ability theory very well.

measuring similarity should also be good at recognition ofidentity. I seek to verify this by applying PMap to the recog-nition problem and comparing it with the traditional LDAapproach. Specifically I tested the following:

• AAM, trained using the full AAM training set

• AAM+LDA, where LDA is trained on a subset of 30individuals from the AAM training set (which includesvariation in pose, expression, and lighting)

• AAM+PMap, where PMap is trained using the fullsimilarity data (this includes an image of each indi-vidual from the AAM training set and more, but littlevariation in pose, expression, or lighting)

All three mappings were tested on the 25 individualsfrom the AAM training set which were not used for LDAtraining. True and false positive rates are averaged over allimages in the testing set. The results in Figure 20 showthat LDA is clearly superior at the recognition task, whichapparently contradicts the findings of Holub et al. The dif-ference may be explained by the fact that their PMap train-ing dataset includes multiple images of some individualsand more variation in pose and lighting, allowing the PMapto learn more about variation within and between individu-als. It may also be attributed to differences in experimentaltechnique, where they measured the average rank distancebetween instances of the same individual rather than a fullROC curve.

However it is encouraging that AAM+PMap improvesover AAM, despite including only one example per personand little pose variation. This demonstrates that it is learn-ing some similarity which it is able to generalize and recog-nize that instances of the same person are similar.

5. Discussion and Future WorkThe most interesting result of my work appears to be

demonstrating that PMap is relatively ineffective at captur-ing general notions of human similarity. However I do notfeel that I have fully explored why this is the case, or ex-plained sufficiently why my results do not agree with Holubet al.’s. More experiments are necessary, specifically toexamine factors in training which may improve the perfor-mance of PMap:

• How much variation in pose and expression should bepresent in the training set

• Whether to include multiple images of each individualin the training set

• Can we consider the human-provided triplets to de-scribe individuals rather than specific images, andtherefore explicitly try to make the perceptual map

optimize some relationship between individuals ratherthan images (such as average pair-wise distance, Earth-movers distance, or some other class distance measure)

Aside from enhancing the PMap algorithm, I believe thatthere are two primary directions this work can be refined:using more sophisticated representations for facial features,and using more sophisticated distance metrics for compar-ing facial similarity. Improved representations for facialfeatures are of general interest in face recognition and aren’tuniquely important for this task, except perhaps in that sim-ilarity judgment is especially hard and therefore demandsdetailed representations. Therefore I think it’s more inter-esting to explore new distance metrics.

One particularly promising direction is local distancefunctions[1], which abandon a Euclidean distance metric infavor of one which allows each image to be compared onthe basis of those features which are most relevant to it. Inthe context of faces, this would mean, for example, that ifone person has a very distinctive nose (such that anyone elsewith that nose automatically looks similar), then their nosecan be given a higher weight for the purposes of compar-ison. My experience observing myself comparing similarfaces suggests that people do something similar, latchingon to one or two distinctive features for each person andpreferring to identify (and compare) them primarily basedon those features.

An altogether different direction would be to exploreadding further context. In this work, for example, hairstyleis totally ignored (unless it happens to block part of theface), but some hairstyles are obviously distinctive and rel-evant to similarity. In addition, humans may experience un-conscious influence from factors not directly attributable tothe face, such as more easily recognizing famous people, orpeople of the same race, or considering factors like height,personality, voice, and so on when comparing faces.

Finally, more work is required to explore practical ap-plications of this work. For example, how effective is thismethod at recognizing family resemblance? Is the currentperformance good enough to assist casting and talent agen-cies in searching for celebrity look-alikes? Are there anyapplications for a model of facial similarity besides straight-forward searching? For example, if we have some modelof how humans describe faces in natural language, can weestablish a correspondence between faces and descriptionsand use our learned similarity space to generate appropriatedescriptions for new faces?

References[1] F. S. A. Frome, Y. Singer and J. Malik. Learning globally-

consistent local distance functions for shape-based image re-trieval and classification. In Proceedings of the IEEE Inter-national Conference on Computer Vision (ICCV), 2007. 13

[2] V. Blanz and T. Vetter. Face recognition based on fitting a 3dmorphable model. IEEE Trans. Pattern Anal. Mach. Intell.,25(9):1063–1074, 2003. 5

[3] M. Castrillon Santana, O. Deniz Suarez,M. Hernandez Tejera, and C. Guerra Artal. Encara2:Real-time detection of multiple faces at different resolutionsin video streams. Journal of Visual Communication andImage Representation, pages 130–140, April 2007. 7

[4] M. Castrillon Santana, J. Lorenzo Navarro, O. Deniz Suarez,and A. Falcon Martel. Multiple face detection at differentresolutions for perceptual user interfaces. In 2nd IberianConference on Pattern Recognition and Image Analysis, Es-toril, Portugal, June 2005. 6, 7

[5] M. Castrillon-Santana, L. A.-C. O. Deniz-Suarez, andJ. Lorenzo-Navarro. Face and facial feature detection evalu-ation. In Third International Conference on Computer VisionTheory and Applications, VISAPP08, January 2008. 7

[6] S. Chopra, R. Hadsell, and Y. Lecun. Learning a similar-ity metric discriminatively, with application to face verifica-tion. In CVPR ’05: Proceedings of the 2005 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition (CVPR’05) - Volume 1, pages 539–546, Washington,DC, USA, 2005. IEEE Computer Society. 2

[7] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appear-ance models. Lecture Notes in Computer Science, 1407:484–??, 1998. 5, 6

[8] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active ap-pearance models. IEEE Trans. on Pattern Recognition andMachine Intelligence, 23(6):681–685, 2001. 6

[9] GNU Scientific Library (GSL).http://www.gnu.org/software/gsl/. 8

[10] A. Holub, Y. hsueh Liu, and P. Perona. On constructing fa-cial similarity maps. Computer Vision and Pattern Recog-nition, 2007. CVPR ’07. IEEE Conference on, pages 1–8,17-22 June 2007. 2, 4, 5, 7, 9

[11] Hoowat. http://www.findmycelebritylookalike.com/. 2[12] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller.

Labeled faces in the wild: A database for studying facerecognition in unconstrained environments. Technical Re-port 07-49, University of Massachusetts, Amherst, Oct.2007. 3

[13] Internet Movie Database (IMDB). http://www.imdb.com. 3[14] Inkscape. http://www.inkscape.org/. 5[15] D. Levin. Race as a visual feature: Using visual search and

perceptual discrimination tasks to understand face categoriesand the cross-race recognition deficit. Journal of Experimen-tal Psychology: General, December 2000. 4

[16] X. Lu. Image analysis for face recognition. Personal Notes,May 2003. 5

[17] I. Matthews and S. Baker. Active appearance models revis-ited. Int. J. Comput. Vision, 60(2):135–164, 2004. 6

[18] MyHeritage.com. http://www.myheritage.com/face-recognition. 2

[19] M. M. Nordstrøm, M. Larsen, J. Sierakowski, and M. B.Stegmann. The IMM face database - an annotated dataset of240 face images. Technical report, Informatics and Mathe-matical Modelling, Technical University of Denmark, DTU,

Richard Petersens Plads, Building 321, DK-2800 Kgs. Lyn-gby, may 2004. 4, 5

[20] Open Computer Vision Library (OpenCV).http://sourceforge.net/projects/opencvlibrary/. 4

[21] PHP Hypertext Preprocessor (PHP). http://www.php.net/. 4[22] Recognition And Vision Library (RAVL).

http://www.ee.surrey.ac.uk/CVSSP/Ravl/. 7[23] N. Sazanova. Personal site. http://artns.us/Celebrities.htm. 1[24] SQLite database engine. http://www.sqlite.org/. 4[25] StarsInYou.com. http://www.starsinyou.com. 2[26] M. B. Stegmann. Analysis and segmentation of face im-

ages using point annotations and linear subspace techniques.Technical report, Informatics and Mathematical Modelling,Technical University of Denmark, DTU, Richard PetersensPlads, Building 321, DK-2800 Kgs. Lyngby, aug 2002. Seethe publication link for the images and the annotations. 6

[27] M. Turk and A. Pentland. Face recognition using eigenfaces.In Proc. IEEE Conference on Computer Vision and PatternRecognition, 1991. 7

[28] P. Viola and M. Jones. Robust real-time object detection.International Journal of Computer Vision, 2001. 4

Documents

CS 395T: Celebrity Look-Alikes · Tripplehorn and Amber Tamblyn, Leonardo Di Caprio and Mila Jovovich, Kristen Johnston and Mo’Nique. human perception, psychology, and culture,