Upload
vaidehi
View
215
Download
2
Embed Size (px)
Citation preview
This article was downloaded by: [UQ Library]On: 10 November 2014, At: 10:48Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH,UK
Visual CognitionPublication details, including instructions for authorsand subscription information:http://www.tandfonline.com/loi/pvis20
Computational perspectives onthe other-race effectAlice J. O'Toolea & Vaidehi Natua
a School of Behavioural and Brain Sciences, Universityof Texas at Dallas, Richardson, TX, USAPublished online: 14 Jun 2013.
To cite this article: Alice J. O'Toole & Vaidehi Natu (2013) Computationalperspectives on the other-race effect, Visual Cognition, 21:9-10, 1121-1137, DOI:10.1080/13506285.2013.803505
To link to this article: http://dx.doi.org/10.1080/13506285.2013.803505
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all theinformation (the “Content”) contained in the publications on our platform.However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, orsuitability for any purpose of the Content. Any opinions and views expressedin this publication are the opinions and views of the authors, and are not theviews of or endorsed by Taylor & Francis. The accuracy of the Content shouldnot be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions,claims, proceedings, demands, costs, expenses, damages, and other liabilitieswhatsoever or howsoever caused arising directly or indirectly in connectionwith, in relation to or arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes.Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly
forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
Computational perspectives on the other-race effect
Alice J. O’Toole and Vaidehi Natu
School of Behavioural and Brain Sciences, University of Texas at
Dallas, Richardson, TX, USA
Psychological studies have long shown that human memory is superior for faces ofour own-race than for faces of other-races. In this paper, we review computationalstudies of own- versus other-race face processing. Computational models examinethe visual challenges of representing the uniqueness of individual faces that varyboth within and across demographic categories. These models isolate the visualcomponents of the other-race effect and provide an objective control for socio-affective responses to other-race faces. This control allows researchers to compareand test the role of experience/contact in the other-race effect, using variousoperational definitions of this theoretical construct. The models show that toproduce an other-race effect computationally, biased experience or learning mustintervene during the process of feature selection. This implicates the criticalimportance of ‘‘developmental’’ learning in the other-race effect.
Keywords: Computational; Face; Other-race effect.
The perception of own- and other-race faces has been studied with
experimental behavioural approaches for decades (Malpass & Kravitz,
1969). These studies document factors in the domains of social and visual
perception that give rise to differences in the quality and flexibility of our
memories for own- and other-race faces. Over the last few decades, there
have been hundreds of papers examining perceptual and memory compo-
nents of the other-race effect. This Special Issue contains a comprehensive
look at these findings and their implications.
Please address all correspondence to Alice J. O’Toole, School of Behavioural and
Brain Sciences, University of Texas at Dallas, Richardson, TX 75080, USA. E-mail:
Thanks are due to funding from the Technical Support Working Group of the Department
of Defense, which supported the authors in preparing this paper. Thanks are also due to Allyson
Rice and two anonymous reviewers for comments on a previous version of this manuscript.
Visual Cognition, 2013
Vol. 21, Nos. 9�10, 1121�1137, http://dx.doi.org/10.1080/13506285.2013.803505
# 2013 Taylor & Francis
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
The purpose of the present paper is to consider a less studied aspect of the
other-race effect*its computational foundations. In other words, we will
examine computer-based models of face perception and recognition algo-
rithms that have focused on the computational challenges posed by
processing faces of different races. We note at the outset that the definition
of the other-race effect in a computational model is not entirely analogous to
its definition in the psychological literature. Rather, in computational terms,
‘‘race’’ is a visually based stimulus category defined by the statistical
variability of faces from within and across demographic race/ethnicity
categories. In short, race is one of several demographic categories of faces
(e.g., gender, age) that may pose special challenges to computational models.
As we will see, ‘‘other-race’’ faces, in computational terms, will be defined in
various ways, ranging from race categories that constitute a minority of the
faces in a computationally based model memory or experience history, to
race categories that are underrepresented where a particular face recognition
algorithm was developed. We will return to, and expand on, these
definitional variables at several points in the paper.From the computational perspective, engineers have been developing face
recognition models since the 1980s. The earliest attempts at these algorithms
quickly came up against the difficulty of encoding the uniqueness of
individual faces in the context of populations of faces that share the same
set of ‘‘features’’, arranged in roughly the same configuration. Feature-based
approaches to computationally based face recognition did not fare well,
because feature descriptors did not adequately capture the uniqueness of
individual faces. The second wave of computational attempts fared better
because these models captured subtle variations in the form and configura-
tion of the features, using global structure quantifiers, such as principal
components. The third wave of computational models has been successful
enough to produce commercial algorithms that are now used in industry and
by governments for identity verification.
Recent tests indicate the best algorithms now perform more accurately
than humans, in some challenging viewing environments (O’Toole, An,
Dunlop, Natu, & Phillips, 2012; O’Toole et al., 2007). Although computa-
tional models of face recognition are still limited in fundamental ways, this
approach has informed psychologists about the nature of the computational
problems brains encounter in representing and remembering faces. In
particular, the models offer insight into the problem of quantifying
information in faces that can specify an individual’s identity and his/her
status with respect to visually derived semantic categories (e.g., sex, race, age)
(Bruce & Young, 1986). For present purposes, a computational framework
provides for a quantitative description of demographic categories of humans
(race, sex, age) and a description of how faces within a demographic category
1122 O’TOOLE AND NATU
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
differ from one another. This framework has proven useful for evaluating
and modelling the role of experience in the other-race effect.
We digress briefly to note the term experience takes on different connota-
tions in computational and psychological studies. We will argue, however, that
there are more points of contact on the multiple meanings of this term than we
might first anticipate. To begin, the rich literature on developmental experience
shows that exposure or contact during a critical period can contribute to the
definition and delineation of neural and perceptual feature sets (Kuhl, 1994,
1998). This type of experience has an analogous form in computational models
that derive face representations from statistical learning. Changes to face
processing abilities that take place at this junction are not (easily) malleable
later in the life of the person or computational model. In addition, contact or
experience with people of different races later in life still impacts face
processing skill, but these effects differ in fundamental ways from those
imposed by developmental contact. As we will see, the computational analogy
to adult learning affects the memory capacity of the model and its ability to
keep distinct representations of similar faces. It is worth noting at the outset
that race is both a social and visually derived semantic category (Bruce &
Young, 1986). Gender and age are likewise social and visually derived semantic
categories. Similarly strong effects of experience on perception and memory
have not been reported for age and gender. This may be due to differences in
developmental versus later life contact with race versus gender and age. We will
return to the theme of learning types at several points in the paper.
In this paper, we review computational studies of the other-race effect.
This is a rather sparse literature. Throughout the paper, we emphasize
computational findings relevant for understanding the challenges of creating
representations that are shaped by ‘‘experience’’ with faces. Although the
role of experience in the other-race effect for humans has been controversial,
an understanding of the diversity of neural and computational embodiments
of experience may help to bridge theoretical gaps that are left open with
behavioural approaches.
COMPUTATIONAL APPROACHES TO THEOTHER-RACE EFFECT
Computer vision researchers have been developing computational models of
face recognition for roughly two decades now. Progress on this endeavour
has been reviewed in detail in a recent collection of papers (Li & Jain, 2011).
Historically, and even into the present, these models divide naturally into
two types. Both types are inspired from biological/psychological models of
human perception, albeit in different ways. The first approach is based on
Gabor wavelets connected with dynamic link architectures. Gabor filters are
OTHER-RACE EFFECT: COMPUTATIONAL MODELS 1123
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
roughly analogous to the receptive field properties of cells in the primary
visual cortex (e.g., Wiskott, Fellous, Kruger, & von der Malsburg, 1997; cf.
Shen & Bai, 2006, for a review). In these models, local Gabor filters sample
the face at points on a square lattice. The process of recognition is
implemented by deforming the lattice to ‘‘fit’’ other stored face exemplars.
If the amount of deformation required for the fit does not exceed some
threshold, the best-fit match is chosen as the recognized identity. These
models show some ability to compensate for changes in illumination,
expression, and small changes (10�15 degrees) in pose. They have been
used also to model aspects of human object and face processing (Biederman
& Kalocsais, 1997). As we will see here, these models operate independent of
experience with faces*a characteristic that makes them ill-suited to model
some aspects of the other-race effect.
The second type of model is based on a global analysis of faces implemented
with principal components analysis (PCA; O’Toole, Abdi, Deffenbacher, &
Valentin, 1993; Sirovich & Kirby, 1987; Turk & Pentland, 1991). In their
original implementation, this analysis was applied to images of faces. Since
then, PCA has been applied to ‘‘morphable’’ face representations from two-
dimensional images (Hancock, Burton, & Bruce, 1996) and to three-dimen-
sional laser scans (Blanz & Vetter, 1999). In both cases, the data include a
separated representation of the shape and reflectance of a face. The
psychological appeal of PCA-based computational approaches, especially
implemented with morphable models, is that they are metaphorically compa-
tible with face space models (Valentine, 1991). As such, they have been proven
valuable for understanding well-known effects in human face recognition. In
particular, a face-space framework has been used to reason about the effects of
typicality on face recognition (Light, Kayra-Stuart, & Hollander, 1979). The
framework also provides an appealing metaphor for face adaptation results
(Leopold, O’Toole, Vetter, & Blanz, 2001; Webster & MacLin, 1999). For
present purposes, face-space representations make for a relatively natural
conceptualization of how we perceive and remember own- and other-race faces.
As we shall see, however, not all computational implementations of face-space
models effortlessly produce an other-race effect. Comparing the ones that do,
with the ones that do not, offers interesting clues about how experience can
affect the quality of face representations for own- and other-race faces.
COMPUTATIONAL MODELLING OF THEOTHER-RACE EFFECT
The first attempt to computationally model the other-race effect was carried
out by O’Toole, Deffenbacher, Abdi, and Bartlett (1991) using a simple
autoassociative neural network. This was applied to images of faces that were
1124 O’TOOLE AND NATU
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
aligned so that the eye level and nose roughly coincided. Autoassociative
networks implement a PCA with an iterative procedure that is reminiscent of
perceptual learning. Specifically, these networks act as content-addressable
memories in which the storage of information is parallel and distributed
(images share the same storage space). The iterative ‘‘perceptual learning’’procedure makes gradual adjustments to ‘‘neural synaptic’’ strength to
minimize errors in the ability of the memory to reconstruct a stored image
from a full or partial cue (occluded image). Using this model, O’Toole et al.
conceptualized the cause of the other-race effect as an imbalance in the quality
of face representations for own- and other-race faces. This imbalance was
thought to be due to differences in the amount of perceptual learning people
have for own- versus other-race faces. This type of model embodies the contact
hypothesis of the other-race effect (cf. Levin, 2000). The basic elements of acomputational approach to the contact hypothesis are summarized in O’Toole
et al. and have not changed much in 20 years.
‘‘First, we assume that faces of different races comprise different statistical
categories of faces. Second, within a given category of faces, a set of
differentially weighted ‘‘features’’ is optimal for encoding faces in a manner
that makes faces within the category most discriminable. Different feature sets
and weightings, however, are optimal for processing faces from other-race
categories of faces. Third, with exposure to many faces of a given race and a
smaller number of faces of other races, perceptual learning enables observers to
make optimal use of the features that are best for processing faces from the
category with which they have had the most experience, typically faces of their
own race. By this account, the difficulties experienced with faces of another
race are due to the fact that the optimal features for distinguishing faces of
one’s own race are not optimal in processing the faces of another race’’
(O’Toole et al., 1991, pp. 164�165).
To test the model, O’Toole et al. (1991) implemented autoassociative
networks with a ‘‘majority-race’’ (95%) and ‘‘minority-race’’ (5%) of faces,
with Japanese and Caucasian faces serving alternately as the majority and
minority race. Next, they compared the quality of face representations for
novel majority versus novel minority-race faces. Novel refers to faces that
were not used to create the autoassociative matrix. The results revealed more
accurate reconstructions (i.e., representations) for faces from the majority
race than from the minority race. Moreover, interface similarity, computedas the similarity between all possible pairs of reconstructed images, was
higher for the minority-race than majority-race faces. Thus, the model
created less distinctive representations for minority faces. This result
simulates the basic perceptual components of the other-race effect.
The simulation of a face recognition task with this model proved more
difficult than simulating the perceptual effects. The problem was that the
OTHER-RACE EFFECT: COMPUTATIONAL MODELS 1125
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
quality of reconstructions for the learned minority faces was actually higher
than the quality of the learned majority faces. This finding was puzzling at
first, though the reason for it was ultimately obvious. The minority faces
were distinct with respect to the training set. Therefore, the matrix could
store the minority faces with minimal interference at the learning stage (i.e.,these faces contributed to the features, or PCs, of the model). To simulate the
recognition component of the other-race effect, the solution was to use a
race-biased face history matrix combined with a short-term ‘‘recognition’’
matrix. The former was strongly race-biased and the latter had equal
numbers of Caucasian and Japanese faces. Using a weighted combination of
the history matrix (0.75) and the short-term matrix (0.25), O’Toole et al.
(1991) found the recognition memory version of the other-race effect.
In summary, O’Toole et al. (1991) showed that it was possible to simulatethe perceptual and memory components of the other-race effect using a
simple computational model in which experience or training with different
races is varied. The need to incorporate different kinds of experience into the
model to obtain the recognition effect suggested that a learning history is
critical in determining the quality and suitability of the feature space for
representing faces. The long-term history may be relevant for understanding
the role of developmental learning in constraining the quality of representa-
tions possible for faces from different races.The importance of considering developmental experience in computational
accounts of the other-race effect was reinforced in a study over a decade later.
Furl, Phillips, and O’Toole (2002) evaluated face recognition algorithms from
the Face Recognition Technology (FERET) program (Phillips, Moon, Rizvi,
& Rauss, 2000). The FERET evaluation was a US government-sponsored test
of computer-based face recognition algorithms, conducted between 1994 and
1997. Furl et al. evaluated six algorithms from that test, plus seven additional
control algorithms implemented by the organizers of the FERET, as baselinealgorithms. Furl et al. tested the algorithms’ ability to recognize Caucasian
(majority race) and Asian (minority race) faces. Note that in this study, the
majority race of Caucasian was set by the FERET competition database and
was not under the control of the experimenters.
The algorithms available from the FERET could be grouped into
categories that mapped well onto two psychological hypotheses about the
role of experience in the other-race effect. The generic contact hypothesis
gives equal weight to learning throughout the ‘‘virtual lifespan’’ of thealgorithm. Eight algorithms fit this description, including seven baseline
PCA models, implemented by Moon and Phillips (2001),1 and an eighth
PCA-based algorithm from Moghaddam and Pentland (1997).
1 Jonathon Phillips was the organizer of the FERET test and so the controls were
implemented as baseline algorithms against which the others could be compared.
1126 O’TOOLE AND NATU
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
The developmental contact hypothesis assumes that exposure to faces early
in life tunes feature selection to optimize the quality of representations for
the faces we see most*typically, faces of our own race (cf. Nelson, 2001).2
Once a critical period has passed, these features remain stable. This account
is similar to those proposed by Kuhl (1994, 1998) for native language
learning, whereby phonetic features are tuned with early developmental
exposure to language, and remain stable thereafter. This perspective also
figures prominently in more recent studies of the development of face
recognition (cf. Anzures et al., this issue 2013). Returning to the question of
faces, three algorithms in the FERET used a two-stage learning process
analogous to developmental and mature learning (Moghaddam & Pentland,
1998; Swets & Weng, 1996; Zhao, Krishnaswamy, Chellappa, Swets, &
Weng, 1998). The first step was one of feature selection (using PCA). This
was followed by a standard linear discriminant analysis on a set of newly
learned test faces.Finally, two additional control algorithms available from the FERET test
were deemed noncontact algorithms, because they used the image-based
discriminability of faces. Of note, one of these was an algorithm based on a
dynamic link architecture that processed the output of Gabor jet filters. This
model produces a representation of faces that is independent of the learning
history of the algorithm.
More concretely, all algorithms in the FERET evaluation were trained
using a large number of faces (n�501). Caucasians comprised the majority
(64%) of faces and Asians were the next most populous ethnic group, making
up approximately (7%) of the set. Furl et al. (2002) tested face recognition
accuracy for the algorithms using an equal number of Asian and Caucasian
faces from the training set (old) and an equal number of novel Asian and
Caucasian faces. The results were clear. Algorithms in the developmental
contact group consistently yielded better performance with the Caucasian
faces (majority race). Similar to the result found earlier by O’Toole et al.
(1991) with the PCA model without a learning history (i.e., a step whereby
PC features were based on a learning step with a race bias), seven of the eight
generic contact models performed more accurately on the minority race of
Asian faces. The two noncontact models split between an Asian advantage
and a Caucasian advantage.
The study by Furl et al. (2002) reinforced some basic computational
principles relevant for understanding psychological embodiments of the
contact hypothesis. Specifically, it is well known that contact with other-race
faces, measured in self-report surveys, is a poor predictor of the magnitude
of the other-race effect (Levin, 2000). As noted by Levin (2000), using this
2 We assume this will be covered in detail by Anzures et al. (this issue 2013).
OTHER-RACE EFFECT: COMPUTATIONAL MODELS 1127
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
type of method, some studies support the contact hypothesis (Carroo, 1986;
Chiroro & Valentine, 1995; Cross, Cross, & Daly, 1971; Feinman & Entwisle,
1976; Shepherd, Deregowski, & Ellis, 1974) and others do not (Brigham &
Barkowitz, 1978; Lavarkas, Buri, & Mayzner, 1976; Malpass & Kravitz,
1969; Ng & Lindsay, 1994). The most consistent effects of contact have been
reported in developmental studies, where other-race exposure occurs early in
life (Chance, Turner, & Goldstein, 1982; Cross et al., 1971; Feinman &
Entwisle, 1976). These studies have now been joined by newer studies with
infants, showing that experience with different races of faces can result in
differences for own- versus other-race face processing as early as 3 months of
age (e.g., Kelly, Liu, Ge, Quinn, Slater, Lee, and Pascalis, 2007).
Following along the lines of the emergence of the other-race effect in
infants, Balas (2012) implemented a computational Bayesian model of the
development of the other-race effect. He modelled developmental learning as
a perceptual narrowing of infants’ ability to discriminate individual faces,
which proceeded in the context of the development of face race categories.
The model was a variant of Moghaddam and Pentland’s (1998) algorithm*one of the models that exhibited an other-race effect in the study of Furl
et al. (2002). This algorithm represents a unique and intriguing approach to
the problem in that it directly learns to distinguish appearance differences
that are due intrapersonal variation from appearance differences due to
extrapersonal variation. As such, training examples for the model come from
difference images created from the same person (intrapersonal) and from
different people (extrapersonal variation). The key manipulation in the
simulations of Balas was the inclusion or exclusion of extrapersonal
variations that crossed race boundaries. The model proceeds by generating
two face spaces: one from the intrapersonal difference images and the other
from the extrapersonal difference images. Next, a Bayesian classifier was
trained to discriminate the two types of variation. Using the resultant
discriminator, Balas simulated a Visual Paired Comparison (VPC) task
commonly employed in infant experiments. In this task, infants view a target
face and must compare it to two additional images: one that matches the
target identity and a second image of a different person. Here, Balas tested
the model on its ability to make VPC discriminations.
As noted, the inclusion or exclusion of training examples of extrapersonal
variations that crossed race boundaries was manipulated. When these cross-
race examples were included, Balas (2012) found better performance for the
majority White faces (90% of the training data) than for the minority Asian
race faces. He concluded that the development of the other-race effect in this
model is consistent with perceptual narrowing, whereby the formation of
race categories plays an important role in determining the relative
discriminability of own- and other-race faces.
1128 O’TOOLE AND NATU
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
The importance of a computationally based experience history in the
other-race effect was investigated further by Haque and Cottrell (2005), this
time, with a focus on race categorization. Levin (2000) showed that other-
race faces can be categorized as faces more quickly than own-race faces.
Levin proposed that for other-race faces, race acts a salient ‘‘feature’’ that iseasy to detect. For own-race faces, race is treated as the absence of the
feature. Haque and Cottrell modelled the categorization advantage people
show for other-race faces using a PCA similar to that used in O’Toole et al.
(1991). In this case, however, they modelled the information content of faces
from a majority race versus minority race (Asian or Caucasian), using a
representation created by the PCA model. Information content was assessed
based on Shannon Entropy, as a kind of outlier measure, i.e., outlier faces
have more information. Indeed, they found that the minority race faces hadsignificantly more information than the majority race faces. They interpret
this result in terms of a feature-positive state or salient marker of race in the
minority race faces.
Moving the clock forward, we are now in an era in which face recognition
algorithms are commercial products, widely used by governments, law
enforcement agencies, and other industries where identity verification is
necessary. Over the past decade, beginning with the FERET program, the
US Government has organized large-scale, international tests of facerecognition algorithms on a regular basis. Because these tests attract the
best algorithms in the world, and because the results of these tests are
publicly available, much is known about the state of the art for automatic
face recognition. Since roughly 2005, our lab has had the opportunity to
conduct head-to-head comparisons between algorithms and humans. In
these experiments, we have compared humans and machines at the task of
matching identity in pairs of images.
We digress briefly to describe the procedure we have used in the human�machine comparisons we discuss here. In the large-scale government tests,
algorithms match identity in pairs of images (often more than 100 million
pairs). They do this by assigning a similarity score to each pair of images.
The similarity score indicates the algorithm’s estimate that the two images
are of the same person. The algorithm data consists of a distribution of
similarity scores for same-identity image pairs (pictures of the same person)
and a distribution of similarity scores for different-identity pairs (pictures of
different people). A receiver operating characteristic curve (ROC), createdfrom the same- and different-identity distributions, is used to summarize an
algorithm’s performance. Human participants in our experiments likewise
match identity in interesting subsets (i.e., demographic groups) of the image
pairs used for the algorithm tests. Participants generate a similarity score for
each image pair by rating them on the following scale: (1) sure they are same
person; (2) think they are same person; (3) don’t know; (4) think they are
OTHER-RACE EFFECT: COMPUTATIONAL MODELS 1129
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
different people; (5) sure they are different people. These ratings are used to
create analogous ROC curves for humans and machines using the same
image pairs.
In our first human�machine comparisons, we worked with algorithms
entered in the Face Recognition Grand Challenge (FRGC; Phillips et al.,
2005) and the Face Recognition Vendor Test 2006 (FRVT 2006; Phillips
et al., 2010). In those first experiments, we found, much to our surprise, that
the best algorithms performed more accurately than humans, even with
differences in illumination between the two images (O’Toole et al., 2007;
O’Toole, Phillips, & Narvekar, 2008). Figure 1 shows an example image pair,
with one image taken with studio lighting and the other image taken in a
corridor. With more recent algorithms, tested with highly variable images
(i.e., taken indoors and outdoors, and with expression and appearance
changes), algorithms are now better than humans in all but the most
challenging conditions (O’Toole, An, et al., 2012).
One issue with all of these tests is that the database used to evaluate the
algorithms contains mostly Caucasian faces. The sheer number of stimulus
pairs tested in the FRVT 2006, however, made it possible to determine
whether algorithms in these tests showed evidence of an other-race effect.
Figure 1. Example of stimulus pair and response options. To view this figure in colour, please see the
online issue of the Journal.
1130 O’TOOLE AND NATU
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
For algorithms, this amounts to asking the following question. ‘‘Does the
ethnic composition of the population at the geographic origin of the
algorithm (i.e., where it was developed) affect how an algorithm performs
on faces of different races?’’ We defer a discussion of why we think this might
be the case, until after we present the results.
Phillips, Jiang, Narvekar, Ayyad, and O’Toole (2011) carried out the test
as follows. Algorithms in the FRVT 2006 could be divided into those
originating in Western countries (n �8, from France, Germany, and the
United States) and those originating in East Asian countries (n �5, from
Japan, Korea, and China). Phillips et al. created a Western fusion algorithm
by combining3 the similarity estimates produced by the Western algorithms
and an East Asian algorithm by combining the similarity scores produced by
the East Asian algorithms. Next, all available same-identity and different-
identity pairs of Caucasian faces (n�3,359,404) and Asian faces (n�205,114) were used to create the ROC curves shown in Figure 2. These
curves show the classic other-race effect, with the East Asian fusion
algorithm more accurate with East Asian faces and the Western fusion
algorithm more accurate with Caucasian faces.
Additional experiments by Phillips et al. (2011) replicated the finding that
algorithm and human performance are closely matched. The study also
indicated that humans show an other-race effect at the task of identity
matching*a different task than those typically used in behavioural studies.
Figure 2. Receiver operator characteristics (ROC) curves for East Asian and Western algorithms on
East Asian and Caucasian faces.
3 Combining was done by rescaling the scores from the different algorithms and averaging
them.
OTHER-RACE EFFECT: COMPUTATIONAL MODELS 1131
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
Perhaps the most intriguing difference between the human and machine
behaviour was that although humans showed the other-race effect, their
performance was more stable across changes in face race, than was the
performance of the algorithms, which in some cases ‘‘tanked’’ on faces of the
other race.Finally, we must ask the question of why the algorithms in this study
showed an other-race effect. Why would the demographic origin of the
algorithm affect its relative performance for faces from the local population
versus faces from a nonlocal population? Unfortunately, this is a question we
cannot answer definitively because many of the algorithms that compete in
international tests are proprietary. To protect the proprietary nature of the
software, while still encouraging the participation of the very best algorithms,
the US Government tests use executable versions of the software for theirevaluations. As a consequence, we can only speculate about the reason(s) for
the other-race effect found by Phillips et al. (2011). Almost certainly, part of
this result has to do with the availability of training data (i.e., faces) where
individual algorithms are developed. This, combined with the likelihood that
most state-of-the-art algorithms employ statistical learning analyses for facial
feature selection, could cause an other-race effect. Fortunately, the next study
moves us closer to understanding the role of experience in the other-race effect
for current computer-based face recognition systems.Klare, Burge, Klontz, Vorder Bruegge, and Jain (2012) examined the effects
of demographics on the performance of six face recognition algorithms. Three
of these were commercial off-the-shelf (COTS) algorithms (Cognitec’s Face-
VACS ver. 8.2, PittPatt ver. 5.2.2, and Neurotechnology’s MegaMatcher ver.
3.1). Although the COTS algorithms probably make use of training regimes,
the authors were not able to alter this training in any way for their study. Two
algorithms were nontrainable*a local binary pattern and Gabor feature
representation model. The sixth algorithm was trainable and was spectrallysampled structural subspace features (4SF) algorithm, developed in-house by
Klare et al. The authors use this trainable algorithm to test hypotheses about
the computational mechanisms underlying the other-race effect.
Klare et al. (2012) tested the six algorithms on a large database that could
be subdivided into eight demographic categories. These categories were race
(Black, Hispanic, and White), sex, and age (younger, 18�30; middle-aged,
30�50; and older, 50�70). All three COTS algorithms and the two untrained
algorithms had lower match accuracies on the following three demographicgroups: Blacks, females, and younger subjects. To gain insight into the role of
training, Klare et al. next tested the performance of their 4SF algorithm, as
follows. They compared their algorithm when it was trained with all of three
ethnic groups simultaneously to three separate implementations of the
algorithm trained with each of the ethnic groups individually. The results
indicated that face-matching accuracy was best when the system was trained
1132 O’TOOLE AND NATU
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
only on faces of the same ethnicity. The authors suggest that all COTS face
recognition algorithms should have access to multiple face recognition
systems, trained on different demographic cohorts.
The studies of Klare et al. (2012) and Phillips et al. (2011) underscore the
importance of understanding the mechanisms behind human and machine-based other-race effects. Because face recognition technology is now assigned
to critical tasks, including passport and visa screening in many countries, the
relative accuracy of these machines for faces of different races is more than a
question of interest to psychologists. Rather, it has become part of larger
issues of social policy and nondiscrimination in assuring that everyone is
treated fairly and equally by these emerging technologies.
SUMMARY
In conclusion, in the last 20 years (or more) since the first algorithm-based
model of the other-race effect, we have gained insight into computationalmechanisms that can give rise to the other-race effect. All of the models we
have seen indicate that differential experience with faces of various races, per
se, is not sufficient to produce the effect. Rather, to produce an other-race
effect computationally, biased experience or learning must intervene during
the process of feature selection. This computational principle aligns well with
developmental learning, which may produce a set of stable features that are
optimal for own-race face representation, but are limited in their ability to
represent the uniqueness of other-race faces.
ON MEASURING THE OTHER-RACE EFFECT FORALGORITHMS
In this final section, we briefly discuss a recently identified problem in
accurately measuring the performance of face recognition systems on
different demographic groups of faces. Although this may seem an esoteric
point that is of interest only to researchers who test algorithms, it is an
excellent example of how cross-talk between psychologists and computer
vision researchers can inform attempts to use automatic face recognition
technology. This measurement issue has become increasingly important with
the use of these systems in airports and in other venues where there is acontinually changing tableau of faces of many races.
As noted previously, the performance of algorithms on face identification
tasks requires a distribution of similarity scores for pairs of images that show the
same identity and a distribution of similarity scores for pairs of images that show
different people. There is a strong tendency in the computer vision community
to worry only about the same-identity distribution. In other words, the idea is
OTHER-RACE EFFECT: COMPUTATIONAL MODELS 1133
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
that if the target and test image are similar, the algorithm will always perform
well. Of course, this is one part of the problem. The other part has to do with the
similarity of pairs of images that show different people. Psychologists have long
worried about the background against which known faces are encoded*in the
context of face typicality effects and other-race effects.
By definition, the same-identity pairs are of the same sex and ethnicity,
though not necessarily of the age (pictures may be taken weeks, months, or
years apart). Different-identity pairs, however, might show people of
different ethnicities, genders, or ages. Whether different-identity image pairs
are constrained to be of the same sex, race, and age is a decision made by the
researchers. In many cases, the performance of face recognition technology is
evaluated with no constraints on the different-identity distribution.
O’Toole, Phillips, An, and Dunlop (2012) documented the effects of
yoking different-identity pairs by gender, race, or both on estimates of the
performance of face recognition algorithms. As expected, performance, as
measured with ROCs, looks best when the different-identity similarity score
distribution is completely unconstrained. In other words, this occurs when
the different-identity pairs are allowed to differ in race and sex. Con-
comitantly, performance looks worse when different-identity pairs are
constrained to be of the same sex and race.
Although this is in some ways an obvious result, it is one that becomes
important when we consider the background population against which
automatic face recognition systems must function. Imagine, for a moment,
an international airport in Europe. The ethnic diversity of the background
population may vary by the time of day (i.e., planes from the Far East Asia,
Europe, North America, and Africa may land at different times of the day)
and/or by the time of year (tourist season). An unstable background
population will give rise to unstable expectations about how well the
algorithm will operate for faces from different races. Behavioural studies of
face typicality have focused the attention of psychologists on the importance
of these background distributions (i.e., what is typical in a particular
context). These findings also have a place in understanding and predicting
the behaviour of face recognition software in the field.
CONCLUSION
Computational algorithms provide a basic framework for testing perceptual
mechanisms that may give rise to an other-race effect, and thus have the
potential to inform psychological tests of the phenomenon. Understanding
how humans develop and retain perceptual and memory advantages for
faces of their own-race has always been an important question in both
social and eyewitness domains. Face recognition systems are becoming a
1134 O’TOOLE AND NATU
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
commodity that we all deal with at border crossing and in security-
monitored settings (banks, embassies, railways, airports). From the broad
perspective of social and political policy, researchers from both psychology
and computer vision must begin to consider and counter the role face race
can play in the accuracy of these systems.Before closing it is worth pointing out that, to date, computational
modelling efforts are limited by the availability of truly diverse databases that
have high quality images of several races. Notably absent in this literature is the
inclusion of images of people of Hispanic and African descent. This limits the
generality of the conclusions that can be made both in visual/perceptual and
social terms. Direct comparisons between humans and machines need to
consider both the perceptual challenges posed by faces of other races, as well as
the social context of interactions among own- and other-race people.
Asymmetries in social contact also pose a challenge, with some people (and
categories of people) broadly exposed to other-race faces, and others less so.
Future efforts should be aimed at developing and testing algorithms on (as yet
nonexistent) databases that represent the full diversity of the human race.
REFERENCES
Anzures, G., Quinn, P. C., Pascalis, O., Slater, A. M., & Lee, K. (2013). Development of own
and other-race biases. Visual Cognition. doi:10.1080/13506285.2013.821428
Balas, B. (2012). Bayesian face recognition and perceptual narrowing in face-space. Develop-
mental Science, 15, 579�588. doi:10.1111/j.1467-7687.2012.01154.x
Biederman, I., & Kalocsais, P. (1997). Neurocomputational bases of object and face recognition.
Philosophical Transactions of the Royal Society of London, 29B, 1203�1219. doi:10.1098/
rstb.1997.0103
Blanz, V., & Vetter, T. (1999). A morphable model for the synthesis of three dimensional faces.
In SIGGRAPH’ 99: Proceedings of the 26th annual conference on Computer Graphics and
Interactive Techniques (pp. 187�194). New York, NY: ACM Press anf Addison-Wesley.
Brigham, J. C., & Barkowitz, P. (1978). Do ‘‘They all look alike?’’ The effects of race, sex,
experience and attitudes on the ability to recognize faces. Journal of Applied Social
Psychology, 8, 306�318. doi:10.1111/j.1559-1816.1978.tb00786.x
Bruce, V., & Young, A. (1986). Understanding face recognition. British Journal of Psychology,
77, 305�327. doi:10.1111/j.2044-8295.1986.tb02199.x
Carroo, A. W. (1986). Other-race face recognition: A comparison of Black American and
African subjects. Perceptual and Motor Skills, 62, 135�138. doi:10.2466/pms.1986.62.1.135
Chance, J. E., Turner, A. L., & Goldstein, A. G. (1982). Development of differential recognition
for own- and other-race faces. Journal of Psychology, 112, 29�37. doi:10.1080/
00223980.1982.9923531
Chiroro, P., & Valentine, T. (1995). An investigation of the contact hypothesis of the own-race
bias in face recognition. Quarterly Journal of Experimental Psychology: Human Experi-
mental Psychology, 48A, 879�894.
Cross, J. F., Cross, J., & Daly, J. (1971). Sex, race, age, and beauty as factors in recognition of
faces. Perception and Psychophysics, 10, 393�396. doi:10.3758/BF03210319
Feinman, S., & Entwisle, D. R. (1976). Children’s ability to recognize other children’s faces.
Child Development, 47, 506�510. doi:10.2307/1128809
OTHER-RACE EFFECT: COMPUTATIONAL MODELS 1135
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
Furl, N., Phillips, P. J., & O’Toole, A. J. (2002). Face recognition algorithms as models of the
other-race effect. Cognitive Science, 96, 1�19.
Hancock, P. J. B., Burton, A. M., & Bruce, V. (1996). Face processing: Human perception and
principal components analysis. Memory and Cognition, 24, 26�40. doi:10.3758/BF03197270
Haque, A., & Cottrell, G. W. (2005). Modeling the other race advantage. In Proceedings of the
27th annual Cognitive Science conference (pp. 899�904). Mahwah, NJ: Lawrence Erlbaum
Associates.
Kelly, D. J., Liu, S., Ge, L., Quinn, P. C., Slater, A. M., Lee, K., & Pascalis, O. (2007). Cross-race
preferences for same-race faces extend beyond the African versus Caucasian contrast in 3-
month-old infants. Infancy, 11, 87�95. doi:10.1207/s15327078in1101_4
Klare, B. F., Burge, M. J., Klontz, J. C., Vorder Bruegge, R. W., & Jain, A. K. (2012). Face
recognition performance: Role of demographic information. IEEE Transactions on
Information Forensics and Security, 7, 1789�1801. doi:10.1109/TIFS.2012.2214212
Kuhl, P. K. (1994). Learning and representation in speech. Current Opinion in Neurobiology, 4,
812�822. doi:10.1016/0959-4388(94)90128-7
Kuhl, P. K. (1998). The development of speech and language. In T. J. Carew, R. Menzel, & C. J.
Schatz (Eds.), Mechanistic relationships between development and learning (pp. 53�73). New
York, NY: Wiley.
Lavarkas, P. J., Buri, J. R., & Mayzner, M. S. (1976). A perspective on the recognition of other-
race faces. Perception and Psychophysics, 20, 475�481. doi:10.3758/BF03208285
Leopold, D. A., O’Toole, A. J., Vetter, T., & Blanz, V. (2001). Prototype-referenced shape encoding
revealed by high-level aftereffects. Nature Neuroscience, 4, 89�94. doi:10.1038/82947
Levin, D. (2000). Race as a visual feature: Using visual search and perceptual discrimination
tasks to understand face categories and the cross-race recognition deficit. Journal of
Experimental Psychology: General, 129, 559�574. doi:10.1037/0096-3445.129.4.559
Li, S. Z., & Jain, A. K. (2011). Handbook of face recognition. London: Springer-Verlag.
Light, L. L., Kayra-Stuart, F., & Hollander, S. (1979). Recognition memory for typical and
unusual faces. Journal of Experimental: Human Perception and Performance, 5, 212�228.
Malpass, R. S., & Kravitz, J. (1969). Recognition for faces of own and other race. Journal of
Personality and Social Psychology, 13, 330�334. doi:10.1037/h0028434
Moghaddam, B., & Pentland, A. (1997). Probabilistic visual learning for object detection. IEEE
Transaction Pattern Analysis and Machine Intelligence, 19, 696�710. doi:10.1109/34.598227
Moghaddam, B., & Pentland, A. (1998). Beyond linear eigenspaces: Bayesian matching for face
recognition. In H. Wechsler, P. J. Phillips, V. Bruce, F. Fogelman Soulie, & T. S. Huang
(Eds.), Face recognition: From theory to applications (pp. 230�243). Berlin: Springer.
Moon, H., & Phillips, P. J. (2001). Computational and performance aspects of PCA-based face
recognition algorithms. Perception, 30, 301�321. doi:10.1068/p2896
Nelson, C. A. (2001). The development and neural bases of face recognition. Infant and Child
Development, 10, 3�18. doi:10.1002/icd.239
Ng, W., & Lindsay, R. C. L. (1994). Cross-race facial recognition: Failure of the contact hypothesis.
Journal of Cross-Cultural Psychology, 25, 217�232. doi:10.1177/0022022194252004
O’Toole, A. J., Abdi, H., Deffenbacher, K. A., & Valentin, D. (1993). Low dimensional
representation of faces in higher dimensions of the face space. Journal of the Optical Society
of America, 10A, 405�410.
O’Toole, A. J., An, X., Dunlop, J. P., Natu, V., & Phillips, P. J. (2012). Comparing face
recognition algorithms to humans on challenging tasks. ACM Transactions on Applied
Perception, 9. doi:10.1145/2355598.2355599
O’Toole, A. J., Deffenbacher, K. A., Abdi, H., & Bartlett, J. C. (1991). Simulation of ‘‘other-race
effect’’ as a problem in perceptual learning. Connection Science, 3, 163�178. doi:10.1080/
09540099108946583
1136 O’TOOLE AND NATU
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014
O’Toole, A. J., Phillips, P. J., An, X., & Dunlop, J. (2012). Demographic effects on estimates of
automatic face recognition. Image and Vision Computing, 30, 169�176. doi:10.1016/
j.imavis.2011.12.007
O’Toole, A. J., Phillips, P. J., Jiang, F., Ayyad, J., Penard, N., & Abdi, H. (2007). Face
recognition algorithms surpass humans matching faces across changes in illumination.
IEEE: Transactions on Pattern Analysis and Machine Intelligence, 29, 1642�1646.
doi:10.1109/TPAMI.2007.1107
O’Toole, A. J., Phillips, P. J., & Narvekar, A. (2008). Humans versus algorithms: Comparisons
from the Face Recognition Vendor Test 2006. Paper presented at the eighth IEEE
international conference on Automatic Face and Gesture Recognition.
Phillips, P. J., Flynn, P. J., Scruggs, T., Bowyer, K. W., Chang, J., Hoffman, K., . . .Worek, W.
(2005). Overview of the face recognition grand challenge. In Proceedings of the Computer
Society conference on Computer Vision and Pattern Recognition (pp. 947�954). Los Alamitos,
CA: IEEE Computer Society Press.
Phillips, P. J., Jiang, F., Narvakar, A., Ayyad, J., & O’Toole, A. J. (2011). An other-race effect for
face recognition algorithms. ACM Transactions on Applied Perception, 8(2), 1�11.
doi:10.1145/1870076.1870082
Phillips, P. J., Moon, H., Rizvi, S., & Rauss, P. (2000). The FERET evaluation method for face
recognition algorithms. IEEE Transaction Pattern Analysis and Machine Intelligence, 22,
1090�1104. doi:10.1109/34.879790
Phillips, P. J., Scruggs, W., O’Toole, A. J., Flynn, P. J., Bowyer, K. W., Scott, C. L., & Sharpe, M.
(2010). FRVT 2006 and ICE 2006 large scale results. IEEE Transactions: Pattern Analysis
and Machine Intelligence, 32, 831�846. doi:10.1109/TPAMI.2009.59
Shen, L., & Bai, L. (2006). A review on Gabor wavelets for face recognition. Pattern Analysis
Applications, 9, 273�292. doi:10.1007/s10044-006-0033-y
Shepherd, J. W., Deregowski, J. B., & Ellis, H. D. (1974). A cross-cultural study of recognition
memory for faces. International Journal of Psychology, 9, 205�212. doi:10.1080/
00207597408247104
Sirovich, L., & Kirby, M. (1987). Low-dimensional procedure for the characterization of
human. Journal of the Optical Society of America, 4A, 519�524.
Swets, D. L., & Weng, J. (1996). Discriminant analysis and eigenspace partition tree for face and
object recognition from views. In Proceedings of second international conference on automatic
face and gesture recognition. Los Alamitos, CA: IEEE Computer Society Press.
Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neurosicence,
3, 71�86. doi:10.1162/jocn.1991.3.1.71
Valentine, T. (1991). A unified account of the effects of distinctiveness, inversion, and race in
face recognition. Quarterly Journal of Experimental Psychology, 43A, 161�204.
Webster, M. A., & MacLin, O. H. (1999). Figural after-effects in the perception of faces.
Psychonomic Bulletin Review, 6, 647�653.
Wiskott, L., Fellous, Kruger, N., & von der Malsburg, C. (1997). Face recognition by elastic
bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19,
775�779.
Zhao, W., Krishnaswamy, A., Chellappa, R., Swets, D., & Weng, J. (1998). Discriminant analysis of
principal components for face recognition. In H. Wechsler, P. J. Phillips, V. Bruce, F. Fogelman
Soulie, & T. S. Huang (Eds.), Face recognition: From theory to applications (pp. 73�85). Berlin:
Springer.
Manuscript received April 2013
Revised manuscript received May 2013
First published online June 2013
OTHER-RACE EFFECT: COMPUTATIONAL MODELS 1137
Dow
nloa
ded
by [
UQ
Lib
rary
] at
10:
48 1
0 N
ovem
ber
2014