Semi-supervised annotation of faces in image collection · SIViP DOI 10.1007/s11760-017-1140-5 ORIGINAL PAPER Semi-supervised annotation of faces in image collection Vijay Kumar1

SIViPDOI 10.1007/s11760-017-1140-5

ORIGINAL PAPER

Semi-supervised annotation of faces in image collection

Vijay Kumar1 · Anoop Namboodiri1 · C. V. Jawahar1

Received: 25 February 2017 / Revised: 26 May 2017 / Accepted: 26 June 2017© Springer-Verlag London Ltd. 2017

Abstract The objective of this work is to correctly detectand recognize faces in an image collection using a databaseof known faces. This has applications in photo-tagging, videoindexing, surveillance and recognition in wearable comput-ers. We propose a two-stage approach for both detectionand recognition tasks. In the first stage, we generate a seedset from the given image collection using off-the-shelf facedetection and recognition algorithms. In the second stage,the obtained seed set is used to improve the performance ofthese algorithms by adapting them to the domain at hand.We propose an exemplar-based semi-supervised frameworkfor improving the detections. For recognition of images, weuse sparse representation classifier and generate seed imagesbased on a confidence measure. The labels of the seed set arethen propagated to other faces using label propagation frame-work by imposing appropriate constraints. Unlike traditionalapproaches, our approach exploits the similarities among thefaces in collection to obtain improved performance. We con-duct extensive experiments on two real-world photo-albumand video collections. Our approach consistently provides animprovement of ∼4% for detection and 5−9% for recogni-tion on all these datasets.

Keywords Face annotation · Semi-supervised framework ·Image collection · Label propagation

B Vijay [email protected]

1 International Institute of Information Technology, Hyderabad,India

1 Introduction

Human faces are the most important entities in images andvideos, consequently their detection and recognition areimportant for many commercial and consumer applications.A lot of progress has been made independently for thesetasks. The current-day algorithms [10,12,17,19] alreadyachieve impressive performance on various detection andrecognition benchmarks. These algorithms are mostly aimedat general problem of detection and recognition of individ-ual target face instances.However, there are certain consumerapplications such as face tagging in albums and video index-ing that require annotation of “collection of faces” appearingin multiple images. In such scenarios, it is beneficial todevelop algorithms that leverage the presence of multiplecollection instances to achieve superior performance.

In this work, we consider the problem of automatic anno-tation of faces in an image collection given a large dictionaryof subjects. By an image collection, we mean a set of imageswith multiple instances of a limited number of subjects.For instance, a family photo-album containing multiple pho-tographs of small number of people taken at different events.Our approach detects and recognizes faces in image collec-tion in two stages as shown in Fig. 1. In the first stage, aset of seed instances are identified from the image collectionusing off-the-shelf face detection and recognition algorithms.In the second stage, the seed images obtained from the tar-get domain are used to improve the performance of thesealgorithms in a semi-supervised framework. Our approachexploits the similarities among instances and is suitable foroffline applications.

For face detection, we use exemplar detector [10,22] dueto its superior performance and flexibility to retrain throughsemi-supervised learning. We select a few highly confidentand diverse examples from the initial detection and cluster-

123

http://crossmark.crossref.org/dialog/?doi=10.1007/s11760-017-1140-5&domain=pdf

http://orcid.org/0000-0002-3835-167X

SIViP

Dictionary

Detection seed-set

Cropped detections

Recognition seed-set

3

1

11

1

2

22

3

8

9

Fig. 1 Overviewof our face annotation approach.Wedetect and recog-nize faces in image collection in two stages. In the first stage, confidentseed images are generated from the collection using off-the-shelf detec-

tion and recognition algorithms. In the second stage, the obtained seedimages are used to further improve the performance of both detectionand recognition using the semi-supervised approaches

ing. The exemplar detector is then retrained on these seedexamples to adapt to the new domain. To recognize thedetected faces,we follow a novel two-stage approach.We ini-tially recognize the faces using the training dictionary usingoff-the-shelf recognition algorithm and retain the labels ofhighly confident examples based on a confidence measure.We then cast the recognition problem in a transductive semi-supervised framework treating the confident samples fromthe collection as the labeled set and rest as unlabeled set. Weimpose two constraints during propagation based on appear-ance and temporal similarities to exploit the relation amongfaces in different feature spaces.

Our approach achieves superior performance due to thefollowing reasons. The appearance of the subjects thatappear in multiple images of a collection is usually con-sistent. Our approach exploits this correlation among facesto propagate the labels from seed images to hard exam-ples that are otherwise difficult to recognize. Also, unlikesingle-stage recognition approaches [15,23] that use dictio-naries with large number of subjects, our dictionaries in thesecond stage contain only the subjects present in the col-lection, thereby reducing the confusion during multi-classclassification.

The approach is related to our previous work [9] whichfocuses on semi-supervised recognition of faces in videos.This paper proposes a generic framework for annotatingimage collections with an end-to-end annotation pipelineincluding both detection and recognition stages. We conductextensive experiments using deep convolutional features onseveral real-world collections of family albums and videosand demonstrate the improvements over baseline algorithms.

2 Related work

Face detection approaches can be roughly grouped intofour categories, namely cascadeAdaBoost, deformable parts,exemplar and deep network-based models. AdaBoost frame-work consists of a cascade ensemble of weak classifiers thatdetect faces in a coarse-to-fine strategy. Several features andconvergence criteria are explored in this framework, start-ing from Viola–Jones (VJ) [25] to the latest works of [14].Deformable part model (DPM)-based techniques [14,33]model the appearance of object parts and their relations. Avanilla DPM trained on a large database achieves impres-sive results [14]. Exemplar detectors [10,22] do not learna global model that generalizes to the training examples.Instead, each exemplar generate voting maps in a retrievalframework, which are then aggregated to detect the faces.Recently, end-to-end trained convolutional neural networks(CNNs) are achieving superior performance for face detec-tion [12,13].

Semi-supervised face detection schemes are typicallyemployed when the number of labeled examples are lim-ited, or to adapt the detectors to the target domain. Lie et al.[11] trains an SVM classifier with top positives and negativesobtained with VJ. Similarly in [7], less confident detectionsin each target image are re-scored through a regression pro-cess trained on top positives and negatives. This approach isnot practical for large image collections as each target imagerequires separate training. Sebe et al. [20] trains a detectorusing few labeled examples along with a large number ofunlabeled samples. The approach may be less relevant todaydue to the availability of large-scale face datasets.

123

SIViP

Face recognition is another popular problem in whichapproaches that seek invariance to pose and illumination vari-ations [3], hand-designed feature descriptors for improveddiscrimination [28], and low-dimensional feature representa-tions [4,24] are proposed. One of the highly cited approachesin recent years is sparse representation classifier (SRC) [26]that represents each target image as a sparse linear combina-tion of the training instances. For the related topic of faceverification, metric learning approaches [6] learn embed-dings that bring similar faces closer and dissimilar facesfarther apart. More recently, CNN-based approaches [17,19]that learn a hierarchy of low-level to mid-level featuresthrough an end-to-end training are yielding excellent resultsfor various face recognition and verification tasks.

Semi-supervised face recognition approaches are pro-posed when there are a few labeled training samples. In [18,31], mean face templates for each subject are computed usingeither PCA and LDA. The unlabeled samples that are nearestto each of these class templates are augmented to labeled setto repeat the procedure. In [27], labeled examples are used asa constraint to obtain label constrained low-rank representa-tions for face recognition.

Face recognition in videos is another related topic whichcontains several works in three categories—key frame, tem-poral model, and image-set-based approaches. A completereview of these approaches is given in [21]. A few worksexploited additional cues in sitcom videos such as clothingand audio [23] and relationship between subjects [2].

3 Semi-supervised face detection

Given an image collection, our objective is to detect as manyfaces as possible by adapting off-the-shelf detector to theinstances in the collection. The overview of our detectionapproach is shown in Fig. 2. We obtain initial detectionsfrom the collection using a generic detector trained on largedatabase of images. We then select a few highly confidentexamples from the initial detections that are diverse and dis-similar through clustering. The detector is then retrained onthese seed examples to adapt it to the new domain.

We choose the exemplar-based detector [10,22] due itshigh performance, simplicity, and flexibility it offers foradaption. The approach aligns well with our objective toimprove performance using instances from the target domain.The exemplar detector uses retrieval framework for detection.A large database of training exemplars that cover better facialvariations is initially constructed, and local features (denseSIFT) are extracted and quantized using a vocabulary. Dur-ing testing, each exemplar generates multiple voting mapswhich are aggregated to locate the faces in the target image.

To achieve high recall, we initially apply the exemplardetector with very low threshold. Let di be the set of initial

Clustering of top positives and top negativesExemplar detectorImage collection

database

Retrain detector

new database

Vocabulary

Indexing

Compute TF-IDF

Thresholds

Fig. 2 Overview of our detection approach. We apply off-the-shelfexemplar detector to get the initial detections. Top scoring positiveand negative samples are selected and clustered to obtain a set of seedimages. The exemplar detector is retrained by augmenting the seedimages to exemplar database and re-computing the detector parameters

detections and si their corresponding scores. The scores siare used to obtain a set of confident positive and negativeinstances. To achieve this, we define two thresholds ω1 andω2 and consider those detections with scores si greater thanω1 as top positives P and less than ω2 as top negatives N .

P = { di , si > ω1, i = 1, 2, . . . , n}N = { di , si < ω2, i = 1, 2, . . . , n} (1)

Once we obtain the sets P and N , we cluster the imagesusing k-means and select τ detections as seed images. Thisensures that the obtained seed images are diverse whichis essential for video collections containing near identi-cal instances. If the diversity is not maintained, the mostoccurring exemplar instances dominate the voting process,severely degrading the performance. For each seed image,features are extracted and quantized, and augmented to theexemplar database. To retrain the exemplar detector, it isenough to update the inverse document frequencies of thevisual words, unlike appearance-based detectors [12,14,25]which require retraining of the models from scratch.

4 Semi-supervised face recognition in imagecollection

We next recognize the detected faces given a labeled dic-tionary of subjects using a two-stage pipeline as shown inFig. 3. In the first stage referred as seed-set selection, thedetections are recognized using off-the-shelf face recogni-tion algorithm. We then identify the key seed images that areconfidently recognized using a robust confidence measure.In the second stage referred as propagation, we propagatethe labels from the seed images to the remaining unlabeledcollection images incorporating various constraints.

Our recognition approach follows the intuition that certainfaces in the collection may have similar appearance, poseto the faces in the given dictionary. If such faces from thecollection are recognized correctly, it is possible to propagate

123

SIViP

Dictionary

0.2 0.70.5 0.8SCI:

seed image

t-1 t t+1 t+100 t+101 t+1000 t+1001 t+1500 t+1501 t+150001

seed imageseed image

Fig. 3 Overview of our proposed semi-supervised face recognitionapproach. In the first stage, we label all the images in a collection usingSRC and select only highly confident seed images based on SCI. Thelabels of the seed images are then propagated to remaining images inthe collection by imposing the constraints in time and feature space

Fig. 4 Apossible scenario that depict the effectiveness of our approachon a video collection with a head pose change. If a few images insuch shots are recognized correctly in the first stage, their labels can beeffectively propagated to the remaining images in the second stage

their labels to the remaining unlabeled images as they belongto the same target domain. Our approach is applicable toscenarios that have imageswith high correlation among them.For instance, a video shot of zoom-in or zoom-out of a facein a fixed pose or video shot where there is a gradual changeof pose of the subject as shown in Fig. 4. If one or few imagesin such shots are recognized confidently in the first stage, thelabels can be propagated to the remaining faces effectivelythrough propagation.

4.1 Seed-set selection

We use Sparse Representation Classifier (SRC) [26] as aoff-the-shelf face recognition algorithm for initial labelingas it is robust to noise and occlusion and provides strongconfidence measure based on reconstruction weights. LetD = [D1, D2, . . . , DM ] be the dictionary of labeled examplesand X = [X1, X2, . . . , XN ] is the collection of face instances.SRC represents the target image as a sparse linear combina-tion of dictionary faces.

αi = argminαi

||Xi − D αi ||2 + λ||αi ||1, (2)

where αi is the representation of the sample Xi and λ is aLagrangian constant which controls the trade-off betweenreconstruction error and sparsity. The label of Xi is obtainedusing the minimum reconstruction error criteria as

label(Xi ) = argminj

||Xi − Dj αi j ||2, (3)

where Dj are the training samples belonging to class j andαi j are the corresponding weights. Once the images in thecollection are labeled, we select a few confident images usingSparsity Concentration Index (SCI) [26] defined as

SCI(i) = c · max j ||αi j ||1/||αi ||1 − 1

c − 1, (4)

where c denote the number of classes. This score indicateshow well a target image is represented using the dictionaryelements from a particular class. It is noted in [26] that l1-based minimization results in nonzero coefficients that areconcentrated with training examples from the correct class,even in the presence of noise and occlusions. Thus, the SCIscore defined on the l1-coefficients is robust and serves asa strong indicator for recognition confidence. An high scoreindicates that the target image is represented mostly fromthe single class and a very low score close to zero indicatescontribution from all the classes. We consider this SCI scoreas a confidence measure to retain the label of a face.

4.2 Propagation of seed images

Once confident seed images are selected, their labels arepropagated to the remaining unlabeled images in the col-lection through label propagation framework [32]. Since thecollection contain correlated images, an image can be rec-ognized using its appearance and its similarity with otherimages. This can be achieved in a graph-based frameworkwhere the image similarities are used to predict their labels.Recognition using seed images in the second stage reducesthe domain mismatch between the dictionary and collection,and also the confusion rate when the dictionary has largenumber of subjects compared to the collection.

We reconsider X = [X1, X2, . . . , XN ] = [Xl Xu] withoutloss of generality. Here, Xl denote the labeled seed imagesfrom the first stage and Xu denote the remaining unlabeledimages in the collection. Let, F ∈ R

N×c denote a nonnegativelabeling matrix where the ith row of F indicate the predictedclass scores of Xi . Let Y ∈ R

N×c be the initial labeling matrix.For the seed image i belonging to class j , we define Yi j = 1,and 0 otherwise. For all the remaining unlabeled images, weassign a zero vector, i.e., Yi j = 0, ∀ j .

Given X , we construct an undirected graph 〈V, E〉 usingboth labeled and unlabeled images. Each node in the graphrepresents an image and the edges E represent the similari-ties between images. Larger the edge weight, greater is thesimilarity between images. For image collections, we con-sider appearance similarity and consider both appearance andtemporal similarities for video collections.

123

SIViP

4.2.1 Temporal Similarity

Let ti and t j be the absolute numbers that denote ith andj th frames, respectively. Also let υi = (υ

x1i , υ

y1i , υ

x2i , υ

y2i ) and

υ j = (υx1j , υ

y1j , υ

x2j , υ

y2j ) denote the x and y co-ordinates of top-

left and bottom-right corners of the rectangles representingith and j th face in the collection, respectively. We definetemporal similarity [9] as follows,

Wti j = exp

(−(ti − t j )χi j

2σ 2t

), (5)

where σt controls the spread of Gaussian function. χi j isdefined as the absolute sum of the differences of the i th andj th image co-ordinates.

χi j =∑

|υi − υ j |. (6)

A large value is assigned for Wti j for pair of faces if they

appear in subsequent frames and at similar locations. Withabove similarity, we define the temporal constraint as

∑i, j

W ti j ||Fi − Fj ||2. (7)

During propagation, this constraint ensures that the imagesthat are closer in temporal space have similar labels.

4.2.2 Appearance Similarity

Images that are similar in appearance space (SIFT or CNN)should have large edgeweights between them.Whileweightsbased on k-nearest neighbor and Gaussian function are com-monly used in the literature, they are not robust to facialillumination and expression variations. For this reason, weemploy a new scheme to compute the edge weights. Werepresent each image as a linear combination of its near-est neighbors to preserve the locality and impose nonzeroconstraints on the weights [9]. Our approach encourages thecreation of edges only with similar samples as required forbetter performance of graph-based frameworks.

wi = argminwik

||Xi − Σk:Xk∈N (Xi )Xkwik ||2 + β||wi ||2s.t ∀k, wik ≥ 0, (8)

whereN (Xi ) denotes the k neighboring samples of Xi , andβ is a Lagrangian constant that controls the trade-off betweentwo terms. Appearance weight matrix Wa ∈ R

N×N is thenconstructed as:

Wai j =

{wi (k), if X j ∈ N (Xi )

0, otherwise, (9)

wi (k) denotes the kth element of vector wi corresponding tokth neighbor. Weights obtained by this method may not besymmetric, i.e,Wa

i j �= Waji . To make the weights symmetric,

we perform the below operation

Wai j = Wa

ji = Wai j + Wa

ji

2. (10)

The appearance constraint is finally incorporated into thelabel propagation framework as

∑i, j

Wai j ||Fi − Fj ||2. (11)

This constraint ensures that images that are highly similar inappearance space should belong to the same class.

4.2.3 Propagation

We finally incorporate appearance (Eq. 11) and temporal(Eq. 7) constraints into label propagation formulation [32]to propagate the labels from labeled seed images to the unla-beled images as follows.

Q(F) = argminF

γ1

2

N∑i, j

W ti j ||Fi − Fj ||2

+γ2

2

N∑i, j

Wai j ||Fi − Fj ||2

+γ3

N∑i

||Fi − Yi ||2. (12)

If the first and second terms ensure that the images thatare similar based on appearance and temporal weights havesimilar labels, third term retains the labels of the confidentseed images. γi are the parameters that control the trade-offbetween these three terms.

Let, Dtii = ∑

j Wti j and Da

ii = ∑j W

ai j be the diagonal and

symmetric matrices whose entries are row sums of Wti j and

Wai j , respectively. L

t = Dt − Wt and La = Da − Wa are theLaplacian matrices that are symmetric and positive semi-definite, defined on temporal and appearance similarities,respectively.

Following [1], we can rewrite the first term in Eq. 12 as

N∑i, j

W ti j ||Fi − Fj ||2 =

N∑i, j

(F2i + F2

j − 2Fi Fj )Wti j

=N∑i

F2i D

tii +

N∑j

F2j D

tj j

−2N∑i, j

Fi FjWti j = 2F Lt F (13)

123

SIViP

where F is the matrix transpose of F . Similarly, second termis equivalent to 2F La F . Thus, Eq. 12 can be rewritten as

Q(F) = argminF

γ1 F Lt F + γ2 F La F + γ3||F − Y ||2 (14)

Differentiating Q(F) with respect to F and equating to 0,

2γ1Lt F + 2γ2L

aF + 2γ3(F − Y ) = 0 (15)

F∗ = γ3(γ1Lt + γ2L

a + γ3)−1Y (16)

The final identity yi of the image Xi can be obtained fromF∗ as yi = argmax

jF∗i j , where j = {1, 2, . . . , c}.

4.3 Rejection of unknown faces

The candidate boxes obtained from the detection stage maycontain false positives or unknown faces that are not presentin the dictionary. This is particularly common in video col-lections. The algorithm should be able to accept or reject thelabels of such candidates after propagation. For this purpose,we define a confidence measure known as label dominancescore (LDS) that measures the contribution of each class dur-ing the reconstruction of the sample. This is simply definedas the ratio of two largest class scores of the sample.

LDS (i) = Fi jarg max

k, k �= jFik

where j = argmaxj

Fi j (17)

Intuitively, when an image has large edge weights withimages belonging to a particular class, the scoring vector Fiwill have a high score for that particular class. When there issuch clear dominance of one class, LDS will be high and weconsider the labeling as confident and retain its final label.

5 Experiments and results

5.1 Datasets and setup

We evaluate our approach on several album and video collec-tions whose instances exhibit pose, illumination, viewpointvariations that are seen in the real world. All these collectionscontain multiple images in which a small number of subjectsappear in different images. Album collections contain multi-ple photographs of family members taken at various indoorand outdoor events, and video collections contain severalframes of actors appearing in one or more videos. We showfew images from these collections in Figs. 7 and 8.G-album [5] consists of 589 photographs containing 931faces belonging to 32 subjects. For each subject, we considera maximum of 10 images to create a labeled dictionary andthe rest as unlabeled collection.

People in photo-albums (PIPA) [30] is a large collectionof 1438 user-uploaded albums collected from Flickr. Thedataset consists of 37,107 photographs with 63,188 headinstances belonging to 2356 identities. Similar to G-albumsetting, we consider a maximum of 10 images to create alabeled dictionary and the rest as unlabeled collection.Movie trailer face dataset [15] consists of 4485 face tracksfrom101movie trailers released in the year 2010. These trail-ers are collected from YouTube and contain the celebritiespresented in the PubFig Dataset [8] along with additional10 actors. The labeled dictionary consists of 34,522 images(PubFigs + 10 additional actors) with each actor having amaximum of 200 images. Since the dataset provides the rawdescriptors based on the combination of LBP, HOG, andGabor features), we show only the recognition results.

Hannah movie dataset [16] consists of face annotationsfor the entire movie Hannah and Her Sisters. The datasethas 153,833 frames with 202,178 face bounding boxes and254 different labels of 41 named, 186 unknown characters,and 15 miscellaneous crowd regions. We create the labeleddictionary from IMDB photographs of actor’s profile for eachnamed character. We manually annotate the face boundingboxes for the dictionary images. Of the 41 named characters,only 26 prominent actors had profiles in IMDB. The labeleddictionary consists of 2385 images belonging to26prominentactors appeared in the movie.

5.2 Features

We follow the same feature extraction procedure [10] basedon dense SIFT for detection. For recognition, we extract deepCNN features usingVGGmodel [17] trained on 2M faces of2622 identities using VGG-16 architecture. The features areextracted from seventh fully connected layer whose dimen-sion is 4096. We apply PCA to reduce the feature dimensionto 300.

5.3 Results

We use the publicly available implementation of [10] forexemplar face detection. We set the upper and lower thresh-olds ω1 and ω2 to 80th and 20th percentile of the detectionscores to select the top 20% positives and negatives, respec-tively. For video collections, we further cluster the detectionsto select 300 diverse images. The obtained detections areresized and followed through the feature indexing pipelineas the original algorithm. The term frequencies and inversedocument frequencies are updated based on seed images. Thedetection results in Fig. 5 show that performance improve-ment obtained using our approach.

For generating the seed images during recognition, we setthe error tolerance λ = 0.05 and retain the labels of top 25%of images based on SCI. We set various parameters using

123

SIViP

(a) G-album (b) PIPA (c) Hannah

0.5 0.6 0.7 0.8 0.9 1

Recall

0.4

0.5

0.6

0.7

0.8

0.9

1P

reci

sion

Exemplar - AP:0.877

Exemplar semi-supervised - AP:0.910

(a) G-album

0.1 0.2 0.3 0.4 0.5 0.6 0.7

Recall

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Pre

cisi

on

Exemplar - AP: 0.368

Exemplar semi-supervised - AP: 0.404

(b) PIPA

0.1 0.2 0.3 0.4 0.5 0.6 0.7

Recall

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pre

cisi

on

Exemplar - AP: 0.467

Exemplar semi-supervised - AP: 0.506

(c) Hannah

Fig. 5 Precision–recall curves of baseline and adapted exemplar detec-tors on various image collections. We notice a considerable improve-ment in the performance with our proposed approach that involves asimple augmentation of seed images into the exemplar database

Table 1 Recognition performance (%) of various methods in terms ofaccuracy and average precision (AP) on image collections

Method G-album PIPA

Accuracy AP Accuracy AP

1-NN 77.26 91.69 56.02 78.49

SVM 80.82 97.57 38.15 93.21

CRC [29] 75.33 94.58 51.27 76.79

SRC [26] 79.56 97.41 58.84 93.70

MSSRC [15] 79.56 97.41 58.84 93.70

Our approach 84.52 97.72 63.80 94.59

Table 2 Recognition performance (%) of various methods in terms ofaccuracy and average precision (AP) on video collections

Method Hannah Movie Trailer

Accuracy AP Accuracy AP

1-NN 32.71 15.87 23.60 9.53

SVM 44.82 83.42 54.68 50.06

CRC [29] 37.49 37.92 41.93 36.33

SRC [26] 39.76 49.97 47.78 54.33

MSSRC [15] 43.74 38.62 50.52 58.69

Our approach 54.19 84.79 55.98 59.34

cross-validation on held-out set. We set the number of neigh-bors k for computing appearance weights to 120, γ2 = 0.3and γ3 = 0.7. For the collections with temporal information(e.g. G-album and PIPA), we set γ1 = 1 and similarly for

collections without temporal information (e.g. Hannah andMovie Trailer), we set γ1 = 0.

We show the recognition performance of variousapproaches on image and video collections in Tables 1and 2, respectively. Our approach which exploits the cor-relation among the instances outperforms other approachesthat recognize instances independently. Note that, for imagecollections without tracks, MSSRC is same as SRC. Wealso show the average precision which measures the abil-ity to reject unknown instances. We use SCI as a confidencemeasure for SRC, CRC, MSSRC and LLC algorithms, L2Euclidean distance for k-NNand probability scores for SVM.We show the precision–recall (PR) curves of variousmethodsin Fig. 6. It is clear from table that our confidence measurebased on LDS is robust in rejecting unknown instances. Therecognition rates of various approaches for different numberof labeled training samples are shown in Table 3 for G-album dataset. The performance improvement is larger whenthere are limited labeled examples, and it becomes closer toother approaches as the labeled examples are increased. Weshow few qualitative results of our detection and recognitionapproach in Figs. 7 and 8, respectively.Ablation study We show the effectiveness of differentstages and constraints during recognition in Table 4. Clearly,we obtain an improvement with two-stage approach whenappearance similarities in the collection are exploited. Addi-tionally, when time information is used for videos, it bringsfurther performance improvement. Finally, we show the per-formance of our approach for varying percentage of seed

Table 3 Recognition rates [%] of various methods on G-album fordifferent number of labeled examples

Method 1 Train 2 Train 5 Train 10 Train 20 Train

KNN 56.12 65.85 74.15 77.26 80.43

SVM 57.45 64.40 78.09 80.82 85.42

SRC [26] 56.45 68.39 77.86 79.56 86.82

CRC [29] 56.27 68.28 77.62 75.33 85.87

Our approach 60.25 71.27 82.52 84.52 87.18

(a) G-album (b) PIPA (c) Hannah (d) MovieTrailer

0 10 20 30 40 50 60 70 80 90 100

Recall (%)

70

75

80

85

90

95

100

Pre

cisi

on

(%

)

NN - AP: 0.918SVM - AP: 0.975SRC - AP: 0.974CRC - AP: 0.946Ours - AP: 0.977

(a) G-album

0 10 20 30 40 50 60 70 80 90 100

Recall (%)

40

50

60

70

80

90

100

Pre

cisi

on

(%

)

NN - AP: 0.7849SVM - AP: 0.9321SRC - AP: 0.9370CRC - AP: 0.7679Ours - AP: 0.9459

(b) PIPA (c) Hannah

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Recall (%)0 10 20 30 40 50 60 70 80 90 100

Recall (%)

Pre

cisi

on

(%

)

0

10

20

30

40

50

60

70

80

90

100

Pre

cisi

on

(%

)

NNSVML2SRC (Voting)MSSRCOurs

(d) MovieTrailer

Fig. 6 Precision–recall curves of recognition algorithms on various image collections. Our confidence score based on LDS is more robust andperforms better than all the approaches in rejecting unknown samples

123

SIViP

Fig. 7 Performance improvement after adapting the exemplar detec-tors to the image collection from aGalbum and bHannah datasets. Theoriginal detections are shown in blue, and new detections are shown in

red. Notice how the faces that were by missed the original detector aredetected after adaption using images from the seed set

(b)(a) (c)

Fig. 8 Qualitative results on PIPA: a the comparisons of our approachwith single-stage approach of SRC. Red boxes show the success case ofour approach and failure case of SRC. Blue (and yellow) boxes indicate

instances that are (not) correctly predicted by both approaches. b, c Thesuccess and failure cases of our approach, using the test instance andits top 4 instances with nonzero appearance weights wa

Table 4 Ablation study: Recognition performance of our approach inthe first stage (SRC), second stage based on appearance and temporalsimilarities alone, and overall performance with both the similarities

Method G-album PIPA Hannah Trailer

Stage 1 (SRC) 79.56 58.84 39.76 47.78

Appearance 84.52 63.80 47.83 51.13

Temporal − − 44.49 50.24

Overall 84.52 63.80 54.19 55.98

images retained in Fig. 9. The performance improvementis significant when we retain 20−30% of seed images andbecomes closer to the performance of first stage with increas-ing proportion of seed images.Computational complexityOur approach brings substantialperformance improvementswithminimal additional cost dueto its two-stage process. With the feature extraction commonacross stages, the additional complexity is due to voting mapgeneration for detection and<Wt ,Wa, F> computation forrecognition in the second stage. The detection and recogni-tion steps take∼1.5 and∼2 times more than the single-stageapproach, respectively.

6 Conclusion

In this paper, we present an approach for automatic anno-tation of faces in an image collection. We demonstrate atwo-stage approach for both detection and recognition steps

0 20 40 60 80 100

Percentages of images retained

20

30

40

50

60

Acc

ura

cy (

%)

Fig. 9 Recognition performance for varyingpercentage of seed imagesselected in the first stage using Hannah dataset. The improvement issignificant when 20−30% of seed images are selected, and becomesless effective when large proportion of seed images are retained

that exploits the similarities among collection instances toimprove the performance. We first generate a set of seedimages using off-the-shelf detection and recognition algo-rithms, which are then are used to improve the performanceby adapting them to the target image collection. We proposean exemplar-based semi-supervised approach for improv-ing the detections and label propagation-based frameworkfor improving the recognition. Experiments on album andvideo collections show that our method obtains an improve-ment of ∼4% for detection and 5−9% for recognitiontasks.

123

SIViP

Acknowledgements This work is partly supported by theMCIT, NewDelhi. Vijay Kumar is supported by TCS PhD fellowship 2012.

References

1. Belkin, M., Niyogi, P.: Laplacian Eigenmaps for dimensionalityreduction and data representation. Neural Comput. 15, 1373–1396(2003)

2. Buml,M., Tapaswi,M., Stiefelhagen, R.: Semi-supervised learningwith constraints for person identification in multimedia data. In:IEEE Conference on Computer Vision and Pattern Recognition(2013)

3. De Marsico, M., Nappi, M., Riccio, D.: Face authentication withundercontrolled pose and illumination. Signal Image Video Pro-cess. 5, 401–413 (2011)

4. Ding, C., Bao, T., Karmoshi, S., Zhu, M.: Single sample perperson face recognition with KPCANet and a weighted vot-ing scheme. Signal Image Video Process. (2017). doi:10.1007/s11760-017-1077-8

5. Gallagher, A., Chen, T.: Clothing cosegmentation for recogniz-ing people. In: IEEE Conference on Computer Vision and PatternRecognition (2008)

6. Guillaumin, M., Verbeek, J., Schmid, C.: Is that you? Metric learn-ing approaches for face identification. In: International Conferenceon Computer Vision (2009)

7. Jain, V., Learned-Miller, E.: Online domain adaptation of a pre-trained cascade of classifiers. In: IEEE Conference on ComputerVision and Pattern Recognition (2011)

8. Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Attributeand simile classifiers for face verification. In: IEEE InternationalConference on Computer Vision (2009)

9. Kumar, V., Namboodiri, A., Jawahar, C.: Face recognition invideos by label propagation. In: International Conference on Pat-tern Recognition (2014)

10. Kumar, V., Namboodiri, A., Jawahar, C.: Visual phrases for exem-plar face detection. In: IEEE InternationalConference onComputerVision (2015)

11. Li, H., Hua, G., Lin, Z., Brandt, J., Yang, J.: Probabilistic elasticpart model for unsupervised face detector adaptation. In: IEEEInternational Conference on Computer Vision (2013)

12. Li, H., Lin, Z., Shen, X., Brandt, J., Hua, G.: A convolutional neu-ral network cascade for face detection. In: IEEE Conference onComputer Vision and Pattern Recognition (2015)

13. Li, Y., Sun, B., Wu, T., Wang, Y.: Face detection with end-to-endintegration of a CONvNet and a 3D model. In: European Confer-ence on Computer Vision (2016)

14. Mathias,M.,Benenson,R., Pedersoli,M.,VanGool, L.: Face detec-tion without bells and whistles. In: IEEE European Conference onComputer Vision (2014)

15. Ortiz, E., Wright, A., Shah, M.: Face recognition in movie trailersvia mean sequence sparse representation-based classification. In:IEEE Conference on Computer Vision and Pattern Recognition(2013)

16. Ozerov, A., Vigouroux, J.R., Chevallier, L., Perez, P.: Clothingcosegmentation for recognizing people. In: International Confer-ence on Image Processing (2013)

17. Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition.In: British Machine Vision Conference (2015)

18. Roli, F., Marcialis, G.L.: Semi-supervised PCA-based face recog-nition using self training. In: Joint International Association OfPattern Recognition Workshop (2006)

19. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unifiedembedding for face recognition and clustering. In: IEEE Confer-ence on Computer Vision and Pattern Recognition (2015)

20. Sebe, N., Cohen, I., Huang, T.S., Gevers, T.: Semi-supervised facedetection. In: IEEE Conference on Computer Vision and PatternRecognition Workshops (2005)

21. Shan, C.: Face recognition and retrieval in video. In: VideoSearch and Mining. Springer, pp 235–260. doi:10.1007/978-3-642-12900-1_9 (2010)

22. Shen, X., Lin, Z., Brandt, J., Wu, Y.: Detecting and aligning facesby image retrieval. In: IEEE Conference on Computer Vision andPattern Recognition (2013)

23. Tapaswi, M., Bäuml, M., Stiefelhagen, R.: “Knock! Knock! Whois it?” Probabilistic person identification in TV-series. In: IEEEConference on Computer Vision and Pattern Recognition (2012)

24. Turk, M., Pentland, A.: Eigenfaces for recognition. J. Cogn. Neu-rosci. 3, 71–86 (1991)

25. Viola, P.A., Jones,M.J.: Rapid object detection using a boosted cas-cade of simple features. In: IEEE Conference on Computer Visionand Pattern Recognition (2001)

26. Wright, J., Yang,A.Y., Ganesh,A., Sastry, S.S.,Ma,Y.: Robust facerecognition via sparse representation. IEEE Trans. Pattern Anal.Mach. Intell. 31, 210–227 (2009)

27. Yang, C., Ye,M., Tang, S., Xiang, T., Liu, Z.: Semi-supervised low-rank representation for image classification. Signal Image VideoProcess. 11, 73–80 (2017)

28. Yu, W., Gan, L., Yang, S., Ding, Y., Jiang, P., Wang, J., Li, S.: Animproved LBP algorithm for texture and face classification. SignalImage Video Process. 8, 155–161 (2014)

29. Zhang, L., Yang, M., Feng, X.: Sparse representation or Collab-orative representation: Which helps face recognition?. In: IEEEInternational Conference on Computer Vision (2011)

30. Zhang,N., Paluri,M., Taigman,Y., Fergus,R., Bourdev, L.:Beyondfrontal faces: improving person recognition using multiple cues.In: IEEE Conference on Computer Vision and Pattern Recognition(2015)

31. Zhao, X., Evans, N.W., Dugelay, J.L.: Semi-supervised face recog-nition with LDA self-training. In: International Conference onImage Processing (2011)

32. Zhu,X., Ghahramani, Z.: Learning from labeled and unlabeled datawith label propagation. Technical Report, CMU (2002)

33. Zhu, X., Ramanan, D.: Face detection, pose estimation, and land-mark localization in the wild. In: IEEE Conference on ComputerVision and Pattern Recognition (2012)

123

http://dx.doi.org/10.1007/s11760-017-1077-8

http://dx.doi.org/10.1007/s11760-017-1077-8

http://dx.doi.org/10.1007/978-3-642-12900-1_9

http://dx.doi.org/10.1007/978-3-642-12900-1_9

Documents

Semi-supervised annotation of faces in image collection · SIViP DOI 10.1007/s11760-017-1140-5 ORIGINAL PAPER Semi-supervised annotation of faces in image collection Vijay Kumar1