PROBABILISTIC PERSON IDENTIFICATION IN TV NEWS ...PROBABILISTIC PERSON IDENTIFICATION IN TV NEWS PROGRAMS USING IMAGE WEB DATABASE F. Battisti, M. Carli, M. Leo, A. Neri COMLAB - Telecommunication

PROBABILISTIC PERSON IDENTIFICATION IN TV NEWS PROGRAMS USING IMAGEWEB DATABASE

F. Battisti, M. Carli, M. Leo, A. Neri

COMLAB - Telecommunication LabUniversita degli Studi Roma TRE,

Via Vito Volterra, 6200146 Roma Italy

ABSTRACT

The automatic labeling of faces in TV broadcasting is still a challenging problem. The high variability in view points, facial ex-pressions, general appearance, and lighting conditions, as well as occlusions, rapid shot changes, and camera motions, producesignificant variations in image appearance. The application of automatic tools for face recognition is not yet fully establishedand the human intervention is needed. In this paper, we deal with the automatic face recognition in TV broadcasting programs.The target of the proposed method is to identify the presence of a specific person in a video by means of a set of imagesdownloaded from Web using a specific search key.

Index Terms— Annotation, TV broadcast, face detection, face recognition

1. INTRODUCTION

Broadcasted television news play an important role in the media that is to transfer relevant and updated information regardingdifferent issues, which are happening around the world, as they are. Therefore, the automatic summarization of the receivedstream is a fundamental task facilitating effective browsing, searching, and monitoring of news content. In particular, labelingof faces in programs of broadcast TV is still a challenging problem. For many automatic applications, such as multimediacontent monitoring, face recognition could be a key feature if able to provide an acceptable level of performance in term ofpercentage of correct identifications and false detection. Our investigation deals with the problem of automatic face recognitionin TV broadcast programs. The target of the proposed method is to identify the presence of a specific person in a video. To thisaim a coarse to fine approach is proposed. In the first phase, a guess of the possible candidates is computed by extracting facefeatures from videos under test. The recognition is performed by matching these features with the ones extracted by a set ofweb images.Two different classes of face recognition tools can be recognized: cooperating (forward annotation) and non-cooperating. In thefirst class the name of the person appearing in the video is annotated by the operator or by an automatic tool, and broadcastedin a separated stream or directly inserted into the stream itself. The latter approach exploits techniques based on the use of datahiding for inserting into the stream itself, in a imperceptible while robust way, information about the recognized person. In thestate of the art, several techniques have been proposed which are based on reversible or high capacity systems for inserting therequired amount of data1–3 . Even if promising, these systems are prone to transmission error and to high variability due tomultimedia transcoding typical of most of data hiding systems. However, if enforced, the cooperative approach could lead to avery effective, simple, and cost preserving, annotation of TV broadcasting. Methods belonging to the first class do not requireany information from the broadcaster company. As in the concealment methodology, the person identification and annotation isperformed at the receiver side. Face recognition can be performed automatically or exploiting human involvement. An exampleof an approach based on the human supervision is provided by the portal Amazon Mechanical Turk4 : a virtual place for activ-ities where, at present, almost exclusively persons are used for the realization of video annotation systems. In that framework,

This work has been partially financed by the project ”Virtualized Analysis of Audio/Video Communication (VIS)” on regional fund n. FILAS-CR-2011-1127.

the crowdsourcing model is adopted for exploiting human intelligence in refining the results of video annotation tools. Never-theless, the most used systems for face recognition are based on a two-steps automatic processing of the received video: first,the detection of regions in the frame containing human faces is performed and, then, the recognition and annotation operation isimplemented. As well known, the performances of face recognition techniques can degrade due to lighting conditions, changesin skin color, or to the orientation of the subjects in the frame in the video sequence.

State-of-the art5, 6 approaches tackle the problem mainly considering face tracks; however there are many situations in whichfaces to be recognized are among unknown others (i.e. in the news typically there is an anchor man in the foreground and otherfaces displayed in the background). Many methods have been proposed for Automatic Face Recognition (AFR) such as theones based on Bayesian eigenfaces7 or Fisher-faces8 . These methods present a good accuracy on a small number of controlledtest sets. In9 , a probabilistic method for identifying characters in TV or movies is described.In this paper, a similar model is designed for detecting faces in news programs: the main task is the identification of differentpersons appearing in TV programs, i.e., isolating faces repetitively appearing (anchor men/women) with respect to the inter-viewed one or to the people to be identified. The proposed approach can be divided into three steps: in a first step, the news issegmented into separate scenes; then, face detection is performed by using a Viola-Jones based approach10 . Finally the outputsof the detection process are analyzed for performing face recognition.

The rest of the paper is organized as follows: in Section 2, the video analysis approach is presented, highlighting theface detection and recognition in Subsection 2.1 and Subsection 2.3 respectively. In Section 3, the results of the performedexperimental test are reported and, finally, the conclusions are drawn.

2. VIDEO ANALYSIS APPROACH

The target of this work is the development of a process for people identification by means of faces recognition and trackingin a TV stream. The person identification is obtained by comparing the faces extracted by the TV stream, with the alreadyannotated pictures downloaded from the Web. As already stated in the Introduction, the objective is challenging since facescontained in the TV streaming differ in scale, pose, lighting, expressions, or hair style. Furthermore, transmission errors maylead to poor image quality or motion blurring artifacts. Furthermore, the photos downloaded from the Web are very differentfrom each other. Face recognition is a set of techniques used to verify the presence of biometric features within an image andthe subsequent association of a face with the identity of an individual through the analysis of features of the face. The globaland local image features (morphology of the eyes, nose, etc.) are critical in the detection and recognition of the face are relatedwith the problem of pattern recognition. The process of detection and recognition of faces is based on the following steps:

• selection of the area of interest as a bounding box containing the face. The multimedia stream is analyzed frame byframe until a frame that contains face is found. This selection is based on the detection of some additional physicalcharacteristics (i.e., eyes, nose, etc.) inside the bounding box;

• image cropping and resizing to the size of 180x200 pixels;

• face recognition.

Figure 1 show the block diagram of the proposed method. In order to perform the recognition, a database has been created byautomatic downloading of a set of faces from Web using specific search keys.

2.1. Face Detection

The goal of this phase is the identification in the frame of regions containing human faces. To this aim we have adoptedthe Viola-Jones algorithm10 . Viola Jones detector is a strong, binary classifier exploiting several weak detectors; each weakdetector is a very simple binary classifier. During the learning phase, the cascade of weak detectors is trained up to obtain therequired detection rate / miss rate (or precision / recall rate) using Adaboost. To detect regions containing faces, the genericframe is partitioned in several rectangular patches, each of which is used as input to the classifiers cascade. If a rectangularimage patch passes through all of the cascade stages, then it is classified as ”positive”. The process is iterated at different scales.Figure 3 shows a frame of a video in which two faces are detected with different dimensions that are scaled to the same size,without loss of proportionality.

Fig. 1. Detection and tracking process.

Fig. 2. Scene where 2 faces have been detected.

2.2. Face recognition

Eigenfaces algorithms allow to implement a system capable of efficient, simple, and accurate face recognition. The system isinitialized by acquiring a training set of faces downloaded from the Internet; eigenvectors and eigenvalues are computed on thecovariance matrix of the training images. The values of the eigenvectors uk are considered and the faces are projected into theface space in order to calculate their weights11 .The mean values of the M training images (Γ1,Γ2, ...,ΓM ) is considered as the ”average face” Ψ, computed as Ψ = 1

MΣMn=1Γn.

Each training face differs from the average face by a value Φ, given by Φi = Γi −Ψ.The new image Γ is projected into the face space using ωk = uT

k (Γ−Ψ) whereuTl uk = δlk = 1 if l = k

uTl uk = δlk = 0 otherwise.

The weights ΩT = [ω1, ω2, . . . , ωM ] and the Euclidean distance ε2 = ∥Ω− Ωk∥2 allow to measure the distance between thetest face and the generic face k.The distance ε2 can be set as a measure of the nearest face to any element of the training set. In the test with the faces

(a) (b)

Fig. 3. Result of the automatic scaling of faces where two images of different dimensions initials have been reported both tothe size of 180x200 pixels with no loss of proportions.

downloaded from the Internet, the minimum value of distance ε2 has been used for the association of the face with the one inthe database.

2.3. Face recognition model

Our improvement with respect to the state of the art, relays on the following model accounting for the presence of the sameperson through several consecutive frames. In this case, the time component is exploited for reducing the probability of missdetection and increasing the probability of correct identification. Let zi be the ith frame of the actual sequence. Denoting withSj the subset of sample images of the jth subject, we may write:

zi = s(j)ki

+ ni (1)

where s(j)ki

∈ Sj and ni is a sample from a Stationary zero mean Gaussian White noise with variance σ2N modeling both model

mismatch and imaging system noise. Then, the conditional probability density of zi givens(j)ki

, i = 1, ...,M

is

p(ZN

1 /s(j)k1

, s(j)k2

, . . . , s(j)kM

)=

N∏i=1

p(zi/s

(j)ki

)= (2)

=1

(2πσ2N )

M2

exp

−

M∑i=1

[zi − s

(j)ki

]22σ2

N

. (3)

Since the actual s(j)ki∈ Sj are unknown we approximate the likelihood of the observed sequence ZM

1 with respect to the jth

hypothesis as follows:

ln Λ(ZN

1 ;Hj

)= −M

2ln 2πσ2

N − 1

2σ2N

M∑i=1

mins(j)ki

∈Sj

[zi − s

(j)ki

]2(4)

Thus the maximum likelihood estimator selects, among the candidates, the one for which hold the following relation:

j = Arg

Max

j

[ln Λ

(ZN

1 ;Hj

)]=

= Arg

Max

j

[−M

2 ln 2πσ2N − 1

2σ2N

M∑i=1

mins(j)ki

∈Sj

[zi − s

(j)ki

]2] (5)

This condition is also equivalent to:

j = Arg

minj

[M∑i=1

mins(j)ki

∈Sj

[zi − s

(j)ki

]2](6)

That is to say that the candidate that less differs from the query for the whole clip, is the one corresponding to the query.

3. EXPERIMENTAL PHASE

Experimental tests have been carried out for evaluating the effectiveness of the proposed system. To this aim, several videosequences have been recorded by using a commercial TV Tuner. In particular we considered two Italian TV broadcastingcompanies. 212 clips have been recorded and a database of 100 subjects has been created with faces downloaded from theInternet. The faces from the Internet have been manually annotated and they have been used as ground truth for identifyingrandom faces extracted from the clips. In this work, 100 faces of 10 characters have been used for training the system. Eachperson training set is composed by a collection of 10 images of 180x200 pixels coming from pictures downloaded from theInternet resized, cropped, and filled in order to have a database of pictures with all elements with the same dimensions.

Experimental results concerning the face detection phase, show a correct detection probability of 80%. The motivations forwhich the probability of detection is not equal to 100% are several. Among them:

• biometric features can not be identified;

• skin color is too similar to the background color;

• presence of background characterized by regular geometric elements;

• presence of partial or total occlusions;

• presence of objects that look like faces.

Another possible cause is the quality of the received signal. In fact, transmission of compressed video over error pronechannels may result in packet losses or errors, which can significantly degrade the image quality. To improve the quality of thereceived signal, methods exploiting forward error concealment (i.e.,12) or error concealment (i.e.,13) have been proposed.

Figure 4 shows a frame in which the background has a regular structure. In this case the detection algorithm is able toresolve the ambiguity only by exploiting the information from several consecutive frames by using the KLT algorithm14, 15 .The performances of the Viola-Jones detector may be improved by exploiting recent modifications of the original algorithm,leading to a detection probability of 98%16 .

To analyze the performances of the face recognition system, experimental tests were carried out on datasets created bydownloading from the Web annotated images. This database contains 10 individuals, (mostly male), with 10 images each.The total number of images used in this experiment is 100. For each dataset we created 10 subsets via randomly selecting thetraining images per individual. The faces were scaled and cropped to be adapted to the size of 180x200 pixels. Figure 5 showsa set of the database of faces downloaded from Web and in the experimentation phase.

Moreover, tests have been performed on the face recognition system, by randomly selecting frames containing faces of 212clips captured during the transmission of two Italian TV broadcasters. 230 faces were extracted and from them a subset of 29faces has been selected to perform the recognition. Figure 6 shows the selected faces.

These faces were analyzed with the algorithm of eigenfaces obtaining a success rate equal to 96.55%. Figures 7 show thebehavior of the proposed recognition model. In particular the cumulative distance between the face under test and the ones inthe databases is shown. As can be noticed, by increasing the available information, that is by considering a larger number offrames for the comparison, the correct candidate is clearly denoted. Following Eq. 6 the best candidate is the one that shows alower distance that in Figures 7 is represented by the dotted curve. It is also important to notice that by increasing the numberof considered frames it is possible to improve the performances of the recognition system. In fact, as can be noticed in Figure7(b), if a single frame is considered, it may not be possible to recognize the correct subject. However, by increasing the numberof considered frames, the recognition is feasible.

Fig. 4. Example of frame with a regular pattern in the background.

Fig. 5. Annotated faces DB downloaded from the web.

4. CONCLUSIONS

In this paper a framework for people recognition in TV broadcast programs is presented. Specifically, the people identificationis performed by comparing the faces detected in the broadcasted stream with the annotated faces, previously collected on theweb. The proposed approach consists of three steps. The recorded video stream is portioned into scenes and shots detectingthe change of scenes. In the second step we performed on each subtrack the face detection making the track of faces in eachshot and resizing them. Then we performed a face recognition on a sample of faces building a database from Web. Performedexperimental test prove the effectiveness of the proposed scheme.

5. REFERENCES

[1] D. De Luca Picione, F. Battisti, K. Egiazarian, M. Carli, and J. Astola, “A fibonacci LSB data hiding technique,” in Procs.14th European Signal Processing Conference (EUSIPCO), 2006.

[2] M. Muzzarelli, M. Carli, G. Boato, and K. Egiazarian, “Reversible watermarking via histogram shifting and least squareoptimization,” in Procs. of the 2010 ACM SIGMM Multimedia and Security Workshop, 2010.

[3] F. Battisti, M. Cancellaro, M. Carli, G. Boato, and A. Neri, “Watermarking and encryption of color images in the Fibonaccidomain,” in Procs. SPIE International Conference on Electronic Imaging, Image Processing: Algorithms and Systems VII,2008.

Fig. 6. Face DB extracted from the recorded TV streams.

[4] A. M. Turk, “https://www.mturk.com/mturk/welcome,” in Last visited January, 2014.

[5] M. Everingham, J. Sivic, and A. Zisserman, “Hello! my name is... buffy - automatic naming of characters in tv video,” inProcs. of the Workshop of British Machine Vision Association, 2006.

[6] J. Sivic, M. Everingham, and A. Zisserman, “Who are you? - learning person specific classifiers from video,” in Procs. ofIEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009.

[7] B. Moghaddam, W. Wahid, and A. Pentland, “Beyond eigenfaces - probabilistic matching for face recognition,” in Procs.of IEEE Conference on Automatic Face and Gesture Recognition, 1998.

[8] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: Recognition using class specific linearprojection,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), pp. 711–720, 1997.

[9] M. Tapaswi, M. Bauml, and R. Stiefelhagen, “Knock! knock! who is it? probabilistic person identification in tv series,”in Procs. of IEEE Conference on Computer Vision and Pattern Recognition, 2012.

[10] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Procs. of IEEE Conferenceon Computer Vision and Pattern Recognition, 2001.

[11] M. Turk and A. Pentland, “Eigenfaces for recognition,” in Journal of Cognitive Neuroscience, 3(1), pp. 71–86, 1991.

[12] F. Battisti, M. Carli, E. Mammi, and A. Neri, “A study on the impact of AL-FEC techniques on TV over IP Quality ofExperience,” EURASIP Journal on Advances in Signal Processing (1), p. 86, 2011.

[13] J. Wang, Y. Tang, and S. Goto, “A spatial error propagation reduction based temporal error concealment for 1Seg Videobroadcasting,” in Procs. International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS),pp. 103–106, 2009.

1 2 3 4 5 6 7 8 9 100.5

1

1.5

2

2.5

3

3.5

4x 10

5

Distance

Processed Frames

(a) Subject 1

1 2 3 4 5 6 7 8 9 100.5

1

1.5

2

2.5

3

3.5

4x 10

5

Distance

Processed Frames

(b) Subject 2

1 2 3 4 5 6 7 8 90.5

1

1.5

2

2.5

3

3.5x 10

5

Distance

Processed Frames

(c) Subject 3

Fig. 7. Three examples of face recognition results vs. number of considered frames. The dotted line corresponds to the correctidentification.

[14] J. Shi and C. Tomasi, “Good features to track,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 593–600, 1994.

[15] R. Tjahyadi, W. Liu, S. An, and S. Venkatesh, “Face recognition via the overlapping energy histogram,” in Procs. 20th

International Joint Conference on Artificial Intelligence, 2007.

[16] S. Wang and A. Abdel-Dayem, “Improved Viola-Jones Face Detector,” in Procs. International Conference on Computingand Information Technology (ICCIT), 2012.

Documents

PROBABILISTIC PERSON IDENTIFICATION IN TV NEWS ...PROBABILISTIC PERSON IDENTIFICATION IN TV NEWS PROGRAMS USING IMAGE WEB DATABASE F. Battisti, M. Carli, M. Leo, A. Neri COMLAB - Telecommunication