Transcript

Multimodal Person Discovery in Broadcast TV Taskat MediaEval 2015

Johann Poignant, Hervé Bredin, Claude BarrasLIMSI – CNRS, Orsay, [email protected]

Motivation➢Indexing TV archives → find people appearing and speaking➢Biometric models are not always available → find people identities in the video (pronounced names, names written on screen)

Definition of the task➢From a collection of TV broadcast pre-segmented into shots, participants are asked to provide:

➢The names of people both speaking and appearing at the same time during the shot, with a confidence score➢An evidence (a unique shot) proving that the person holds the right name

➢List of persons was not provided➢Person biometric models trained on externel data can not be used➢Participants should find their names in the audio (using ASR) or visual (using OCR) streams

A BA

Hello Mrs B

Mr A

blah blah

shot #1 shot #2 shot #3

A B B

blah blah

shot #4 speaking face

evidence

A B

blah blah

A

text overlay

speech transcript

INPUT

OUTPUT

LEGEND

Shot#1 is an evidence for Mr A

Shot#3 is an evidence for Mrs B

Datasets➢Dev set: REPERE (2 French channel, 8 shows, 137 hours, 50 hours manually annotated)➢Test set: INA corpus (1 French channel, 172 editions of the evening broadcast news, 106 hours) ➢Annotated a posteriori based on participants' submissions

Baseline and metadata➢Available as open-source software at

http://github.com/MediaEvalPersonDiscoveryTask

Metrics➢For each queries q ϵ Q, rank all shots (according to their confidence) of the best hypothesis p (if Levenshtein ratio(p, q) > 0.95)

➢Average Precision

➢Mean Average Precision

➢Correctness of evidences

➢Evidence-weighted Mean Average

k is the rankP(k) is the precision at cut-off krel(k) =1 if the shot at rank k is relevant, 0 otherwise

∑k=1

(P(k)) · rel(k))

# relevant shots

n

AP =

QMAP = ∑

q=1 AP(q)1

Q

C(q) = 1 if the Levenshtein ratio(p, q) > 0.95 and if the evidence proves q identity0 otherwise

QEwMAP = ∑

q=1 C(q) · AP(q)1

Q

Recommended