IBM Research TRECVID-2007 Video Retrieval System · Video Retrieval System Apostol (Paul) Natsev IBM T. J. Watson Research Center Hawthorne, NY 10532 USA ... Acknowledgement IBM Research

© 2007 IBM Corporation

IBM Research TRECVID-2007 Video Retrieval System

Apostol (Paul) NatsevIBM T. J. Watson Research CenterHawthorne, NY 10532 USA

On Behalf Of:Murray Campbell, Alexander Haubold, John R. Smith, Jelena Tesic, Lexing Xie, Rong Yan

© 2007 IBM Corporation2

Acknowledgement

� IBM Research

– Murray Campbell

– Matthew Hill

– Apostol (Paul) Natsev

– Quoc-Bao Nguyen

– Christian Penz

– John R. Smith

– Jelena Tesic

– Lexing Xie

– Rong Yan

� UIUC

– Ming Liu

– Xu Han

– Xun Xu

– Thomas Huang

� Columbia Univ.

– Alexander Haubold

� Carnegie Mellon Univ.

– Jun Yang


Part 1 of 2:Automatic Search


Outline

Overall System

Summary

Review of Baseline Retrieval Experts

Performance Analysis

Summary (Repeated)


System Overview

Gri

dF

eatu

res

Gri

dF

eatu

res

Color Mom.Color Mom.

Wav. TextureWav. Texture

Videos:-Keyframes- ASR/MT

Videos:-Keyframes- ASR/MT

Visual Search(Bagged SVM)

Visual Search(Bagged SVM)

Query-dynamic(Weighted Avg)


Features Retrieval Experts Fusion

Glo

bal

Featu

res

Glo

bal

Featu

res

Color Corr.Color Corr.

Color Hist.Color Hist.

Cooc. TextureCooc. Texture

Edge Hist.Edge Hist.

Text-LR5Text-LR5

Text-LR10Text-LR10Sp

eech

Sp

eech

Co

ncep

ts

LSCOM-155LSCOM-155

MediaNet-50MediaNet-50

TV07-36TV07-36

Text SearchText Search

Concept Search

Model Vectors(Bagged SVM)

Statistical Map(co-occurrence)

WordNet Map(Brown stats)

WordNet Map(web stats)

Queries:-Text blurb- Examples

Query-indep.(Average)


V

S

M1

M2

T

M3

Text

MSV.avg

TMSV.avg

TMSV.qdyn

TMSVC.qdyn







Runs


Summary: What Worked and What Didn’t?

Grade

��

��

��

��

��

��

��

��

��

� Baseline retrieval experts

� Speech/text-based

� Concept-based (statistical mapping)

� Concept-based (WordNet mapping, Brown corpus stats)

� Concept-based (WordNet mapping, web stats)

� Concept-based (query examples modeling)

� Visual-based

� Experiments

� Improving query-to-concept mapping: web-based stats

� Leveraging external resources: type C runs

� Query-dynamic multimodal fusion


Baseline Retrieval Experts: Review of Approaches

� Speech/Text-Based Retrieval

– Auto-query expansion with JuruXML search engine (Volkmer et al., ICME’06)

� Visual-Based Retrieval

– Bagged SVM modeling of query topic examples (Natsev et al., MM’05)

� Concept-Based Retrieval (G2 Statistical Map)

– Based on co-occurrence of ASR terms and concepts (Natsev et al., MM’07)

� Concept-Based Retrieval (WordNet Map, Brown Stats)

– Based on JCN similarity, IC from Brown Corpus (Haubold et al., ICME’06)

� Concept-Based Retrieval (WordNet Map, Web Stats)

– Based on JCN similarity, IC from WWW sample

� Concept-Based Retrieval (Content-Based Modeling)

– Bagged SVM modeling of query topic examples (Tesic et al., CIVR’07)


Improving Query-to-Concept Mapping

� WordNet similarity measures

– Frequently used for query-to-concept mapping

– Frequently based on Information Content (IC) to model term saliency

• Resnik: evaluates information content (IC) of common root

• Jiang-Conrath: evaluates IC of common root and ICs of terms

– IC typically estimated from 1961 Brown Corpus

• IC from Brown Corpus outdated by >40 years

• 76% of words in WordNet not in Brown Corpus (so IC = 0)

� Idea: create approximation using WWW

– Perform frequency analysis over large sample of web pages

– Google page count as indicator of frequency


IC from Large-Scale Web Sampling

� Sample of web pages:

– 1,231,929 documents (~1M)

– 1,169,368,161 WordNet terms (~1B)

– Distribution similar to Brown Corpus

– Therefore: potentially useful as IC

� Google page count:

– As a proxy to term frequency counts

– Term frequency ≈ Doc. Frequency?

– Experiments show linear relationship

– Therefore: Potentially useful as IC


Outline

Overall System

Summary

Review of Baseline Retrieval Experts

Performance Analysis

Summary (Repeated)


Evaluation Results: Baseline Retrieval Experts

� Statistical concept-based run did not generalize

� Web-based IC led to 20% improvement in WordNet runs

� Concept-based runs performed better than speech/text-based runs

� Content-based runs performed best

Comparison of Unimodal Retrieval Experts

0.0144

0.005 0.005

0.0149 0.01520.0146

0.01850.0177

0.02280.0238

0.0162

0.0302

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

All topics All topics but 219

Me

an

Av

era

ge

Pre

cis

ion

Speech/Text-Based Concepts (Co-occurrence Map) Concepts (WordNet Brown Map)

Concepts (WordNet Web Map) Concepts (Content-Based) Visual Content-Based


Baseline Retrieval Experts: JAWS Analysis

� For the adventurous:

– Stacked chart showing contribution of each expert per topic

Comparison of Unimodal Retrieval Experts

0%

20%

40%

60%

80%

100%

Cook

cha

ract

ercr

owd

bicy

cle

stre

et b

uildin

gs

mou

ntain

s

stre

et n

ight

wate

rfron

t

mus

ical

inst

rum

ents

road

from

veh

icle

stre

et p

rote

st

door

ope

nedbo

at

cana

l riv

er

stre

et m

arke

tbr

idge

wom

an in

terv

iew

shee

p goat

s

pers

on te

lepho

ne

peop

le ta

ble

class

room tra

in

peop

le s

tairs

keyb

oard

peop

le d

ogsR

ela

tive c

on

trib

uti

on

per

exp

ert

Speech/Text-Based Concepts (Co-occurrence Map)

Concepts (WordNet Brown Map) Concepts (WordNet Web Map)

Concepts (Content-Based) Visual Content-Based


Summary of Submitted and Unsubmitted IBM Runs

0.0272CTMSV-C.qdynTMSWSCVqdynQuery-dynamic fusion, type C (T + MSW + SC + V)

0.0372C-TMWSCVavgMultimodal AVG fusion, type C (T + MW + SC + V)

0.0303C-TMSWSCVavgMultimodal AVG fusion, type C (T + MSW + SC + V)

0.0210ATMSV.qdynTMSSVqdynQuery-dynamic fusion, type A (T + MS + S + V)

0.0249C-SCConcept baseline (content-based, type C)

0.0638C-OracleOracle fusion (pick best run for each topic)

0.0302A-VVisual baseline

TMSSVavg

MBSVavg

MSSVavg

S

MW

MB

MS

T

Code

0.0231ATMSVMultimodal AVG fusion, type A (T + MS + S + V)

0.0213AMSVNon-text baseline (MS + S + V)

0.0228A-Concept baseline (content-based, type A)

0.0177C-Concept baseline (WordNet, Web stats)

0.0146A-Concept baseline (WordNet, Brown stats)

0.0301A-Non-text baseline (MB + S + V)

0.0045A-Concept baseline (stat)

0.0149ATextText baseline

MAPTypeRun IDDescription


Overall Performance Comparison

� Best run from each organization shown� Submitted IBM runs in third tier, improve by dropping failed run� IBM runs achieve highest AP scores on 5 of 24 topics

Automatic Search (Best-of Team Performance)

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

F_

A_

2_

RA

-US

TC

-

SJT

U-S

EA

RC

H_

1

F_

C_

IBM

_O

racle

F_

A_

2_

CT

_3

F_

A_

2_

ua

_4

F_

A_

1_

cu

-

Img

Ba

se

lin

e_

4

F_

A_

2_

IBM

-Un

offic

ial

F_

A_

2_

yU

HK

_S

CS

_2

F_

A_

2_

SIO

N_

3

F_

B_

1_

A-M

M_

5

F_

A_

1_

CS

B_

vis

bl_

1

F_

C_

2_

IBM

-TM

SV

-

C.q

dyn

_3

F_

A_

2_

OM

_6

_1

F_

C_

1_

UG

Sys_

2

F_

A_

1_

SR

G_

KN

2_

2

F_

A_

2_

SC

-2_

2

F_

A_

2_

UT

wik

i-t2

-

nm

_2

F_

C_

1_

NA

I_1

F_

A_

2_

ow

a0

7A

S0

3_

2

F_

A_

1_

UT

en

_1

F_

A_

1_

JT

U_

FA

_1

_1

Me

an

Av

era

ge

Pre

cis

ion

Best IBM Submitted Run

Best IBM Unsubmitted Run

Best IBM Oracle Fusion Run


IBM Performance Relative to Best AP Per Topic

� Good (>90%): street night, street market, sheep/goat, road/vehicle, people telephone, canal/river

� So-so (40-60%): mountains, waterfront, boat, train, crowd, street protest, street buildings

� Bad (<25%): people table, people stairs, people dogs, people keyboard

IBM Automatic Search

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

stre

et n

ight

shee

p go

ats

road

from

veh

icle

stre

et m

arke

t

pers

on te

leph

one

cana

l riv

er

door

ope

ned

bicy

cle

mou

ntai

ns

stre

et p

rote

stw

ater

front

stre

et b

uild

ings

crow

d

Coo

k ch

arac

ter

boat

train

mus

ical i

nstru

men

tsbr

idge

clas

sroo

m

wom

an in

terv

iew

peop

le ta

ble

peop

le s

tairs

peop

le d

ogs

keyb

oard

Pe

rfo

rma

nc

e R

ela

tiv

e t

o B

es

t P

er-

To

pic

Ru

n (

%)


Summary: What Worked and What Didn’t?

Grade

��

��

��

��

��

��

��

��

��

� Baseline retrieval experts

� Speech/text-based

� Concept-based (statistical mapping)

� Concept-based (WordNet mapping, Brown corpus stats)

� Concept-based (WordNet mapping, web stats)

� Concept-based (query examples modeling)

� Visual-based

� Experiments

� Improving query-to-concept mapping: web-based stats

� Leveraging external resources: type C runs

� Query-dynamic multimodal fusion


Part 2 of 2:Interactive Search


Annotation-based Interactive Retrieval

� Model interactive search as a video annotation task

– Consider each query topic as a keyword, annotate video keyframes

– Extend from CMU’s Extreme Video Retrieval system [Hauptmann et al., MM’06]

� Hybrid annotation system

– Minimize annotation time by leveraging two annotation interfaces

– Tagging: Flickr, ESP game [von Ahn et al., CHI’04]

– Browsing: IBM’s EVA [Volker et al., MM’05], CMU’s EVR [Hauptmann et al., MM’06]

� Formal analysis by modeling the interactive retrieval process

– Tagging-based annotation time per shot

– Browsing-based annotation time per shot


Manual Annotation (I) : Tagging

� Allow users to associate a single image at a time with one or more keywords (the most widely used manual annotation approaches)

� Advantages

– Freely choose arbitrary keywords to annotate

– Only need to annotate relevant keywords

� Disadvantages

– “Vocabulary mismatch” problem

– Inefficient to design and type keywords

� Suitable for annotating rare keywords


Formal Time Model for Tagging

� Key factors for tagging time model

– Number of keywords K for image l

– Annotation time for kth word t’fk

– Initial setup time for a new image t’s

– Noise term ε (zero-mean distribution)

� Annotation time for one image

� Total expected annotation time for the entire collection containing L images

– Assumption: the expected time to annotate the kth word is constant tf

1...f fk s fk s

k

T t t t t tε ε′ ′ ′ ′ ′= + + + + = + +∑

( ) ( )( )l

l

total fk s l f s

l k l

E T E t E t K t t′ ′= + = +∑∑ ∑


Validation of Tagging Time Model

� User study on TRECVID’05 development data

– A user to manually tag 100 images using 303 keywords

– If the model is correct, a linear fit should be found in the results

– The annotation results fit the model very well


Manual Annotation (II) : Browsing

� Allow users to associate multiple images with a single word at the same time

� Advantages

– Efficient to annotate each pair of

images and words

– No “vocabulary mismatch”

� Disadvantages

– Need to spend time on judging both

relevant and irrelevant pairs

– Start with controlled vocabulary

– Annotate one keyword at a time

� Suitable for annotating frequent keywords


Formal Time Model for Browsing

� Key factors for browsing time model

– Number of relevant images Lk for a word k

– Annotation time for a relevant image t’p

– Annotation time for an irrelevant image t’n

– Noise term ε (zero-mean distribution)

� Annotation time for all images w.r.t. a keyword

� Total expected annotation time for the entire collection containing L images

– Assumption: the expected time to annotate a relevant (irrelevant) image is constant

1 1

k kL L L

pl nl

l l

T t t ε−

= =

′ ′= + +∑ ∑

( ) ( ) ( )( )( )k k

k k

total pl nl k p k n

k l l k

E T E t E t L t L L t

′ ′= + = + −

∑ ∑ ∑ ∑


Validation of Browsing Time Model

� User study on TRECVID’05 development data

– Three users to manually browse images in 15 minutes (for 25 keywords )

– If the model is correct, a linear fit should be found in the results

– The annotation results fit the model for all users


Video Retrieval as Hybrid Annotation

� Jointly annotates all topics at the same time

� Switches between tagging and browsing annotation interfaces in order to minimize the overall annotation time

– Formally model the annotation time as functions of word frequency, time

per word, and annotation interfaces

– Online machine learning to select images, topics and interfaces based on

the annotation models

� More details will be released in the final notebook paper

– See related analysis in R. Yan et al. [ACM MM 2007 Workshop on Many Faces of

Multimedia]


TRECVID-2007 Performance Analysis

� The proposed approach allows users to annotate 60% of the image-topic pairs, as compared with ~10% allowed by simple browsing

� Balance between tagging & browsing: 1529 retrieved shots from tagging, 797 retrieved shots from browsing

� Simple temporal expansion can improve MAP from 0.35 to 0.43

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

I_A

_EER

_Exp

and

I_A

_2_J

W-E

ER_1

I_C

_2_VG

G_I

_1_1

I_B

_2_A

-MM

_1I_

A_2

_CT_1

I_A

_2_A

L_C

O_3

I_A

_2_u

a_6

I_A

_2_a

ce-s

hot-E

-2_2

I_A

_2_d

an_P

_1I_

a_2_S

C-1

_1I_

C_1

_t292

.ITI_

2I_

c_2_

UQ

2_2

Mean

Avera

ge P

recis

ion


TRECVID-2007 Per-Query Performance Analysis

� Good: “telephone”, “interview”, “mountains”

� Not as good: “canal”, “bicycle”, “classroom”, “dog”

� Highest AP scores on 10 out of 24 topics

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

stre

et n

ight

shee

p go

ats

road

from

veh

icle

stre

et m

arke

t

pers

on te

leph

one

cana

l riv

er

door o

pene

dbi

cycl

e

mou

ntain

s

stre

et p

rote

stw

ater

front

stre

et bui

ldin

gscr

owd

Coo

k ch

arac

ter

boat

train

mus

ical i

nstru

men

tsbr

idge

clas

sroom

wom

an in

terv

iew

peop

le ta

ble

peop

le s

tairs

peop

le d

ogs

keyb

oard

Av

era

ge

Pre

cis

ion

Best IBM Best Non-IBM


Potential Improvement

� Search beyond keyframe (look at the I-frame)

� Search beyond I-frame (leverage text info or temporal context)

� Better online learning algorithms to search for more shots in the same amount of time (include temporal information)

FindingSheep/Goats

FindingSheep/Goats

Documents

IBM Research TRECVID-2007 Video Retrieval System · Video Retrieval System Apostol (Paul) Natsev IBM T. J. Watson Research Center Hawthorne, NY 10532 USA ... Acknowledgement IBM Research