Focus on spoken content in multimedia retrieval

Focus on spoken content in multimedia retrieval 1/48

Focus on spoken contentin multimedia retrieval

Maria Eskevich

Centre for Next Generation LocalisationSchool of Computing, Dublin City University,

Dublin, Ireland

April, 16, 2013


Outline

I Spoken Content Retrieval: historical perspective

I MediaEval Benchmark:

I 3 years of Spoken Content Retrieval experiments:Rich Speech Retrieval and Search and Hyperlinking tasks

I Dataset collection creation issues for multimedia retrieval:crowdsourcing aspect

I Interesting observations on results:I Segmentation methodsI Evaluation metricsI Numbers


Outline





I Interesting observations on results: segmentation aspect


Towards Effective Retrieval

of

Spontaneous Conversational Spoken Content



Information Retrieval (IR)

Standard IR SystemSpoken Content Retrieval (SCR)

Queries

IR SystemSCR SystemIndexed

DocumentsIndexed

Transcripts

IR ModelInformation

Request

ResultsAudioFiles

RetrievalRetrieval

Speech Processing (Automatic Speech Recognition (ASR))

Audio Data Collection Transcriptsof Audio DataASR

System



of





Standard IR System

Spoken Content Retrieval (SCR)

Queries

IR SystemSCR SystemIndexed

DocumentsIndexed

Transcripts

IR ModelInformation

Request

ResultsAudioFiles

RetrievalRetrieval



System



of





Standard IR System


Queries

IR System

SCR System

IndexedDocuments

IndexedTranscripts

IR ModelInformation

Request

Results

AudioFiles

Retrieval

Retrieval



System



of





Standard IR System


Queries

IR System

SCR System

IndexedDocuments

IndexedTranscripts

IR ModelInformation

Request

Results

AudioFiles

Retrieval

Retrieval



System



of




Information Retrieval (IR)Standard IR System


Queries

IR System

SCR SystemIndexed

DocumentsIndexed

Transcripts

IR ModelInformation

Request

Results

AudioFiles

Retrieval

Retrieval



System



Data

Spoken Content

ASR Transcript

Research Question 1:How does segmentationof spoken data affect the

retrieval performance?What are the character-istics of a segmentationmethod that maximizes

SCR effectiveness?

Research Question 2:What is the relationship be-

tween ASR errors in the tran-script and retrieval behaiour?

Research Question 3:How can regions of poor

speech recognition be identi-fied and processed in orderto improve overall speech

retrieval performance (detec-tion, special treatment in thespeech retrieval process)?

Research Question 4:Can we implement a mean-

ingful approach to SCRof conversation content

incorporating task specificsegmentation, and specialtreatment of regions withunreliable ASR output?

RQ 2 RQ 3 RQ 4

Experiments

ASR System

Indexed Transcript

Ranked Result List

1

2...

RQ 1Indexing

EvaluationMetrics

Retrieval



Data

Spoken Content

ASR Transcript



SCR effectiveness?









RQ 2 RQ 3 RQ 4

Experiments

ASR System

Indexed Transcript

Ranked Result List

1

2...

RQ 1Indexing

EvaluationMetrics

Retrieval



Data

Spoken Content

ASR Transcript



SCR effectiveness?









RQ 2 RQ 3 RQ 4

Experiments

ASR System

Indexed Transcript

Ranked Result List

1

2...

RQ 1

Indexing

EvaluationMetrics

Retrieval



Data

Spoken Content

ASR Transcript



SCR effectiveness?









RQ 2 RQ 3 RQ 4

Experiments

ASR System

Indexed Transcript

Ranked Result List

1

2...

RQ 1

Indexing

EvaluationMetrics

Retrieval



DataSpoken Content

ASR Transcript



SCR effectiveness?









RQ 2 RQ 3 RQ 4

Experiments

ASR System

Indexed Transcript

Ranked Result List

1

2...

RQ 1

Indexing

EvaluationMetrics

Retrieval



DataSpoken Content

ASR Transcript



SCR effectiveness?









RQ 2 RQ 3 RQ 4

Experiments

ASR System

Indexed Transcript

Ranked Result List

1

2...

RQ 1

Indexing

EvaluationMetrics

Retrieval



DataSpoken Content

ASR Transcript



SCR effectiveness?









RQ 2 RQ 3 RQ 4

Experiments

ASR System

Indexed Transcript

Ranked Result List

1

2...

RQ 1

Indexing

EvaluationMetrics

Retrieval


Outline: Spoken Content

DataSpoken Content

ASR Transcript



SCR effectiveness?









RQ 2 RQ 3 RQ 4

Experiments

ASR System

Indexed Transcript

Ranked Result List

1

2...

RQ 1

Indexing

EvaluationMetrics

Retrieval


Spoken Content Retrieval: historical perspectiveSpoken Content

Prepared Speech

InformalConversational

Speech

Broadcast NewsBroadcast News

LecturesLectures

MeetingsMeetings

Informal ContentInformal Content

Internet TV,Podcast, Interview


Broadcast News:

I DataI High quality recordings:

I Often soundproof studioI Speaker - professional presenter

I Well defined structureI Query is on a certain topic:

User is ready to listen to the whole section

I Experiments: TREC SDR (1997-2000)I Known-item search and ad-hoc retrievalI Search with and without fixed story boundaries

I Evaluation: interest in rank position

HIGHLIGHT: ”Success story” (Garofolo et al., 2000):Performance on ASR Transcript ≈ Manual Transcript

I ASR good: large amounts of training dataI Data structure

CHALLENGE:Speech data in broadcast news is close to the written text,and differs from the informal content of spontaneous speech

Lectures:I Data:

I Prepared presentations containingconversational style features:hesitations, mispronunciations

I Specialized vocabularyI Out-Of-Vocabulary wordsI Lecture specific words may have low

probability scores in the ASR languagemodel

I Additional information available:presentation slides, textbooks

I Experiments:I Lectures browsing:

e.g. TalkMiner, MIT lectures, eLectures

I SpokenDoc(2) Tasks at NTCIR-9, NTCIR-10:e.g. IR experiments, evaluation metrics thatassess topic segmentation methods

HIGHLIGHT/CHALLENGE:I Focus on segmentation methods, jump-in

points

Meetings:

I Data features:I Mixture of semi-formal and prepared spoken

contentI Additional data: slides, minutes

I Possible real life motivated scenario:I Jump-in points where discussion on topic

started or a decision point is reachedI Opinion of a certain person or person with a

certain roleI Search for all relevant (parts of) meetings

where topic was discussed

I Experiments:I topic segmentation, browsingI summarization

HIGHLIGHT/CHALLENGE:I No unified search scenarioI We created a test retrieval collection on the basis of AMI

corpus and set up a task scenario ourselves

Informal Content (Interviews, Internet TV):I Data features:

I Varying quality: semi- andnon-professional data creators

I Additional data: professionally oruser-generated metadata

I Experiments:I CLEF CL-SR: MALACH collection

I un/known-boundaries, ad-hoc taskI MediaEval’11,’12,’13: retrieval of

semi-professional multimedia contentI known-item task, unknown

boundariesI Metrics: focus on ranking and penalize

distance from the jump-in point

HIGHLIGHT/CHALLENGE:I Metric does not always take into account how much time the

user needs to spend listening to access the relevant contentI Diversity of the informal multimedia contentI Search scenario no longer limited to factual information

Review of the challenges/our work for Informal SCR:

I Framework of retrieval experiment has to be setup: retrieval collections to be createdOur work: We collected new multimodal retrievalcollections via crowdsourcing

I ASR errors decrease IR resultsOur work: We examined deeper relationshipbetween ASR performance and results ranking

I Suitable segmentation is vitalOur work: We carry out experiments with varyingmethods

I Need for metrics that reflect all aspects of userexperienceOur work: We created a new set of metrics


Spoken Content Retrieval: historical perspectiveSpoken ContentPrepared Speech


Speech


LecturesLectures

MeetingsMeetings




Broadcast News:










Lectures:I Data:






e.g. TalkMiner, MIT lectures, eLecturesI SpokenDoc(2) Tasks at NTCIR-9, NTCIR-10:

e.g. IR experiments, evaluation metrics thatassess topic segmentation methods


points

Meetings:




























Speech

Broadcast News

Broadcast News

LecturesLectures

MeetingsMeetings




Broadcast News:










Lectures:I Data:









points

Meetings:




























Speech

Broadcast News

Broadcast News

Lectures

Lectures

MeetingsMeetings




Broadcast News:










Lectures:I Data:









points

Meetings:




























Speech

Broadcast News

Broadcast News

Lectures

Lectures

Meetings

Meetings




Broadcast News:










Lectures:I Data:









points

Meetings:




























Speech

Broadcast News

Broadcast News

Lectures

Lectures

Meetings

Meetings

Informal Content

Informal Content



Broadcast News:










Lectures:I Data:









points

Meetings:




























Speech

Broadcast News

Broadcast News

Lectures

Lectures

Meetings

Meetings

Informal Content

Informal Content



Broadcast News:










Lectures:I Data:









points

Meetings:




























Speech


Lectures

Lectures

Meetings

Meetings

Informal Content

Informal Content



Broadcast News:










Lectures:I Data:









points

Meetings:




























Speech


Lectures

Lectures

Meetings

Meetings

Informal Content

Informal Content



Broadcast News:










Lectures:I Data:









points

Meetings:




























Speech

Broadcast News

Broadcast News

LecturesLectures

Meetings

Meetings

Informal Content

Informal Content



Broadcast News:










Lectures:I Data:









points

Meetings:




























Speech

Broadcast News

Broadcast News

LecturesLectures

Meetings

Meetings

Informal Content

Informal Content



Broadcast News:










Lectures:I Data:









points

Meetings:




























Speech

Broadcast News

Broadcast News

Lectures

Lectures

MeetingsMeetings

Informal Content

Informal Content



Broadcast News:










Lectures:I Data:









points

Meetings:




























Speech

Broadcast News

Broadcast News

Lectures

Lectures

MeetingsMeetings

Informal Content

Informal Content



Broadcast News:










Lectures:I Data:









points

Meetings:




























Speech

Broadcast News

Broadcast News

Lectures

Lectures

Meetings

Meetings




Broadcast News:










Lectures:I Data:









points

Meetings:




























Speech

Broadcast News

Broadcast News

Lectures

Lectures

Meetings

Meetings




Broadcast News:










Lectures:I Data:









points

Meetings:




























Speech

Broadcast News

Broadcast News

LecturesLectures

MeetingsMeetings




Broadcast News:










Lectures:I Data:









points

Meetings:


























Outline



I 3 years of Spoken Content Retrieval experiments:Rich Speech Retrieval and Search and Hyperlinkingtasks




MediaEvalMultimedia Evaluation benchmarking inititative

I Evaluate new algorithms for multimedia access andretrieval.

I Emphasize the ”multi” in multimedia: speech, audio,visual content, tags, users, context.

I Innovates new tasks and techniques focusing on thehuman and social aspects of multimedia content.


MediaEval 2011Rich Speech Retrieval (RSR) Task

I Task Goal:I Information to be found - combination of required

audio and visual content, and speaker’s intention













Transcript 1

6=Meaning 1 6=

Transcript 2

Meaning 2

Conventional retrieval





Transcript 1

6=

Meaning 1

6=

Transcript 2Meaning 2






Transcript 1 6=Meaning 1 6=







Transcript 1 6=Meaning 1 6=







Transcript 1 =Meaning 1 6=

Speech act 1 6=


Speech act 2

Extended speech retrieval





Transcript 1 =Meaning 1 6=Speech act 1 6=

Transcript 2Meaning 2Speech act 2






Transcript 1 =Meaning 1 6=Speech act 1 6=

Transcript 2Meaning 2Speech act 2



MediaEval 2012-2013:Search and Hyperlinking (S&H) Task Background


MediaEval 2012-2013:S&H Task


MediaEval 2012-2013: S&H Task and Crowdsourcing


Outline




I Dataset collection creation issues for multimediaretrieval: crowdsourcing aspect



What is crowdsourcing?

I Crowdsourcing is a form of human computation.I Human computation is a method of having people dothings that we might consider assigning to a computingdevice, e.g. a language translation task.I A crowdsourcing system facilitates a crowdsourcingprocess.

I Factors to take into account:I Sufficient number of workersI Level of paymentI Clear instructionsI Possible cheating




I Factors to take into account:

I Sufficient number of workersI Level of paymentI Clear instructionsI Possible cheating




I Factors to take into account:I Sufficient number of workers

I Level of paymentI Clear instructionsI Possible cheating




I Factors to take into account:I Sufficient number of workersI Level of payment

I Clear instructionsI Possible cheating




I Factors to take into account:I Sufficient number of workersI Level of paymentI Clear instructions

I Possible cheating




I Factors to take into account:I Sufficient number of workersI Level of paymentI Clear instructionsI Possible cheating


Results assessment

I Number of accepted HITs 6= number of collected queries

I No overlap of workers in dev and test setsI Creative work - Creative Cheating:

I Copy and paste provided examples− > Examples should be pictures, not texts

I Choose the option of no speech act found in the video− > Manual assessment by requester needed

I Workers rarely find noteworthy content later than thethird minute from the start of playback point in the video


Results assessment







Results assessment







Results assessment


I No overlap of workers in dev and test sets

I Creative work - Creative Cheating:I Copy and paste provided examples− > Examples should be pictures, not texts




Results assessment







Results assessment



I Copy and paste provided examples

− > Examples should be pictures, not textsI Choose the option of no speech act found in the video− > Manual assessment by requester needed



Results assessment







Results assessment




I Choose the option of no speech act found in the video

− > Manual assessment by requester neededI Workers rarely find noteworthy content later than thethird minute from the start of playback point in the video


Results assessment







Results assessment







Crowdsourcing issuesfor multimedia retrieval collection creation

I It is possible to crowdsource extensive and complextasks to support speech and language resources

I Use concepts and vocabulary familiar to the workersI Pay attention to technical issues of watching the videoI Video preprocessing into smaller segmentsI Creative work demands higher reward level, or just

more flexible systemI High level of wastage due to task complexity




I Use concepts and vocabulary familiar to the workers

I Pay attention to technical issues of watching the videoI Video preprocessing into smaller segmentsI Creative work demands higher reward level, or just





I Use concepts and vocabulary familiar to the workersI Pay attention to technical issues of watching the video

I Video preprocessing into smaller segmentsI Creative work demands higher reward level, or just





I Use concepts and vocabulary familiar to the workersI Pay attention to technical issues of watching the videoI Video preprocessing into smaller segments

I Creative work demands higher reward level, or justmore flexible system

I High level of wastage due to task complexity





more flexible system

I High level of wastage due to task complexity







Outline







Dataset segment representation


Approach 1: Fixed length segmentation

I Fixed length segmentationI Number of words (including/excluding stop words)I Time slots

I Fixed length segmentation with sliding window:

I Post-processing:





I Post-processing:





I Post-processing:





I Post-processing:





I Post-processing:





I Post-processing:





I Post-processing:





I Post-processing:





I Post-processing:


Approach 2: Flexible length segmentation

I Speech or Video units of varying length

I Speech: sentence, speech segment, silence points,changes of speakers

I Video: shots

I Topical segmentation

I Lexical cohesion - C99, TexTiling





I Video: shots







I Video: shots




Outline







Evaluation: Search sub-task









I Mean Reciprocal Rank (MRR):

RR =1

RANKI Mean Generalized Average Precision (mGAP):

GAP =1

RANK. PENALTY



I Mean Reciprocal Rank (MRR):

RR =1

RANKI Mean Generalized Average Precision (mGAP):

GAP =1

RANK. PENALTY



I Mean Average Segment Precision (MASP):Ranking + Length of (ir)relevant content

Segment Precision (SP[r ]) at rank r :

Average Segment Precision:

ASP =1n.

N∑r=1

SP[r ] · rel(sr )

rel(sr ) = 1, if relevant content is present,otherwise rel(sr ) = 0






ASP =1n.

N∑r=1

SP[r ] · rel(sr )







ASP =1n.

N∑r=1

SP[r ] · rel(sr )







ASP =1n.

N∑r=1

SP[r ] · rel(sr )




Focus on Precision/Recall of the relevant content within theretrieved segment.


Outline







Experiments (RSR): Spontaneous Speech SearchRelationship BetweenRetrieval Effectiveness and Segmentation Methods

Segment:I 100 % Recall of the relevant contentI High Precision (30, 56 %) of the relevant contentI Topic consistency
















Experiments (S&H)

I Fixed length segmentation with sliding windowI 2 transcrpts (LIMSI, LIUM)

LIMSI LIUM


Segmentation requirements for effective SCRI Segmentation plays significant role in retrieving relevant

content

I High recall and precision of the relevant content within thesegment leads to good segment ranking.

I Related metadata can be useful to improve ranking of thesegment with high recall and containing non relevantcontent.

I Influence of ASR quality:I The errors effect is not straightforward, can be smoothed by

the use of context, query dependent treatment of thetranscript.

I ASR System Vocabulary variability: longer segments havehigher MRR scores with transcript of lower languagevariability (LIMSI), whereas shorter segments performbetter with transcripts of higher language variability (LIUM).

I Multimodal queries: addition of visual informationdecreases performance.



contentI High recall and precision of the relevant content within the

segment leads to good segment ranking.

I Related metadata can be useful to improve ranking of thesegment with high recall and containing non relevantcontent.








segment leads to good segment ranking.I Related metadata can be useful to improve ranking of the

segment with high recall and containing non relevantcontent.










I Influence of ASR quality:

I The errors effect is not straightforward, can be smoothed bythe use of context, query dependent treatment of thetranscript.































Thank you for your attention!

Questions?

Technology

Focus on spoken content in multimedia retrieval