48
ASR and alignment systems for multi-genre media data Peter Bell Automatic Speech Recognition— ASR Lecture 14 13 March 2017 ASR Lecture 14 ASR and alignment systems for multi-genre media data 1

ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

ASR and alignment systems for multi-genre mediadata

Peter Bell

Automatic Speech Recognition— ASR Lecture 1413 March 2017

ASR Lecture 14 ASR and alignment systems for multi-genre media data 1

Page 2: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

This lecture

The MGB Challenge

Building ASR systems from captioned TV broadcasts

Lightly supervised alignment

Speech activity detection

ASR Lecture 14 ASR and alignment systems for multi-genre media data 2

Page 3: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

What are we working on in CSTR?

Topics Wide domain coverage, understanding diverse data,cross-lingual recognition, environment and speakermodelling

Methods Deep learning, canonical models, adaptation,factorisation, generalisation

Applications Talks and lectures, TV broadcasts, multipartymeetings, spoken dialogue systems

ASR Lecture 14 ASR and alignment systems for multi-genre media data 3

Page 4: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

What are we working on in CSTR?

Topics Wide domain coverage, understanding diverse data,cross-lingual recognition, environment and speakermodelling

Methods Deep learning, canonical models, adaptation,factorisation, generalisation

Applications Talks and lectures, TV broadcasts, multipartymeetings, spoken dialogue systems

ASR Lecture 14 ASR and alignment systems for multi-genre media data 3

Page 5: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Case study: multi-genre TV broadcasts

Automatic speech processing of TV broadcasts has an obviouscommercial need, but is still very difficult for current systems

ASR Lecture 14 ASR and alignment systems for multi-genre media data 4

Page 6: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

The MGB Challenge

We proposed an open challenge to work on EnglishMulti-Genre Broadcast data at the 2015 ASRU workshop

Our aim was to encourage researchers from around to worldto work on this kind of data

Create a standard experimental setup so that cutting edgeresearch methods can be compared in a controlled setting

Repeated on Arabic TV data for the 2016 SLT workshop

ASR Lecture 14 ASR and alignment systems for multi-genre media data 5

Page 7: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Data supplied to all participants

1,600 hours of TV, taken from 7 complete weeks of BBCoutput over four channels, with accompanying subtitle text

600M words of subtitle text from 1988 onwards

XML metadata for all shows, generated in a standard format

Data supplied freely for the purpose of participation in thechallenge

ASR Lecture 14 ASR and alignment systems for multi-genre media data 6

Page 8: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Why is this task difficult

Many different background noise conditions

Diverse range of accents and speaking styles – including fastdramatic speech, and natural, spontaneous speech

Speaker identities are usually not known

Although lots of training data is available, the captionsavailable are not very accurate.

ASR Lecture 14 ASR and alignment systems for multi-genre media data 7

Page 9: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Example...

Two contrasting programmes...

ASR Lecture 14 ASR and alignment systems for multi-genre media data 8

Page 10: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

The tasks

Transcription of multi-genre TV showswe supplied around 16 TV shows to be completely transcribedshow names and genre labels are providedsome shows are from series appearing in the training data;some are not

Subtitle alignmentfor the same shows as Task 1, the subtitle text as originallybroadcast were providedthese differ from the verbatim audio content for a range ofreasonsparticipants must produce time stamps for all words in thesubtitles

ASR Lecture 14 ASR and alignment systems for multi-genre media data 9

Page 11: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

The tasks

Longitudinal transcriptionaim to evaluate ASR in a realistic longitudinal settingparticipants transcribed complete TV series, where the outputfrom shows broadcast earlier could be used to adapt andenhance the performance of later shows

Longitudinal diarization and speaker linkingaim to label speakers uniquely across a complete seriesrealistic longitudinal setting again: participants must processshows sequentially in date order

ASR Lecture 14 ASR and alignment systems for multi-genre media data 10

Page 12: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Our ASR system

Some features of our best system:

Models trained on 640 hours of broadcasts

DNNs with 6 hidden layers, an input window of 9 frames and28k output states used in combination with CNNs with asimilar structure

Networks trained with cross-entropy criterion, followed byminimum Bayes risk full-sequence training

Training using a complex recipe of multiple iterations, with alltraining data re-aligned several times – the completeprocedure takes several weeks, even on GPU machines!

No speaker adaptation, but mean and variance normalisationused, based on speaker clusters

ASR Lecture 14 ASR and alignment systems for multi-genre media data 11

Page 13: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Our ASR system

Some features of our best system:

Models trained on 640 hours of broadcasts

DNNs with 6 hidden layers, an input window of 9 frames and28k output states used in combination with CNNs with asimilar structure

Networks trained with cross-entropy criterion, followed byminimum Bayes risk full-sequence training

Training using a complex recipe of multiple iterations, with alltraining data re-aligned several times – the completeprocedure takes several weeks, even on GPU machines!

No speaker adaptation, but mean and variance normalisationused, based on speaker clusters

ASR Lecture 14 ASR and alignment systems for multi-genre media data 11

Page 14: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Our ASR system

Some features of our best system:

Models trained on 640 hours of broadcasts

DNNs with 6 hidden layers, an input window of 9 frames and28k output states used in combination with CNNs with asimilar structure

Networks trained with cross-entropy criterion, followed byminimum Bayes risk full-sequence training

Training using a complex recipe of multiple iterations, with alltraining data re-aligned several times – the completeprocedure takes several weeks, even on GPU machines!

No speaker adaptation, but mean and variance normalisationused, based on speaker clusters

ASR Lecture 14 ASR and alignment systems for multi-genre media data 11

Page 15: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Our ASR system

Some features of our best system:

Models trained on 640 hours of broadcasts

DNNs with 6 hidden layers, an input window of 9 frames and28k output states used in combination with CNNs with asimilar structure

Networks trained with cross-entropy criterion, followed byminimum Bayes risk full-sequence training

Training using a complex recipe of multiple iterations, with alltraining data re-aligned several times – the completeprocedure takes several weeks, even on GPU machines!

No speaker adaptation, but mean and variance normalisationused, based on speaker clusters

ASR Lecture 14 ASR and alignment systems for multi-genre media data 11

Page 16: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Our ASR system

Some features of our best system:

Models trained on 640 hours of broadcasts

DNNs with 6 hidden layers, an input window of 9 frames and28k output states used in combination with CNNs with asimilar structure

Networks trained with cross-entropy criterion, followed byminimum Bayes risk full-sequence training

Training using a complex recipe of multiple iterations, with alltraining data re-aligned several times – the completeprocedure takes several weeks, even on GPU machines!

No speaker adaptation, but mean and variance normalisationused, based on speaker clusters

ASR Lecture 14 ASR and alignment systems for multi-genre media data 11

Page 17: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Some results on development data

System 3gram 4gram

210 hours training data

GMM 53.1 -DNN 40.9 37.4+ sequence training 37.1 33.7

640 hours training data

Final DNN 31.3 28.2Final CNN 30.8 28.0

ROVER 30.1 27.3

ASR Lecture 14 ASR and alignment systems for multi-genre media data 12

Page 18: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Using broadcast captions for training

Problems with using closed captions as training data labels:

Timings may not be accurate

Not all words spoken are captioned

Words may appear in the captions that were never actuallyspoken

Limited speaker information is available (in the form of colourchanges in the subtitles)

he loves your PICTURES SO MUCH he thinks YOU'RE GONNA do INCREDIBLY well in milan

he loves your ******** ** PICTURE he thinks ****** YOU'LL do ********** well in milan

ASR Lecture 14 ASR and alignment systems for multi-genre media data 13

Page 19: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Training acoustic models on TV data

The basic recipe:

1 Using the captions and a previous ASR system, identify wordsand their timings within the audio

2 Select a set of utterances to use in training

3 Generate a pronunciation for every word from a basedictionary, and use this to create a phone alignment for eachutterance

4 Train GMM and then DNN models using these phonealignments, frequently re-aligning the data

ASR Lecture 14 ASR and alignment systems for multi-genre media data 14

Page 20: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Lightly supervised training

The problem of identifying words from the captions and usingthem to update the models is an example of lightly supervisedtraining

We don’t have perfect labels for each training sample, but wedo know something about them

The main challenge is in identifying reliable labels andlearning from them, without also learning from unreliablelabels, or past mistakes

ASR Lecture 14 ASR and alignment systems for multi-genre media data 15

Page 21: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Lightly supervised training

A standard method [Braunschweiler et al]:

1 Train an biased language model on the captions, interpolatedwith a background LM

p(wt |ht) = λpbias(wt |ht) + (1− λ)pbg (wt |ht)

2 Decode the training data with a pre-existing acoustic model,and the biased LM

3 Align the captions with the ASR output

4 Select utterances where there is a good match between thecaptions and the automatic output

ASR Lecture 14 ASR and alignment systems for multi-genre media data 16

Page 22: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Lightly supervised training

A standard method [Braunschweiler et al]:

1 Train an biased language model on the captions, interpolatedwith a background LM

p(wt |ht) = λpbias(wt |ht) + (1− λ)pbg (wt |ht)

2 Decode the training data with a pre-existing acoustic model,and the biased LM

3 Align the captions with the ASR output

4 Select utterances where there is a good match between thecaptions and the automatic output

ASR Lecture 14 ASR and alignment systems for multi-genre media data 16

Page 23: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Lightly supervised training

A standard method [Braunschweiler et al]:

1 Train an biased language model on the captions, interpolatedwith a background LM

p(wt |ht) = λpbias(wt |ht) + (1− λ)pbg (wt |ht)

2 Decode the training data with a pre-existing acoustic model,and the biased LM

3 Align the captions with the ASR output

4 Select utterances where there is a good match between thecaptions and the automatic output

ASR Lecture 14 ASR and alignment systems for multi-genre media data 16

Page 24: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Lightly supervised training

A standard method [Braunschweiler et al]:

1 Train an biased language model on the captions, interpolatedwith a background LM

p(wt |ht) = λpbias(wt |ht) + (1− λ)pbg (wt |ht)

2 Decode the training data with a pre-existing acoustic model,and the biased LM

3 Align the captions with the ASR output

4 Select utterances where there is a good match between thecaptions and the automatic output

ASR Lecture 14 ASR and alignment systems for multi-genre media data 16

Page 25: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Data selection by genre

0 20 40 60 80 100 120 140 160 180 2000

25

50

75

100

125

150

175

200

225

250

275

300

%PMER

hours

of data

advice

childrens

comedy

competition

documentary

drama

events

news

ASR Lecture 14 ASR and alignment systems for multi-genre media data 17

Page 26: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Data selection

0 20 40 60 80 100 120 140 160 180 2000

10

20

30

40

50

60

70

80

90

100

%PMER

%data

all (1005h)

advice(145h)

childrens(90h)

comedy(42h)

competition(129h)

documentary(134h)

drama(55h)

events(118h)

news(293h)

ASR Lecture 14 ASR and alignment systems for multi-genre media data 18

Page 27: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

An alternative alignment method

The biased LM approach is quite computationally costly, andcan lead to bias towards data that we can already recognisewell

We have used an alternative approach based on constructingweighted finite state transducers for each utterance

This allows us to use much stronger constraints – based onthe captions – at decoding time

ASR Lecture 14 ASR and alignment systems for multi-genre media data 19

Page 28: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

An alternative alignment method

The biased LM approach is quite computationally costly, andcan lead to bias towards data that we can already recognisewell

We have used an alternative approach based on constructingweighted finite state transducers for each utterance

This allows us to use much stronger constraints – based onthe captions – at decoding time

ASR Lecture 14 ASR and alignment systems for multi-genre media data 19

Page 29: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

An alternative alignment method

The biased LM approach is quite computationally costly, andcan lead to bias towards data that we can already recognisewell

We have used an alternative approach based on constructingweighted finite state transducers for each utterance

This allows us to use much stronger constraints – based onthe captions – at decoding time

ASR Lecture 14 ASR and alignment systems for multi-genre media data 19

Page 30: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Recap: ASR with WFTs

Most modern decoders use a transducer approach to combinethe acoustic model, lexicon and language model in a unifiedframeworkFind the lowest-cost path through a composed transducerH ◦ C ◦ L ◦ G

0

4/3.1205HIT:HIT/10.845

3/5.7855HE:HE/3.9108 2/3.3187

HANNAH:HANNAH/11.561

1/4.5261

#0:<eps>/4.1394

HIT:HIT/7.8698

HE:HE/5.6797

#0:<eps>/1.5777

HIT:HIT/6.7907HE:HE/8.1692

#0:<eps>/2.4009

HE:HE/6.2462

#0:<eps>/0.82158

HIT:HIT/8.4869

HE:HE/6.409

HANNAH:HANNAH/10.933

ASR Lecture 14 ASR and alignment systems for multi-genre media data 20

Page 31: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Alignment with WFSTs

A G transducer that allows any substring of the original captions –known as a factor transducer

0

1

HELLO:HELLO

#0:<eps>

2#0:<eps>

3#0:<eps>

4#0:<eps>

5#0:<eps>

6#0:<eps>

AND:AND

WELCOME:WELCOME

TO:TO

THE:THE

BOOK:BOOK

7QUIZ:QUIZ

ASR Lecture 14 ASR and alignment systems for multi-genre media data 21

Page 32: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Alignment with WFSTs

A determinized version of the G transducer

0

2#0:<eps>

1

HELLO:HELLO

3

AND:AND

8

WELCOME:WELCOME

7TO:TO

6THE:THE

5

QUIZ:QUIZ

4BOOK:BOOK

AND:AND

WELCOME:WELCOME

TO:TO

THE:THE

BOOK:BOOK

QUIZ:QUIZ

ASR Lecture 14 ASR and alignment systems for multi-genre media data 22

Page 33: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Alignment with WFSTs

What about when word appears in the captions that was notactually spoken? We need to alter the design to be robust to thisby allowing deletions (at a cost)

0

1

HELLO:HELLO/0.9

#0:<eps>/0.1

#0:<eps> 2

#0:<eps>

3#0:<eps>

4

#0:<eps>

5

#0:<eps>

6

#0:<eps>

AND:AND/0.9

#0:<eps>/0.1 WELCOME:WELCOME/0.9

#0:<eps>/0.1 TO:TO/0.9

#0:<eps>/0.1 THE:THE/0.9

#0:<eps>/0.1 BOOK:BOOK/0.9

#0:<eps>/0.1 7QUIZ:QUIZ/0.9

#0:<eps>/0.1

ASR Lecture 14 ASR and alignment systems for multi-genre media data 23

Page 34: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Alignment with WFSTs

A determinized version

0

1HELLO:HELLO/0.9

2

#0:<eps>

3AND:AND/0.9

#0:<eps>/0.1

AND:AND/0.9

4

BOOK:BOOK/0.9 5

QUIZ:QUIZ/0.9

6

THE:THE/0.9

7TO:TO/0.9

8WELCOME:WELCOME/0.9

9

#0:<eps>/0.1

WELCOME:WELCOME/0.9

#0:<eps>/0.1

QUIZ:QUIZ/0.9

#0:<eps>/0.1

BOOK:BOOK/0.9

#0:<eps>/0.1

THE:THE/0.9

#0:<eps>/0.1

TO:TO/0.9

#0:<eps>/0.1

BOOK:BOOK/0.9

QUIZ:QUIZ/0.9

THE:THE/0.9

TO:TO/0.9WELCOME:WELCOME/0.9

10

#0:<eps>/0.1

BOOK:BOOK/0.9

QUIZ:QUIZ/0.9

THE:THE/0.9TO:TO/0.9

11

#0:<eps>/0.1

BOOK:BOOK/0.9

QUIZ:QUIZ/0.9

THE:THE/0.9

12

#0:<eps>/0.1

BOOK:BOOK/0.9

QUIZ:QUIZ/0.9

13

#0:<eps>/0.1

QUIZ:QUIZ/0.9

#0:<eps>/0.1

ASR Lecture 14 ASR and alignment systems for multi-genre media data 24

Page 35: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

The complete alignment process

1 Decode with a factor-transducer for the each programme

2 Align the output to the original captions

3 Re-segment the data, to potentially include missed speech

4 Decode again with utterance-specific factor transducers,allowing word-skips

ASR Lecture 14 ASR and alignment systems for multi-genre media data 25

Page 36: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Another example

Spot how the automatically-aligned captions differ from the wordsactually spoken...

ASR Lecture 14 ASR and alignment systems for multi-genre media data 26

Page 37: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Scoring alignment for MGB

Scoring with respect to a forced-alignment ofhuman-generated verbatim transcription

Words spoken but not in the captions are ignored

For words in both, systems judged correct if supplied timingsare correct within a 100ms window

Evaluated in terms of f-score

P =Nmatch

Nhyp,R =

Nmatch

Nref,F = 2× P × R

P + R

Segments with overlapped speech are ignored

ASR Lecture 14 ASR and alignment systems for multi-genre media data 27

Page 38: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Our alignment results

System Precision Recall F-score

Preliminary DNN AMs

Pass 1 FT 0.8816 0.7629 0.8180+ force align 0.8290 0.7855 0.8066Pass 2 FT+skip 0.8679 0.8563 0.8620

Final DNN AMs

Pass 1 0.9009 0.8128 0.8546Pass 2 FT+skip 0.8856 0.9013 0.8934

ASR Lecture 14 ASR and alignment systems for multi-genre media data 28

Page 39: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Evaluation results

Participant F-score

Cambridge 0.900Edinburgh/Quorate 0.877CRIM 0.863Vocapia/LIMSI 0.846Sheffield 0.834NHK 0.797

ASR Lecture 14 ASR and alignment systems for multi-genre media data 29

Page 40: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Speech activity detection

SAD is the task of deciding which portions of the audiocontain speech

Aims to segment to audio into “reasonable length” utterances

It’s surprisingly difficult! We need good models for non-speechas well as speech

Training non-speech models on the TV data is effectivelyunsupervised learning, as we can’t be sure that uncaptionedportions of audio don’t actually contain speech

One solution is to train non-speech models only on the shortpauses between known words

ASR Lecture 14 ASR and alignment systems for multi-genre media data 30

Page 41: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Speech activity detection

SAD is the task of deciding which portions of the audiocontain speech

Aims to segment to audio into “reasonable length” utterances

It’s surprisingly difficult! We need good models for non-speechas well as speech

Training non-speech models on the TV data is effectivelyunsupervised learning, as we can’t be sure that uncaptionedportions of audio don’t actually contain speech

One solution is to train non-speech models only on the shortpauses between known words

ASR Lecture 14 ASR and alignment systems for multi-genre media data 30

Page 42: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Speech activity detection

SAD is the task of deciding which portions of the audiocontain speech

Aims to segment to audio into “reasonable length” utterances

It’s surprisingly difficult! We need good models for non-speechas well as speech

Training non-speech models on the TV data is effectivelyunsupervised learning, as we can’t be sure that uncaptionedportions of audio don’t actually contain speech

One solution is to train non-speech models only on the shortpauses between known words

ASR Lecture 14 ASR and alignment systems for multi-genre media data 30

Page 43: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Speech activity detection

SAD is the task of deciding which portions of the audiocontain speech

Aims to segment to audio into “reasonable length” utterances

It’s surprisingly difficult! We need good models for non-speechas well as speech

Training non-speech models on the TV data is effectivelyunsupervised learning, as we can’t be sure that uncaptionedportions of audio don’t actually contain speech

One solution is to train non-speech models only on the shortpauses between known words

ASR Lecture 14 ASR and alignment systems for multi-genre media data 30

Page 44: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Speech activity detection

SAD is the task of deciding which portions of the audiocontain speech

Aims to segment to audio into “reasonable length” utterances

It’s surprisingly difficult! We need good models for non-speechas well as speech

Training non-speech models on the TV data is effectivelyunsupervised learning, as we can’t be sure that uncaptionedportions of audio don’t actually contain speech

One solution is to train non-speech models only on the shortpauses between known words

ASR Lecture 14 ASR and alignment systems for multi-genre media data 30

Page 45: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

SAD with DNNs

Speech or non-speech?

Inputfeatures

ASR Lecture 14 ASR and alignment systems for multi-genre media data 31

Page 46: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Problems with this DNN

Sensitive to the (frequent) errors in the frame labels

It’s hard to learn a good hidden representation of speech whenmodelling only two output classes

We’re not making use of the knowledge we have from thelightly supervised alignment of the captions

ASR Lecture 14 ASR and alignment systems for multi-genre media data 32

Page 47: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Alternative multi-task architecture

Speech or non-speech?

Inputfeatures

Context-dependent

phoneme targets

ASR Lecture 14 ASR and alignment systems for multi-genre media data 33

Page 48: ASR and alignment systems for multi-genre media data...ASR Lecture 14 ASR and alignment systems for multi-genre media data16 Data selection by genre 0 20 40 60 80 100 120 140 160 180

Reading

P. Bell and S. Renals “A system for automatic alignment of broadcastmedia captions using weighted finite-state transducers,” in Proc. ASRU,2015.

P.C. Woodland et al. “Cambridge University transcription systems for theMulti-Genre Broadcast Challenge,” in Proc. ASRU, 2015.

P. Moreno and C. Alberti, “A factor automaton approach for the forcedalignment of long speech recordings,” in Proc. ICASSP, 2009.

N. Braunschweiler, M. Gales, and S. Buchholz, “Lightly supervisedrecognition for automatic alignment of large coherent speech recordings,”in Proc. Interspeech, 2010.

P. Bell et al. “The MGB Challenge: evaluating multi-genre broadcastmedia recognition” in Proc. ASRU, 2015.

ASR Lecture 14 ASR and alignment systems for multi-genre media data 34