Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
ASR and alignment systems for multi-genre mediadata
Peter Bell
Automatic Speech Recognition— ASR Lecture 1413 March 2017
ASR Lecture 14 ASR and alignment systems for multi-genre media data 1
This lecture
The MGB Challenge
Building ASR systems from captioned TV broadcasts
Lightly supervised alignment
Speech activity detection
ASR Lecture 14 ASR and alignment systems for multi-genre media data 2
What are we working on in CSTR?
Topics Wide domain coverage, understanding diverse data,cross-lingual recognition, environment and speakermodelling
Methods Deep learning, canonical models, adaptation,factorisation, generalisation
Applications Talks and lectures, TV broadcasts, multipartymeetings, spoken dialogue systems
ASR Lecture 14 ASR and alignment systems for multi-genre media data 3
What are we working on in CSTR?
Topics Wide domain coverage, understanding diverse data,cross-lingual recognition, environment and speakermodelling
Methods Deep learning, canonical models, adaptation,factorisation, generalisation
Applications Talks and lectures, TV broadcasts, multipartymeetings, spoken dialogue systems
ASR Lecture 14 ASR and alignment systems for multi-genre media data 3
Case study: multi-genre TV broadcasts
Automatic speech processing of TV broadcasts has an obviouscommercial need, but is still very difficult for current systems
ASR Lecture 14 ASR and alignment systems for multi-genre media data 4
The MGB Challenge
We proposed an open challenge to work on EnglishMulti-Genre Broadcast data at the 2015 ASRU workshop
Our aim was to encourage researchers from around to worldto work on this kind of data
Create a standard experimental setup so that cutting edgeresearch methods can be compared in a controlled setting
Repeated on Arabic TV data for the 2016 SLT workshop
ASR Lecture 14 ASR and alignment systems for multi-genre media data 5
Data supplied to all participants
1,600 hours of TV, taken from 7 complete weeks of BBCoutput over four channels, with accompanying subtitle text
600M words of subtitle text from 1988 onwards
XML metadata for all shows, generated in a standard format
Data supplied freely for the purpose of participation in thechallenge
ASR Lecture 14 ASR and alignment systems for multi-genre media data 6
Why is this task difficult
Many different background noise conditions
Diverse range of accents and speaking styles – including fastdramatic speech, and natural, spontaneous speech
Speaker identities are usually not known
Although lots of training data is available, the captionsavailable are not very accurate.
ASR Lecture 14 ASR and alignment systems for multi-genre media data 7
Example...
Two contrasting programmes...
ASR Lecture 14 ASR and alignment systems for multi-genre media data 8
The tasks
Transcription of multi-genre TV showswe supplied around 16 TV shows to be completely transcribedshow names and genre labels are providedsome shows are from series appearing in the training data;some are not
Subtitle alignmentfor the same shows as Task 1, the subtitle text as originallybroadcast were providedthese differ from the verbatim audio content for a range ofreasonsparticipants must produce time stamps for all words in thesubtitles
ASR Lecture 14 ASR and alignment systems for multi-genre media data 9
The tasks
Longitudinal transcriptionaim to evaluate ASR in a realistic longitudinal settingparticipants transcribed complete TV series, where the outputfrom shows broadcast earlier could be used to adapt andenhance the performance of later shows
Longitudinal diarization and speaker linkingaim to label speakers uniquely across a complete seriesrealistic longitudinal setting again: participants must processshows sequentially in date order
ASR Lecture 14 ASR and alignment systems for multi-genre media data 10
Our ASR system
Some features of our best system:
Models trained on 640 hours of broadcasts
DNNs with 6 hidden layers, an input window of 9 frames and28k output states used in combination with CNNs with asimilar structure
Networks trained with cross-entropy criterion, followed byminimum Bayes risk full-sequence training
Training using a complex recipe of multiple iterations, with alltraining data re-aligned several times – the completeprocedure takes several weeks, even on GPU machines!
No speaker adaptation, but mean and variance normalisationused, based on speaker clusters
ASR Lecture 14 ASR and alignment systems for multi-genre media data 11
Our ASR system
Some features of our best system:
Models trained on 640 hours of broadcasts
DNNs with 6 hidden layers, an input window of 9 frames and28k output states used in combination with CNNs with asimilar structure
Networks trained with cross-entropy criterion, followed byminimum Bayes risk full-sequence training
Training using a complex recipe of multiple iterations, with alltraining data re-aligned several times – the completeprocedure takes several weeks, even on GPU machines!
No speaker adaptation, but mean and variance normalisationused, based on speaker clusters
ASR Lecture 14 ASR and alignment systems for multi-genre media data 11
Our ASR system
Some features of our best system:
Models trained on 640 hours of broadcasts
DNNs with 6 hidden layers, an input window of 9 frames and28k output states used in combination with CNNs with asimilar structure
Networks trained with cross-entropy criterion, followed byminimum Bayes risk full-sequence training
Training using a complex recipe of multiple iterations, with alltraining data re-aligned several times – the completeprocedure takes several weeks, even on GPU machines!
No speaker adaptation, but mean and variance normalisationused, based on speaker clusters
ASR Lecture 14 ASR and alignment systems for multi-genre media data 11
Our ASR system
Some features of our best system:
Models trained on 640 hours of broadcasts
DNNs with 6 hidden layers, an input window of 9 frames and28k output states used in combination with CNNs with asimilar structure
Networks trained with cross-entropy criterion, followed byminimum Bayes risk full-sequence training
Training using a complex recipe of multiple iterations, with alltraining data re-aligned several times – the completeprocedure takes several weeks, even on GPU machines!
No speaker adaptation, but mean and variance normalisationused, based on speaker clusters
ASR Lecture 14 ASR and alignment systems for multi-genre media data 11
Our ASR system
Some features of our best system:
Models trained on 640 hours of broadcasts
DNNs with 6 hidden layers, an input window of 9 frames and28k output states used in combination with CNNs with asimilar structure
Networks trained with cross-entropy criterion, followed byminimum Bayes risk full-sequence training
Training using a complex recipe of multiple iterations, with alltraining data re-aligned several times – the completeprocedure takes several weeks, even on GPU machines!
No speaker adaptation, but mean and variance normalisationused, based on speaker clusters
ASR Lecture 14 ASR and alignment systems for multi-genre media data 11
Some results on development data
System 3gram 4gram
210 hours training data
GMM 53.1 -DNN 40.9 37.4+ sequence training 37.1 33.7
640 hours training data
Final DNN 31.3 28.2Final CNN 30.8 28.0
ROVER 30.1 27.3
ASR Lecture 14 ASR and alignment systems for multi-genre media data 12
Using broadcast captions for training
Problems with using closed captions as training data labels:
Timings may not be accurate
Not all words spoken are captioned
Words may appear in the captions that were never actuallyspoken
Limited speaker information is available (in the form of colourchanges in the subtitles)
he loves your PICTURES SO MUCH he thinks YOU'RE GONNA do INCREDIBLY well in milan
he loves your ******** ** PICTURE he thinks ****** YOU'LL do ********** well in milan
ASR Lecture 14 ASR and alignment systems for multi-genre media data 13
Training acoustic models on TV data
The basic recipe:
1 Using the captions and a previous ASR system, identify wordsand their timings within the audio
2 Select a set of utterances to use in training
3 Generate a pronunciation for every word from a basedictionary, and use this to create a phone alignment for eachutterance
4 Train GMM and then DNN models using these phonealignments, frequently re-aligning the data
ASR Lecture 14 ASR and alignment systems for multi-genre media data 14
Lightly supervised training
The problem of identifying words from the captions and usingthem to update the models is an example of lightly supervisedtraining
We don’t have perfect labels for each training sample, but wedo know something about them
The main challenge is in identifying reliable labels andlearning from them, without also learning from unreliablelabels, or past mistakes
ASR Lecture 14 ASR and alignment systems for multi-genre media data 15
Lightly supervised training
A standard method [Braunschweiler et al]:
1 Train an biased language model on the captions, interpolatedwith a background LM
p(wt |ht) = λpbias(wt |ht) + (1− λ)pbg (wt |ht)
2 Decode the training data with a pre-existing acoustic model,and the biased LM
3 Align the captions with the ASR output
4 Select utterances where there is a good match between thecaptions and the automatic output
ASR Lecture 14 ASR and alignment systems for multi-genre media data 16
Lightly supervised training
A standard method [Braunschweiler et al]:
1 Train an biased language model on the captions, interpolatedwith a background LM
p(wt |ht) = λpbias(wt |ht) + (1− λ)pbg (wt |ht)
2 Decode the training data with a pre-existing acoustic model,and the biased LM
3 Align the captions with the ASR output
4 Select utterances where there is a good match between thecaptions and the automatic output
ASR Lecture 14 ASR and alignment systems for multi-genre media data 16
Lightly supervised training
A standard method [Braunschweiler et al]:
1 Train an biased language model on the captions, interpolatedwith a background LM
p(wt |ht) = λpbias(wt |ht) + (1− λ)pbg (wt |ht)
2 Decode the training data with a pre-existing acoustic model,and the biased LM
3 Align the captions with the ASR output
4 Select utterances where there is a good match between thecaptions and the automatic output
ASR Lecture 14 ASR and alignment systems for multi-genre media data 16
Lightly supervised training
A standard method [Braunschweiler et al]:
1 Train an biased language model on the captions, interpolatedwith a background LM
p(wt |ht) = λpbias(wt |ht) + (1− λ)pbg (wt |ht)
2 Decode the training data with a pre-existing acoustic model,and the biased LM
3 Align the captions with the ASR output
4 Select utterances where there is a good match between thecaptions and the automatic output
ASR Lecture 14 ASR and alignment systems for multi-genre media data 16
Data selection by genre
0 20 40 60 80 100 120 140 160 180 2000
25
50
75
100
125
150
175
200
225
250
275
300
%PMER
hours
of data
advice
childrens
comedy
competition
documentary
drama
events
news
ASR Lecture 14 ASR and alignment systems for multi-genre media data 17
Data selection
0 20 40 60 80 100 120 140 160 180 2000
10
20
30
40
50
60
70
80
90
100
%PMER
%data
all (1005h)
advice(145h)
childrens(90h)
comedy(42h)
competition(129h)
documentary(134h)
drama(55h)
events(118h)
news(293h)
ASR Lecture 14 ASR and alignment systems for multi-genre media data 18
An alternative alignment method
The biased LM approach is quite computationally costly, andcan lead to bias towards data that we can already recognisewell
We have used an alternative approach based on constructingweighted finite state transducers for each utterance
This allows us to use much stronger constraints – based onthe captions – at decoding time
ASR Lecture 14 ASR and alignment systems for multi-genre media data 19
An alternative alignment method
The biased LM approach is quite computationally costly, andcan lead to bias towards data that we can already recognisewell
We have used an alternative approach based on constructingweighted finite state transducers for each utterance
This allows us to use much stronger constraints – based onthe captions – at decoding time
ASR Lecture 14 ASR and alignment systems for multi-genre media data 19
An alternative alignment method
The biased LM approach is quite computationally costly, andcan lead to bias towards data that we can already recognisewell
We have used an alternative approach based on constructingweighted finite state transducers for each utterance
This allows us to use much stronger constraints – based onthe captions – at decoding time
ASR Lecture 14 ASR and alignment systems for multi-genre media data 19
Recap: ASR with WFTs
Most modern decoders use a transducer approach to combinethe acoustic model, lexicon and language model in a unifiedframeworkFind the lowest-cost path through a composed transducerH ◦ C ◦ L ◦ G
0
4/3.1205HIT:HIT/10.845
3/5.7855HE:HE/3.9108 2/3.3187
HANNAH:HANNAH/11.561
1/4.5261
#0:<eps>/4.1394
HIT:HIT/7.8698
HE:HE/5.6797
#0:<eps>/1.5777
HIT:HIT/6.7907HE:HE/8.1692
#0:<eps>/2.4009
HE:HE/6.2462
#0:<eps>/0.82158
HIT:HIT/8.4869
HE:HE/6.409
HANNAH:HANNAH/10.933
ASR Lecture 14 ASR and alignment systems for multi-genre media data 20
Alignment with WFSTs
A G transducer that allows any substring of the original captions –known as a factor transducer
0
1
HELLO:HELLO
#0:<eps>
2#0:<eps>
3#0:<eps>
4#0:<eps>
5#0:<eps>
6#0:<eps>
AND:AND
WELCOME:WELCOME
TO:TO
THE:THE
BOOK:BOOK
7QUIZ:QUIZ
ASR Lecture 14 ASR and alignment systems for multi-genre media data 21
Alignment with WFSTs
A determinized version of the G transducer
0
2#0:<eps>
1
HELLO:HELLO
3
AND:AND
8
WELCOME:WELCOME
7TO:TO
6THE:THE
5
QUIZ:QUIZ
4BOOK:BOOK
AND:AND
WELCOME:WELCOME
TO:TO
THE:THE
BOOK:BOOK
QUIZ:QUIZ
ASR Lecture 14 ASR and alignment systems for multi-genre media data 22
Alignment with WFSTs
What about when word appears in the captions that was notactually spoken? We need to alter the design to be robust to thisby allowing deletions (at a cost)
0
1
HELLO:HELLO/0.9
#0:<eps>/0.1
#0:<eps> 2
#0:<eps>
3#0:<eps>
4
#0:<eps>
5
#0:<eps>
6
#0:<eps>
AND:AND/0.9
#0:<eps>/0.1 WELCOME:WELCOME/0.9
#0:<eps>/0.1 TO:TO/0.9
#0:<eps>/0.1 THE:THE/0.9
#0:<eps>/0.1 BOOK:BOOK/0.9
#0:<eps>/0.1 7QUIZ:QUIZ/0.9
#0:<eps>/0.1
ASR Lecture 14 ASR and alignment systems for multi-genre media data 23
Alignment with WFSTs
A determinized version
0
1HELLO:HELLO/0.9
2
#0:<eps>
3AND:AND/0.9
#0:<eps>/0.1
AND:AND/0.9
4
BOOK:BOOK/0.9 5
QUIZ:QUIZ/0.9
6
THE:THE/0.9
7TO:TO/0.9
8WELCOME:WELCOME/0.9
9
#0:<eps>/0.1
WELCOME:WELCOME/0.9
#0:<eps>/0.1
QUIZ:QUIZ/0.9
#0:<eps>/0.1
BOOK:BOOK/0.9
#0:<eps>/0.1
THE:THE/0.9
#0:<eps>/0.1
TO:TO/0.9
#0:<eps>/0.1
BOOK:BOOK/0.9
QUIZ:QUIZ/0.9
THE:THE/0.9
TO:TO/0.9WELCOME:WELCOME/0.9
10
#0:<eps>/0.1
BOOK:BOOK/0.9
QUIZ:QUIZ/0.9
THE:THE/0.9TO:TO/0.9
11
#0:<eps>/0.1
BOOK:BOOK/0.9
QUIZ:QUIZ/0.9
THE:THE/0.9
12
#0:<eps>/0.1
BOOK:BOOK/0.9
QUIZ:QUIZ/0.9
13
#0:<eps>/0.1
QUIZ:QUIZ/0.9
#0:<eps>/0.1
ASR Lecture 14 ASR and alignment systems for multi-genre media data 24
The complete alignment process
1 Decode with a factor-transducer for the each programme
2 Align the output to the original captions
3 Re-segment the data, to potentially include missed speech
4 Decode again with utterance-specific factor transducers,allowing word-skips
ASR Lecture 14 ASR and alignment systems for multi-genre media data 25
Another example
Spot how the automatically-aligned captions differ from the wordsactually spoken...
ASR Lecture 14 ASR and alignment systems for multi-genre media data 26
Scoring alignment for MGB
Scoring with respect to a forced-alignment ofhuman-generated verbatim transcription
Words spoken but not in the captions are ignored
For words in both, systems judged correct if supplied timingsare correct within a 100ms window
Evaluated in terms of f-score
P =Nmatch
Nhyp,R =
Nmatch
Nref,F = 2× P × R
P + R
Segments with overlapped speech are ignored
ASR Lecture 14 ASR and alignment systems for multi-genre media data 27
Our alignment results
System Precision Recall F-score
Preliminary DNN AMs
Pass 1 FT 0.8816 0.7629 0.8180+ force align 0.8290 0.7855 0.8066Pass 2 FT+skip 0.8679 0.8563 0.8620
Final DNN AMs
Pass 1 0.9009 0.8128 0.8546Pass 2 FT+skip 0.8856 0.9013 0.8934
ASR Lecture 14 ASR and alignment systems for multi-genre media data 28
Evaluation results
Participant F-score
Cambridge 0.900Edinburgh/Quorate 0.877CRIM 0.863Vocapia/LIMSI 0.846Sheffield 0.834NHK 0.797
ASR Lecture 14 ASR and alignment systems for multi-genre media data 29
Speech activity detection
SAD is the task of deciding which portions of the audiocontain speech
Aims to segment to audio into “reasonable length” utterances
It’s surprisingly difficult! We need good models for non-speechas well as speech
Training non-speech models on the TV data is effectivelyunsupervised learning, as we can’t be sure that uncaptionedportions of audio don’t actually contain speech
One solution is to train non-speech models only on the shortpauses between known words
ASR Lecture 14 ASR and alignment systems for multi-genre media data 30
Speech activity detection
SAD is the task of deciding which portions of the audiocontain speech
Aims to segment to audio into “reasonable length” utterances
It’s surprisingly difficult! We need good models for non-speechas well as speech
Training non-speech models on the TV data is effectivelyunsupervised learning, as we can’t be sure that uncaptionedportions of audio don’t actually contain speech
One solution is to train non-speech models only on the shortpauses between known words
ASR Lecture 14 ASR and alignment systems for multi-genre media data 30
Speech activity detection
SAD is the task of deciding which portions of the audiocontain speech
Aims to segment to audio into “reasonable length” utterances
It’s surprisingly difficult! We need good models for non-speechas well as speech
Training non-speech models on the TV data is effectivelyunsupervised learning, as we can’t be sure that uncaptionedportions of audio don’t actually contain speech
One solution is to train non-speech models only on the shortpauses between known words
ASR Lecture 14 ASR and alignment systems for multi-genre media data 30
Speech activity detection
SAD is the task of deciding which portions of the audiocontain speech
Aims to segment to audio into “reasonable length” utterances
It’s surprisingly difficult! We need good models for non-speechas well as speech
Training non-speech models on the TV data is effectivelyunsupervised learning, as we can’t be sure that uncaptionedportions of audio don’t actually contain speech
One solution is to train non-speech models only on the shortpauses between known words
ASR Lecture 14 ASR and alignment systems for multi-genre media data 30
Speech activity detection
SAD is the task of deciding which portions of the audiocontain speech
Aims to segment to audio into “reasonable length” utterances
It’s surprisingly difficult! We need good models for non-speechas well as speech
Training non-speech models on the TV data is effectivelyunsupervised learning, as we can’t be sure that uncaptionedportions of audio don’t actually contain speech
One solution is to train non-speech models only on the shortpauses between known words
ASR Lecture 14 ASR and alignment systems for multi-genre media data 30
SAD with DNNs
Speech or non-speech?
Inputfeatures
ASR Lecture 14 ASR and alignment systems for multi-genre media data 31
Problems with this DNN
Sensitive to the (frequent) errors in the frame labels
It’s hard to learn a good hidden representation of speech whenmodelling only two output classes
We’re not making use of the knowledge we have from thelightly supervised alignment of the captions
ASR Lecture 14 ASR and alignment systems for multi-genre media data 32
Alternative multi-task architecture
Speech or non-speech?
Inputfeatures
Context-dependent
phoneme targets
ASR Lecture 14 ASR and alignment systems for multi-genre media data 33
Reading
P. Bell and S. Renals “A system for automatic alignment of broadcastmedia captions using weighted finite-state transducers,” in Proc. ASRU,2015.
P.C. Woodland et al. “Cambridge University transcription systems for theMulti-Genre Broadcast Challenge,” in Proc. ASRU, 2015.
P. Moreno and C. Alberti, “A factor automaton approach for the forcedalignment of long speech recordings,” in Proc. ICASSP, 2009.
N. Braunschweiler, M. Gales, and S. Buchholz, “Lightly supervisedrecognition for automatic alignment of large coherent speech recordings,”in Proc. Interspeech, 2010.
P. Bell et al. “The MGB Challenge: evaluating multi-genre broadcastmedia recognition” in Proc. ASRU, 2015.
ASR Lecture 14 ASR and alignment systems for multi-genre media data 34