The Use of Context in Large Vocabulary Speech Recognition

1048576 The Use of Context in Large Vocabulary Speech Recognition

Julian James OdellMarch 1995

Dissertation submitted to the University ofCambridge for the degree of Doctor of Philosophy

Presenter Hsu-Ting Wei

Context

Contents (cont)

Introduction

bull The use of context dependent models introduces two major problemsndash 1 Sparsely and unevenness training datandash 2 Efficient decoding strategy which incorporates context

dependencies both within words and across word boundaries

Introduction (cont)

bull About problem 1 (ch3)ndash Construct a robust and accurate recognizers using decision tree

bases clustering techniquesndash Linguistic knowledge is usedndash The approach allows the construction of models which are

dependent upon contextual effects occurring across word boundaries

bull About problem 2 (ch4~)ndash The thesis presents a new decoder design which is capable of

using these models efficientlyndash The decoder can generate a lattice of word hypotheses with little

computational overhead

Ch3 Context dependency in speech

bull 31 Contextual Variationndash In order to maximize the accuracy of HMM based speech recog

nition systems it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses

bull Signal parameterisationbull Model structure

ndash Ensure that their between class variance is higher than the within class variance

Ch3 Context dependency in speech (cont)

bull Most of the variability inherent in speech is due to contextual effectsndash Session effects

bull Speaker effects ndash Major source of variation

bull Environmental effectsndash Control by minimizing the background noise and

ensuring that the same microphone is usedndash Local effects

bull Utterancendash Co-articulation stress emphasis

bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased

bull Session effectsndash Speaker dependent system (SD) is significantly more accurate

than a similar speaker independent system (SI)ndash Speaker effects

bull Gender and agebull Dialectbull Style

ndash In order to making the SI system to simulate SD system we can do

bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker

bull Session effects (cont)ndash Operating recognizers in parallel

bull Disadvantage ndash The computational load appears to rises linearly with the number of

systemsbull Advantage

ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech

Speaker typeanswer

bull Session effects (cont)ndash Adapting the recognizer to match the new speaker

bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use

parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker

MAPMLLR

bull Local effectsndash Co-articulation means that the acoustic realization of a phone in

a particular phonetic context is more consistent than the same phone occurring in a variety of contexts

ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er

bull Local effectsndash Context Dependent Phonetic Models

bull IN LIMSIndash 45 monophone context (Festival CMU 41)

raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)

raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context

raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil

ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)

ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil

bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy

ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil

English dictionary

bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones

(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))

English dictionary (cont)

bull The LIMSI dictionary phones set (1993)ndash 45 phones

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

bull Vowel questions

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

bull Questions which is used in HTK

lt= State tying

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

Conclusion

bull Implement HTK right biphone task and triphone task

Context

Contents (cont)

Introduction

Introduction (cont)

Speaker typeanswer

MAPMLLR

English dictionary

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

Context

Contents (cont)

Introduction

Introduction (cont)

Speaker typeanswer

MAPMLLR

English dictionary

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

Contents (cont)

Introduction

Introduction (cont)

Speaker typeanswer

MAPMLLR

English dictionary

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

Introduction

Introduction (cont)

Speaker typeanswer

MAPMLLR

English dictionary

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

Introduction (cont)

Speaker typeanswer

MAPMLLR

English dictionary

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

Speaker typeanswer

MAPMLLR

English dictionary

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

Speaker typeanswer

MAPMLLR

English dictionary

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

Speaker typeanswer

MAPMLLR

English dictionary

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

Speaker typeanswer

MAPMLLR

English dictionary

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

MAPMLLR

English dictionary

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

English dictionary

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

English dictionary

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

English dictionary

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

舌尖音

刺耳的

音節主音

摩擦音

破擦音

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

lt= State tying

Ch4Decoding

Ch4Decoding (cont)

Conclusion

Ch4Decoding

Ch4Decoding (cont)

Conclusion

Ch4Decoding (cont)

Conclusion

Ch4Decoding (cont)

Conclusion

Documents

The Use of Context in Large Vocabulary Speech Recognition