23
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy Presenter: Hsu-Ting Wei

The Use of Context in Large Vocabulary Speech Recognition

  • Upload
    huyen

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

􀀀 The Use of Context in Large Vocabulary Speech Recognition. Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy. Presenter: Hsu-Ting Wei. Context. Contents (cont.). Introduction. - PowerPoint PPT Presentation

Citation preview

Page 1: The Use of Context in  Large Vocabulary Speech Recognition

1048576 The Use of Context in Large Vocabulary Speech Recognition

Julian James OdellMarch 1995

Dissertation submitted to the University ofCambridge for the degree of Doctor of Philosophy

Presenter Hsu-Ting Wei

2

3

Context

4

Contents (cont)

5

Introduction

bull The use of context dependent models introduces two major problemsndash 1 Sparsely and unevenness training datandash 2 Efficient decoding strategy which incorporates context

dependencies both within words and across word boundaries

6

Introduction (cont)

bull About problem 1 (ch3)ndash Construct a robust and accurate recognizers using decision tree

bases clustering techniquesndash Linguistic knowledge is usedndash The approach allows the construction of models which are

dependent upon contextual effects occurring across word boundaries

bull About problem 2 (ch4~)ndash The thesis presents a new decoder design which is capable of

using these models efficientlyndash The decoder can generate a lattice of word hypotheses with little

computational overhead

7

Ch3 Context dependency in speech

bull 31 Contextual Variationndash In order to maximize the accuracy of HMM based speech recog

nition systems it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses

bull Signal parameterisationbull Model structure

ndash Ensure that their between class variance is higher than the within class variance

8

Ch3 Context dependency in speech (cont)

bull Most of the variability inherent in speech is due to contextual effectsndash Session effects

bull Speaker effects ndash Major source of variation

bull Environmental effectsndash Control by minimizing the background noise and

ensuring that the same microphone is usedndash Local effects

bull Utterancendash Co-articulation stress emphasis

bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased

9

Ch3 Context dependency in speech (cont)

bull Session effectsndash Speaker dependent system (SD) is significantly more accurate

than a similar speaker independent system (SI)ndash Speaker effects

bull Gender and agebull Dialectbull Style

ndash In order to making the SI system to simulate SD system we can do

bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker

10

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Operating recognizers in parallel

bull Disadvantage ndash The computational load appears to rises linearly with the number of

systemsbull Advantage

ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech

Speaker typeanswer

11

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Adapting the recognizer to match the new speaker

bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use

parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker

MAPMLLR

12

Ch3 Context dependency in speech (cont)

bull Local effectsndash Co-articulation means that the acoustic realization of a phone in

a particular phonetic context is more consistent than the same phone occurring in a variety of contexts

ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er

13

Ch3 Context dependency in speech (cont)

bull Local effectsndash Context Dependent Phonetic Models

bull IN LIMSIndash 45 monophone context (Festival CMU 41)

raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)

raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context

raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil

ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)

ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil

bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy

ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil

14

English dictionary

bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones

(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))

15

English dictionary (cont)

bull The LIMSI dictionary phones set (1993)ndash 45 phones

16

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 2: The Use of Context in  Large Vocabulary Speech Recognition

2

3

Context

4

Contents (cont)

5

Introduction

bull The use of context dependent models introduces two major problemsndash 1 Sparsely and unevenness training datandash 2 Efficient decoding strategy which incorporates context

dependencies both within words and across word boundaries

6

Introduction (cont)

bull About problem 1 (ch3)ndash Construct a robust and accurate recognizers using decision tree

bases clustering techniquesndash Linguistic knowledge is usedndash The approach allows the construction of models which are

dependent upon contextual effects occurring across word boundaries

bull About problem 2 (ch4~)ndash The thesis presents a new decoder design which is capable of

using these models efficientlyndash The decoder can generate a lattice of word hypotheses with little

computational overhead

7

Ch3 Context dependency in speech

bull 31 Contextual Variationndash In order to maximize the accuracy of HMM based speech recog

nition systems it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses

bull Signal parameterisationbull Model structure

ndash Ensure that their between class variance is higher than the within class variance

8

Ch3 Context dependency in speech (cont)

bull Most of the variability inherent in speech is due to contextual effectsndash Session effects

bull Speaker effects ndash Major source of variation

bull Environmental effectsndash Control by minimizing the background noise and

ensuring that the same microphone is usedndash Local effects

bull Utterancendash Co-articulation stress emphasis

bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased

9

Ch3 Context dependency in speech (cont)

bull Session effectsndash Speaker dependent system (SD) is significantly more accurate

than a similar speaker independent system (SI)ndash Speaker effects

bull Gender and agebull Dialectbull Style

ndash In order to making the SI system to simulate SD system we can do

bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker

10

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Operating recognizers in parallel

bull Disadvantage ndash The computational load appears to rises linearly with the number of

systemsbull Advantage

ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech

Speaker typeanswer

11

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Adapting the recognizer to match the new speaker

bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use

parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker

MAPMLLR

12

Ch3 Context dependency in speech (cont)

bull Local effectsndash Co-articulation means that the acoustic realization of a phone in

a particular phonetic context is more consistent than the same phone occurring in a variety of contexts

ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er

13

Ch3 Context dependency in speech (cont)

bull Local effectsndash Context Dependent Phonetic Models

bull IN LIMSIndash 45 monophone context (Festival CMU 41)

raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)

raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context

raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil

ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)

ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil

bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy

ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil

14

English dictionary

bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones

(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))

15

English dictionary (cont)

bull The LIMSI dictionary phones set (1993)ndash 45 phones

16

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 3: The Use of Context in  Large Vocabulary Speech Recognition

3

Context

4

Contents (cont)

5

Introduction

bull The use of context dependent models introduces two major problemsndash 1 Sparsely and unevenness training datandash 2 Efficient decoding strategy which incorporates context

dependencies both within words and across word boundaries

6

Introduction (cont)

bull About problem 1 (ch3)ndash Construct a robust and accurate recognizers using decision tree

bases clustering techniquesndash Linguistic knowledge is usedndash The approach allows the construction of models which are

dependent upon contextual effects occurring across word boundaries

bull About problem 2 (ch4~)ndash The thesis presents a new decoder design which is capable of

using these models efficientlyndash The decoder can generate a lattice of word hypotheses with little

computational overhead

7

Ch3 Context dependency in speech

bull 31 Contextual Variationndash In order to maximize the accuracy of HMM based speech recog

nition systems it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses

bull Signal parameterisationbull Model structure

ndash Ensure that their between class variance is higher than the within class variance

8

Ch3 Context dependency in speech (cont)

bull Most of the variability inherent in speech is due to contextual effectsndash Session effects

bull Speaker effects ndash Major source of variation

bull Environmental effectsndash Control by minimizing the background noise and

ensuring that the same microphone is usedndash Local effects

bull Utterancendash Co-articulation stress emphasis

bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased

9

Ch3 Context dependency in speech (cont)

bull Session effectsndash Speaker dependent system (SD) is significantly more accurate

than a similar speaker independent system (SI)ndash Speaker effects

bull Gender and agebull Dialectbull Style

ndash In order to making the SI system to simulate SD system we can do

bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker

10

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Operating recognizers in parallel

bull Disadvantage ndash The computational load appears to rises linearly with the number of

systemsbull Advantage

ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech

Speaker typeanswer

11

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Adapting the recognizer to match the new speaker

bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use

parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker

MAPMLLR

12

Ch3 Context dependency in speech (cont)

bull Local effectsndash Co-articulation means that the acoustic realization of a phone in

a particular phonetic context is more consistent than the same phone occurring in a variety of contexts

ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er

13

Ch3 Context dependency in speech (cont)

bull Local effectsndash Context Dependent Phonetic Models

bull IN LIMSIndash 45 monophone context (Festival CMU 41)

raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)

raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context

raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil

ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)

ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil

bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy

ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil

14

English dictionary

bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones

(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))

15

English dictionary (cont)

bull The LIMSI dictionary phones set (1993)ndash 45 phones

16

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 4: The Use of Context in  Large Vocabulary Speech Recognition

4

Contents (cont)

5

Introduction

bull The use of context dependent models introduces two major problemsndash 1 Sparsely and unevenness training datandash 2 Efficient decoding strategy which incorporates context

dependencies both within words and across word boundaries

6

Introduction (cont)

bull About problem 1 (ch3)ndash Construct a robust and accurate recognizers using decision tree

bases clustering techniquesndash Linguistic knowledge is usedndash The approach allows the construction of models which are

dependent upon contextual effects occurring across word boundaries

bull About problem 2 (ch4~)ndash The thesis presents a new decoder design which is capable of

using these models efficientlyndash The decoder can generate a lattice of word hypotheses with little

computational overhead

7

Ch3 Context dependency in speech

bull 31 Contextual Variationndash In order to maximize the accuracy of HMM based speech recog

nition systems it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses

bull Signal parameterisationbull Model structure

ndash Ensure that their between class variance is higher than the within class variance

8

Ch3 Context dependency in speech (cont)

bull Most of the variability inherent in speech is due to contextual effectsndash Session effects

bull Speaker effects ndash Major source of variation

bull Environmental effectsndash Control by minimizing the background noise and

ensuring that the same microphone is usedndash Local effects

bull Utterancendash Co-articulation stress emphasis

bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased

9

Ch3 Context dependency in speech (cont)

bull Session effectsndash Speaker dependent system (SD) is significantly more accurate

than a similar speaker independent system (SI)ndash Speaker effects

bull Gender and agebull Dialectbull Style

ndash In order to making the SI system to simulate SD system we can do

bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker

10

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Operating recognizers in parallel

bull Disadvantage ndash The computational load appears to rises linearly with the number of

systemsbull Advantage

ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech

Speaker typeanswer

11

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Adapting the recognizer to match the new speaker

bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use

parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker

MAPMLLR

12

Ch3 Context dependency in speech (cont)

bull Local effectsndash Co-articulation means that the acoustic realization of a phone in

a particular phonetic context is more consistent than the same phone occurring in a variety of contexts

ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er

13

Ch3 Context dependency in speech (cont)

bull Local effectsndash Context Dependent Phonetic Models

bull IN LIMSIndash 45 monophone context (Festival CMU 41)

raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)

raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context

raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil

ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)

ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil

bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy

ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil

14

English dictionary

bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones

(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))

15

English dictionary (cont)

bull The LIMSI dictionary phones set (1993)ndash 45 phones

16

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 5: The Use of Context in  Large Vocabulary Speech Recognition

5

Introduction

bull The use of context dependent models introduces two major problemsndash 1 Sparsely and unevenness training datandash 2 Efficient decoding strategy which incorporates context

dependencies both within words and across word boundaries

6

Introduction (cont)

bull About problem 1 (ch3)ndash Construct a robust and accurate recognizers using decision tree

bases clustering techniquesndash Linguistic knowledge is usedndash The approach allows the construction of models which are

dependent upon contextual effects occurring across word boundaries

bull About problem 2 (ch4~)ndash The thesis presents a new decoder design which is capable of

using these models efficientlyndash The decoder can generate a lattice of word hypotheses with little

computational overhead

7

Ch3 Context dependency in speech

bull 31 Contextual Variationndash In order to maximize the accuracy of HMM based speech recog

nition systems it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses

bull Signal parameterisationbull Model structure

ndash Ensure that their between class variance is higher than the within class variance

8

Ch3 Context dependency in speech (cont)

bull Most of the variability inherent in speech is due to contextual effectsndash Session effects

bull Speaker effects ndash Major source of variation

bull Environmental effectsndash Control by minimizing the background noise and

ensuring that the same microphone is usedndash Local effects

bull Utterancendash Co-articulation stress emphasis

bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased

9

Ch3 Context dependency in speech (cont)

bull Session effectsndash Speaker dependent system (SD) is significantly more accurate

than a similar speaker independent system (SI)ndash Speaker effects

bull Gender and agebull Dialectbull Style

ndash In order to making the SI system to simulate SD system we can do

bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker

10

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Operating recognizers in parallel

bull Disadvantage ndash The computational load appears to rises linearly with the number of

systemsbull Advantage

ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech

Speaker typeanswer

11

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Adapting the recognizer to match the new speaker

bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use

parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker

MAPMLLR

12

Ch3 Context dependency in speech (cont)

bull Local effectsndash Co-articulation means that the acoustic realization of a phone in

a particular phonetic context is more consistent than the same phone occurring in a variety of contexts

ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er

13

Ch3 Context dependency in speech (cont)

bull Local effectsndash Context Dependent Phonetic Models

bull IN LIMSIndash 45 monophone context (Festival CMU 41)

raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)

raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context

raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil

ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)

ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil

bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy

ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil

14

English dictionary

bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones

(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))

15

English dictionary (cont)

bull The LIMSI dictionary phones set (1993)ndash 45 phones

16

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 6: The Use of Context in  Large Vocabulary Speech Recognition

6

Introduction (cont)

bull About problem 1 (ch3)ndash Construct a robust and accurate recognizers using decision tree

bases clustering techniquesndash Linguistic knowledge is usedndash The approach allows the construction of models which are

dependent upon contextual effects occurring across word boundaries

bull About problem 2 (ch4~)ndash The thesis presents a new decoder design which is capable of

using these models efficientlyndash The decoder can generate a lattice of word hypotheses with little

computational overhead

7

Ch3 Context dependency in speech

bull 31 Contextual Variationndash In order to maximize the accuracy of HMM based speech recog

nition systems it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses

bull Signal parameterisationbull Model structure

ndash Ensure that their between class variance is higher than the within class variance

8

Ch3 Context dependency in speech (cont)

bull Most of the variability inherent in speech is due to contextual effectsndash Session effects

bull Speaker effects ndash Major source of variation

bull Environmental effectsndash Control by minimizing the background noise and

ensuring that the same microphone is usedndash Local effects

bull Utterancendash Co-articulation stress emphasis

bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased

9

Ch3 Context dependency in speech (cont)

bull Session effectsndash Speaker dependent system (SD) is significantly more accurate

than a similar speaker independent system (SI)ndash Speaker effects

bull Gender and agebull Dialectbull Style

ndash In order to making the SI system to simulate SD system we can do

bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker

10

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Operating recognizers in parallel

bull Disadvantage ndash The computational load appears to rises linearly with the number of

systemsbull Advantage

ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech

Speaker typeanswer

11

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Adapting the recognizer to match the new speaker

bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use

parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker

MAPMLLR

12

Ch3 Context dependency in speech (cont)

bull Local effectsndash Co-articulation means that the acoustic realization of a phone in

a particular phonetic context is more consistent than the same phone occurring in a variety of contexts

ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er

13

Ch3 Context dependency in speech (cont)

bull Local effectsndash Context Dependent Phonetic Models

bull IN LIMSIndash 45 monophone context (Festival CMU 41)

raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)

raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context

raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil

ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)

ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil

bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy

ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil

14

English dictionary

bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones

(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))

15

English dictionary (cont)

bull The LIMSI dictionary phones set (1993)ndash 45 phones

16

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 7: The Use of Context in  Large Vocabulary Speech Recognition

7

Ch3 Context dependency in speech

bull 31 Contextual Variationndash In order to maximize the accuracy of HMM based speech recog

nition systems it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses

bull Signal parameterisationbull Model structure

ndash Ensure that their between class variance is higher than the within class variance

8

Ch3 Context dependency in speech (cont)

bull Most of the variability inherent in speech is due to contextual effectsndash Session effects

bull Speaker effects ndash Major source of variation

bull Environmental effectsndash Control by minimizing the background noise and

ensuring that the same microphone is usedndash Local effects

bull Utterancendash Co-articulation stress emphasis

bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased

9

Ch3 Context dependency in speech (cont)

bull Session effectsndash Speaker dependent system (SD) is significantly more accurate

than a similar speaker independent system (SI)ndash Speaker effects

bull Gender and agebull Dialectbull Style

ndash In order to making the SI system to simulate SD system we can do

bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker

10

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Operating recognizers in parallel

bull Disadvantage ndash The computational load appears to rises linearly with the number of

systemsbull Advantage

ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech

Speaker typeanswer

11

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Adapting the recognizer to match the new speaker

bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use

parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker

MAPMLLR

12

Ch3 Context dependency in speech (cont)

bull Local effectsndash Co-articulation means that the acoustic realization of a phone in

a particular phonetic context is more consistent than the same phone occurring in a variety of contexts

ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er

13

Ch3 Context dependency in speech (cont)

bull Local effectsndash Context Dependent Phonetic Models

bull IN LIMSIndash 45 monophone context (Festival CMU 41)

raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)

raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context

raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil

ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)

ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil

bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy

ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil

14

English dictionary

bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones

(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))

15

English dictionary (cont)

bull The LIMSI dictionary phones set (1993)ndash 45 phones

16

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 8: The Use of Context in  Large Vocabulary Speech Recognition

8

Ch3 Context dependency in speech (cont)

bull Most of the variability inherent in speech is due to contextual effectsndash Session effects

bull Speaker effects ndash Major source of variation

bull Environmental effectsndash Control by minimizing the background noise and

ensuring that the same microphone is usedndash Local effects

bull Utterancendash Co-articulation stress emphasis

bull By taking these contextual effects into account the variability can be reduced and the accuracy of the models increased

9

Ch3 Context dependency in speech (cont)

bull Session effectsndash Speaker dependent system (SD) is significantly more accurate

than a similar speaker independent system (SI)ndash Speaker effects

bull Gender and agebull Dialectbull Style

ndash In order to making the SI system to simulate SD system we can do

bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker

10

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Operating recognizers in parallel

bull Disadvantage ndash The computational load appears to rises linearly with the number of

systemsbull Advantage

ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech

Speaker typeanswer

11

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Adapting the recognizer to match the new speaker

bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use

parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker

MAPMLLR

12

Ch3 Context dependency in speech (cont)

bull Local effectsndash Co-articulation means that the acoustic realization of a phone in

a particular phonetic context is more consistent than the same phone occurring in a variety of contexts

ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er

13

Ch3 Context dependency in speech (cont)

bull Local effectsndash Context Dependent Phonetic Models

bull IN LIMSIndash 45 monophone context (Festival CMU 41)

raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)

raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context

raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil

ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)

ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil

bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy

ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil

14

English dictionary

bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones

(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))

15

English dictionary (cont)

bull The LIMSI dictionary phones set (1993)ndash 45 phones

16

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 9: The Use of Context in  Large Vocabulary Speech Recognition

9

Ch3 Context dependency in speech (cont)

bull Session effectsndash Speaker dependent system (SD) is significantly more accurate

than a similar speaker independent system (SI)ndash Speaker effects

bull Gender and agebull Dialectbull Style

ndash In order to making the SI system to simulate SD system we can do

bull Operating recognizers in parallelbull Adapting the recognizer to match the new speaker

10

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Operating recognizers in parallel

bull Disadvantage ndash The computational load appears to rises linearly with the number of

systemsbull Advantage

ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech

Speaker typeanswer

11

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Adapting the recognizer to match the new speaker

bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use

parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker

MAPMLLR

12

Ch3 Context dependency in speech (cont)

bull Local effectsndash Co-articulation means that the acoustic realization of a phone in

a particular phonetic context is more consistent than the same phone occurring in a variety of contexts

ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er

13

Ch3 Context dependency in speech (cont)

bull Local effectsndash Context Dependent Phonetic Models

bull IN LIMSIndash 45 monophone context (Festival CMU 41)

raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)

raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context

raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil

ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)

ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil

bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy

ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil

14

English dictionary

bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones

(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))

15

English dictionary (cont)

bull The LIMSI dictionary phones set (1993)ndash 45 phones

16

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 10: The Use of Context in  Large Vocabulary Speech Recognition

10

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Operating recognizers in parallel

bull Disadvantage ndash The computational load appears to rises linearly with the number of

systemsbull Advantage

ndash The system tends to dominate quickly and the computational load is high for only the first few seconds of speech

Speaker typeanswer

11

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Adapting the recognizer to match the new speaker

bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use

parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker

MAPMLLR

12

Ch3 Context dependency in speech (cont)

bull Local effectsndash Co-articulation means that the acoustic realization of a phone in

a particular phonetic context is more consistent than the same phone occurring in a variety of contexts

ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er

13

Ch3 Context dependency in speech (cont)

bull Local effectsndash Context Dependent Phonetic Models

bull IN LIMSIndash 45 monophone context (Festival CMU 41)

raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)

raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context

raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil

ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)

ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil

bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy

ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil

14

English dictionary

bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones

(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))

15

English dictionary (cont)

bull The LIMSI dictionary phones set (1993)ndash 45 phones

16

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 11: The Use of Context in  Large Vocabulary Speech Recognition

11

Ch3 Context dependency in speech (cont)

bull Session effects (cont)ndash Adapting the recognizer to match the new speaker

bull Problem There is insufficient data to update the modelndash It is possible to make use of both techniques and initially use

parallel systems to choose the speaker characteristics then once enough data is available adapt the chosen system to better match the speaker

MAPMLLR

12

Ch3 Context dependency in speech (cont)

bull Local effectsndash Co-articulation means that the acoustic realization of a phone in

a particular phonetic context is more consistent than the same phone occurring in a variety of contexts

ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er

13

Ch3 Context dependency in speech (cont)

bull Local effectsndash Context Dependent Phonetic Models

bull IN LIMSIndash 45 monophone context (Festival CMU 41)

raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)

raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context

raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil

ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)

ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil

bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy

ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil

14

English dictionary

bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones

(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))

15

English dictionary (cont)

bull The LIMSI dictionary phones set (1993)ndash 45 phones

16

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 12: The Use of Context in  Large Vocabulary Speech Recognition

12

Ch3 Context dependency in speech (cont)

bull Local effectsndash Co-articulation means that the acoustic realization of a phone in

a particular phonetic context is more consistent than the same phone occurring in a variety of contexts

ndash Ex rdquoWe were away with William in Sea Worldrdquo w iy w erhellip s iy w er

13

Ch3 Context dependency in speech (cont)

bull Local effectsndash Context Dependent Phonetic Models

bull IN LIMSIndash 45 monophone context (Festival CMU 41)

raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)

raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context

raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil

ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)

ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil

bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy

ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil

14

English dictionary

bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones

(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))

15

English dictionary (cont)

bull The LIMSI dictionary phones set (1993)ndash 45 phones

16

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 13: The Use of Context in  Large Vocabulary Speech Recognition

13

Ch3 Context dependency in speech (cont)

bull Local effectsndash Context Dependent Phonetic Models

bull IN LIMSIndash 45 monophone context (Festival CMU 41)

raquo STEAK = sil s t ey k silndash 2071 biphone context (Festival CMU 1364)

raquo STEAK = sil sil-s s-t t-ey ey-k silndash 95221 triphone context

raquo STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil

ndash Word Boundariesbull Word Internal Context Dependency (Intra-word)

ndash STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil

bull Cross World Context Dependency (Inter-word) =gtcan increase accuracy

ndash STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil

14

English dictionary

bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones

(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))

15

English dictionary (cont)

bull The LIMSI dictionary phones set (1993)ndash 45 phones

16

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 14: The Use of Context in  Large Vocabulary Speech Recognition

14

English dictionary

bull Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) ndash 40 distinct phones

(hello nil (((hh ax l) 0) ((ow) 1)))(world nil (((w er l d) 1)))

15

English dictionary (cont)

bull The LIMSI dictionary phones set (1993)ndash 45 phones

16

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 15: The Use of Context in  Large Vocabulary Speech Recognition

15

English dictionary (cont)

bull The LIMSI dictionary phones set (1993)ndash 45 phones

16

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 16: The Use of Context in  Large Vocabulary Speech Recognition

16

Linguistic knowledge (cont)

鼻音摩擦音流音

bull General questions

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 17: The Use of Context in  Large Vocabulary Speech Recognition

17

Linguistic knowledge (cont)

bull Vowel questions

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 18: The Use of Context in  Large Vocabulary Speech Recognition

18

Linguistic knowledge (cont)

bull Consonant questions

發音時很用力的子音發音較不費力的子音

舌尖音

刺耳的

音節主音

摩擦音

破擦音

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 19: The Use of Context in  Large Vocabulary Speech Recognition

19

Linguistic knowledge (cont)

bull Questions which is used in HTK

lt= State tying

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 20: The Use of Context in  Large Vocabulary Speech Recognition

20

Ch4Decoding

bull This chapter described several decoding techniques suitable for recognition of continuous speech using HMM

bull It is concerned with the use of cross word context dependent acoustic and long span language models

bull Ideal decoderndash 42 Time-Synchronous decoding

bull 421 Token passingbull 422 Beam pruning bull 423 N-Best decodingbull 424 Limitationsbull 425 Back-Off implementation

ndash 43 Best First Decodingbull 431 A Decodingbull 432 The stack decoder for speech recognition

ndash 44 A Hybrid approach

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 21: The Use of Context in  Large Vocabulary Speech Recognition

21

Ch4Decoding (cont)

41 Requirementsndash Ideal decoder It should find the most likely grammatical hypothe

sis for an unknow utterance bull Acoustic model likelihoodbull Language model likelihood

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 22: The Use of Context in  Large Vocabulary Speech Recognition

22

Ch4Decoding (cont)

41 Requirements (cont)ndash The ideal decoder would have following characteristics

bull Efficiency Ensure that the system does not lag behind the speaker

bull Accuracy Find the most likely grammatical sequence of words for each utterance

bull Scalability (可擴放性 ) () The computation required by the decoder would also increase less than linearly with the size of the vocabulary

bull Versatility(多樣性 ) Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency (n-gram language + cross-word context dependent models)

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
Page 23: The Use of Context in  Large Vocabulary Speech Recognition

23

Conclusion

bull Implement HTK right biphone task and triphone task

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23