29
Automatic Continuous Speech Recognition Database speech text text text text Scoring

Automatic Continuous Speech Recognition Database speech text Scoring

Embed Size (px)

Citation preview

Page 1: Automatic Continuous Speech Recognition Database speech text Scoring

Automatic Continuous Speech Recognition

Databasespeech

texttext

text

text

Scoring

Page 2: Automatic Continuous Speech Recognition Database speech text Scoring

Automatic Continuous Speech Recognition

Problems with isolated word recognition:– Every new task contains novel words

without any available training data.– There are simply too many words, and this

words may have different acoustic realizations. Increases variability

coarticulation of “words” Speech velocity

– we don´t know the limits of the words.

Page 3: Automatic Continuous Speech Recognition Database speech text Scoring

In CSR, should we use words? Or what is the basic unit to represent salient acoustic and phonetic information?

Page 4: Automatic Continuous Speech Recognition Database speech text Scoring

Model Units Issues

Accurate.– Represent the acoustic realization that

appears in different contexts. Trainable Generalizable:

– New words can be derived

Page 5: Automatic Continuous Speech Recognition Database speech text Scoring

Comparison of Different Units

Words: – Small task.

accurate, trainable, no-generalizable

– Large Vocabulary: accurate, non-trainable, no-generalizable.

Phonemes:– Large Vocabulary:

No-accurate, trainable, over-generalizable

Page 6: Automatic Continuous Speech Recognition Database speech text Scoring

Syllables– English: 30,000

No-very-accurate, no-trainable, generalizable

– Chinese: 1200 tone-dependent syllables– Japanese: 50 syllables for

accurate, trainable, generalizable

Allophones: Realizations of phonemes in different context.– accurate, no-trainable, generalizable

– Triphones: Example of allophone.

Page 7: Automatic Continuous Speech Recognition Database speech text Scoring

Traning in Sphinx

phonemes set is trained

senons are trained:1-gaussians to 8_or_16-gaussinas

triphones are created

senons are created

senons are prunned

triphones are trained

Page 8: Automatic Continuous Speech Recognition Database speech text Scoring

Context Independent: Phonemes– SPHINX:

model_architecture/Telefonica.ci.mdef Context Dependent:Triphone:

– SPHINX: model_architecture/Telefonica.untied.mdef

Page 9: Automatic Continuous Speech Recognition Database speech text Scoring

Clustering Acoustic-Phonetic Units

Many Phones have similar effects on the neighboring phones, hence, many triphones have very similar Markov states.

A senone is a cluster of similar Markov

states. Advantages:

– More training data.– Less memory used.

Page 10: Automatic Continuous Speech Recognition Database speech text Scoring

Senonic Decision Tree (SDT)

SDT Classify Markov States of Triphones represented in the training corpus by asking Linguistic Questions composed of Conjuntions, Disjunctions and/or negations of a set of predetermined questions.

Page 11: Automatic Continuous Speech Recognition Database speech text Scoring

Linguistic Questions

Question Phones in Each Question

Aspgen Hh

Sil Sil

Alvstp d,t

Dental dh, th

Labstp b, p

Liquid l, r

Lw l, w

S/Sh S, sh

…. …

Page 12: Automatic Continuous Speech Recognition Database speech text Scoring

Decision Tree for Classifying the second state of k-triphone

Is left phone (LP) a sonorant or nasal?

yes

Is right phone (RP) a back-R? Is LP /s,z,sh,sh/?

Is RF voiced?

Is LP back L or ( LC neither a nasal or RF A LAX-vowel)?

Senone 1 Senone 5 Senone 6

Senone 4

Senone 3Senone 2

Page 13: Automatic Continuous Speech Recognition Database speech text Scoring

When applied to the word welcome

Is left phone (LP) a sonorant or nasal?

yes

Is right phone (RP) a back-R? Is left phone /s,z,sh,sh/?

Is RF voiced?

Is LP back L or ( LC neither a nasal or RF A LAX-vowel)?

Senone 1 Senone 5 Senone 6

Senone 4

Senone 3Senone 2

Page 14: Automatic Continuous Speech Recognition Database speech text Scoring

The tree can automatically constructed by searching, for each node, the question that the maximum entropy decrease – Sphinx:

Construction: $base_dir/ c_scripts/03.bulidtrees. Results: $base_dir/trees/Telefonica.unpruned/A-0.dtree

When the tree grows, it needs to be pruned – Sphinx:

$base_dir/ c_scripts/ 04.bulidtrees. Results:aA $base_dir/trees/Telefonica.500/A-0.dtree $base_dir/Telefonica_arquitecture/Telefonica.500.mdef

Page 15: Automatic Continuous Speech Recognition Database speech text Scoring

Subword unit Models based on HMMs

Page 16: Automatic Continuous Speech Recognition Database speech text Scoring

Words

Words can be modeled using composite HMMs

A null transition is used to go from one subword unit to the following

/sil/ /t/ /uw/ /sil/

Page 17: Automatic Continuous Speech Recognition Database speech text Scoring

Continuous Speech TrainingDatabase

speech

texttext

text

text

Scoring

Page 18: Automatic Continuous Speech Recognition Database speech text Scoring

For each utterance to train, the subword units are concatenated to form words model.– Sphinx: Dictionary– $base_dir/training_input/dict.txt– $base_dir/training_input/train.lbl

Page 19: Automatic Continuous Speech Recognition Database speech text Scoring

Let’s assume we are going to train the phonemes in the sentence:– Two four six.

The phonems of this sentence are:– /t//w//o//f//o//r//s//i//x/

Therefore the HMM will be:

/sil/ /t/ /uw/ /sil//f/ /o/ /r/ /s/ /i/ /x/

Page 20: Automatic Continuous Speech Recognition Database speech text Scoring

We can estimate the parameters for each HMM using the forward-backward reestimation formulas already definded.

Page 21: Automatic Continuous Speech Recognition Database speech text Scoring

The ability to automatically align each individual HMM to the corresponding unsegmented speech observation sequence is one of the most powerful features in the forward-backward algorithm.

Page 22: Automatic Continuous Speech Recognition Database speech text Scoring

Language Models for Large Vocabulary Speech Recognitin

Databasespeech

texttext

text

text

Scoring

Page 23: Automatic Continuous Speech Recognition Database speech text Scoring

Instead of using:

The recongition can be imporved using the calculating the Maximum Posteriory Probability:

P M P M P M P M k ii i k k( / ) ( ) ( / ) ( )O O

M,,q=MPMPikMPMP kkki 21 )()( ; )/()/( OO

Languaje ModelLanguaje ModelViterbiViterbi

Page 24: Automatic Continuous Speech Recognition Database speech text Scoring

Language Models for Large Vocabulary Speech Recognitin

Goal:– Provide an estimate of the probability of a

“word” sequence (w1 w2 w3 ...wQ)

for the given recognition task.

This can be solved as follows:

QwwwwPWP 321

121

213121321

|

||

QQ

Q

wwwwP

wwwPwwPwPwwwwPWP

Page 25: Automatic Continuous Speech Recognition Database speech text Scoring

Since, it is impossible to reliable estimate the conditional probabilities,

hence in practice it is used an N-gram language model:

En practice, realiable estimators are obtained for N=1 (unigram) N=2 (bigram) or possible N=3 (trigram).

121121 || jNjNjQQQ wwwwPwwwwP

121| jj wwwwP

121| jj wwwwP j

Page 26: Automatic Continuous Speech Recognition Database speech text Scoring

Examples:

Unigram:P(Maria loves Pedro)=P(Maria)P(loves)P(Pedro)

Bigram:P(Maria|<sil>)P(loves|Maria)P(Pedro|loves)P(</sil>|Pedro)

Page 27: Automatic Continuous Speech Recognition Database speech text Scoring

CMU-Cambridge Language Modeling Tools

$base_dir/c_scripts/languageModelling

Page 28: Automatic Continuous Speech Recognition Database speech text Scoring

Databasespeech

texttext

text

text

Scoring

Page 29: Automatic Continuous Speech Recognition Database speech text Scoring

P(Wi| Wi-2,Wi-1)=

C(Wi-2 Wi-1 )=Total Number Sequence Wi-2 Wi-1 was observed

C(Wi-2 Wi-1 Wi ) =Total Number Sequence Wi-2 Wi-1 Wi was observed

C(Wi-2 Wi-1 Wi )

C(Wi-2 Wi-1)

where