Isolated-Word Speech Recognition Using Hidden Markov Models 6.962 Week 10 Presentation Irina Medvedev Massachusetts Institute of Technology April 19, 2001

Isolated-Word Speech Recognition Using

Hidden Markov Models

6.962 Week 10 Presentation

Irina Medvedev

Massachusetts Institute of Technology

April 19, 2001

Outline

• Markov Processes, Chains, Models

• Isolated-Word Speech Recognition¤ Feature Analysis

¤ Unit Matching

¤ Training

¤ Recognition

• Conclusions

Markov Process

• The process, x(t), is first-order Markov if, for any set of ordered times, ,

• The current value of a Markov process depends all of the memory necessary to predict the future

• The past does not add any additional information about the future

Nttt 21

Markov Process

• The transition probability density provides an important statistical description of a Markov process and is defined as

• A complete specification of a Markov process consists of

• first-order density :

• transition density :

tszxp sxtx ,|)(|)(

zxp sxtx |)(|)(

xp tx )(

Markov Chains

1S

2S 3S

12a

21a

23a

32a

13a31a

11a

33a22a

Fully Connected Markov Model

• A Markov chain can be used to describe a system which, at any time, belongs to one of N distinct states,

• At regularly spaced times, the system may stay in the same state or transition to a different state

• State at time t is denoted by qt

• The state transition probabilities are made according to a set of probabilities associated with each state

• These probabilities are stored in the state transition matrix

where N is the number of states in the Markov chain.

• The state transition probabilities are

and have the properties and

Markov Chains

NNNN

N

N

aaa

aaa

aaa

11

22221

11211

A


• Hidden Markov Models (HMMs) are used when the states are not observable events.

• Instead, the observation is a probabilistic function of the state rather than the state itself

• The states are described by a probability model

• The HMM is a doubly embedded stochastic process

HMM Example: Coin Toss

H H T H T How do we build an HMM to explain the observed sequence of head and tails?

Choose a 2-state model

Several possibilities exist

1-coin Model: Observable

H T

)H(P )H(1 P

)H(1 P

)H(P1 2

11a 22a

111 a

221 a1

1

1)T(

)H(

PP

PP

2

2

1)T(

)H(

PP

PP

2-coin Model: States are Hidden


Hidden Markov Models are characterized by

• N, the number of states in the model

• A, the state transition matrix

• Observation probability distribution state

• Initial state distribution,

Model Parameter Set:

jbBj ,

Left-Right HMM

1S 2S 3S 4S

11a 22a 33a 44a

23a 34a12a

44

3433

2322

1211

000

00

00

00

a

aa

aa

aa

A

• Can only transition to a higher state or stay in the same state

• No-skip constraint allows states to transition only to the next state or remain the same state

• Zeros in the state transition matrix represent illegal state transitions

4-state left-right HMM with no skip transitions

Isolated-Word Speech Recognition

• Recognize one word at a time

• Assume incoming signal is of the form:

silence – speech – silence

• Feature Analysis • Training

• Unit Matching • Recognition

Feature Analysis

• We perform feature analysis to extract observation vectors upon which all processing will be performed

• The discrete-time speech signal is with discrete Fourier transform

• To reduce the dimensionality of the V-dim speech vector, we use cepstral coefficients, which serve as the feature observation vector for all future processing

TVyyy 110 y

TKYYY 110 Y

Cepstral Coefficients

• Feature vectors are cepstral coefficients obtained from the sampled speech vector

where is the periodogram estimate of the power spectral density of the speech

• We eliminate the zeroth component and keep cepstral coefficients 1 through L-1

• Dimensionality reduction = LV

Properties of Cepstral Coefficients

• Serve to undo the convolution between the pitch and the vocal tract

• High-order cepstral components carry speaker dependent pitch information, which is not relevant for speech recognition

• Cepstral coefficients are well approximated by a Gaussian probability density function (pdf)

• Correlation values of cepstral coefficients are very low

Modeling of Cepstral Coefficients

• HMM assumes that the Markovian states generate the cepstral vectors

• Each state represents a Gaussian source with mean vector and covariance matrix

• Each feature vector of cepstral coefficients can be modeled as a sample vector of an L-dim Gaussian random vector with mean vector and diagonal covariance matrix

Formulation of the Feature Vectors

Title:/afs/athena.mit.edu/user/i/r/irinam/SPEECH/cepstral1.epsCreator:MATLAB, The Mathworks, Inc.Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Unit Matching

• Initial Goal: obtain an HMM for each speech recognition unit

• Large vocabulary (300 words): recognition units are phonemes

• Small-vocabulary (10 words): recognition units are words

We will consider an isolated-word speech recognition system for a small vocabulary of M words

Notation

• Observation vector is , where each is a cepstral feature vector and is the number of feature vectors in an observation

• State Sequence is , where each

• State index

• Word index

• Time index

• The term model will be used for both the HMM and the parameter set describing the HMM,

Training

• We need to obtain an HMM for each of the M words

• The process of building the HMMs is called training

• Each HMM is characterized by the number of states, N, and the model parameter set,

• Each cepstral feature vector, , in state, , can be modeled by an L-dim Gaussian pdf

where is the mean vector and is the covariance matrix in state

Training

• A Gaussian pdf is completely characterized by the mean vector and covariance matrix

• The model parameter set can be modified to

• The training procedure is the same for each word. For convenience, we will drop the subscript from

Building the HMM

• To build the HMM, we need to determine the parameter set that maximizes the likelihood of the observation for that word.

• Objective:

• The double maximization can be performed by optimizing over the state sequence and the model individually

Uniform Segmentation

50 segments 8 states

7 7 6 6 6 6 6 6

1S 2S 3S 4S 5S 6S 7S 8S

Determining the initial state sequence

• Given the initial state sequence, we maximize over the model

• The maximization entails estimating the model parameters from the observation given the state sequence

• Estimation is performed using the Baum-Welch re-estimation formulas

Maximization over the Model

Re-estimation Formulas

Initial state distribution

Covariance matrix per state

Mean vector per state

State transition matrix

it Sqt

Titit

ii diagN :

ˆˆ1ˆ xx

Ni 1

is the number of feature vectors in state

Model Estimation

1S2S3S4S5S6S7S8S9S

10S

1t 2t 3t 4t 5t 6t 7t 8t 9t 10t 11t 12t 13t 14t

100000

0000

010000

0000

0000

0000

31

32

21

21

31

32

31

32

A

• Given the model, we maximize over the state sequence

• The probability expression can be rewritten as

Maximization over the state sequence

• Applying the logarithm transforms the maximization of a product into a maximization of a sum

• We are still looking for the state sequence that maximizes the expression

• The optimal state sequence can be determined using the Viterbi algorithm

Maximization over the state sequence

Trellis Structure of HMMs

1S2S3S4S5S

1t 2t 3t 4t 5t

• Redrawing the HMM as a trellis makes it easy to see the state sequence as a path through the trellis

• The optimal state sequence is determined by the Viterbi algorithm as the single best path that maximizes

Training Procedure

CepstralCalculation

Uniform Segmentation

Estimation of

(Baum-Welch)

πA ,,, ii

)(ny x

State Sequence

Segmentation (Viterbi)

Converged?No

Yes

qx |,max P

1|,max qxqP

Recognition

• We have a set of HMMs, one for each word

• Objective: Choose the word model that maximizes the probability of the observation given the model (Maximum Likelihood detection rule)

• Classifier for observation is

• The likelihood can be written as a summation over all state sequences

Recognition

• Replace the full likelihood by an approximation that takes into account only the most probable state sequence capable of producing the observation

• Treating the most probable state sequence as the best path in the HMM trellis allows us to use the Viterbi algorithm to maximize the above probability

• The best-path classifier for observation is

Recognition

CepstralCalculation

)(ny x SelectMaximum

1| xP

2| xP

NP |x

Index of recognized

word

Conclusion

• Introduced hidden Markov models

• Described process of isolated-word speech recognition

¤ Feature vectors ¤ Unit matching

¤ Unit matching Training ¤ Recognition

• Other considerations

¤ Artificial Neural Networks (ANNs) for speech recognition

¤ Hybrid HMM/ANN models

¤ Minimum classification error HMM design

Documents

Isolated-Word Speech Recognition Using Hidden Markov Models 6.962 Week 10 Presentation Irina Medvedev Massachusetts Institute of Technology April 19, 2001