Speech Processing and Understanding CSC401 …kazemian/CSC401_A3_Tut1.pdfSpeech Processing and Understanding CSC401 Assignment 3 ... Assignment 3 •Two parts: ... ~16ms of speech

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 1

Speech Processing and Understanding

CSC401 Assignment 3

Frank Rudzicz and Siavash Kazemian


Agenda• Background

• Speech technology, in general

• Acoustic phonetics

• Assignment 3

• Speaker Recognition: Gaussian mixture models

• Speech Recognition:

• Continuous hidden Markov models

• Transcription

• The Forward Algorithm

• Word-error rates with Levenshtein distance.


Speech Recognition

● Time-variant acoustic pressure waves + signal processing.● Feature extraction.● Segmentation, and classification of segments by phoneme.● Selection of most likely syntax.● Possible performance of the proper procedure.

“open the pod bay doors”

/ow p ah n dh ah p aa d b ey d ao r z/open(podBay.doors);


Applications of ASR

Put this there.

Multimodality & HCI

DictationMy hands are in the air.Telephony

Emerging...

• Data mining/indexing.• Assistive technology.• Conversation.

• Data mining/indexing.• Assistive technology.• Conversation.

Buy ticket...AC490...

yes





• Assignment 3




• Transcription




Formants in sonorants

• However, formants are insuffcient features for use in speech recognition generally...


Mel-frequency cepstral coefficients

• In real speech data, the spectrogram is often transformed to a representation that more closely represents human auditory response and is more amenable to accurate classification.

• MFCCs are ‘spectra of spectra’. They are the discrete cosine transform of the logarithms of the nonlinearly Mel-scaled powers of the Fourier transform of windows of the original waveform.


Challenges in speech data

• Co-articulation and dropped phonemes.• (Intra-and-Inter-) Speaker variability.• No word boundaries.• Slurring, disfluency (e.g., ‘um’).• Signal Noise.• Highly dimensional.


Phonemes• Words are formed by phonemes (aka ‘phones’),

e.g., ‘pod’ = /p aa d/

• Words have diferent pronunciations. and in practice we can never be certain of which phones were uttered, nor their start/stop points.

Sentence

Verb phrase

Verb Noun phrase

Det Modifier Noun (plu)

Noun Noun

open the pod bay doors

ow p ah n dh ah p aa d b ey d ao r z

Syntactic

Lexical

Phonemic


Phonetic alphabets

• International Phonetic Association (IPA)

• Can represent sounds in all languages

• Contains non-ASCII characters

• ARPAbet

• One of the earliest attempts at encoding English for early speech recognition.

• TIMIT/CMU

• Very popular among modern databases for speech recognition.

• Used in assignment 3


Example phonetic alphabets

• The other consonants are transcribed as you would expect

• i.e., p, b, m, t, d, n, k, g, s, z, f, v, w, h





• Assignment 3




• Transcription




Assignment 3

• Two parts:

• Speaker identification: Determine which of 30 speakers an unknown test sample of speech comes from, given Gaussian mixture models you will train for each speaker.

• Speech recognition: Learn about phonetic annotation of speech data, training continuous hidden Markov models, and using these to identify phonemes with probabilities produced by the Forward algorithm. You will also learn to compute word-error rates with the Levenshtein distance.


Speaker Data

• 30 speakers (e.g., FCJF0, MDPK0).

• Each speaker has 9 training utterances.

• e.g., Training/FCJF0/SA1.*, Training/FCJF0/SI1028.*

• Each utterance has 5 files:

• *.wav : The original wave file.

• *.mfcc : The relevant acoustic features.

• *.txt : Sentence-level transcription.

• *.wrd : Word-level transcription.

• *.phn : Phoneme-level transcription.


Speaker Data (cont.)

• All you need to know: A speech utterance is an Nxd matrix

• Each row represents the features of a d-dimensional point in time.

• There are N rows in a sequence of N frames.

• These data are in space-delimited text files *mfcc

X1[1] X1[2] ... X1[d]X2[1] X2[2] ... X2[d]

... ... ... ...XN[1] XN[2] ... XN[d]N

12

1 2 ddata dimension

fram

es

tim

e


Speaker Data (cont.)

• You are given ‘transcription files’ (*.wrd, *.txt, *.phn) in which each line tells the start and end frames for a unit of speech.

• For example, if a *.wrd file has the line: 501 540 pod =>samples 501 to 540 inclusive represent the word ‘pod’

... ... ... ...X501[1] X501[2] ... X501[d]

... ... ... ...X540[1] X540[2] ... X540[d]

... ... ... ...

540

...501

1 2 d

data dimension

fram

es

tim

e

...

...




• Matlab


• Assignment 3




• Transcription




Speaker Recognition

• There are 30 testing utterances.

• e.g., Testing/unkn_1.*, Testing/unkn_2.*

• Each speaker produced 1 of these testing utterances.

• We don't know which speaker produced which test utterance.

• Every speaker occupies a characteristic part of the acoustic space.

• We want to learn a probability distribution for each speaker that describes their acoustic behaviour.

• Use those distributions to identify the speaker-dependant features of some unknown sample of speech data.


Some background: fitting to data• Given a set of observations X of some random variable, we

wish to interpolate a simple, continuous function that allows us to make decisions about future events.

• e.g., if I fit curves to observed data from speakers A and B (below), and later observe an unlabeled datum x=15, I can infer that x is much more likely to come from speaker B than from speaker A.

A B

15 15


Probability of observing the data

• We want to choose fits that better match our observed data.

• The data (blue bars) are more likely to come from the parameterization on the left, so we like that one better.


Finding parameters: 1D Gaussians

• Often called Normal distributions

• The parameters we can adjust to ft the data are and :


Maximum likelihood estimation

• Given data , Maximum likelihood estimation (MLE) produces an estimate of a parameter set by maximizing the likelihood of observing the data:

• Here,

• The likelihood function provides a surface over all possible parameterizations. In order to find the highest likelihood, we usually look at the derivative

to see at which point in the function stops growing.


MLE - 1D Gaussian

• Estimate :

A similar approach gives the MLE estimate of :


Multidimensional Gaussians

• When your data is d-dimensional, the input variable is

the mean vector is

the covariance matrix is

with

and


Non-Gaussian data

• Our speaker data does not behave unimodally.

• i.e., we can't use just 1 Gaussian per speaker.

• E.g., observations below occur mostly bimodally, so fitting 1 Gaussian would not be representative.


Gaussian mixtures

• Gaussian mixtures are a weighted linear combination of M component gaussians, , such that

• Since the mixture weightsare unknown, learning a GMMfunction is semi-parametric andsemi-supervised.


MLE for Gaussian mixtures

• For notational convenience, and

so ,

and

• To find , we solve where

...see Appendix for more


MLE for Gaussian mixtures (pt. 2)

• Given

• Since , we obtain

by solving for in :

, and:


Recipe for GMM ML estimationDo the following for each speaker individually. Use all the frames available in their respective Training directories

1. Initialize: Guess with M random vectors in the data, or by performing M-means clustering.

2. Compute likelihood: Compute and

3. Update parameters:

Repeat 2&3 until converges


Cheat sheet

Probability of xt in the GMM

Probability of the mth Gaussian, given xt

Probability of observing xt in the mth Gaussian

Prior probability of the mth Gaussian


Initializing theta

• Initialize each mum to a random vector from the data.

• Initialize Sigmam to a `random’ diagonal matrix (or identity

matrix).

• Initialize omegam randomly, with these constraints:

• A good choice would be to set omegam to 1/M


Practical tips for MLE of GMMs• Assume diagonal covariance matrices to reduce the number

of parameters.

• Numerical stability: compute in the log domain

• Efficiency: pre-compute parameters

PrecomputePrecompute




• Matlab


• Assignment 3




• Transcription




Speech recognition task• You will train one HMM for each phoneme

in the database.

• Given phonetically annotated test utterances, determine if the correct HMM gives the highest probability for each phone segment.

/s/

/sh/

/z/

/iy/

/eh/

...

HMM models:

85

...64 85 ae85 96 s96 102 epi102 106 m...

unkn_24.phn

96

unkn_24.mfcc

e.g.,

{X

Is P(X| ) > P(X| all other HMMs)?/s/

/ih/


Continuous multivariate HMMs

• Multivariate HMMs with continuous GMM emission likelihoods are very similar to discrete HMMs, except

• Instead of reading letters or words from a discrete alphabet, we’re reading d-dimensional feature vectors consisting of real numbers.

• Observation likelihoods are computed with Gaussian mixture models.

• For each phoneme, you should have a three-state HMM:

3-state monophone (e.g., /s/)e.g.,

/s/


Continuous, multivariate HMMs

00 11 22

3-state monophone (e.g., /s/)

Given a d-dimensional transformation, , which represents ~16ms of speech and given that you’re in state s at time t,

compute , which is the standard GMM

output formula. Each state has its own parameters

PΘ ( 0 )

PΘ ( 1 )

PΘ ( 2 )


Concatenating phones into words


Tips

• During training you might get messages like******likelihood decreased from X to Y!

This is an artifact of the way the toolbox estimates likelihoods.

You can usually ignore these messages, but they may be indicative of non-ideal estimates – I don't get them.

• Each phone will usually take between 5 and 15 EM iterations.

• Training your HMMs can take up to 4 or 5 hours at 100% CPU.

• Debug your code on a small subset of all training data.

• Try using screen.




• Matlab


• Assignment 3




• Transcription




Phonetic annotation

• Problem: The first 10 *.phn files in the Testing directory have errors.

• Specifically, unkn_1.phn to unkn_10.phn.

• Task: You must fix unkn_1.phn to unkn_10.phn with Wavesurfere.g.,

0 83 h#84 92 m93 100 ow101 114 hh...

unkn_1.phn

0 83 h#84 92 n93 100 ow101 114 hh...

unkn_1.phn


Wavesurfer - getting started1. Open Wavesurfer

2. Open a *.wav file from the Testing directory.

3. When prompted, select “TIMIT transcription”


1. You should see a black-and-white spectrogram and a pane below it marked by “.PHN”

this is where you transcribe with phonemes

controls

Wavesurfer - getting started 2


Wavesurfer - example1. Left -click in the PHN panel where you think a phoneme

ends. The red vertical bar helps you visualize this on the spectrogram.

2. Type the code of the phoneme, in this case ‘hh’

• Note: You are given *.wrd and *.txt files, which tell you the words spoken.

• Look these words up in the CMU dictionary to know the expected pronunciations.


Wavesurfer - example 21 You can adjust the boundary by hovering the mouse over the vertical line until you get a double-arrow. Left-click and drag to adjust.

2 To adjust a label, hover the mouse over the text until the mouse turns into a cursor, then select the label and type.

Hint: Always listen to the segment you’ve labelled, and adjust accordingly.


Wavesurfer - example 31. To save, middle-click on the PHN panel, and select “Save

Transcription As...”. Other actions are also available here, such as label deletion.

2. To adjust a label, hover the mouse over the text until the mouse turns into a cursor, then select the label and type.


Wavesurfer result

• At the end, you will have fixed unkn_1.phn to unkn_10.phn.

0 83 h#84 92 n93 100 ow101 114 hh...

unkn_1.phn

...## ## ??## ## ??## ## ??## ## ??...

unkn_2.phn

...## ## ??## ## ??## ## ??## ## ??...

unkn_10.phn

...

• These *.phn files are platform independent, it does not matter which version of Wavesurfer you use.

• Boundaries between phones are not always easy to determine. You will not be marked on how similar your boundaries are to ours.


Wavesurfer - basic mouse controlsMouse controls (Linux, CDF):

• Left: Specify a new end-boundary of a phone.

• Middle: Play the audio of the selected segment.

• Right: Context menu.

Mouse controls (if you're forced to use Mac OS X):

● Left: Specify a new end-boundary of a phone.

● Middle: Context menu.

● Right: Play the audio of the selected segment.


Using the CMU dictionary

• Each of the *.wav files you transcribe are associated with a corresponding *.txt file, which has the orthography of the utterance.

• For each word in the utterance, search the CMU dictionary to get the associated transcription.

• Some words have multiple pronunciations. Pick the one you think best matches the audio. Ignore stress markers on phones (i.e. ‘0’, ‘1’, ‘2’)

EVOLUTION EH2 V AH0 L UW1 SH AH0 N EVOLUTION(2) IY2 V AH0 L UW1 SH AH0 N EVOLUTION(3) EH2 V OW0 L UW1 SH AH0 N EVOLUTION(4) IY2 V OW0 L UW1 SH AH0 N EVOLUTIONARY EH2 V AH0 L UW1 SH AH0 N EH2 R IY0

e.g.,


Word-error rates

• If somebody saidREF: how to recognize speech

but an ASR system heardHYP: how to wreck a nice beach

how do we measure the error that occurred?

• One measure is #CorrectWords/#HypothesisWordse.g., 2/6 above

• Another measure is (S+I+D)/#ReferenceWords

• S: # Substitution errors (one word for another)

• I: # Insertion errors (extra words)

• D: # Deletion errors (words that are missing).


Computing Levenshtein Distance

• In the exampleREF: how to recognize speech. HYP: how to wreck a nice beach

How do we count each of S, I, and D?

• If wreck is a substitution error, what about a and nice?


Computing Levenshtein Distance• In the example

REF: how to recognize speech. HYP: how to wreck a nice beach

How do we count each of S, I, and D?If wreck is a substitution error, what about a and nice?

• Levenshtein distance:Initialize R[0,0] = 0, and R[i.j] = ∞ for all i=0 or j=0for i=1..n (#ReferenceWords)

for j=1..m (#Hypothesis words)R[i,j] = min( R[i-1,j] + 1 (deletion)

R[i-1,j-1] (only if words match)

R[i-1,j-1]+1 (only if words difer)R[i,j-1] + 1 (insertion)

Return 100*R(n,m)/n


Levenshtein example

how to wreck a nice beach

0 ∞ ∞ ∞ ∞ ∞ ∞

how ∞ 0 1 2 3 4 5

to ∞

recognize ∞

speech ∞


Levenshtein example


0 ∞ ∞ ∞ ∞ ∞ ∞

how ∞ 0 1 2 3 4 5

to ∞ 1 0 1 2 3 4

recognize ∞

speech ∞


Levenshtein example


0 ∞ ∞ ∞ ∞ ∞ ∞

how ∞ 0 1 2 3 4 5

to ∞ 1 0 1 2 3 4

recognize ∞ 2 1 1 2 3 4

speech ∞


Levenshtein example


0 ∞ ∞ ∞ ∞ ∞ ∞

how ∞ 0 1 2 3 4 5

to ∞ 1 0 1 2 3 4

recognize ∞ 2 1 1 2 3 4

speech ∞ 3 2 2 2 3 4

Word-error rate is 4/4 = 100%

2 substitutions, 2 insertions


Appendices


Multidimensional Gaussians, pt. 2

• If the ith and jth dimensions are statistically independent,

and

• If all dimensions are statistically independent, and the covariance matrix becomes diagonal, which and

where


MLE example - dD Gaussians

• The MLE estimates for parameters given i.i.d. training data are obtained by maximizing the joint likelihood

• To do so, we solve , where

• Giving


MLE for Gaussian mixtures (pt1.5)

• Given and

• Obtain an ML estimate, , of the mean vector by maximizing w.r.t.

• Why?

δ of sum = sum of δ δ rule for loge

δ wrt is 0 for all other mixtures in the sum in

Suggested reading

Documents

Speech Processing and Understanding CSC401 …kazemian/CSC401_A3_Tut1.pdfSpeech Processing and Understanding CSC401 Assignment 3 ... Assignment 3 •Two parts: ... ~16ms of speech