60
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 1 Speech Processing and Understanding CSC401 Assignment 3 Frank Rudzicz and Siavash Kazemian

Speech Processing and Understanding CSC401 …kazemian/CSC401_A3_Tut1.pdfSpeech Processing and Understanding CSC401 Assignment 3 ... Assignment 3 •Two parts: ... ~16ms of speech

Embed Size (px)

Citation preview

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 1

Speech Processing and Understanding

CSC401 Assignment 3

Frank Rudzicz and Siavash Kazemian

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 2

Agenda• Background

• Speech technology, in general

• Acoustic phonetics

• Assignment 3

• Speaker Recognition: Gaussian mixture models

• Speech Recognition:

• Continuous hidden Markov models

• Transcription

• The Forward Algorithm

• Word-error rates with Levenshtein distance.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 3

Speech Recognition

● Time-variant acoustic pressure waves + signal processing.● Feature extraction.● Segmentation, and classification of segments by phoneme.● Selection of most likely syntax.● Possible performance of the proper procedure.

“open the pod bay doors”

/ow p ah n dh ah p aa d b ey d ao r z/open(podBay.doors);

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 4

Applications of ASR

Put this there.

Multimodality & HCI

DictationMy hands are in the air.Telephony

Emerging...

• Data mining/indexing.• Assistive technology.• Conversation.

• Data mining/indexing.• Assistive technology.• Conversation.

Buy ticket...AC490...

yes

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 5

Agenda• Background

• Speech technology, in general

• Acoustic phonetics

• Assignment 3

• Speaker Recognition: Gaussian mixture models

• Speech Recognition:

• Continuous hidden Markov models

• Transcription

• The Forward Algorithm

• Word-error rates with Levenshtein distance.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 6

Formants in sonorants

• However, formants are insuffcient features for use in speech recognition generally...

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 7

Mel-frequency cepstral coefficients

• In real speech data, the spectrogram is often transformed to a representation that more closely represents human auditory response and is more amenable to accurate classification.

• MFCCs are ‘spectra of spectra’. They are the discrete cosine transform of the logarithms of the nonlinearly Mel-scaled powers of the Fourier transform of windows of the original waveform.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 8

Challenges in speech data

• Co-articulation and dropped phonemes.• (Intra-and-Inter-) Speaker variability.• No word boundaries.• Slurring, disfluency (e.g., ‘um’).• Signal Noise.• Highly dimensional.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 9

Phonemes• Words are formed by phonemes (aka ‘phones’),

e.g., ‘pod’ = /p aa d/

• Words have diferent pronunciations. and in practice we can never be certain of which phones were uttered, nor their start/stop points.

Sentence

Verb phrase

Verb Noun phrase

Det Modifier Noun (plu)

Noun Noun

open the pod bay doors

ow p ah n dh ah p aa d b ey d ao r z

Syntactic

Lexical

Phonemic

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 10

Phonetic alphabets

• International Phonetic Association (IPA)

• Can represent sounds in all languages

• Contains non-ASCII characters

• ARPAbet

• One of the earliest attempts at encoding English for early speech recognition.

• TIMIT/CMU

• Very popular among modern databases for speech recognition.

• Used in assignment 3

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 11

Example phonetic alphabets

• The other consonants are transcribed as you would expect

• i.e., p, b, m, t, d, n, k, g, s, z, f, v, w, h

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 12

Agenda• Background

• Speech technology, in general

• Acoustic phonetics

• Assignment 3

• Speaker Recognition: Gaussian mixture models

• Speech Recognition:

• Continuous hidden Markov models

• Transcription

• The Forward Algorithm

• Word-error rates with Levenshtein distance.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 13

Assignment 3

• Two parts:

• Speaker identification: Determine which of 30 speakers an unknown test sample of speech comes from, given Gaussian mixture models you will train for each speaker.

• Speech recognition: Learn about phonetic annotation of speech data, training continuous hidden Markov models, and using these to identify phonemes with probabilities produced by the Forward algorithm. You will also learn to compute word-error rates with the Levenshtein distance.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 14

Speaker Data

• 30 speakers (e.g., FCJF0, MDPK0).

• Each speaker has 9 training utterances.

• e.g., Training/FCJF0/SA1.*, Training/FCJF0/SI1028.*

• Each utterance has 5 files:

• *.wav : The original wave file.

• *.mfcc : The relevant acoustic features.

• *.txt : Sentence-level transcription.

• *.wrd : Word-level transcription.

• *.phn : Phoneme-level transcription.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 15

Speaker Data (cont.)

• All you need to know: A speech utterance is an Nxd matrix

• Each row represents the features of a d-dimensional point in time.

• There are N rows in a sequence of N frames.

• These data are in space-delimited text files *mfcc

X1[1] X1[2] ... X1[d]X2[1] X2[2] ... X2[d]

... ... ... ...XN[1] XN[2] ... XN[d]N

12

1 2 ddata dimension

fram

es

tim

e

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 16

Speaker Data (cont.)

• You are given ‘transcription files’ (*.wrd, *.txt, *.phn) in which each line tells the start and end frames for a unit of speech.

• For example, if a *.wrd file has the line: 501 540 pod =>samples 501 to 540 inclusive represent the word ‘pod’

... ... ... ...X501[1] X501[2] ... X501[d]

... ... ... ...X540[1] X540[2] ... X540[d]

... ... ... ...

540

...501

1 2 d

data dimension

fram

es

tim

e

...

...

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 17

Agenda• Background

• Speech technology, in general

• Matlab

• Acoustic phonetics

• Assignment 3

• Speaker Recognition: Gaussian mixture models

• Speech Recognition:

• Continuous hidden Markov models

• Transcription

• The Forward Algorithm

• Word-error rates with Levenshtein distance.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 18

Speaker Recognition

• There are 30 testing utterances.

• e.g., Testing/unkn_1.*, Testing/unkn_2.*

• Each speaker produced 1 of these testing utterances.

• We don't know which speaker produced which test utterance.

• Every speaker occupies a characteristic part of the acoustic space.

• We want to learn a probability distribution for each speaker that describes their acoustic behaviour.

• Use those distributions to identify the speaker-dependant features of some unknown sample of speech data.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 19

Some background: fitting to data• Given a set of observations X of some random variable, we

wish to interpolate a simple, continuous function that allows us to make decisions about future events.

• e.g., if I fit curves to observed data from speakers A and B (below), and later observe an unlabeled datum x=15, I can infer that x is much more likely to come from speaker B than from speaker A.

A B

15 15

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 20

Probability of observing the data

• We want to choose fits that better match our observed data.

• The data (blue bars) are more likely to come from the parameterization on the left, so we like that one better.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 21

Finding parameters: 1D Gaussians

• Often called Normal distributions

• The parameters we can adjust to ft the data are and :

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 22

Maximum likelihood estimation

• Given data , Maximum likelihood estimation (MLE) produces an estimate of a parameter set by maximizing the likelihood of observing the data:

• Here,

• The likelihood function provides a surface over all possible parameterizations. In order to find the highest likelihood, we usually look at the derivative

to see at which point in the function stops growing.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 23

MLE - 1D Gaussian

• Estimate :

A similar approach gives the MLE estimate of :

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 24

Multidimensional Gaussians

• When your data is d-dimensional, the input variable is

the mean vector is

the covariance matrix is

with

and

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 25

Non-Gaussian data

• Our speaker data does not behave unimodally.

• i.e., we can't use just 1 Gaussian per speaker.

• E.g., observations below occur mostly bimodally, so fitting 1 Gaussian would not be representative.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 26

Gaussian mixtures

• Gaussian mixtures are a weighted linear combination of M component gaussians, , such that

• Since the mixture weightsare unknown, learning a GMMfunction is semi-parametric andsemi-supervised.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 27

MLE for Gaussian mixtures

• For notational convenience, and

so ,

and

• To find , we solve where

...see Appendix for more

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 28

MLE for Gaussian mixtures (pt. 2)

• Given

• Since , we obtain

by solving for in :

, and:

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 29

Recipe for GMM ML estimationDo the following for each speaker individually. Use all the frames available in their respective Training directories

1. Initialize: Guess with M random vectors in the data, or by performing M-means clustering.

2. Compute likelihood: Compute and

3. Update parameters:

Repeat 2&3 until converges

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 30

Cheat sheet

Probability of xt in the GMM

Probability of the mth Gaussian, given xt

Probability of observing xt in the mth Gaussian

Prior probability of the mth Gaussian

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 31

Initializing theta

• Initialize each mum to a random vector from the data.

• Initialize Sigmam to a `random’ diagonal matrix (or identity

matrix).

• Initialize omegam randomly, with these constraints:

• A good choice would be to set omegam to 1/M

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 32

Practical tips for MLE of GMMs• Assume diagonal covariance matrices to reduce the number

of parameters.

• Numerical stability: compute in the log domain

• Efficiency: pre-compute parameters

PrecomputePrecompute

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 33

Agenda• Background

• Speech technology, in general

• Matlab

• Acoustic phonetics

• Assignment 3

• Speaker Recognition: Gaussian mixture models

• Speech Recognition:

• Continuous hidden Markov models

• Transcription

• The Forward Algorithm

• Word-error rates with Levenshtein distance.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 34

Speech recognition task• You will train one HMM for each phoneme

in the database.

• Given phonetically annotated test utterances, determine if the correct HMM gives the highest probability for each phone segment.

/s/

/sh/

/z/

/iy/

/eh/

...

HMM models:

85

...64 85 ae85 96 s96 102 epi102 106 m...

unkn_24.phn

96

unkn_24.mfcc

e.g.,

{X

Is P(X| ) > P(X| all other HMMs)?/s/

/ih/

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 35

Continuous multivariate HMMs

• Multivariate HMMs with continuous GMM emission likelihoods are very similar to discrete HMMs, except

• Instead of reading letters or words from a discrete alphabet, we’re reading d-dimensional feature vectors consisting of real numbers.

• Observation likelihoods are computed with Gaussian mixture models.

• For each phoneme, you should have a three-state HMM:

3-state monophone (e.g., /s/)e.g.,

/s/

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 36

Continuous, multivariate HMMs

00 11 22

3-state monophone (e.g., /s/)

Given a d-dimensional transformation, , which represents ~16ms of speech and given that you’re in state s at time t,

compute , which is the standard GMM

output formula. Each state has its own parameters

PΘ ( 0 )

PΘ ( 1 )

PΘ ( 2 )

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 37

Concatenating phones into words

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 38

Tips

• During training you might get messages like******likelihood decreased from X to Y!

This is an artifact of the way the toolbox estimates likelihoods.

You can usually ignore these messages, but they may be indicative of non-ideal estimates – I don't get them.

• Each phone will usually take between 5 and 15 EM iterations.

• Training your HMMs can take up to 4 or 5 hours at 100% CPU.

• Debug your code on a small subset of all training data.

• Try using screen.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 39

Agenda• Background

• Speech technology, in general

• Matlab

• Acoustic phonetics

• Assignment 3

• Speaker Recognition: Gaussian mixture models

• Speech Recognition:

• Continuous hidden Markov models

• Transcription

• The Forward Algorithm

• Word-error rates with Levenshtein distance.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 40

Phonetic annotation

• Problem: The first 10 *.phn files in the Testing directory have errors.

• Specifically, unkn_1.phn to unkn_10.phn.

• Task: You must fix unkn_1.phn to unkn_10.phn with Wavesurfere.g.,

0 83 h#84 92 m93 100 ow101 114 hh...

unkn_1.phn

0 83 h#84 92 n93 100 ow101 114 hh...

unkn_1.phn

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 41

Wavesurfer - getting started1. Open Wavesurfer

2. Open a *.wav file from the Testing directory.

3. When prompted, select “TIMIT transcription”

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 42

1. You should see a black-and-white spectrogram and a pane below it marked by “.PHN”

this is where you transcribe with phonemes

controls

Wavesurfer - getting started 2

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 43

Wavesurfer - example1. Left -click in the PHN panel where you think a phoneme

ends. The red vertical bar helps you visualize this on the spectrogram.

2. Type the code of the phoneme, in this case ‘hh’

• Note: You are given *.wrd and *.txt files, which tell you the words spoken.

• Look these words up in the CMU dictionary to know the expected pronunciations.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 44

Wavesurfer - example 21 You can adjust the boundary by hovering the mouse over the vertical line until you get a double-arrow. Left-click and drag to adjust.

2 To adjust a label, hover the mouse over the text until the mouse turns into a cursor, then select the label and type.

Hint: Always listen to the segment you’ve labelled, and adjust accordingly.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 45

Wavesurfer - example 31. To save, middle-click on the PHN panel, and select “Save

Transcription As...”. Other actions are also available here, such as label deletion.

2. To adjust a label, hover the mouse over the text until the mouse turns into a cursor, then select the label and type.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 46

Wavesurfer result

• At the end, you will have fixed unkn_1.phn to unkn_10.phn.

0 83 h#84 92 n93 100 ow101 114 hh...

unkn_1.phn

...## ## ??## ## ??## ## ??## ## ??...

unkn_2.phn

...## ## ??## ## ??## ## ??## ## ??...

unkn_10.phn

...

• These *.phn files are platform independent, it does not matter which version of Wavesurfer you use.

• Boundaries between phones are not always easy to determine. You will not be marked on how similar your boundaries are to ours.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 47

Wavesurfer - basic mouse controlsMouse controls (Linux, CDF):

• Left: Specify a new end-boundary of a phone.

• Middle: Play the audio of the selected segment.

• Right: Context menu.

Mouse controls (if you're forced to use Mac OS X):

● Left: Specify a new end-boundary of a phone.

● Middle: Context menu.

● Right: Play the audio of the selected segment.

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 48

Using the CMU dictionary

• Each of the *.wav files you transcribe are associated with a corresponding *.txt file, which has the orthography of the utterance.

• For each word in the utterance, search the CMU dictionary to get the associated transcription.

• Some words have multiple pronunciations. Pick the one you think best matches the audio. Ignore stress markers on phones (i.e. ‘0’, ‘1’, ‘2’)

EVOLUTION EH2 V AH0 L UW1 SH AH0 N EVOLUTION(2) IY2 V AH0 L UW1 SH AH0 N EVOLUTION(3) EH2 V OW0 L UW1 SH AH0 N EVOLUTION(4) IY2 V OW0 L UW1 SH AH0 N EVOLUTIONARY EH2 V AH0 L UW1 SH AH0 N EH2 R IY0

e.g.,

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 49

Word-error rates

• If somebody saidREF: how to recognize speech

but an ASR system heardHYP: how to wreck a nice beach

how do we measure the error that occurred?

• One measure is #CorrectWords/#HypothesisWordse.g., 2/6 above

• Another measure is (S+I+D)/#ReferenceWords

• S: # Substitution errors (one word for another)

• I: # Insertion errors (extra words)

• D: # Deletion errors (words that are missing).

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 50

Computing Levenshtein Distance

• In the exampleREF: how to recognize speech. HYP: how to wreck a nice beach

How do we count each of S, I, and D?

• If wreck is a substitution error, what about a and nice?

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 51

Computing Levenshtein Distance• In the example

REF: how to recognize speech. HYP: how to wreck a nice beach

How do we count each of S, I, and D?If wreck is a substitution error, what about a and nice?

• Levenshtein distance:Initialize R[0,0] = 0, and R[i.j] = ∞ for all i=0 or j=0for i=1..n (#ReferenceWords)

for j=1..m (#Hypothesis words)R[i,j] = min( R[i-1,j] + 1 (deletion)

R[i-1,j-1] (only if words match)

R[i-1,j-1]+1 (only if words difer)R[i,j-1] + 1 (insertion)

Return 100*R(n,m)/n

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 52

Levenshtein example

how to wreck a nice beach

0 ∞ ∞ ∞ ∞ ∞ ∞

how ∞ 0 1 2 3 4 5

to ∞

recognize ∞

speech ∞

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 53

Levenshtein example

how to wreck a nice beach

0 ∞ ∞ ∞ ∞ ∞ ∞

how ∞ 0 1 2 3 4 5

to ∞ 1 0 1 2 3 4

recognize ∞

speech ∞

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 54

Levenshtein example

how to wreck a nice beach

0 ∞ ∞ ∞ ∞ ∞ ∞

how ∞ 0 1 2 3 4 5

to ∞ 1 0 1 2 3 4

recognize ∞ 2 1 1 2 3 4

speech ∞

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 55

Levenshtein example

how to wreck a nice beach

0 ∞ ∞ ∞ ∞ ∞ ∞

how ∞ 0 1 2 3 4 5

to ∞ 1 0 1 2 3 4

recognize ∞ 2 1 1 2 3 4

speech ∞ 3 2 2 2 3 4

Word-error rate is 4/4 = 100%

2 substitutions, 2 insertions

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 56

Appendices

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 57

Multidimensional Gaussians, pt. 2

• If the ith and jth dimensions are statistically independent,

and

• If all dimensions are statistically independent, and the covariance matrix becomes diagonal, which and

where

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 58

MLE example - dD Gaussians

• The MLE estimates for parameters given i.i.d. training data are obtained by maximizing the joint likelihood

• To do so, we solve , where

• Giving

csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 59

MLE for Gaussian mixtures (pt1.5)

• Given and

• Obtain an ML estimate, , of the mean vector by maximizing w.r.t.

• Why?

δ of sum = sum of δ δ rule for loge

δ wrt is 0 for all other mixtures in the sum in

Suggested reading