Upload
trinhnguyet
View
245
Download
1
Embed Size (px)
Citation preview
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 1
Speech Processing and Understanding
CSC401 Assignment 3
Frank Rudzicz and Siavash Kazemian
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 2
Agenda• Background
• Speech technology, in general
• Acoustic phonetics
• Assignment 3
• Speaker Recognition: Gaussian mixture models
• Speech Recognition:
• Continuous hidden Markov models
• Transcription
• The Forward Algorithm
• Word-error rates with Levenshtein distance.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 3
Speech Recognition
● Time-variant acoustic pressure waves + signal processing.● Feature extraction.● Segmentation, and classification of segments by phoneme.● Selection of most likely syntax.● Possible performance of the proper procedure.
“open the pod bay doors”
/ow p ah n dh ah p aa d b ey d ao r z/open(podBay.doors);
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 4
Applications of ASR
Put this there.
Multimodality & HCI
DictationMy hands are in the air.Telephony
Emerging...
• Data mining/indexing.• Assistive technology.• Conversation.
• Data mining/indexing.• Assistive technology.• Conversation.
Buy ticket...AC490...
yes
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 5
Agenda• Background
• Speech technology, in general
• Acoustic phonetics
• Assignment 3
• Speaker Recognition: Gaussian mixture models
• Speech Recognition:
• Continuous hidden Markov models
• Transcription
• The Forward Algorithm
• Word-error rates with Levenshtein distance.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 6
Formants in sonorants
• However, formants are insuffcient features for use in speech recognition generally...
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 7
Mel-frequency cepstral coefficients
• In real speech data, the spectrogram is often transformed to a representation that more closely represents human auditory response and is more amenable to accurate classification.
• MFCCs are ‘spectra of spectra’. They are the discrete cosine transform of the logarithms of the nonlinearly Mel-scaled powers of the Fourier transform of windows of the original waveform.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 8
Challenges in speech data
• Co-articulation and dropped phonemes.• (Intra-and-Inter-) Speaker variability.• No word boundaries.• Slurring, disfluency (e.g., ‘um’).• Signal Noise.• Highly dimensional.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 9
Phonemes• Words are formed by phonemes (aka ‘phones’),
e.g., ‘pod’ = /p aa d/
• Words have diferent pronunciations. and in practice we can never be certain of which phones were uttered, nor their start/stop points.
Sentence
Verb phrase
Verb Noun phrase
Det Modifier Noun (plu)
Noun Noun
open the pod bay doors
ow p ah n dh ah p aa d b ey d ao r z
Syntactic
Lexical
Phonemic
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 10
Phonetic alphabets
• International Phonetic Association (IPA)
• Can represent sounds in all languages
• Contains non-ASCII characters
• ARPAbet
• One of the earliest attempts at encoding English for early speech recognition.
• TIMIT/CMU
• Very popular among modern databases for speech recognition.
• Used in assignment 3
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 11
Example phonetic alphabets
• The other consonants are transcribed as you would expect
• i.e., p, b, m, t, d, n, k, g, s, z, f, v, w, h
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 12
Agenda• Background
• Speech technology, in general
• Acoustic phonetics
• Assignment 3
• Speaker Recognition: Gaussian mixture models
• Speech Recognition:
• Continuous hidden Markov models
• Transcription
• The Forward Algorithm
• Word-error rates with Levenshtein distance.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 13
Assignment 3
• Two parts:
• Speaker identification: Determine which of 30 speakers an unknown test sample of speech comes from, given Gaussian mixture models you will train for each speaker.
• Speech recognition: Learn about phonetic annotation of speech data, training continuous hidden Markov models, and using these to identify phonemes with probabilities produced by the Forward algorithm. You will also learn to compute word-error rates with the Levenshtein distance.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 14
Speaker Data
• 30 speakers (e.g., FCJF0, MDPK0).
• Each speaker has 9 training utterances.
• e.g., Training/FCJF0/SA1.*, Training/FCJF0/SI1028.*
• Each utterance has 5 files:
• *.wav : The original wave file.
• *.mfcc : The relevant acoustic features.
• *.txt : Sentence-level transcription.
• *.wrd : Word-level transcription.
• *.phn : Phoneme-level transcription.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 15
Speaker Data (cont.)
• All you need to know: A speech utterance is an Nxd matrix
• Each row represents the features of a d-dimensional point in time.
• There are N rows in a sequence of N frames.
• These data are in space-delimited text files *mfcc
X1[1] X1[2] ... X1[d]X2[1] X2[2] ... X2[d]
... ... ... ...XN[1] XN[2] ... XN[d]N
12
1 2 ddata dimension
fram
es
tim
e
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 16
Speaker Data (cont.)
• You are given ‘transcription files’ (*.wrd, *.txt, *.phn) in which each line tells the start and end frames for a unit of speech.
• For example, if a *.wrd file has the line: 501 540 pod =>samples 501 to 540 inclusive represent the word ‘pod’
... ... ... ...X501[1] X501[2] ... X501[d]
... ... ... ...X540[1] X540[2] ... X540[d]
... ... ... ...
540
...501
1 2 d
data dimension
fram
es
tim
e
...
...
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 17
Agenda• Background
• Speech technology, in general
• Matlab
• Acoustic phonetics
• Assignment 3
• Speaker Recognition: Gaussian mixture models
• Speech Recognition:
• Continuous hidden Markov models
• Transcription
• The Forward Algorithm
• Word-error rates with Levenshtein distance.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 18
Speaker Recognition
• There are 30 testing utterances.
• e.g., Testing/unkn_1.*, Testing/unkn_2.*
• Each speaker produced 1 of these testing utterances.
• We don't know which speaker produced which test utterance.
• Every speaker occupies a characteristic part of the acoustic space.
• We want to learn a probability distribution for each speaker that describes their acoustic behaviour.
• Use those distributions to identify the speaker-dependant features of some unknown sample of speech data.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 19
Some background: fitting to data• Given a set of observations X of some random variable, we
wish to interpolate a simple, continuous function that allows us to make decisions about future events.
• e.g., if I fit curves to observed data from speakers A and B (below), and later observe an unlabeled datum x=15, I can infer that x is much more likely to come from speaker B than from speaker A.
A B
15 15
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 20
Probability of observing the data
• We want to choose fits that better match our observed data.
• The data (blue bars) are more likely to come from the parameterization on the left, so we like that one better.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 21
Finding parameters: 1D Gaussians
• Often called Normal distributions
• The parameters we can adjust to ft the data are and :
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 22
Maximum likelihood estimation
• Given data , Maximum likelihood estimation (MLE) produces an estimate of a parameter set by maximizing the likelihood of observing the data:
• Here,
• The likelihood function provides a surface over all possible parameterizations. In order to find the highest likelihood, we usually look at the derivative
to see at which point in the function stops growing.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 23
MLE - 1D Gaussian
• Estimate :
A similar approach gives the MLE estimate of :
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 24
Multidimensional Gaussians
• When your data is d-dimensional, the input variable is
the mean vector is
the covariance matrix is
with
and
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 25
Non-Gaussian data
• Our speaker data does not behave unimodally.
• i.e., we can't use just 1 Gaussian per speaker.
• E.g., observations below occur mostly bimodally, so fitting 1 Gaussian would not be representative.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 26
Gaussian mixtures
• Gaussian mixtures are a weighted linear combination of M component gaussians, , such that
• Since the mixture weightsare unknown, learning a GMMfunction is semi-parametric andsemi-supervised.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 27
MLE for Gaussian mixtures
• For notational convenience, and
so ,
and
• To find , we solve where
...see Appendix for more
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 28
MLE for Gaussian mixtures (pt. 2)
• Given
• Since , we obtain
by solving for in :
, and:
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 29
Recipe for GMM ML estimationDo the following for each speaker individually. Use all the frames available in their respective Training directories
1. Initialize: Guess with M random vectors in the data, or by performing M-means clustering.
2. Compute likelihood: Compute and
3. Update parameters:
Repeat 2&3 until converges
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 30
Cheat sheet
Probability of xt in the GMM
Probability of the mth Gaussian, given xt
Probability of observing xt in the mth Gaussian
Prior probability of the mth Gaussian
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 31
Initializing theta
• Initialize each mum to a random vector from the data.
• Initialize Sigmam to a `random’ diagonal matrix (or identity
matrix).
• Initialize omegam randomly, with these constraints:
• A good choice would be to set omegam to 1/M
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 32
Practical tips for MLE of GMMs• Assume diagonal covariance matrices to reduce the number
of parameters.
• Numerical stability: compute in the log domain
• Efficiency: pre-compute parameters
PrecomputePrecompute
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 33
Agenda• Background
• Speech technology, in general
• Matlab
• Acoustic phonetics
• Assignment 3
• Speaker Recognition: Gaussian mixture models
• Speech Recognition:
• Continuous hidden Markov models
• Transcription
• The Forward Algorithm
• Word-error rates with Levenshtein distance.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 34
Speech recognition task• You will train one HMM for each phoneme
in the database.
• Given phonetically annotated test utterances, determine if the correct HMM gives the highest probability for each phone segment.
/s/
/sh/
/z/
/iy/
/eh/
...
HMM models:
85
...64 85 ae85 96 s96 102 epi102 106 m...
unkn_24.phn
96
unkn_24.mfcc
e.g.,
{X
Is P(X| ) > P(X| all other HMMs)?/s/
/ih/
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 35
Continuous multivariate HMMs
• Multivariate HMMs with continuous GMM emission likelihoods are very similar to discrete HMMs, except
• Instead of reading letters or words from a discrete alphabet, we’re reading d-dimensional feature vectors consisting of real numbers.
• Observation likelihoods are computed with Gaussian mixture models.
• For each phoneme, you should have a three-state HMM:
3-state monophone (e.g., /s/)e.g.,
/s/
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 36
Continuous, multivariate HMMs
00 11 22
3-state monophone (e.g., /s/)
Given a d-dimensional transformation, , which represents ~16ms of speech and given that you’re in state s at time t,
compute , which is the standard GMM
output formula. Each state has its own parameters
PΘ ( 0 )
PΘ ( 1 )
PΘ ( 2 )
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 37
Concatenating phones into words
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 38
Tips
• During training you might get messages like******likelihood decreased from X to Y!
This is an artifact of the way the toolbox estimates likelihoods.
You can usually ignore these messages, but they may be indicative of non-ideal estimates – I don't get them.
• Each phone will usually take between 5 and 15 EM iterations.
• Training your HMMs can take up to 4 or 5 hours at 100% CPU.
• Debug your code on a small subset of all training data.
• Try using screen.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 39
Agenda• Background
• Speech technology, in general
• Matlab
• Acoustic phonetics
• Assignment 3
• Speaker Recognition: Gaussian mixture models
• Speech Recognition:
• Continuous hidden Markov models
• Transcription
• The Forward Algorithm
• Word-error rates with Levenshtein distance.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 40
Phonetic annotation
• Problem: The first 10 *.phn files in the Testing directory have errors.
• Specifically, unkn_1.phn to unkn_10.phn.
• Task: You must fix unkn_1.phn to unkn_10.phn with Wavesurfere.g.,
0 83 h#84 92 m93 100 ow101 114 hh...
unkn_1.phn
0 83 h#84 92 n93 100 ow101 114 hh...
unkn_1.phn
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 41
Wavesurfer - getting started1. Open Wavesurfer
2. Open a *.wav file from the Testing directory.
3. When prompted, select “TIMIT transcription”
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 42
1. You should see a black-and-white spectrogram and a pane below it marked by “.PHN”
this is where you transcribe with phonemes
controls
Wavesurfer - getting started 2
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 43
Wavesurfer - example1. Left -click in the PHN panel where you think a phoneme
ends. The red vertical bar helps you visualize this on the spectrogram.
2. Type the code of the phoneme, in this case ‘hh’
• Note: You are given *.wrd and *.txt files, which tell you the words spoken.
• Look these words up in the CMU dictionary to know the expected pronunciations.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 44
Wavesurfer - example 21 You can adjust the boundary by hovering the mouse over the vertical line until you get a double-arrow. Left-click and drag to adjust.
2 To adjust a label, hover the mouse over the text until the mouse turns into a cursor, then select the label and type.
Hint: Always listen to the segment you’ve labelled, and adjust accordingly.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 45
Wavesurfer - example 31. To save, middle-click on the PHN panel, and select “Save
Transcription As...”. Other actions are also available here, such as label deletion.
2. To adjust a label, hover the mouse over the text until the mouse turns into a cursor, then select the label and type.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 46
Wavesurfer result
• At the end, you will have fixed unkn_1.phn to unkn_10.phn.
0 83 h#84 92 n93 100 ow101 114 hh...
unkn_1.phn
...## ## ??## ## ??## ## ??## ## ??...
unkn_2.phn
...## ## ??## ## ??## ## ??## ## ??...
unkn_10.phn
...
• These *.phn files are platform independent, it does not matter which version of Wavesurfer you use.
• Boundaries between phones are not always easy to determine. You will not be marked on how similar your boundaries are to ours.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 47
Wavesurfer - basic mouse controlsMouse controls (Linux, CDF):
• Left: Specify a new end-boundary of a phone.
• Middle: Play the audio of the selected segment.
• Right: Context menu.
Mouse controls (if you're forced to use Mac OS X):
● Left: Specify a new end-boundary of a phone.
● Middle: Context menu.
● Right: Play the audio of the selected segment.
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 48
Using the CMU dictionary
• Each of the *.wav files you transcribe are associated with a corresponding *.txt file, which has the orthography of the utterance.
• For each word in the utterance, search the CMU dictionary to get the associated transcription.
• Some words have multiple pronunciations. Pick the one you think best matches the audio. Ignore stress markers on phones (i.e. ‘0’, ‘1’, ‘2’)
EVOLUTION EH2 V AH0 L UW1 SH AH0 N EVOLUTION(2) IY2 V AH0 L UW1 SH AH0 N EVOLUTION(3) EH2 V OW0 L UW1 SH AH0 N EVOLUTION(4) IY2 V OW0 L UW1 SH AH0 N EVOLUTIONARY EH2 V AH0 L UW1 SH AH0 N EH2 R IY0
e.g.,
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 49
Word-error rates
• If somebody saidREF: how to recognize speech
but an ASR system heardHYP: how to wreck a nice beach
how do we measure the error that occurred?
• One measure is #CorrectWords/#HypothesisWordse.g., 2/6 above
• Another measure is (S+I+D)/#ReferenceWords
• S: # Substitution errors (one word for another)
• I: # Insertion errors (extra words)
• D: # Deletion errors (words that are missing).
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 50
Computing Levenshtein Distance
• In the exampleREF: how to recognize speech. HYP: how to wreck a nice beach
How do we count each of S, I, and D?
• If wreck is a substitution error, what about a and nice?
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 51
Computing Levenshtein Distance• In the example
REF: how to recognize speech. HYP: how to wreck a nice beach
How do we count each of S, I, and D?If wreck is a substitution error, what about a and nice?
• Levenshtein distance:Initialize R[0,0] = 0, and R[i.j] = ∞ for all i=0 or j=0for i=1..n (#ReferenceWords)
for j=1..m (#Hypothesis words)R[i,j] = min( R[i-1,j] + 1 (deletion)
R[i-1,j-1] (only if words match)
R[i-1,j-1]+1 (only if words difer)R[i,j-1] + 1 (insertion)
Return 100*R(n,m)/n
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 52
Levenshtein example
how to wreck a nice beach
0 ∞ ∞ ∞ ∞ ∞ ∞
how ∞ 0 1 2 3 4 5
to ∞
recognize ∞
speech ∞
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 53
Levenshtein example
how to wreck a nice beach
0 ∞ ∞ ∞ ∞ ∞ ∞
how ∞ 0 1 2 3 4 5
to ∞ 1 0 1 2 3 4
recognize ∞
speech ∞
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 54
Levenshtein example
how to wreck a nice beach
0 ∞ ∞ ∞ ∞ ∞ ∞
how ∞ 0 1 2 3 4 5
to ∞ 1 0 1 2 3 4
recognize ∞ 2 1 1 2 3 4
speech ∞
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 55
Levenshtein example
how to wreck a nice beach
0 ∞ ∞ ∞ ∞ ∞ ∞
how ∞ 0 1 2 3 4 5
to ∞ 1 0 1 2 3 4
recognize ∞ 2 1 1 2 3 4
speech ∞ 3 2 2 2 3 4
Word-error rate is 4/4 = 100%
2 substitutions, 2 insertions
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 57
Multidimensional Gaussians, pt. 2
• If the ith and jth dimensions are statistically independent,
and
• If all dimensions are statistically independent, and the covariance matrix becomes diagonal, which and
where
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 58
MLE example - dD Gaussians
• The MLE estimates for parameters given i.i.d. training data are obtained by maximizing the joint likelihood
• To do so, we solve , where
• Giving
csc401/2511 - Winter 2012 – Assignment 3 Tutorial, Frank Rudzicz, Siavash Kazemian 59
MLE for Gaussian mixtures (pt1.5)
• Given and
• Obtain an ML estimate, , of the mean vector by maximizing w.r.t.
• Why?
δ of sum = sum of δ δ rule for loge
δ wrt is 0 for all other mixtures in the sum in