Upload
hadaspeery
View
245
Download
0
Embed Size (px)
DESCRIPTION
pattern recognition and classification
Citation preview
11-751 Speech Recognition09-15-2008
Pattern Recognition and Classification(for speech recognition)
2/49
09-15-2008 Course information & quick review
The components of a modern ASR system
Pattern Recognition / Classification Will continue with this topic on Wednesday
3/49
Course Grading 30% Homework Assignments
4 assignments over the course
40% Exam [12-Dec] In-class final exam at the end of the course close book, covers the material presented in the course
30% Speech term project Proposal (1-pager) [due: 08-Oct] oral presentation (15min) [start of Dec] written report (10 pages max) [due: 15-Dec] demonstration (if applicable) your ideas and creativity for projects are highly welcome details and project ideas given on Wednesday
4/49
Instructors
RM 203: Alex Waibel ( [email protected])
RM 221: Ian Lane ( [email protected] )
RM 209: Yik-Cheung (Wilson) Tam ( [email protected] )
interACT2F, 407 S.Craig St.
Newell-Simon Hall
Doherty Hall
5/49
What we have looked at so far Why speech recognition?
Speech production How humans generate speech Vocal tract model of speech
Features used for speech recognition Spectral representation of speech LPC (Linear Predictive Coding) MFCC (Mel frequency cepstral coefficients)
Dynamic time warping and template matching Isolated word recognition
6Alex Waibel Speech RecognitionAlex Waibel Speech Recognition
Vocal Tract Model of SpeechVocal Tract Model of Speech
A V
x
x
A N
P L ( n )
pitch period
vocal tract parameters
speech
vocal tractV(t)
radiationmodel R(t)
random noisegenerator
impulse traingenerator
G(t)t
w w
w
7Alex Waibel Speech RecognitionAlex Waibel Speech Recognition
Sloppy SpeechSloppy Speech
ReadSpeech
Conver-SationalSpeech
Actual Input: I have been I have been getting into
Recognition: and I am I being too yeah
Recognition: I have been ties than getting into the
8/49
Feature Extraction
Acoustic vectors computed every 10ms Mel-Scale filters mimic auditory processing DCT decorrelates the signal to improve statistical independence First and second differentials appended dynamic information of signal
FFT
Speech Waveform
Log DCT
Mel scale triangular filters
2
FFT based spectrum
39 Element Acoustic Vector
x
9/49
Template matching and DTW
Linear alignment can handle
the problem of different speaking rates
But: it can not handle the problem of
varying speaking rates during
the same utterance.
First idea to overcome the varying length of Utterances, Problem (2):
1. Normalize their length 2. Make a linear alignment
10/49
Components of a Modern ASR System
Suggested reading:
S. Young, Large vocabulary continuous speech recognition: A review
11/49
ASR the big picture
Input Speech
Output Text
Hello world
???
12/49
ASR the big pictureThe purpose of Signal Preprocessing is:
1) Signal Digitalization (Quantization and Sampling)
Represent an analog signal in an appropriate form to be processed by the computer
2) Digital Signal Preprocessing (Feature Extraction)
Extract features that are suitable for recognition process
Input Speech
Output Text
Hello world
???
Front-end Processing
13/49
Fundamental EquationFor observed feature vector sequence
find the most likely word sequence
Input Speech
Output Text
Hello world
???
Front-end Processing
x W
W=argmax P Wx=argmax P W P xW P xW W
Wx
14/49
Speech Recognition DecodingFor observed feature vector sequence
find the most likely word sequence
Input Speech
Output Text
Hello world
Front-end Processing
W
x WP xW
W=argmax P Wx=argmax P W P xW P x
Acoustic Model
P W
Language Model
W W
Search (how to efficiently find maximum W)
x
15/49
Acoustic Model Given W , what is the likelihood to see feature vector(s) x
we need a representation for W in terms of feature vectors
Usually a two-part representation: pronunciation dictionary: describe W as concatenation of phones phones models that explain phones in terms of feature vectors
Input Speech
Output Text
Hello world
Front-end Processing
x WP xW P W
I /i/you /j/ /u/we /v/ /e/
Acoustic Model(phones)
Pronunciation Dictionary(map words to
phone sequences)
16/49
Why breaking down the words into phones Need collection of reference patterns for each word
High computational effort (esp. for large vocabularies), proportional to vocabulary size
Large vocabulary also means: need huge amount of training data Difficult to train suitable references (or sets of references) Impossible to recognize untrained words
Replace whole words by suitable sub units Poor performance when the environment changes
Works only well for speaker-dependent recognition (variations) Unsuitable where speaker is unknown and no training is feasible Unsuitable for continuous speech (combinatorial explosion) Difficult to train/recognize subword units
Replace the template approach by a better modeling process
17/49
Speech Production as a Stochastic Process The same word / phoneme sounds different every time it is uttered Regard words / phonemes as states of a speech production process
In a given state we can observe different acoustic sounds Not all sounds are possible / likely in every state We say: in a given state the speech process "emits" sound
according to some probability distribution The production process makes transitions from one state to
another Not all transitions are possible, they have different probabilities
When we specify the probabilities for sound-emissions (emission probabilities) and for the state transitions, we call this a model.
18/49
HMM Acoustic Modelling
Hidden Markov Model for each phone or senome (context dependent model) Transition probabilities: ai j model durational variability in speech
Output distribution: bi(yk) model spectral variability
Y = y1 y2 y3 y4 y5
1 2 3 4 5
b2(y1) b2(y2) b3(y3) b4(y4) b4(y5)
a12 a23 a34 a45
a22 a33 a44
Acoustic Vector Sequence
19/49
Language Model What is the likelihood to see word sequence W
prior probability of W independent of observed event x
Input Speech
Output Text
Hello world
Front-end Processing
x WP xW P W
p(you|how are)
p(today|are you)
p(world|Hello)
Language Model
(likelihood of word sequences)
20/49
Language Modelling P(W) is the a-priori probability of observing word
sequence W independent of the observed signal x
N-gram language model estimate the probability of some word wk given the preceding n-1 words Typically use n=3,4
Smoothing required account for word sequences not seen during training
P W = P wkwk1 ,wk2
21/49
Decoding with classifiers
Feature extraction
Decision(apply trained
classifiers)
Speech features
Hypotheses(phonemes)
Speech
/h/ /e/ /l/ /o//w/ /o/ /r/ /l/ /d/
/h/
...
22/49
Training classifiers
Feature extraction
TrainClassifier
Speech features
ImprovedClassifiers
AlignedSpeech
/e//h/ /e/ /l/ /o/ /h/ /e/ /l/ /o/
Use all aligned speech features (e.g. of phoneme /e/) to train the reference vectors of /e/ (=Codebook)- kmeans- LVQ
23/49
Suggested reading:
X. Huang, A. Acero, H. Hon, Spoken Language Processing, Chapter 4
R. Duda and P Hart, Pattern Classification and Scene Analysis, John Wiley & Sons, 2000 (2nd Edition)
Pattern Recognition and Classification(for speech recognition)
24/49
Pattern Recognition Approaches
PatternRecognition
statisticalknowledge /connectionist
supervised unsupervised
parametric nonparametric
linear nonlinear
25/49
Pattern Recognition Approaches Knowledge-based approaches:
Compile knowledge Build decision trees
Connectionist approaches: Automatic knowledge acquisition, "black-box" behavior Simulation of biological processes
Statistical approaches: Build a statistical model of the "real world" Compute probabilities according to the models
26/49
Classification Trees
Using decision tree to predict someones height class by traversing the tree and answering the yes/no questions Choice and order of questions is designed subjectively (knowledge-based) Classification and Regression Trees (CART) provide an automatic, data-driven framework to construct the decision process
Simple binary decision tree for height classification:
T=tall, t=medium-tall, M=medium, m=medium-short, S=short
Goal: Predict the height of a new person
27/49
The CART AlgorithmStep 1: Create a set of questions Q that consists of all possible questionsStep 2: Pick a splitting criterion that can evaluate all possible questionsStep 3: Create a tree with one root node consisting of all training samplesStep 4: Find the best composite question for each terminal node
(Goal is classification, so objective is to reduce uncertainty; use entropy H > find question which gives the greatest H reduction)- generate a tree with several simple-question splits- cluster leaf nodes into two classes according to the splitting criterion- construct a corresponding composite question (,)
Step 5: Split: Take the split with the best criterion from step 4Step 6: Stop criterion: go to step 7 if all leaf nodes contain data samples
from the same class or if the improvements of all potential splits fall below a defined threshold
Step 7: Prune the tree into the optimal size using an independent test sample estimate or cross-validation to prevent the tree from over-modeling the training data = allow generalization
28/49
Neural Net Approaches
NN: attempt real-time response and human-like performance Many simple processing elements operating in parallel Most common training procedure: error back-propagation:
Generalization of the MMSE (Minimum Mean Squared Error) algorithm Gradient search: minimize difference between actual and wanted output
MLP approximates the a posteriori probabilities P(Class|Pattern) Common problem: if an output of 0 0 0 ... 0 1 0 ... 0 0 0 is desired, the net tends to produce 0 0 0 ... 0 for all inputs
Parallel Supercomputers: renewed interest in NN Appealing to ASR: parallel evaluation of many clues and facts Most common approach: Multi-layer Perceptron MLP
29/49
Pattern Recognition Types of classifiers
Supervised - Unsupervised Classifiers Parametric - Non-Parametric Classifiers Linear - Non-linear Classifiers
Classical Statistical Methods Bayes Classifier K-Nearest Neighbor
Connectionist Methods Perceptron Multilayer Perceptrons
30/49
Supervised - Unsupervised Supervised training:
Class to be recognized is known for each sample in training data. Requires a priori knowledge of useful features and knowledge/labeling of each training token (cost!).
Unsupervised training:Class is not known and structure is to be discovered automatically. Feature-space-reduction
example: clustering, auto-associative nets
31/49
Unsupervised Classification
Classification: Classes Not Known: Find Structure
F1
F2
++
+ +++
++++
+
++
++ +
+++ ++
+
++
+
+ +++
++ ++++
+
+++++
+
32/49
Unsupervised Classification
Classification: Classes Not Known: Find Structure Clustering How to cluster? How many clusters?
F1
F2
++
+ +++
++++
+
++
++ +
+++ ++
+
++
+
+ +++
++ ++++
+
+++++
+
33/49
Supervised Classification
Classification: Classes Known: Creditworthiness: Yes-No Features: Income, Age Classifiers
Age
Income
++
+ +++
++++
+
++
++ +
+++ ++
+
++
+
+ +++
++ ++++
+
+++++ credit-worthy
non-credit-worthy+
34/49
Age
Income
++
+ +++
++++
+
++
++ +
+++ ++
+
++
+
+ +++
++ ++++
+
+++++ credit-worthy
non-credit-worthy+
Is Joe credit-worthy ?
Features: age, income Classes: creditworthy, non-creditworthy Problem: Given Joe's income and age, should a loan be made? Other Classification Problems: Fraud Detection, Customer Selection...
2
1
xx
Classification Problem
35/49
Parametric - Non-parametric
Parametric: assume underlying probability distribution; estimate the parameters of this distribution. Example: "Gaussian Classifier"
Non-parametric: Don't assume distribution. Estimate probability of error or error criterion directly from training data. Examples: Parzen Window, k-nearest neighbor, perceptron...
L
Number
Income
good loans
bad loans
minimum error criterium
36/49
Bayes Decision TheoryBayes Rule:
plays a central role for statistical pattern recognition
Concept of decision making based on:1) prior knowledge of categories prior probability:
AND
2) knowledge from observation data posterior probability:
Bayes Rule:
where:
Class-conditional Probability Density function:(referred to as the likelihood function how likely is x generated)
P ( j / x ) =p ( x / j ) P ( j )
p ( x )
p ( x ) = p ( x / j ) P ( j )j
observation of x
)( jP
)/( xP j
)/( jxp
37/49
Minimum-Error-Rate Decision Rule
P if we decideP else
Classification error is minimized, if we:Decide if
otherwiseDecide if
otherwise
For the multi-category case:Decide if P > P for all
P ( e r r o r / x ) = { ( 1 / x )( 2 / x )
1
1 2
2
P ( 1 / x ) > P ( 2 / x ) ;
p ( x / 1 ) P ( 1 ) > p ( x / 2 ) P ( 2 ) ;
2
i ( i / x ) ( j / x ) ij
38/49
Classification ErrorBayes decision rule: move the decision boundary such that the decision is made to choose the class i based on the maximum value of P(x|i) P(i). The tail integral area P(error) becomes minimum
39/49
Classifier Discriminant Functions
Assign x to class , if for all
Minimum-error-rate classifier:
g i ( x ) > g j ( x )
A priori probability
g i ( x ) , i = 1 , . . . , c
i
g i ( x ) = P ( i / x )=
p ( x / i ) P ( i )p ( x / j ) ( j )
j = 1
cg i ( x ) = p ( x / i ) P ( i )
class conditional probability density function
g i ( x ) = l o g ( p ( x / i ) ) + l o g ( P ( i ) )
independent of class i
ij
Decision problem = pattern classification problem where unknown data x are classified into known categories (e.g. classify sounds into phonemes)
g i ( x ) > g j ( x )
40/49
Classifier Design in Practice
Need a priori probability P (not too bad) Need class conditional PDF p Problems:
limited training data limited computation class-labeling potentially costly and prone to error classes may not be known good features not known
Parametric Solution: Assume that p has a particular parametric form Most common representative: multivariate normal density
( i )
( x / i )
( x / i )
41/49
Most important probability distributionsince random variables in physical experiments (incl. Speech signals) have distribs which are approximately Gaussian - normal distribution
X has a Gaussian distrib with mean and variance 2 if X has a continuous pdf ofthe form:l
Three Gaussian distributions with samemean but different variances (sigma)
Three Binomial distributions with different probs p; E(X) = np Var(X) = np(1-p)
Three Poisson distributions with differentlambda (E(X) = Var(X) =
42/49
Often the shape of the set of vectors that belong to one class does not look like what can be modeled by a single Gaussian. A (weighted) sum of Gaussians can approximate many more densities:
In general, a class can be modeled as a mixture of Gaussians:
Mixtures of Gaussian Densities
43/49
Its parameters are: The mean vector (a vector with d coefficients) The covariance matrix (a symmetric dxd matrix), if X indep. is diagonal The determinant of the covariance matrix| |
The most often used model for (preprocessed) speech signals are Gaussian densities. Often the "size" of the parameter spaces is measured in "number of densities A multivariate Gaussian density looks like this:
Gaussian Densities
44/49
Gaussian Classifier For each class i, need to estimate from training data:
covariance matrix mean vector
i i
45/49
Estimation of Parameters MLE, Maximum Likelihood Estimation, i.e. Find the
set of parameters that maximizes the likelihood of generating the observed data
If p(x|W) is assumed to be Gaussian, then W will be defined by the mean and the covariance matrix:
= 1n
x kk=1
n
= 1n
(x k )(x k )t
k=1
n
46/49
Problems of Classifier Design Features:
What and how many features should be selected? Any features? The more the better? If additional features not useful (same mean and
covariance), classifier will automatically ignore them?
47/49
Curse of Dimensionality Adding more features
Adding independent features may help BUT: adding indiscriminant features may lead to
worse performance!
Reason: Training Data vs. Number of Parameters Limited training data.
Solution: select features carefully reduce dimensionality Principle Component Analysis
48/49
Trainability Two-phoneme classification example (Huang et al.), Phonemes modeled
by Gaussian mixtures Parameters are trained with a varied set of training samples
49/49
Problemsf(x)
x
Normal distribution does not model this situation well. other densities may be mathematically intractable.
non-parametric techniques
Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Alignment of Vector SequencesSlide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49