pattern recognition speech

11-751 Speech Recognition09-15-2008

Pattern Recognition and Classification(for speech recognition)

2/49

09-15-2008 Course information & quick review

The components of a modern ASR system

Pattern Recognition / Classification Will continue with this topic on Wednesday

3/49

Course Grading 30% Homework Assignments

4 assignments over the course

40% Exam [12-Dec] In-class final exam at the end of the course close book, covers the material presented in the course

30% Speech term project Proposal (1-pager) [due: 08-Oct] oral presentation (15min) [start of Dec] written report (10 pages max) [due: 15-Dec] demonstration (if applicable) your ideas and creativity for projects are highly welcome details and project ideas given on Wednesday

4/49

Instructors

RM 203: Alex Waibel ( [email protected])

RM 221: Ian Lane ( [email protected] )

RM 209: Yik-Cheung (Wilson) Tam ( [email protected] )

interACT2F, 407 S.Craig St.

Newell-Simon Hall

Doherty Hall

5/49

What we have looked at so far Why speech recognition?

Speech production How humans generate speech Vocal tract model of speech

Features used for speech recognition Spectral representation of speech LPC (Linear Predictive Coding) MFCC (Mel frequency cepstral coefficients)

Dynamic time warping and template matching Isolated word recognition

6Alex Waibel Speech RecognitionAlex Waibel Speech Recognition

Vocal Tract Model of SpeechVocal Tract Model of Speech

A V

x

x

A N

P L ( n )

pitch period

vocal tract parameters

speech

vocal tractV(t)

radiationmodel R(t)

random noisegenerator

impulse traingenerator

G(t)t

w w

w

7Alex Waibel Speech RecognitionAlex Waibel Speech Recognition

Sloppy SpeechSloppy Speech

ReadSpeech

Conver-SationalSpeech

Actual Input: I have been I have been getting into

Recognition: and I am I being too yeah

Recognition: I have been ties than getting into the

8/49

Feature Extraction

Acoustic vectors computed every 10ms Mel-Scale filters mimic auditory processing DCT decorrelates the signal to improve statistical independence First and second differentials appended dynamic information of signal

FFT

Speech Waveform

Log DCT

Mel scale triangular filters

2

FFT based spectrum

39 Element Acoustic Vector

x

9/49

Template matching and DTW

Linear alignment can handle

the problem of different speaking rates

But: it can not handle the problem of

varying speaking rates during

the same utterance.

First idea to overcome the varying length of Utterances, Problem (2):

1. Normalize their length 2. Make a linear alignment

10/49

Components of a Modern ASR System

Suggested reading:

S. Young, Large vocabulary continuous speech recognition: A review

11/49

ASR the big picture

Input Speech

Output Text

Hello world

???

12/49

ASR the big pictureThe purpose of Signal Preprocessing is:

1) Signal Digitalization (Quantization and Sampling)

Represent an analog signal in an appropriate form to be processed by the computer

2) Digital Signal Preprocessing (Feature Extraction)

Extract features that are suitable for recognition process

Input Speech

Output Text

Hello world

???

Front-end Processing

13/49

Fundamental EquationFor observed feature vector sequence

find the most likely word sequence

Input Speech

Output Text

Hello world

???


x W

W=argmax P Wx=argmax P W P xW P xW W

Wx

14/49

Speech Recognition DecodingFor observed feature vector sequence

find the most likely word sequence

Input Speech

Output Text

Hello world


W

x WP xW

W=argmax P Wx=argmax P W P xW P x

Acoustic Model

P W

Language Model

W W

Search (how to efficiently find maximum W)

x

15/49

Acoustic Model Given W , what is the likelihood to see feature vector(s) x

we need a representation for W in terms of feature vectors

Usually a two-part representation: pronunciation dictionary: describe W as concatenation of phones phones models that explain phones in terms of feature vectors

Input Speech

Output Text

Hello world


x WP xW P W

I /i/you /j/ /u/we /v/ /e/

Acoustic Model(phones)

Pronunciation Dictionary(map words to

phone sequences)

16/49

Why breaking down the words into phones Need collection of reference patterns for each word

High computational effort (esp. for large vocabularies), proportional to vocabulary size

Large vocabulary also means: need huge amount of training data Difficult to train suitable references (or sets of references) Impossible to recognize untrained words

Replace whole words by suitable sub units Poor performance when the environment changes

Works only well for speaker-dependent recognition (variations) Unsuitable where speaker is unknown and no training is feasible Unsuitable for continuous speech (combinatorial explosion) Difficult to train/recognize subword units

Replace the template approach by a better modeling process

17/49

Speech Production as a Stochastic Process The same word / phoneme sounds different every time it is uttered Regard words / phonemes as states of a speech production process

In a given state we can observe different acoustic sounds Not all sounds are possible / likely in every state We say: in a given state the speech process "emits" sound

according to some probability distribution The production process makes transitions from one state to

another Not all transitions are possible, they have different probabilities

When we specify the probabilities for sound-emissions (emission probabilities) and for the state transitions, we call this a model.

18/49

HMM Acoustic Modelling

Hidden Markov Model for each phone or senome (context dependent model) Transition probabilities: ai j model durational variability in speech

Output distribution: bi(yk) model spectral variability

Y = y1 y2 y3 y4 y5

1 2 3 4 5

b2(y1) b2(y2) b3(y3) b4(y4) b4(y5)

a12 a23 a34 a45

a22 a33 a44

Acoustic Vector Sequence

19/49

Language Model What is the likelihood to see word sequence W

prior probability of W independent of observed event x

Input Speech

Output Text

Hello world


x WP xW P W

p(you|how are)

p(today|are you)

p(world|Hello)

Language Model

(likelihood of word sequences)

20/49

Language Modelling P(W) is the a-priori probability of observing word

sequence W independent of the observed signal x

N-gram language model estimate the probability of some word wk given the preceding n-1 words Typically use n=3,4

Smoothing required account for word sequences not seen during training

P W = P wkwk1 ,wk2

21/49

Decoding with classifiers

Feature extraction

Decision(apply trained

classifiers)

Speech features

Hypotheses(phonemes)

Speech

/h/ /e/ /l/ /o//w/ /o/ /r/ /l/ /d/

/h/

...

22/49

Training classifiers

Feature extraction

TrainClassifier

Speech features

ImprovedClassifiers

AlignedSpeech

/e//h/ /e/ /l/ /o/ /h/ /e/ /l/ /o/

Use all aligned speech features (e.g. of phoneme /e/) to train the reference vectors of /e/ (=Codebook)- kmeans- LVQ

23/49

Suggested reading:

X. Huang, A. Acero, H. Hon, Spoken Language Processing, Chapter 4

R. Duda and P Hart, Pattern Classification and Scene Analysis, John Wiley & Sons, 2000 (2nd Edition)

Pattern Recognition and Classification(for speech recognition)

24/49

Pattern Recognition Approaches

PatternRecognition

statisticalknowledge /connectionist

supervised unsupervised

parametric nonparametric

linear nonlinear

25/49

Pattern Recognition Approaches Knowledge-based approaches:

Compile knowledge Build decision trees

Connectionist approaches: Automatic knowledge acquisition, "black-box" behavior Simulation of biological processes

Statistical approaches: Build a statistical model of the "real world" Compute probabilities according to the models

26/49

Classification Trees

Using decision tree to predict someones height class by traversing the tree and answering the yes/no questions Choice and order of questions is designed subjectively (knowledge-based) Classification and Regression Trees (CART) provide an automatic, data-driven framework to construct the decision process

Simple binary decision tree for height classification:

T=tall, t=medium-tall, M=medium, m=medium-short, S=short

Goal: Predict the height of a new person

27/49

The CART AlgorithmStep 1: Create a set of questions Q that consists of all possible questionsStep 2: Pick a splitting criterion that can evaluate all possible questionsStep 3: Create a tree with one root node consisting of all training samplesStep 4: Find the best composite question for each terminal node

(Goal is classification, so objective is to reduce uncertainty; use entropy H > find question which gives the greatest H reduction)- generate a tree with several simple-question splits- cluster leaf nodes into two classes according to the splitting criterion- construct a corresponding composite question (,)

Step 5: Split: Take the split with the best criterion from step 4Step 6: Stop criterion: go to step 7 if all leaf nodes contain data samples

from the same class or if the improvements of all potential splits fall below a defined threshold

Step 7: Prune the tree into the optimal size using an independent test sample estimate or cross-validation to prevent the tree from over-modeling the training data = allow generalization

28/49

Neural Net Approaches

NN: attempt real-time response and human-like performance Many simple processing elements operating in parallel Most common training procedure: error back-propagation:

Generalization of the MMSE (Minimum Mean Squared Error) algorithm Gradient search: minimize difference between actual and wanted output

MLP approximates the a posteriori probabilities P(Class|Pattern) Common problem: if an output of 0 0 0 ... 0 1 0 ... 0 0 0 is desired, the net tends to produce 0 0 0 ... 0 for all inputs

Parallel Supercomputers: renewed interest in NN Appealing to ASR: parallel evaluation of many clues and facts Most common approach: Multi-layer Perceptron MLP

29/49

Pattern Recognition Types of classifiers

Supervised - Unsupervised Classifiers Parametric - Non-Parametric Classifiers Linear - Non-linear Classifiers

Classical Statistical Methods Bayes Classifier K-Nearest Neighbor

Connectionist Methods Perceptron Multilayer Perceptrons

30/49

Supervised - Unsupervised Supervised training:

Class to be recognized is known for each sample in training data. Requires a priori knowledge of useful features and knowledge/labeling of each training token (cost!).

Unsupervised training:Class is not known and structure is to be discovered automatically. Feature-space-reduction

example: clustering, auto-associative nets

31/49

Unsupervised Classification

Classification: Classes Not Known: Find Structure

F1

F2

++

+ +++

++++

+

++

++ +

+++ ++

+

++

+

+ +++

++ ++++

+

+++++

+

32/49

Unsupervised Classification

Classification: Classes Not Known: Find Structure Clustering How to cluster? How many clusters?

F1

F2

++

+ +++

++++

+

++

++ +

+++ ++

+

++

+

+ +++

++ ++++

+

+++++

+

33/49

Supervised Classification

Classification: Classes Known: Creditworthiness: Yes-No Features: Income, Age Classifiers

Age

Income

++

+ +++

++++

+

++

++ +

+++ ++

+

++

+

+ +++

++ ++++

+

+++++ credit-worthy

non-credit-worthy+

34/49

Age

Income

++

+ +++

++++

+

++

++ +

+++ ++

+

++

+

+ +++

++ ++++

+

+++++ credit-worthy

non-credit-worthy+

Is Joe credit-worthy ?

Features: age, income Classes: creditworthy, non-creditworthy Problem: Given Joe's income and age, should a loan be made? Other Classification Problems: Fraud Detection, Customer Selection...

2

1

xx

Classification Problem

35/49

Parametric - Non-parametric

Parametric: assume underlying probability distribution; estimate the parameters of this distribution. Example: "Gaussian Classifier"

Non-parametric: Don't assume distribution. Estimate probability of error or error criterion directly from training data. Examples: Parzen Window, k-nearest neighbor, perceptron...

L

Number

Income

good loans

bad loans

minimum error criterium

36/49

Bayes Decision TheoryBayes Rule:

plays a central role for statistical pattern recognition

Concept of decision making based on:1) prior knowledge of categories prior probability:

AND

2) knowledge from observation data posterior probability:

Bayes Rule:

where:

Class-conditional Probability Density function:(referred to as the likelihood function how likely is x generated)

P ( j / x ) =p ( x / j ) P ( j )

p ( x )

p ( x ) = p ( x / j ) P ( j )j

observation of x

)( jP

)/( xP j

)/( jxp

37/49

Minimum-Error-Rate Decision Rule

P if we decideP else

Classification error is minimized, if we:Decide if

otherwiseDecide if

otherwise

For the multi-category case:Decide if P > P for all

P ( e r r o r / x ) = { ( 1 / x )( 2 / x )

1

1 2

2

P ( 1 / x ) > P ( 2 / x ) ;

p ( x / 1 ) P ( 1 ) > p ( x / 2 ) P ( 2 ) ;

2

i ( i / x ) ( j / x ) ij

38/49

Classification ErrorBayes decision rule: move the decision boundary such that the decision is made to choose the class i based on the maximum value of P(x|i) P(i). The tail integral area P(error) becomes minimum

39/49

Classifier Discriminant Functions

Assign x to class , if for all

Minimum-error-rate classifier:

g i ( x ) > g j ( x )

A priori probability

g i ( x ) , i = 1 , . . . , c

i

g i ( x ) = P ( i / x )=

p ( x / i ) P ( i )p ( x / j ) ( j )

j = 1

cg i ( x ) = p ( x / i ) P ( i )

class conditional probability density function

g i ( x ) = l o g ( p ( x / i ) ) + l o g ( P ( i ) )

independent of class i

ij

Decision problem = pattern classification problem where unknown data x are classified into known categories (e.g. classify sounds into phonemes)

g i ( x ) > g j ( x )

40/49

Classifier Design in Practice

Need a priori probability P (not too bad) Need class conditional PDF p Problems:

limited training data limited computation class-labeling potentially costly and prone to error classes may not be known good features not known

Parametric Solution: Assume that p has a particular parametric form Most common representative: multivariate normal density

( i )

( x / i )

( x / i )

41/49

Most important probability distributionsince random variables in physical experiments (incl. Speech signals) have distribs which are approximately Gaussian - normal distribution

X has a Gaussian distrib with mean and variance 2 if X has a continuous pdf ofthe form:l

Three Gaussian distributions with samemean but different variances (sigma)

Three Binomial distributions with different probs p; E(X) = np Var(X) = np(1-p)

Three Poisson distributions with differentlambda (E(X) = Var(X) =

42/49

Often the shape of the set of vectors that belong to one class does not look like what can be modeled by a single Gaussian. A (weighted) sum of Gaussians can approximate many more densities:

In general, a class can be modeled as a mixture of Gaussians:

Mixtures of Gaussian Densities

43/49

Its parameters are: The mean vector (a vector with d coefficients) The covariance matrix (a symmetric dxd matrix), if X indep. is diagonal The determinant of the covariance matrix| |

The most often used model for (preprocessed) speech signals are Gaussian densities. Often the "size" of the parameter spaces is measured in "number of densities A multivariate Gaussian density looks like this:

Gaussian Densities

44/49

Gaussian Classifier For each class i, need to estimate from training data:

covariance matrix mean vector

i i

45/49

Estimation of Parameters MLE, Maximum Likelihood Estimation, i.e. Find the

set of parameters that maximizes the likelihood of generating the observed data

If p(x|W) is assumed to be Gaussian, then W will be defined by the mean and the covariance matrix:

= 1n

x kk=1

n

= 1n

(x k )(x k )t

k=1

n

46/49

Problems of Classifier Design Features:

What and how many features should be selected? Any features? The more the better? If additional features not useful (same mean and

covariance), classifier will automatically ignore them?

47/49

Curse of Dimensionality Adding more features

Adding independent features may help BUT: adding indiscriminant features may lead to

worse performance!

Reason: Training Data vs. Number of Parameters Limited training data.

Solution: select features carefully reduce dimensionality Principle Component Analysis

48/49

Trainability Two-phoneme classification example (Huang et al.), Phonemes modeled

by Gaussian mixtures Parameters are trained with a varied set of training samples

49/49

Problemsf(x)

x

Normal distribution does not model this situation well. other densities may be mathematically intractable.

non-parametric techniques

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Alignment of Vector SequencesSlide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49

Documents

pattern recognition speech