49
11-751 Speech Recognition 09-15-2008 Pattern Recognition and Classification (for speech recognition)

pattern recognition speech

Embed Size (px)

DESCRIPTION

pattern recognition and classification

Citation preview

  • 11-751 Speech Recognition09-15-2008

    Pattern Recognition and Classification(for speech recognition)

  • 2/49

    09-15-2008 Course information & quick review

    The components of a modern ASR system

    Pattern Recognition / Classification Will continue with this topic on Wednesday

  • 3/49

    Course Grading 30% Homework Assignments

    4 assignments over the course

    40% Exam [12-Dec] In-class final exam at the end of the course close book, covers the material presented in the course

    30% Speech term project Proposal (1-pager) [due: 08-Oct] oral presentation (15min) [start of Dec] written report (10 pages max) [due: 15-Dec] demonstration (if applicable) your ideas and creativity for projects are highly welcome details and project ideas given on Wednesday

  • 4/49

    Instructors

    RM 203: Alex Waibel ( [email protected])

    RM 221: Ian Lane ( [email protected] )

    RM 209: Yik-Cheung (Wilson) Tam ( [email protected] )

    interACT2F, 407 S.Craig St.

    Newell-Simon Hall

    Doherty Hall

  • 5/49

    What we have looked at so far Why speech recognition?

    Speech production How humans generate speech Vocal tract model of speech

    Features used for speech recognition Spectral representation of speech LPC (Linear Predictive Coding) MFCC (Mel frequency cepstral coefficients)

    Dynamic time warping and template matching Isolated word recognition

  • 6Alex Waibel Speech RecognitionAlex Waibel Speech Recognition

    Vocal Tract Model of SpeechVocal Tract Model of Speech

    A V

    x

    x

    A N

    P L ( n )

    pitch period

    vocal tract parameters

    speech

    vocal tractV(t)

    radiationmodel R(t)

    random noisegenerator

    impulse traingenerator

    G(t)t

    w w

    w

  • 7Alex Waibel Speech RecognitionAlex Waibel Speech Recognition

    Sloppy SpeechSloppy Speech

    ReadSpeech

    Conver-SationalSpeech

    Actual Input: I have been I have been getting into

    Recognition: and I am I being too yeah

    Recognition: I have been ties than getting into the

  • 8/49

    Feature Extraction

    Acoustic vectors computed every 10ms Mel-Scale filters mimic auditory processing DCT decorrelates the signal to improve statistical independence First and second differentials appended dynamic information of signal

    FFT

    Speech Waveform

    Log DCT

    Mel scale triangular filters

    2

    FFT based spectrum

    39 Element Acoustic Vector

    x

  • 9/49

    Template matching and DTW

    Linear alignment can handle

    the problem of different speaking rates

    But: it can not handle the problem of

    varying speaking rates during

    the same utterance.

    First idea to overcome the varying length of Utterances, Problem (2):

    1. Normalize their length 2. Make a linear alignment

  • 10/49

    Components of a Modern ASR System

    Suggested reading:

    S. Young, Large vocabulary continuous speech recognition: A review

  • 11/49

    ASR the big picture

    Input Speech

    Output Text

    Hello world

    ???

  • 12/49

    ASR the big pictureThe purpose of Signal Preprocessing is:

    1) Signal Digitalization (Quantization and Sampling)

    Represent an analog signal in an appropriate form to be processed by the computer

    2) Digital Signal Preprocessing (Feature Extraction)

    Extract features that are suitable for recognition process

    Input Speech

    Output Text

    Hello world

    ???

    Front-end Processing

  • 13/49

    Fundamental EquationFor observed feature vector sequence

    find the most likely word sequence

    Input Speech

    Output Text

    Hello world

    ???

    Front-end Processing

    x W

    W=argmax P Wx=argmax P W P xW P xW W

    Wx

  • 14/49

    Speech Recognition DecodingFor observed feature vector sequence

    find the most likely word sequence

    Input Speech

    Output Text

    Hello world

    Front-end Processing

    W

    x WP xW

    W=argmax P Wx=argmax P W P xW P x

    Acoustic Model

    P W

    Language Model

    W W

    Search (how to efficiently find maximum W)

    x

  • 15/49

    Acoustic Model Given W , what is the likelihood to see feature vector(s) x

    we need a representation for W in terms of feature vectors

    Usually a two-part representation: pronunciation dictionary: describe W as concatenation of phones phones models that explain phones in terms of feature vectors

    Input Speech

    Output Text

    Hello world

    Front-end Processing

    x WP xW P W

    I /i/you /j/ /u/we /v/ /e/

    Acoustic Model(phones)

    Pronunciation Dictionary(map words to

    phone sequences)

  • 16/49

    Why breaking down the words into phones Need collection of reference patterns for each word

    High computational effort (esp. for large vocabularies), proportional to vocabulary size

    Large vocabulary also means: need huge amount of training data Difficult to train suitable references (or sets of references) Impossible to recognize untrained words

    Replace whole words by suitable sub units Poor performance when the environment changes

    Works only well for speaker-dependent recognition (variations) Unsuitable where speaker is unknown and no training is feasible Unsuitable for continuous speech (combinatorial explosion) Difficult to train/recognize subword units

    Replace the template approach by a better modeling process

  • 17/49

    Speech Production as a Stochastic Process The same word / phoneme sounds different every time it is uttered Regard words / phonemes as states of a speech production process

    In a given state we can observe different acoustic sounds Not all sounds are possible / likely in every state We say: in a given state the speech process "emits" sound

    according to some probability distribution The production process makes transitions from one state to

    another Not all transitions are possible, they have different probabilities

    When we specify the probabilities for sound-emissions (emission probabilities) and for the state transitions, we call this a model.

  • 18/49

    HMM Acoustic Modelling

    Hidden Markov Model for each phone or senome (context dependent model) Transition probabilities: ai j model durational variability in speech

    Output distribution: bi(yk) model spectral variability

    Y = y1 y2 y3 y4 y5

    1 2 3 4 5

    b2(y1) b2(y2) b3(y3) b4(y4) b4(y5)

    a12 a23 a34 a45

    a22 a33 a44

    Acoustic Vector Sequence

  • 19/49

    Language Model What is the likelihood to see word sequence W

    prior probability of W independent of observed event x

    Input Speech

    Output Text

    Hello world

    Front-end Processing

    x WP xW P W

    p(you|how are)

    p(today|are you)

    p(world|Hello)

    Language Model

    (likelihood of word sequences)

  • 20/49

    Language Modelling P(W) is the a-priori probability of observing word

    sequence W independent of the observed signal x

    N-gram language model estimate the probability of some word wk given the preceding n-1 words Typically use n=3,4

    Smoothing required account for word sequences not seen during training

    P W = P wkwk1 ,wk2

  • 21/49

    Decoding with classifiers

    Feature extraction

    Decision(apply trained

    classifiers)

    Speech features

    Hypotheses(phonemes)

    Speech

    /h/ /e/ /l/ /o//w/ /o/ /r/ /l/ /d/

    /h/

    ...

  • 22/49

    Training classifiers

    Feature extraction

    TrainClassifier

    Speech features

    ImprovedClassifiers

    AlignedSpeech

    /e//h/ /e/ /l/ /o/ /h/ /e/ /l/ /o/

    Use all aligned speech features (e.g. of phoneme /e/) to train the reference vectors of /e/ (=Codebook)- kmeans- LVQ

  • 23/49

    Suggested reading:

    X. Huang, A. Acero, H. Hon, Spoken Language Processing, Chapter 4

    R. Duda and P Hart, Pattern Classification and Scene Analysis, John Wiley & Sons, 2000 (2nd Edition)

    Pattern Recognition and Classification(for speech recognition)

  • 24/49

    Pattern Recognition Approaches

    PatternRecognition

    statisticalknowledge /connectionist

    supervised unsupervised

    parametric nonparametric

    linear nonlinear

  • 25/49

    Pattern Recognition Approaches Knowledge-based approaches:

    Compile knowledge Build decision trees

    Connectionist approaches: Automatic knowledge acquisition, "black-box" behavior Simulation of biological processes

    Statistical approaches: Build a statistical model of the "real world" Compute probabilities according to the models

  • 26/49

    Classification Trees

    Using decision tree to predict someones height class by traversing the tree and answering the yes/no questions Choice and order of questions is designed subjectively (knowledge-based) Classification and Regression Trees (CART) provide an automatic, data-driven framework to construct the decision process

    Simple binary decision tree for height classification:

    T=tall, t=medium-tall, M=medium, m=medium-short, S=short

    Goal: Predict the height of a new person

  • 27/49

    The CART AlgorithmStep 1: Create a set of questions Q that consists of all possible questionsStep 2: Pick a splitting criterion that can evaluate all possible questionsStep 3: Create a tree with one root node consisting of all training samplesStep 4: Find the best composite question for each terminal node

    (Goal is classification, so objective is to reduce uncertainty; use entropy H > find question which gives the greatest H reduction)- generate a tree with several simple-question splits- cluster leaf nodes into two classes according to the splitting criterion- construct a corresponding composite question (,)

    Step 5: Split: Take the split with the best criterion from step 4Step 6: Stop criterion: go to step 7 if all leaf nodes contain data samples

    from the same class or if the improvements of all potential splits fall below a defined threshold

    Step 7: Prune the tree into the optimal size using an independent test sample estimate or cross-validation to prevent the tree from over-modeling the training data = allow generalization

  • 28/49

    Neural Net Approaches

    NN: attempt real-time response and human-like performance Many simple processing elements operating in parallel Most common training procedure: error back-propagation:

    Generalization of the MMSE (Minimum Mean Squared Error) algorithm Gradient search: minimize difference between actual and wanted output

    MLP approximates the a posteriori probabilities P(Class|Pattern) Common problem: if an output of 0 0 0 ... 0 1 0 ... 0 0 0 is desired, the net tends to produce 0 0 0 ... 0 for all inputs

    Parallel Supercomputers: renewed interest in NN Appealing to ASR: parallel evaluation of many clues and facts Most common approach: Multi-layer Perceptron MLP

  • 29/49

    Pattern Recognition Types of classifiers

    Supervised - Unsupervised Classifiers Parametric - Non-Parametric Classifiers Linear - Non-linear Classifiers

    Classical Statistical Methods Bayes Classifier K-Nearest Neighbor

    Connectionist Methods Perceptron Multilayer Perceptrons

  • 30/49

    Supervised - Unsupervised Supervised training:

    Class to be recognized is known for each sample in training data. Requires a priori knowledge of useful features and knowledge/labeling of each training token (cost!).

    Unsupervised training:Class is not known and structure is to be discovered automatically. Feature-space-reduction

    example: clustering, auto-associative nets

  • 31/49

    Unsupervised Classification

    Classification: Classes Not Known: Find Structure

    F1

    F2

    ++

    + +++

    ++++

    +

    ++

    ++ +

    +++ ++

    +

    ++

    +

    + +++

    ++ ++++

    +

    +++++

    +

  • 32/49

    Unsupervised Classification

    Classification: Classes Not Known: Find Structure Clustering How to cluster? How many clusters?

    F1

    F2

    ++

    + +++

    ++++

    +

    ++

    ++ +

    +++ ++

    +

    ++

    +

    + +++

    ++ ++++

    +

    +++++

    +

  • 33/49

    Supervised Classification

    Classification: Classes Known: Creditworthiness: Yes-No Features: Income, Age Classifiers

    Age

    Income

    ++

    + +++

    ++++

    +

    ++

    ++ +

    +++ ++

    +

    ++

    +

    + +++

    ++ ++++

    +

    +++++ credit-worthy

    non-credit-worthy+

  • 34/49

    Age

    Income

    ++

    + +++

    ++++

    +

    ++

    ++ +

    +++ ++

    +

    ++

    +

    + +++

    ++ ++++

    +

    +++++ credit-worthy

    non-credit-worthy+

    Is Joe credit-worthy ?

    Features: age, income Classes: creditworthy, non-creditworthy Problem: Given Joe's income and age, should a loan be made? Other Classification Problems: Fraud Detection, Customer Selection...

    2

    1

    xx

    Classification Problem

  • 35/49

    Parametric - Non-parametric

    Parametric: assume underlying probability distribution; estimate the parameters of this distribution. Example: "Gaussian Classifier"

    Non-parametric: Don't assume distribution. Estimate probability of error or error criterion directly from training data. Examples: Parzen Window, k-nearest neighbor, perceptron...

    L

    Number

    Income

    good loans

    bad loans

    minimum error criterium

  • 36/49

    Bayes Decision TheoryBayes Rule:

    plays a central role for statistical pattern recognition

    Concept of decision making based on:1) prior knowledge of categories prior probability:

    AND

    2) knowledge from observation data posterior probability:

    Bayes Rule:

    where:

    Class-conditional Probability Density function:(referred to as the likelihood function how likely is x generated)

    P ( j / x ) =p ( x / j ) P ( j )

    p ( x )

    p ( x ) = p ( x / j ) P ( j )j

    observation of x

    )( jP

    )/( xP j

    )/( jxp

  • 37/49

    Minimum-Error-Rate Decision Rule

    P if we decideP else

    Classification error is minimized, if we:Decide if

    otherwiseDecide if

    otherwise

    For the multi-category case:Decide if P > P for all

    P ( e r r o r / x ) = { ( 1 / x )( 2 / x )

    1

    1 2

    2

    P ( 1 / x ) > P ( 2 / x ) ;

    p ( x / 1 ) P ( 1 ) > p ( x / 2 ) P ( 2 ) ;

    2

    i ( i / x ) ( j / x ) ij

  • 38/49

    Classification ErrorBayes decision rule: move the decision boundary such that the decision is made to choose the class i based on the maximum value of P(x|i) P(i). The tail integral area P(error) becomes minimum

  • 39/49

    Classifier Discriminant Functions

    Assign x to class , if for all

    Minimum-error-rate classifier:

    g i ( x ) > g j ( x )

    A priori probability

    g i ( x ) , i = 1 , . . . , c

    i

    g i ( x ) = P ( i / x )=

    p ( x / i ) P ( i )p ( x / j ) ( j )

    j = 1

    cg i ( x ) = p ( x / i ) P ( i )

    class conditional probability density function

    g i ( x ) = l o g ( p ( x / i ) ) + l o g ( P ( i ) )

    independent of class i

    ij

    Decision problem = pattern classification problem where unknown data x are classified into known categories (e.g. classify sounds into phonemes)

    g i ( x ) > g j ( x )

  • 40/49

    Classifier Design in Practice

    Need a priori probability P (not too bad) Need class conditional PDF p Problems:

    limited training data limited computation class-labeling potentially costly and prone to error classes may not be known good features not known

    Parametric Solution: Assume that p has a particular parametric form Most common representative: multivariate normal density

    ( i )

    ( x / i )

    ( x / i )

  • 41/49

    Most important probability distributionsince random variables in physical experiments (incl. Speech signals) have distribs which are approximately Gaussian - normal distribution

    X has a Gaussian distrib with mean and variance 2 if X has a continuous pdf ofthe form:l

    Three Gaussian distributions with samemean but different variances (sigma)

    Three Binomial distributions with different probs p; E(X) = np Var(X) = np(1-p)

    Three Poisson distributions with differentlambda (E(X) = Var(X) =

  • 42/49

    Often the shape of the set of vectors that belong to one class does not look like what can be modeled by a single Gaussian. A (weighted) sum of Gaussians can approximate many more densities:

    In general, a class can be modeled as a mixture of Gaussians:

    Mixtures of Gaussian Densities

  • 43/49

    Its parameters are: The mean vector (a vector with d coefficients) The covariance matrix (a symmetric dxd matrix), if X indep. is diagonal The determinant of the covariance matrix| |

    The most often used model for (preprocessed) speech signals are Gaussian densities. Often the "size" of the parameter spaces is measured in "number of densities A multivariate Gaussian density looks like this:

    Gaussian Densities

  • 44/49

    Gaussian Classifier For each class i, need to estimate from training data:

    covariance matrix mean vector

    i i

  • 45/49

    Estimation of Parameters MLE, Maximum Likelihood Estimation, i.e. Find the

    set of parameters that maximizes the likelihood of generating the observed data

    If p(x|W) is assumed to be Gaussian, then W will be defined by the mean and the covariance matrix:

    = 1n

    x kk=1

    n

    = 1n

    (x k )(x k )t

    k=1

    n

  • 46/49

    Problems of Classifier Design Features:

    What and how many features should be selected? Any features? The more the better? If additional features not useful (same mean and

    covariance), classifier will automatically ignore them?

  • 47/49

    Curse of Dimensionality Adding more features

    Adding independent features may help BUT: adding indiscriminant features may lead to

    worse performance!

    Reason: Training Data vs. Number of Parameters Limited training data.

    Solution: select features carefully reduce dimensionality Principle Component Analysis

  • 48/49

    Trainability Two-phoneme classification example (Huang et al.), Phonemes modeled

    by Gaussian mixtures Parameters are trained with a varied set of training samples

  • 49/49

    Problemsf(x)

    x

    Normal distribution does not model this situation well. other densities may be mathematically intractable.

    non-parametric techniques

    Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Alignment of Vector SequencesSlide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49