Woojay Jeon Research Advisor: Fred Juang School of Electrical and Computer Engineering

Speech Analysis and Cognition Using Category-Dependent Features in a Model of the Central Auditory System

Woojay Jeon

Research Advisor: Fred Juang

School of Electrical and Computer EngineeringGeorgia Institute of Technology

October 8, 2006

2

Synopsis of Project

One of the very few, if any, attempt to address auditory modeling beyond the periphery (ear, cochlea, even auditory nerve fiber) for ASR;

Implemented a model (periphery + 3D cortical model) to calculate cortical response to stimuli;

Investigated cortical representations in ASR; conducted a comprehensive comparative study to understand robustness in auditory representations;

Developed a methodology to analyze robustness based on matched filter theory;

Spawned a new development based on category dependent feature selection and hierarchical pattern recognition.

3

Matched Filtering

Cortical response:– p(y): power (auditory) spectrum– w(y;)response area x,s,, where– r()cortical response– R(): non-zero frequency range of w(y;)

The Cauchy-Schwarz Inequality tells us that r()2 will be maximum when:

If R() includes enough spectral peaks, we can also use the spectral envelope v(y):

4

Signal-Respondent Neurons

(a)

(b)

(c)

(d)

Frequency (kHz)

Scal

e (c

yc/o

ct)

a

bc

d

0.25 0.5 1.0 2.0 4.0

0.5

1.0

2.0

4.0

0.25 1.0 4.0 kHz

-50

51015

0.25 1.0 4.0 kHz

0

2

40.25 1.0 4.0 kHz

0

2

4

0.25 1.0 4.0 kHz

-2

0

2

4

(all points differ in phase)

5

Noise-Respondent Neurons

(a) (b)

Frequency (kHz)

Scal

e (c

yc/o

ct)

a b0.25 0.5 1.0 2.0 4.0

0.5

1.0

2.0

4.0

0.25 1.0 4.0 kHz

0

0.25 1.0 4.0 kHz

0

(all points differ in phase)

6

Noise Robustness

Assuming a conventional Fourier power spectrum with stationary white noise as the distortion, it can be shown mathematically that:

– Sr, : SNR of signal-respondent neuron– Sp, : SNR of auditory spectrum in R()– Sr, : SNR of noise-respondent neuron where

R() = R() Modeling inhibition can increase Sr, even

more.

7

Noise Robustness Experiments

Vowel /iy/ Fricative /dh/

Affricate /jh/ Plosive /p/ Sr(Ai) : Overall SNR of signal-respondent neurons of phoneme wi

Sr(U) : Overall SNR of entire cortical response Sp : Overall SNR of auditory spectrum

20 15 10 5 0 dB0

1

2

3

4

Rat

io

20 15 10 5 0 dB0

2

4

6

Rat

io

20 15 10 5 0 dB0

1

2

3

4

Rat

io

20 15 10 5 0 dB0

1

2

3

4

Rat

io

8

Category-Dependent Feature Selection

LVF: Low Variance Filter; HAF: High Activation Filter; NR: Neuron Reduction (via Clustering and Remapping); PCA: Principal Component Analysis

Auditory Spectrum

Speech Signal

Cortical Response

Early AuditoryModel

Central AuditoryModel (A1)

LVF1 HAF1 PCA1NR1 x1

LVF2 HAF2 PCA2NR2

LVFM HAFM PCAMNRM

x2

xM

9

Hierarchical Classification

Single-Layer ClassifierUses standard Bayesian Decision Theory to classify a test observation into 1 of N classes using class-wise discriminants that estimate the a posteriori probabilities

Hierarchical Classifier (Two-Layer Classifier)A two-stage process that first classifies a test observation into 1 of M categories, then into 1 of |Cn| classes

, arg maxj

j j i jwg p w w g x x x

, arg max

, arg maxm

j n

m m n mC

j n j n i j nw C

f p C C f

h p w w h

x x x

x x x

10

Searching for a Categorization

Ordered Set of Phonemes 1

Phoneme Tree 1

List of Candidate Categorizations 1

Categorization

CART-Style Splitting Combination

Ordered Set of Phonemes N

Phoneme Tree N

List of Candidate Categorizations N

Phoneme Class-Wise Variances

Performance-Based Search

The phoneme-wise variances are arranged into N orderings (each ordering with a different “seed” phoneme).

For each ordering, a CART-style splitting routine is applied to create a “phoneme tree,” from which a list of candidate categorizations is obtained.

We search for the categorization with the best hierarchical classification performance over the training data (using initial models).

11

Model Training

Category-Independent

Features

Initial Category Models (Mixed

HMM)

Category-Dependent Features

Refined Cat.-Dep. Class

Models (HMM)

Training Data

Initial Class Models (HMM)

Cat.-Dep. Class Models (HMM)

ML Estimation (Baum-Welch)

Apply Uniform Weights

Refined Category Models

(Mixed HMM)

MCE Training

ML Estimation (Baum-Welch)

Within-Category MCE TrainingCategorization

CI features are used to construct category models, which are refined with MCE training

, 1i m i m

m i i m iw C w C

f g p C

x x x

12

Hierarchical Classification

Category-Dependent

FeatureSelection

Test Data

Category Models(Mixed HMM)

Within-Category Class Models (HMM)

Category-Independent

Feature Selection

Category Decision arg max ,

m

m mn C

C f X Θ α

Within-CategoryClass Decision

,

arg maxj n

ni n jj w C

w h

X Θ

X nC

nX

iw

13

Phoneme Categorization

Categorization

14

Phoneme Classification Results

Classification rates (%) for clean speech in TIMIT database (48 phonemes)

Classification rates (%) for varying SNR, features, and classifier configurations (*74.51 when 48 phonemes are mapped down to 39 according to convention)

SL: Single-Layer Classifier; CI: Category-Independent Features; CD: Category-Dependent Features; TL: Two-Layer (Hierarchical) Classifier (results produced after MCE training)

Generalization of the MCE Method

Qiang Fu

Research Advisor: Fred JuangSchool of Electrical and Computer Engineering

Georgia Institute of TechnologyOctober 8, 2006

16

Synopsis

Excellent detector results (6-class, 14 class, 44-class) reported; use of detector results as “independent” information for rescoring.

Generalization of minimum error principle to large vocabulary continuous speech recognition– Definition of competing events– Selection of training units (state, phone, ..)– Use of word graph– Unequal error weight.

17

We investigate effects of combining the conventional ASR paradigm and the phonetic class detectors using MVE training

We keep the segmentation information from the Viterbi decoder, which may affect the final improvement

The rescoring algorithm can be flexible in order to fit different tasks

Rescoring Using MVE Detectors

18

Assume there are M classes and K training tokens. A token labeled in the ith class may generate one type I (missing) error and M-1 type II (false alarm) errors. Hence, key scores related to these two types of error are:

1 1

1( ) ( | )1( )K M

i itotal total k k

k i

L l o o class iK

And the overall performance objective becomes

In the above, 1 is the indicator function, l is a sigmoid function, and kI and kII are penalty weights for missing and false alarm errors. A descent algorithm is then applied for the

minimization of the overall error objective.

[ ( | )] [ ( | )]I

Mi i i i j j jtotal I I k II II II k

j i

l k l d o k l d o

ijMj

ogogod

ogogod

jjanti

ijanti

jtgt

ijtgt

jijII

iianti

iianti

itgt

iitgt

iiiI

,,,2,1

)|()|()~|(

)|()|()~|(

Minimum Verification Error

19

Conventional Decoder

SpeechSignals

MVE Detector 1

MVE Detector 2

MVE Detector M

Rescoring Algorithm

Decision Criteria & Thresholds

Decoding Scores

RescoringCandidates

Det

ecto

rSc

ores

Neyman-Pearson

Rescoring Paradigm

20

Suppose there are M classes of sub-word units. Hence there are M sets of detectors accordingly, each of which consists of a target model and an anti-model. For a segment that is decoded as the ith class with log likelihood , its jth (j = 1, 2,…,M) detector scores are and respectively. Namely, the likelihood ratio for the jth detector is . We call the score for the test segment belonging to class i after combination .

Method 1: Naive-adding (NA)

)()()()( iianti

idecode

inew ratioSSS

The reason for subtracting the anti-model score is to scale the decoding score into a relatively close dynamic range with the likelihood ratio. This procedure is also taken in the following two methods.

We simply add the decoder score and the detector score together

)( jantiS

)(idecodeS

)( jtgtS

)(inewS

)()()( janti

jtgt

j SSratio

Rescoring Methods (I)

21

Method 3: Remodeled Posterior Probability (RPP)

M

j

jianti

idecode

iianti

idecode

iden

inumi

new

ratioSS

ratioSSSSS

1

)()()(

)()()(

)(

)()(

)exp()exp(

)exp()exp(

We compute the “remodeled posterior probability”

Method 2: Competitive Rescoring (CR) /1)()()( })exp(

11log{

ij

jiiC ratio

MratioS

)()()()( iC

ianti

idecode

inew SSSS

We add the decoder score and the “competitive” score together, which is a “distance measure” between the claimed class and competitors

Rescoring Methods (II)

22

• Experiments are conducted on the TIMIT database (3696 training utterances and 1344 test utterances. There are 119,580 training tokens for MVE detectors) using three-state HMMs.

• Rescoring candidates are generated using HVite. The model for decoder is trained by Maximum Likelihood (ML) method, and the detectors are trained by MVE. Performance is examined on 6-class (Rabiner and Juang, 1993), 14-class (Deller et.al., 1999), and 48-class (Lee and Hon, ASAP-1989) broad phonetic categories, respectively.

• The models for both decoder and detectors are trained on 39-dimensional MFCC (12 MFCC + 12 delta + 12 acc + 3 log energy).

Experiments Setup

23

Rescoring performance

PhonemeClass Acc(%) Method 1

(NA)Method 2

(CR)Method 3

(RPP)

6-class

Baseline Upper bound

Rescored Relative

75.4480.8476.3617.04

75.4480.8476.3817.40

75.4480.8480.0080.44

14-class


Rescored Relative

63.6170.8565.3824.45

63.6170.8567.2750.55

63.6170.8568.4561.88

48-class


Rescored Relative

55.3362.0855.04-4.30

55.3362.0855.614.15

55.3362.0855.918.59Need to perform phone or word rescoring

24

Three different rescoring methods are introduced and the experiment results show that creating a pseudo-phone graph and re-computing the posterior probability achieves the best performance enhancement;

MVE trained detectors shows promising results in helping the conventional ASR techniques. The detectors can be optimized in the sense of features or attributes (e.g. features representing articulatory knowledge and others), and used for re-ranking the decoded candidates;

Bottom-up event detection and information fusion will be conducted on continuous speech signals in the future.

Conclusions and Future work

25

MCE Generalization

MCE criterion formulation: 1. Define the performance objective and the

corresponding task evaluation measure;2. Specify the target event (i.e., the correct label),

competing events (i.e., the incorrect hypotheses from the recognizer), and the corresponding models;

3. Construct the objective function and set hyper-parameters

4. Choose a suitable optimization method to update parameters.

In this presentation, only the first step which is also the most fundamental one is discussed due to limited space. This work is the first of an extensive generalization of the MCE training criterion

26

Competingwords

Targetwords A

A

B

A

start end

…

labeled word

A

Competingwords

TargetwordsA

A

B

A

start end

…

labeled word

A

Strict Boundary and Relaxed Boundary

27

Experiments are conducted on the WSJ0 database (7077 training utterances and 330 test utterances);

All models are three-state HMMs with 8 Gaussian mixtures in each state. There are totally 7385 physical models, 19075 logical models and 2329 tied states;

The models are constructed on 39-dimensional MFCC (12 MFCC + 12 delta + 12 acc + 3 log energy) feature vectors;

The baseline recognizer basically follows the large vocabulary continuous speech recognition recipe using HTK;

We investigated three cases of maximizing the GPP on different training levels (word, phone, state)

Experimental Setup

28

Table 1: Word Error Rate (WER) and Sentence Error Rate (SER) for WSJ0-eval using different training levels

Training level WER(%) SER(%)Baseline 8.41 57.88

Word-level 8.05 56.97Phone-level 7.96 56.67State-level 8.02 56.97

Results

29

We generalize the criterion for minimum classification error (MCE) training and investigate their impact on recognition performance. This paper is the first part of an extensive generation of the MCE training.

The experiments are conducted based on the framework of “maximizing posterior probability”. The impact of different training levels is investigated and the phone level gained the best performance;

Further investigation upon various tasks based on this generalized framework is in progress.

Conclusion & Future Work

Documents

Woojay Jeon Research Advisor: Fred Juang School of Electrical and Computer Engineering