Sridhar Raghavan Dept. of Electrical and Computer Engineering Mississippi State University URL:

Sridhar RaghavanDept. of Electrical and Computer Engineering

Mississippi State University

URL: http://www.cavs.msstate.edu/hse/ies/publications/books/msstate_theses/2006/fatigue_detection/

Applications of Large Vocabulary Continuous Speech Recognition for Fatigue Detection

Page 2 of 30Using of Large Vocabulary Continuous Speech Recognition to Fatigue Detection

Abstract• Goal of the thesis: To automate the task of fatigue detection using voice.

• Problem statement: Determine a suitable technique to enable automatic fatigue detection using voice. Also make the system robust to out-of-vocabulary words which directly influence fatigue detection performance.

• Hypothesis: An LVCSR system can be used for detecting fatigue from voice.

• Results: Using confidence measures the robustness of the fatigue detection system to out-of-vocabulary words improved by 20%.


MotivationApplications of speech recognition have grown from simple speech to textconversion to other more challenging tasks. An ASR system can be used as acore and various applications can be built using the same technology. Thisthesis explores the application of speech recognition to the task offatigue detection from voice. The ISIP ASR toolkit was used as the core speechengine for this thesis.

Speech Recognition

Speaker Verification

word Spotting

AutomaticCall routing

AutomaticLanguage Detection

AutomobileTelematics

Speech Therapysystems

FatigueDetection

Core Speech Engine

Focus of this thesis


Signs of Fatigue in Human Speech

Creare Inc. provided voice data recorded from subjects who wereinduced with fatigue by sleep depravation. The spectrograms of a subjectbefore and after fatigue induction is shown below.

Non-fatigued speaker saying “papa”

Fatigued speaker saying “papa”


Changes in Human Voice Production System due to FatigueFrom literature it is known that fatigue causes temporal and spectral variationsin the speech signal. The spectral variation can be attributed to a change in thehuman sound production system, while the temporal variation is controlled bythe brain and its explanation is beyond the scope of this thesis.

Effects on human sound production system:

• Yielding walls: The vocal tract walls are not rigid and hence it is known that an increase in the vibration of the walls of the vocal tract causes a slight increase in the lower order formants.

• Viscosity and thermal loss: It is the friction between air and walls of the vocal tract. This loss causes a slight upward shift on the formant frequencies that exist beyond 3-4 kHz.

• Lip radiation: For an ideal vocal tract model the lip radiation loss is ignored, but it is known from literature that the lip radiation loss causes a slight decrease in the formants. This effect is more pronounced for higher order formants.


Fatigue detection using a Speaker Verification SystemSpeaker Verification: A speaker verification system can be used to model thelong term speech characteristics of a speaker. The system builds a model foreach individual speaker. Verification is performed by computing a likelihood(likelihood is defined as the conditional probability of the acoustic datagiven the speaker model).

Fatigue detection using the speaker verification system was conducted asfollows:

1) Models were trained on data that was collected during the initial stage of the recordings.

2) These models were used for testing. There were six recording stages evenly spread over a duration of thirty six hours.


Result from Pilot Speaker verification Experiments

Distribution of the likelihood scores of fatigued and non-fatigued speakers

SpeakerVerification

ActiveSpeaker

Model

Active or fatiguedutterance from thespeaker

OutputObservation: No significant difference in the likelihood scores was observed.

Input

MFCCs

One probable reason for poor performance is that not all sounds in humanspeech are affected by fatigue in the same manner.


Effect of Fatigue on Different Phonemes

Greeley, et al. found that not all phonemes are affected equally due to fatigue.Certain phonemes showed more variations due to fatigue than others.

Certain sounds showed more variations due to fatigue than others


Fatigue Detection using a Word SpotterWord Spotting system: A word spotting system determines the presence ofwords of interest in a speech file. Such a system was built using the ISIP ASRsystem as follows:

1. Labeled training data was used to train the acoustic models.2. A garbage model is built by labeling all the words in the transcription by the same token. The garbage model will be used as a substitute for any word other than the keyword in the final hypothesis.3. The grammar of the recognizer was changed based on what words had to be spotted.

WordSpotter

Input utterance

Loopgrammar

spottedwordalignment

FatigueDetection

Measure of fatigue

The problem with this system was that it generated a high percentage of false hypothesis and this affected fatigue detection performance.


An LVCSR Approach to Detect Fatigue

An ASR system trained on the Creare data provided reasonably accurate phonetic alignments. A WER of 11% was obtained. These alignments were used by the fatigue detection software to grab the MFCC vectors corresponding to specific sounds.

ASR SystemFeatureExtraction

OutputHypothesis

Fatigue Detection System

Fatigue prediction output

Speech signal

Advantage of using this approach is that: 1. Unlike speaker verification technique this approach does not require fatigue dependent data for training the ASR system.2. The grammar of the ASR is fixed unlike the word spotting technique.

But the problem of false alarms still exists, especially when there are out-of-vocabulary words, and this problem is tackled by the use of confidencemeasures.


Generating a Baseline ASR System

Grammar Type WER %

Sentence level grammar (Fixed phrases) 0

Sentence level grammar (Fixed+Spontaneous phrases)

34

Word level grammar (Fixed phrases) 60

Word level grammar (Fixed +Spontaneous phrases)

82

Bigram model (Fixed phrases) 52.4

Bigram model (Fixed + Spontaneous phrases) 74.5

Grammar tuning experiments on Phase II data using 16-mixtureCross-word triphones Model Type WER %

Word 63.9

Monophone 54.3

Cross-word triphone 47.3

Model selection experiments on FAA data

No. of Mixtures WER %

1 47.3

2 36.3

4 23.6

8 11.3

16 11.3

Mixture selection experiments on FAA data

• Phase II data: This data was recorded during a three day military exercise using a PDA (Personal Digital Assistant). The data was very noisy and contained lot of disfluencies. There were 8 fixed phrases and one spontaneous phrase for each of the 21 speakers.

• FAA data: This consisted of 30 words spoken in a studio environment over a three day period. The speakers read fixed text.


Results from state-tying experiments

The performance of the system was improved further by tuning the state tying parameters. This was possible since cross-word triphone models were used as the fundamental acoustic model.

It was observed that the WER improved by increasing the number of states, but made the model specific to the training data. Hence an optimum value was chosen by observing the WER on open loop experiments.

Closed loop state tying experiments

Open loop state tying experiments


Problem due to False Alarms• The output of an ASR system had some errors. The errors are classified as: substitutions, insertions and deletions.

• The performance of the fatigue prediction system relied on the accuracy of the phonetic alignment.

• It did not matter if the ASR miss-recognized the required words, but it did matter when there were false alarms.

HEAT THE POT INSIDE THE OPEN AND POT FOR FIFTEEN MINUTES.

KEEP THE POT INSIDE THE OVEN AND WAIT FOR FIFTEEN MINUTES.Reference Transcription

ASR’s Output

Not a major problem sinceit will not beconsidered forfatigue analysisin any case

Correct hypothesis, hence no problem and its alignments will be used for fatigue detection

Very serious problem since a totally different MFCC vector set corresponding to the word “wait” will be analyzed assuming it is the word “pot”


Using Confidence Measures• By using confidence metric we could prune away less likely words that constituted false alarms.• The first choice for a confidence metric was the likelihood or the acoustic score of every word in the hypotheses.• Observation of the likelihood scores revealed that there was no clear trend that could be useful for pattern classification.• An example of such an observation is shown below


Word Posteriors as Confidence MeasuresWord posteriors can be defined as “the sum of the posterior probabilities of allword graph paths of which the word is a part”.

Word posteriors can be computed in two ways:

1. N-Best list2. Word graphs

In this thesis the word posteriors were computed from word graphs as they aremuch better representation of the search space than N-Best lists.

Example of a word graph


Computing Word Posteriors from Word GraphsRestating the word posterior definition for clarity:

Word posteriors can be defined as “the sum of the posterior probabilities of allword graph paths of which the word is a part”. What does this mean mathematically?

w. suceeding sequences word all Denotes

w. preceeding sequences word all Denotes

T. to 1 time fromvector Acoustic

word the of times end andstart ,

word single

where,

)1|,,()1|,,(

1

e

b

T

eb

w

w

x

tt

w

bw ew

TxewwbwPTxetbtwP


Computing Word Posteriors from Word Graphs

We cannot compute the posterior probability directly, so we decompose it into

likelihood and priors using Baye’s rule.

w w weaea

TT

ea

eaT

T

w weaea

T

a e

a e

wwwpwwwxpxp

wwwp

wwwxp

xp

wwwpwwwxp

),,().,,|()(

and

yprobabilit model Language ),,(

yprobabilit model Acoustic),,|(

)(

),,().,,|(

11

1

1

1

The numerator is computed using the forward backward algorithm. The

denominator term is simply the by product of the forward-backward algorithm.

N

There are 6 different ways to reach the node N and 2 different ways to leave N,

so we need to obtain the forward probability as well as the backward probability

to determine the probability of passing through the node N, and this is where

the forward-backward algorithm comes into picture.



A Toy Example

3/6

3/6 3/6

2/6

4/6

2/6

2/6

2/6

1/6

4/6

1/6

1/6

4/6

1/6

1/6

4/6

5/6Sil

Sil

This

is

atest

sentence

Sil

this

is

the

is

a

the

guest

a

quest

sentence

sense 1/6 1/6

Sil

The values on the links are the likelihoods. Some nodes are outlined with red to signify that they occur at the same time instant.



A forward-backward type algorithm will be used for determining the link probability.

Computing alphas or the forward probability:

Step 1: Initialization

In a conventional HMM forward-backward algorithm we would perform the following:

i state in are we given X

data observed the of prob. Emission)(

i state of prob. Initial

1 )()(

1

1

11

Xb

NiXbi

i

i

ii

A slightly modified version of the above equation will be used on a word graph. The emission probability will be the acoustic score .



The α for the first node is 1:

11

Step 2: Induction

tj

11

Xnobservatio the of yprobabilit emmision )(b

yprobabilit transition

1 ;2 )()()(

t

ij

tj

N

iijtt

X

a

NjTtXbaij

The alpha values computed in the previous step are used to compute the alphas for the succeeding nodes.

Note: Unlike in HMMs where we move from left to right at fixed intervals of time, over here we move from one node to the next based on node indices which are time aligned.



Let us see the computation of the alphas from node 2, the alpha for node 1 was initialized as 1 in the previous step during initialization.

Node 2:

0.005

01.0*)6/3(*12

0.005025

)01.0*)6/3(*005.0()01.0*)6/3(*1(3

Node 3:

Node 4:

05-1.675E

)01.0*)6/2(*005025.0(4

The alpha calculation continues in this manner for all the remaining nodes.

1 3

4

2

3/6

3/6 3/6

2/6

4/6

Sil

Sil

this

is

α =1

α =0.005

α =0.005025

α=1.675E-05



Once the alphas are computed using the forward algorithm, the betas are computed using the backward algorithm.

The backward algorithm is similar to the forward algorithm, but the computation starts from the last node and proceed from right to left.

Step 1 : Initialization

1. is node final theat of value initial The

N1 /1)(

iNiT

Step 2: Induction

node.current

the preceedingjust nodes the of value beta The )(

score acoustic The )(b

score model Language a

Ni1 1....1;-Tt )()()(

1

1j

ij

1 11

j

X

jXbai

t

t

N

j ttjijt



Computation of the beta values from node 14 and backwards.

Node 14:

0.001667

01.0*1*)6/1(14

00833.0

01.0*1*)6/5(13

Node 13:

Node 12:

05-5.555E

00833.0*01.0*)6/4(12

11

12

13 15

1/6

4/6

5/6

sentence

Sil

14

sentence

sense 1/6 1/6

Sil

β=1.66E-5

β=5.55E-5

β=0.00833

β=0.001667

β=1



Node 11:

05-1.666E

)00833.0*01.0*)6/1(( )001667.0*01.0*)6/1((11

In a similar manner we obtain the beta values for all the nodes till node 1. The alpha for the last node should be the same as the beta at the first node.

The link probability is simply the product of the alpha and beta in its preceding and succeeding nodes. Note that this value is not normalized. It was normalized by dividing it with the sum of the probability of all paths through the word graph.

The confidence measure was further strengthened by summing up the link probabilities of all similar words occurring within a particular time frame.


Word graph showing the computed alphas and betas

α=1.675E-7β=4.61E-11

α=2.79E-10β=2.766E-8

1 3

4

5

6

7 10

11

12

139

15

2

3/6

3/6 3/6

2/6

4/6

2/6

2/6

2/6

1/6

4/6

1/6

1/6

4/6

1/6

1/6

4/6

5/6Sil

Sil

This

is

atest

sentence

Sil

this

is

the

is

a

the

guest

quest14

sentence

sense 1/6 1/6

Silα =1β=2.88E-16

8

α =5e-03β=2.87E-16

α =5.025E-03β=5.740E-14

α=1.117E-7β=2.512E-9

α=1.675E-05β=1.536E-13

α=3.35E-5β=8.537E-12

α=1.861E-10β=2.766E-8

α=7.446E-10β=3.7E-07

α=7.751E-13β=1.66E-05

α=4.964E-12β=5.55E-05

α=3.438E-14β=8.33E-03

α=1.2923E-15β=1.667E-03

α=2.88E-16β=1

In this example the probability of occurrence of any word is fixed as it is a loop grammar and any word can follow any other word. Using a statistical language model should further strengthen the posterior scores.

This word graph shows every node with its corresponding alpha and beta computed.


Logarithmic word posterior probabilities

p=-1.0982

0 2

3

4

5

6 9

10

11

128

14

1

p=-0.4051

p=-1.0982

p=-1.0986

p=-0.0086

p=-4.7156

p=-3.604

p=-0.0273

p=-3.6169

p=-0.0459

p=-4.0224

p=-3.6169

p=-0.0459

p=-4.0224

p=-3.2884

p=-0.0459 p=-0.0075

Sil

Sil

This

is

atest

sentence

Sil

this

is the

is

a

the

guest

quest13

sentence

sense

p=-4.8978

p=--4.8978

Sil

7

The ASR system annotates the one-best output with the word posteriors computed from the word graph. These word posteriors are used by the fatigue software to prune away less likely words.


Effectiveness of Word Posteriors

A clear separation in between the histograms of false word scores and correct word scores was observed.

A DET curve was also plotted for both the word likelihood scores and the word posterior scores. It can be observed that the Equal Error Rate is much lower for word posteriors compared to word likelihoods.

00.05

0.10.15

0.20.25

0.30.35

0.40.45

0.5

40 50 60 70 80 90 100 110 120

-(Confidence Metric for sound 'p')

False Words Correct Words


Applying Confidence Measures to Fatigue Detection

The real test for the confidence measure was to test it with the fatigue detection system.

The fatigue detection system used the word posteriors corresponding to every word in the one-best output as confidence measures and will prune away less likely words for fatigue detection.

The false alarms in the experiment were due to out-of-vocabulary words.

The effect of using confidence measures on the test set can be observed from the plot shown below

0.40

0.50

0.60

0.70

0.80

0.90

1.00

1 2 3 4 5 6

Trial

Vc(

p)

50

55

60

65

70

75

80

85

90

95

100

SA

FT

E S

core

All data >-75 Only trained List SAFTE Model


Conclusion and Future Work

Conclusion:

A suitable mechanism for detecting fatigue using an LVCSR system was developed.

A confidence measure algorithm was implemented to make the system robust to false alarms due to out-of-vocabulary words.

The confidence measure algorithm helped in improving the performance of the fatigue detection system by 20%.

Future Work:

A more extensive set of fatigue experiments should be conducted on a larger data set. The data set used for this thesis was limited by high time and cost of collecting such data sets.

The effectiveness of confidence measures can be improved by using a statistical language model instead of a loop grammar.


Resources

• Pattern Recognition Applet: compare popular algorithms on standard or custom data sets

• Speech Recognition Toolkits: compare SVMs and RVMs to standard approaches using a state of the art ASR toolkit

• Foundation Classes: generic C++ implementations of many popular statistical modeling approaches


References

1. F. Wessel, R. Schlüter, K. Macherey and H. Ney, “Confidence Measures for Large Vocabulary Continuous Speech Recognition,” IEEE Transactions on Speech and Audio Processing, vol. 9, no. 3, pp. 288‑298, November 2001.

2. G. Evermann and P.C. Woodland, “Large Vocabulary Decoding and Confidence Estimation using Word Posterior Probabilities” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol.8, pp. 2366-2369, Istanbul, Turkey, March 2000.

3. X. Huang, A. Acero, and H.W. Hon, Spoken Language Processing – A Guide to Theory, Algorithm, and System Development, Prentice-Hall, Upper Saddle River, New Jersey, USA, 2001.

4. D. Jurafsky and J.H. Martin, An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice-Hall, Upper Saddle River, New Jersey, USA, 2000.

5. H.P. Greeley, J. Berg, E. Friets, J.P. Wilson, S. Raghavan and J. Picone, “Detecting Fatigue from Voice Using Speech Recognition,” to be presented at the IEEE International Symposium on Signal Processing and Information Technology, Vancouver, Canada, August 2006.

6. S. Raghavan and J. Picone, "Confidence Measures Using Word Posteriors and Word Graphs," IES Spring'05 Seminar Series, January 30, 2005.

7. L. Mangu, E. Brill and A. Stolcke, “Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion Networks,” Computer, Speech and Language, vol. 14, no. 4, pp. 373 400, October 2000.

Documents

Sridhar Raghavan Dept. of Electrical and Computer Engineering Mississippi State University URL: