Ideas in Confidence Annotation Arthur Chan. Three papers for today Frank Wessel et al, “Using Word Probabilities as Confidence Measures”

Ideas in Confidence Annotation

Arthur Chan

Three papers for today Frank Wessel et al,

“Using Word Probabilities as Confidence Measures” http://www-i6.informatik.rwth-aachen.de/PostScript/InterneArbeiten/W

essel_Word_Probabilities_ConfMeas_ICASSP1998.ps Timothy Hazen et al,

“Recognition Confidence Scoring for Use in Speech Understanding Systems” http://groups.csail.mit.edu/sls/publications/2000/asr2000.pdf

Dan Bohus and Alex Rudnicky “A Principled Approach for Rejection Threshold Optimization in

Spoken Dialogue System” http://www.cs.cmu.edu/~dbohus/docs/dbohus_interspeech05.pdf

Application of Confidence Annotation Provides system a decision whether ASR output could be

trusted, Possible response strategies

Reject the sentence all together. Confirm with the users again Both – e.g. bi-threshold system.

Detection of OOV e.g. If ASR doesn’t include the OOV in the vocabulary.

“What is the focus for paramus park new jersey” “What is the forecast for paris park new jersey” Paramus is OOV, then the system should be not confident about the

phoneme transcription. Improve speech recognition performance

Why? In general, posterior should be used instead of likelihood Does it help? 2%-5% relative level.

How this seminar proceed

In each idea, 3 papers are studied. Only the most representative became suggested

reading. Results will be quoted from different papers.

Preliminary

Mathematical Foundation: Neyman-Pearson Theorem (NPT) Consequence of NPT

In general, likelihood ratio test is the most powerful test to decide which one of the two distributions is in force. H1 : Distribution A is in force. H2 : Distribution B is in force.

Compute F(H1)/F(H2) <> T

In speech recognition, H1 could be the speech model, H2 could be the non-speech

(garbage) model.

Idea 1: Belief of a single ASR feature 3 studied papers:

(suggested) Frank Wessel et al, “Using Word Probabilities as Confidence Measures” http://www-i6.informatik.rwth-aachen.de/PostScript/InterneArbeiten/Wessel_Word_Probabilities_

ConfMeas_ICASSP1998.ps Stephen Cox and Richard Rose, “Confidence Measures for the Switchboard

Database” http://www.ece.mcgill.ca/~rose/papers/cox_rose_icassp96.pdf

Thomas Kemp and Thomas Schaaf, “Estimating Confidence using Word Lattices.” http://overcite.lcs.mit.edu/cache/papers/cs/1116/

http:zSzzSzwww.is.cs.cmu.eduzSz~wwwadmzSzpaperszSzspeechzSzEUROSPEECH97zSzEUROSPEECH97-thomas.pdf/kemp97estimating.pdf

Paper chosen because It has clearest math in minute detail. though less motivating than Cox’s paper.

Origins of Confidence Measure in speech recognition

Formulation of speech recognition P(W | A ) = P( A |W ) P( W) / P (A) In decoding, P(A) is ignored because it is a

common term W* = argmax P(A|W) P(W)

W

Problem : P(A,W) is just a relative measure

P(W|A) is the true measure of how probable a word given the feature.

In reality……

P(A) could only be approximated By law of total probability P(A) = Sum of P(A,W) for all W N-best list, word lattices are therefore used. Other Ideas: filler/garbage/general speech

models. -> keyword spotter tricks A threshold of ratio need to be found.

ROC curve always need to be manually interpreted.

Things that people are not confident

All sorts of things Frame

Frame likelihood ratio Phone

Phone likelihood ratio Word

Posterior probability -> a kind of likelihood ratio too Word likelihood ratio.

Sentence Likelihood

General Observation from Literature

Word-level confidence perform the best (CER)

Word lattice method is slightly more general. This part of presentation will focus on word-

lattice-based method.

Word posterior probability. Author’s Definition

W_a: Word hypothesis preceding w W_e: Word hypothesis succeeding w

Computation with lattice

Only the hypotheses included by lattice need to be computed.

Alpha-beta type of computation could be used. Similar to forward-backward algorithm

Forward probability

For an end time t,

Read: “Total posterior probability end at t which are identical to h”

Recursive formula

Backward probability

For a begin time t

One LM score is missing in the definition, later added back to the computation

Recursion

Posterior Computation

Practical Implementation

According to the author Posterior found using the above formula have

poorer discriminative capability Timing is fuzzy from the recognizer Segments of 30% of overlapping is then used.

Acoustic score and language score Both are scaled

AM scaled by a number equal to 1 LM scaled by a number larger than 1

Experimental Results

Confidence error rate is computed. Definition of CER

# correctly assigned tags/# tags Threshold is optimized by cross-validation

set. Compared to baseline

(Insertions + Deletion)/Number of recognized words

Results: relatively 14-18% of improvement

Summary

Word-based posterior probability is one effective way to compute confidence.

In practice, AM and LM scores need to be scaled appropriately.

Further reading. Frank Soong et al, “Generalized Word Posterior

Probability (GWPP) For Measuring Reliability of Recognized Words”

Idea 2: Belief in multiple ASR features

Background Single ASR feature is not the best Multiple features could be combined to improve

results Combination could be done by machine-learning

algorithm

Reviewed papers (Suggested) Timothy et al, “Recognition Confidence Scoring for

Use in Speech Understanding Systems” http://groups.csail.mit.edu/sls/publications/2000/asr2000.pdf

Zhang et al, http://www.cs.cmu.edu/~rongz/eurospeech_2001_1.pdf A survey:

http://fife.speech.cs.cmu.edu/Courses/11716/2000/Word_Confidence_Annotation.ps

Chase et al, “Word and Acoustic Confidence Annotation for Large Vocabulary Speech Recognition” http://www.cs.cmu.edu/afs/cs/user/lindaq/mosaic/ca.ps

Paper chosen because it is more recent. Combination method is motivated by speech-rec.

General structure of papers in Idea 2

10-30 features from the acoustic model is listed

Combination scheme is chosen. Usually it is based on machine learning method

e.g Decision tree Neural network Support vector machine Fisher Linear separator Or any super-duper ML method.

Outline

Motivation of the paper Decide whether OOV exists Marked potentially mis-recognized words.

What the author tries to do Decide whether an utterance should be accepted

3 different levels of feature Phonetic Level Scoring

Never used Utterance Level Scoring

15 features Word Level Scoring

10 features

Phone-Level Scoring

From the author Several work in the past has already show phone

and frame scores are unlikely to help However, phone score will be used to

generate word-level and sentence-level scores.

Scores are normalized by “catch-all model” In other words, garbage model is used to

approximate p(A) Normalized scores are always used.

Utterance Level Features (the boring group) 1, 1st best hypothesis total score

(AM + LM + PM) 2, 1st best hypothesis average (word) score

The avg. score per word. 3, 1st best hypothesis total LM score 4, 1st best hypothesis avg. LM score 5, 1st best hypothesis total AM score 6, 1st best hypothesis avg. AM score 7, Difference of total score between 1st best hyp & 2nd best hyp 8, Difference of LM score between 1st best hyp and 2nd best hyp 9, Difference of AM score between 1st best hyp and 2nd best hyp 14, Number of N-best 15, Number of words in the 1st best hyp.

Utterance Level Features (the interesting group)

N-best Purity The N-best purity for a hypothesized word is the

fraction of N-best hypotheses in which that particular hypothesized word appear in the same location in the sentence

Or #agreement/Total Similar to rover voting on the N-best list.

Utterance Level Features (the interesting group) (cont.)

10, 1st best hypothesis avg. N-best purity 11, 1st best hypothesis high N-best purity

The fraction of words in the top choice hypothesis which have N-best purity greater than one half.

12, Average N-best purity 13, High N-best purity

Word Level Feature 1, Mean acoustic score -> The mean of log likelihood 2, Mean of acoustic likliehood score -> The mean of likelihood

(not log likelihood) 3, Minimum acoustic score 4, Standard Deviation of acoustic score 5, Mean difference from max score

The average log likelihood ratio between acoustic scores of the best path and from phoneme recognition.

6, Mean Catch-All Score 7, Number of Acoustic Observation 8, N-best Purity 9, Number of N-best 10, Utterance Score

Classifier Training

Linear Separator Input: features Output: (correct, incorrect) pair

Training process 1, Fisher Linear discriminative analysis is used to

produced the first version of the separator 2, A hill climbing algorithm is used to minimized

the classification error.

Results (Word-Level)

Discussion: Is there any meaning in combination method?

IMO, Yes, Provided that the breakdown of the feature contribution to

the reduction of CER is provided. E.g. The goodies in other papers

In Timothy et al, N-best purity is the most useful. In Lin, LM-jitter is the first question that provide most gain In Rong, back-off mode and parsing score provide

significant improvement. Also, timothy et al is special because the optimization of

combination is also MCE trained. So, how things combined does matter too.

Summary

25 features were used in this paper 15 for utterance level 10 for word level

N-best purity was found to be the most helpful

Both simple linear separator training and minimum classification training was used. That explains the huge relative reduction in error.

Idea 3, Believe information other than ASR

ASR output has certain limitation. When apply in different applications,

ASR confidence need to be modified/combined with application-specific information.

Reviewed Papers Dialogue System

(Suggested) Dan Bohus, “A principled approach for rejection threshold optimization in Spoken Dialogue System” http://www.cs.cmu.edu/~dbohus/docs/dbohus_interspeech05.pdf

Sameer Pradhan, Wayne Ward, “Estimating Semantic Confidence for Spoken Dialogue Systemss” http://oak.colorado.edu/~spradhan/publications/semantic-

confidence.pdf CALL

Simon Ho, Brian Mak, “Joint Estimation of Thresholds in a Bi-threshold Verification Problem”

http://www.cs.ust.hk/~mak/PDF/eurospeech2003-bithreshold.pdf Paper chosen because

It is most recent Representative from a dialogue system stand-point.

Big Picture of this type of papers

Use feature external to ASR as confidence feature. Dialogue context

Use cost external to ASR error rate as optimization criterion Cost of misunderstanding 10% FA/FR

As most commented It usually makes more sense than just relying on ASR

features. The quality of feature also depends on the ASR scores.

Overview of the paper

Motivation “Recognition error significantly affect the quality of

success of interaction (for the dialogue system)” Rejection Threshold introduces trade-off between

the number of misunderstanding false rejections.

Incorrect and Correct Transfer of Concepts

An alternative formulation by the authors User tries to convey system concepts If the confidence is below threshold

The system reject the utterance and no concept is transferred

If the confidence is above the threshold The system accept some correct concept But also accept some wrong concept.

Questions the authors want to answer

“Given the existence of this tradeoff, what is the optimal value for rejection threshold?”

“this tradeoff” The trade-off between correctly and incorrectly

transfer concepts.

Logistic regression

General”ized” linear model which

g link function. could be log, logit, identity and reciprocal

http://userwww.sfsu.edu/~efc/classes/biol710/Glz/Generalized%20Linear%20Models.htm

Logistic regression (cont.)

Usually used in Categorical or non-continuous dependent variable Or the relationship itself is actually not linear.

Also used in combination features in ASR See Siu “Improved Estimation, Evaluation and

Applications of Confidence Measures for Speech Recognition”

And generally BBN systems

Impact of Incorrect and Correct Concept to the task success

Logit (TS) = 0.21 + 2.14 CTC – 4.12 ITC The odds of ITC vs CTC is nearly 2 times.

The procedure

Identify a set of variables A, B, …. Involved in the rejection tradeoff (e.g. CTC and ITC)

Choose a global dialogue performance metric P to optimize for (e.g T.S.)

Fit models m which relates the trade-off variables to the chosen global dialogue performance metric: P<-m(A,B)

Find the threshold which maximizes the performance Th* = arg max(P) = arg max (m(A(th), B(th))

Data

RoomLine system Baseline, fixed rejection threshold = 0.3 Each participant attempted

a max of 10 scenario-based interactions 71 states in the dialogue system In general

Rejection Optimization

The 71 state are manually clustered into 3 types: Open-request, system asks open questions

“How may I help you?” Request (bool), system asks a yes/no question

“Do you want a reservation for this room?” Request (non-bool), system request an answer

for more than 2 possible values “Starting at what time do you need the room?”

Cost are then optimized for individual state

Results

Summary of the paper

Principled idea for dialogue system. Logistic regression is used to optimized the

threshold of rejection. A neat paper.

Several clever point Logistic regression Using external metric in the paper

Discussion

3 different types of ideas in confidence annotation

Questions: Which idea should we used? Could ideas be combined?

Goodies in Idea 1

Word Posterior Probability, LM Jitter were found to be very useful in different papers.

Word Posterior Probability is a generalization of many techniques in the field

LM Jitter could be generalized with other parameters in the decoder as well.

Utterance score help word scores.

Goodies in Idea 2

Combination always help Combination in ML sense and DT sense each

give a chunk of gain. Combination methods:

Generalized linear model is easy to interpret and principled.

linear separator could be easily trained ML and DT sense

Neural network and SVM come with standard goodie: general non-linear modeling

Goodies in Idea 3

Every types of application has their own concern which is more important than WER

Researcher should take the liberty to optimize them instead of relying on ASR.

Conclusion

For an ASR-based system Idea 1 + 2 are wins

For an application based on ASR-based system Idea 1+2+3 would be the most helpful.

Documents

Ideas in Confidence Annotation Arthur Chan. Three papers for today Frank Wessel et al, “Using Word Probabilities as Confidence Measures”