View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Ideas in Confidence Annotation
Arthur Chan
Three papers for today Frank Wessel et al,
“Using Word Probabilities as Confidence Measures” http://www-i6.informatik.rwth-aachen.de/PostScript/InterneArbeiten/W
essel_Word_Probabilities_ConfMeas_ICASSP1998.ps Timothy Hazen et al,
“Recognition Confidence Scoring for Use in Speech Understanding Systems” http://groups.csail.mit.edu/sls/publications/2000/asr2000.pdf
Dan Bohus and Alex Rudnicky “A Principled Approach for Rejection Threshold Optimization in
Spoken Dialogue System” http://www.cs.cmu.edu/~dbohus/docs/dbohus_interspeech05.pdf
Application of Confidence Annotation Provides system a decision whether ASR output could be
trusted, Possible response strategies
Reject the sentence all together. Confirm with the users again Both – e.g. bi-threshold system.
Detection of OOV e.g. If ASR doesn’t include the OOV in the vocabulary.
“What is the focus for paramus park new jersey” “What is the forecast for paris park new jersey” Paramus is OOV, then the system should be not confident about the
phoneme transcription. Improve speech recognition performance
Why? In general, posterior should be used instead of likelihood Does it help? 2%-5% relative level.
How this seminar proceed
In each idea, 3 papers are studied. Only the most representative became suggested
reading. Results will be quoted from different papers.
Preliminary
Mathematical Foundation: Neyman-Pearson Theorem (NPT) Consequence of NPT
In general, likelihood ratio test is the most powerful test to decide which one of the two distributions is in force. H1 : Distribution A is in force. H2 : Distribution B is in force.
Compute F(H1)/F(H2) <> T
In speech recognition, H1 could be the speech model, H2 could be the non-speech
(garbage) model.
Idea 1: Belief of a single ASR feature 3 studied papers:
(suggested) Frank Wessel et al, “Using Word Probabilities as Confidence Measures” http://www-i6.informatik.rwth-aachen.de/PostScript/InterneArbeiten/Wessel_Word_Probabilities_
ConfMeas_ICASSP1998.ps Stephen Cox and Richard Rose, “Confidence Measures for the Switchboard
Database” http://www.ece.mcgill.ca/~rose/papers/cox_rose_icassp96.pdf
Thomas Kemp and Thomas Schaaf, “Estimating Confidence using Word Lattices.” http://overcite.lcs.mit.edu/cache/papers/cs/1116/
http:zSzzSzwww.is.cs.cmu.eduzSz~wwwadmzSzpaperszSzspeechzSzEUROSPEECH97zSzEUROSPEECH97-thomas.pdf/kemp97estimating.pdf
Paper chosen because It has clearest math in minute detail. though less motivating than Cox’s paper.
Origins of Confidence Measure in speech recognition
Formulation of speech recognition P(W | A ) = P( A |W ) P( W) / P (A) In decoding, P(A) is ignored because it is a
common term W* = argmax P(A|W) P(W)
W
Problem : P(A,W) is just a relative measure
P(W|A) is the true measure of how probable a word given the feature.
In reality……
P(A) could only be approximated By law of total probability P(A) = Sum of P(A,W) for all W N-best list, word lattices are therefore used. Other Ideas: filler/garbage/general speech
models. -> keyword spotter tricks A threshold of ratio need to be found.
ROC curve always need to be manually interpreted.
Things that people are not confident
All sorts of things Frame
Frame likelihood ratio Phone
Phone likelihood ratio Word
Posterior probability -> a kind of likelihood ratio too Word likelihood ratio.
Sentence Likelihood
General Observation from Literature
Word-level confidence perform the best (CER)
Word lattice method is slightly more general. This part of presentation will focus on word-
lattice-based method.
Word posterior probability. Author’s Definition
W_a: Word hypothesis preceding w W_e: Word hypothesis succeeding w
Computation with lattice
Only the hypotheses included by lattice need to be computed.
Alpha-beta type of computation could be used. Similar to forward-backward algorithm
Forward probability
For an end time t,
Read: “Total posterior probability end at t which are identical to h”
Recursive formula
Backward probability
For a begin time t
One LM score is missing in the definition, later added back to the computation
Recursion
Posterior Computation
Practical Implementation
According to the author Posterior found using the above formula have
poorer discriminative capability Timing is fuzzy from the recognizer Segments of 30% of overlapping is then used.
Acoustic score and language score Both are scaled
AM scaled by a number equal to 1 LM scaled by a number larger than 1
Experimental Results
Confidence error rate is computed. Definition of CER
# correctly assigned tags/# tags Threshold is optimized by cross-validation
set. Compared to baseline
(Insertions + Deletion)/Number of recognized words
Results: relatively 14-18% of improvement
Summary
Word-based posterior probability is one effective way to compute confidence.
In practice, AM and LM scores need to be scaled appropriately.
Further reading. Frank Soong et al, “Generalized Word Posterior
Probability (GWPP) For Measuring Reliability of Recognized Words”
Idea 2: Belief in multiple ASR features
Background Single ASR feature is not the best Multiple features could be combined to improve
results Combination could be done by machine-learning
algorithm
Reviewed papers (Suggested) Timothy et al, “Recognition Confidence Scoring for
Use in Speech Understanding Systems” http://groups.csail.mit.edu/sls/publications/2000/asr2000.pdf
Zhang et al, http://www.cs.cmu.edu/~rongz/eurospeech_2001_1.pdf A survey:
http://fife.speech.cs.cmu.edu/Courses/11716/2000/Word_Confidence_Annotation.ps
Chase et al, “Word and Acoustic Confidence Annotation for Large Vocabulary Speech Recognition” http://www.cs.cmu.edu/afs/cs/user/lindaq/mosaic/ca.ps
Paper chosen because it is more recent. Combination method is motivated by speech-rec.
General structure of papers in Idea 2
10-30 features from the acoustic model is listed
Combination scheme is chosen. Usually it is based on machine learning method
e.g Decision tree Neural network Support vector machine Fisher Linear separator Or any super-duper ML method.
Outline
Motivation of the paper Decide whether OOV exists Marked potentially mis-recognized words.
What the author tries to do Decide whether an utterance should be accepted
3 different levels of feature Phonetic Level Scoring
Never used Utterance Level Scoring
15 features Word Level Scoring
10 features
Phone-Level Scoring
From the author Several work in the past has already show phone
and frame scores are unlikely to help However, phone score will be used to
generate word-level and sentence-level scores.
Scores are normalized by “catch-all model” In other words, garbage model is used to
approximate p(A) Normalized scores are always used.
Utterance Level Features (the boring group) 1, 1st best hypothesis total score
(AM + LM + PM) 2, 1st best hypothesis average (word) score
The avg. score per word. 3, 1st best hypothesis total LM score 4, 1st best hypothesis avg. LM score 5, 1st best hypothesis total AM score 6, 1st best hypothesis avg. AM score 7, Difference of total score between 1st best hyp & 2nd best hyp 8, Difference of LM score between 1st best hyp and 2nd best hyp 9, Difference of AM score between 1st best hyp and 2nd best hyp 14, Number of N-best 15, Number of words in the 1st best hyp.
Utterance Level Features (the interesting group)
N-best Purity The N-best purity for a hypothesized word is the
fraction of N-best hypotheses in which that particular hypothesized word appear in the same location in the sentence
Or #agreement/Total Similar to rover voting on the N-best list.
Utterance Level Features (the interesting group) (cont.)
10, 1st best hypothesis avg. N-best purity 11, 1st best hypothesis high N-best purity
The fraction of words in the top choice hypothesis which have N-best purity greater than one half.
12, Average N-best purity 13, High N-best purity
Word Level Feature 1, Mean acoustic score -> The mean of log likelihood 2, Mean of acoustic likliehood score -> The mean of likelihood
(not log likelihood) 3, Minimum acoustic score 4, Standard Deviation of acoustic score 5, Mean difference from max score
The average log likelihood ratio between acoustic scores of the best path and from phoneme recognition.
6, Mean Catch-All Score 7, Number of Acoustic Observation 8, N-best Purity 9, Number of N-best 10, Utterance Score
Classifier Training
Linear Separator Input: features Output: (correct, incorrect) pair
Training process 1, Fisher Linear discriminative analysis is used to
produced the first version of the separator 2, A hill climbing algorithm is used to minimized
the classification error.
Results (Word-Level)
Discussion: Is there any meaning in combination method?
IMO, Yes, Provided that the breakdown of the feature contribution to
the reduction of CER is provided. E.g. The goodies in other papers
In Timothy et al, N-best purity is the most useful. In Lin, LM-jitter is the first question that provide most gain In Rong, back-off mode and parsing score provide
significant improvement. Also, timothy et al is special because the optimization of
combination is also MCE trained. So, how things combined does matter too.
Summary
25 features were used in this paper 15 for utterance level 10 for word level
N-best purity was found to be the most helpful
Both simple linear separator training and minimum classification training was used. That explains the huge relative reduction in error.
Idea 3, Believe information other than ASR
ASR output has certain limitation. When apply in different applications,
ASR confidence need to be modified/combined with application-specific information.
Reviewed Papers Dialogue System
(Suggested) Dan Bohus, “A principled approach for rejection threshold optimization in Spoken Dialogue System” http://www.cs.cmu.edu/~dbohus/docs/dbohus_interspeech05.pdf
Sameer Pradhan, Wayne Ward, “Estimating Semantic Confidence for Spoken Dialogue Systemss” http://oak.colorado.edu/~spradhan/publications/semantic-
confidence.pdf CALL
Simon Ho, Brian Mak, “Joint Estimation of Thresholds in a Bi-threshold Verification Problem”
http://www.cs.ust.hk/~mak/PDF/eurospeech2003-bithreshold.pdf Paper chosen because
It is most recent Representative from a dialogue system stand-point.
Big Picture of this type of papers
Use feature external to ASR as confidence feature. Dialogue context
Use cost external to ASR error rate as optimization criterion Cost of misunderstanding 10% FA/FR
As most commented It usually makes more sense than just relying on ASR
features. The quality of feature also depends on the ASR scores.
Overview of the paper
Motivation “Recognition error significantly affect the quality of
success of interaction (for the dialogue system)” Rejection Threshold introduces trade-off between
the number of misunderstanding false rejections.
Incorrect and Correct Transfer of Concepts
An alternative formulation by the authors User tries to convey system concepts If the confidence is below threshold
The system reject the utterance and no concept is transferred
If the confidence is above the threshold The system accept some correct concept But also accept some wrong concept.
Questions the authors want to answer
“Given the existence of this tradeoff, what is the optimal value for rejection threshold?”
“this tradeoff” The trade-off between correctly and incorrectly
transfer concepts.
Logistic regression
General”ized” linear model which
g link function. could be log, logit, identity and reciprocal
http://userwww.sfsu.edu/~efc/classes/biol710/Glz/Generalized%20Linear%20Models.htm
Logistic regression (cont.)
Usually used in Categorical or non-continuous dependent variable Or the relationship itself is actually not linear.
Also used in combination features in ASR See Siu “Improved Estimation, Evaluation and
Applications of Confidence Measures for Speech Recognition”
And generally BBN systems
Impact of Incorrect and Correct Concept to the task success
Logit (TS) = 0.21 + 2.14 CTC – 4.12 ITC The odds of ITC vs CTC is nearly 2 times.
The procedure
Identify a set of variables A, B, …. Involved in the rejection tradeoff (e.g. CTC and ITC)
Choose a global dialogue performance metric P to optimize for (e.g T.S.)
Fit models m which relates the trade-off variables to the chosen global dialogue performance metric: P<-m(A,B)
Find the threshold which maximizes the performance Th* = arg max(P) = arg max (m(A(th), B(th))
Data
RoomLine system Baseline, fixed rejection threshold = 0.3 Each participant attempted
a max of 10 scenario-based interactions 71 states in the dialogue system In general
Rejection Optimization
The 71 state are manually clustered into 3 types: Open-request, system asks open questions
“How may I help you?” Request (bool), system asks a yes/no question
“Do you want a reservation for this room?” Request (non-bool), system request an answer
for more than 2 possible values “Starting at what time do you need the room?”
Cost are then optimized for individual state
Results
Summary of the paper
Principled idea for dialogue system. Logistic regression is used to optimized the
threshold of rejection. A neat paper.
Several clever point Logistic regression Using external metric in the paper
Discussion
3 different types of ideas in confidence annotation
Questions: Which idea should we used? Could ideas be combined?
Goodies in Idea 1
Word Posterior Probability, LM Jitter were found to be very useful in different papers.
Word Posterior Probability is a generalization of many techniques in the field
LM Jitter could be generalized with other parameters in the decoder as well.
Utterance score help word scores.
Goodies in Idea 2
Combination always help Combination in ML sense and DT sense each
give a chunk of gain. Combination methods:
Generalized linear model is easy to interpret and principled.
linear separator could be easily trained ML and DT sense
Neural network and SVM come with standard goodie: general non-linear modeling
Goodies in Idea 3
Every types of application has their own concern which is more important than WER
Researcher should take the liberty to optimize them instead of relying on ASR.
Conclusion
For an ASR-based system Idea 1 + 2 are wins
For an application based on ASR-based system Idea 1+2+3 would be the most helpful.