Upload
magdalene-mosley
View
220
Download
2
Tags:
Embed Size (px)
Citation preview
Better Punctuation Prediction with Dynamic Conditional Random Fields
Wei Lu and Hwee Tou Ng
National University of Singapore
Talk Overview
• Background• Related Work• Approaches
– Previous approach: Hidden Event Language Model– Previous approach: Linear-Chain CRF– This work: Factorial CRF
• Evaluation• Conclusion
2
• Automatically insert punctuation symbols into transcribed speech utterances
• Widely studied in speech processing community• Example:
>> Original speech utterance:
>> Punctuated (and cased) version:
You are quite welcome . And by the way , we may get other reservations , so could you please call us as soon as you fix the date ?
you are quite welcome and by the way we may get other reservations so could you please call us as soon as you fix the date
Punctuation Prediction
3
Our Task
• Processing prosodic features requires access to the raw speech data, which may be unavailable
• Tackles the problem from a text processing perspective
Perform punctuation prediction for conversational speech texts without relying on prosodic features
4
Related Work
• With prosodic features– Kim and Woodland (2001): a decision tree framework– Christensen et al. (2001): a finite state and a multi-
layer perceptron– Huang and Zweig (2002): a maximum entropy-based
approach– Liu et al. (2005): linear-chain conditional random
fields
• Without prosodic features– Beeferman et al. (1998): comma prediction with a
trigram language model– Gravano et al. (2009): n-gram based approach
5
Related Work (continued)
• One well-known approach that does not exploit prosodic features– Stolcke et al. (1998) presented a hidden event
language model– It treats boundary detection and punctuation insertion
as an inter-word hidden event detection task– Widely used in many recent spoken language
translation tasks as either a pre-processing (Wang et al., 2008) or post-processing (Kirchhoff and Yang, 2007) step
6
Hidden Event Language Model
7
• HMM (Hidden Markov Model)-based approach– A joint distribution over words and inter-word events– Observations are the words, and word/event pairs are
hidden states
• Implemented in the SRILM toolkit (Stolcke, 2002)• Variant of this approach
– Relocates/duplicates the ending punctuation symbol to be closer to the indicative words
– Works well for predicting English question marks
where is the nearest bus stop ?
? where is the nearest bus stop
Linear-Chain CRF
8
• Linear-chain conditional random fields (L-CRF): Undirected graphical model used for sequence learning– Avoid the strong assumptions about dependencies in
the hidden event language model – Capable of modeling dependencies with arbitrary non-
independent overlapping features
Y1 Y2 Y3 Yn
X1 X2 X3 Xn
…word-layer tags
utterance
An Example L-CRF
• A linear-chain CRF assigns a single tag to each individual word at each time step– Tags: NONE, COMMA, PERIOD, QMARK, EMARK
– Factorized features
• Sentence: no , please do not . would you save your questions for the end
of my talk , when i ask for them ?
COMMA NONE NONE PERIOD NONE NONE … NONE COMMA NONE … QMARK
no please do not would you … my talk when … them
9
Features for L-CRF
• Feature factorization (Sutton et al., 2007)– Product of a binary function on assignment of the set
of cliques at each time step, and a feature function solely defined on the observation sequence
– Feature functions: n-gram (n = 1,2,3) occurrences within 5 words from the current word
Example: for the word “do”:
do@0, please@-1, would_you@[2,3], no_please_do@[-2,0]
COMMA NONE NONE PERIOD NONE NONE … NONE COMMA NONE … QMARK
no please do not would you … my talk when … them
10
Problems with L-CRF
• Long-range dependency between the punctuation symbols and the indicative words cannot be captured properly
• For example: no please do not would you save your questions for the end of
my talk when i ask for them
It is hard to capture the long range dependency between the ending question mark (?) and the initial phrase “would you” with a linear-chain CRF
11
Problems with L-CRF
• What humans might do– no please do not would you save your questions for the end
of my talk when i ask for them
– no please do not would you save your questions for the end of my talk when i ask for them
– no , please do not . would you save your questions for the end of my talk , when i ask for them ?
• Sentence level punctuation (. ? !) are associated with the complete sentence, and therefore should be assigned at the sentence level
12
What Do We Want?
• A model that jointly performs all the following three tasks together– Sentence boundary detection (or sentence
segmentation)– Sentence type identification– Punctuation insertion
13
Factorial CRF
14
• An instance of dynamic CRF– Two-layer factorial CRF (F-CRF) jointly annotates an
observation sequence with two label sequences– Models the conditional probability of the label
sequence pairs <Y,Z> given the observation sequence X
Y1 Y2 Y3 Yn
X1 X2 X3 Xn
…
Z1 Z2 Z3 Zn…sentence-layer tags
word-layer tags
utterance
Example of F-CRF
DEBEG DEIN DEIN DEIN QNBEG QNIN … QNIN QNIN QNIN … QNIN
COMMA NONE NONE PERIOD NONE NONE … NONE COMMA NONE … QMARK
no please do not would you … my talk when … them
• Propose two sets of tags for this joint task– Word-layer: NONE, COMMA, PERIOD, QMARK, EMARK
– Sentence-layer: DEBEG, DEIN, QNBEG, QNIN, EXBEG, EXIN
– Analogous feature factorization and the same feature functions used in L-CRF are used
15
Why Does it Work?
• The sentence-layer tags are used for sentence segmentation and sentence type identification
• The word-layer tags are used for punctuation insertion
• Knowledge learned from the sentence-layer can guide the word-layer tagging process
• The two layers are jointly learned, thus providing evidences that influence each other’s tagging process[no please do not]declarative sent. [would you save your questions
for the end of my talk when i ask for them]question sent.
?QNBEG QNIN …
16
Evaluation Datasets
BTEC CT
CN EN CN EN
Number of utterance pairs 19,972 10,061
Percentage of declarative sentences 64% 65% 77% 81%
Percentage of question sentences 36% 35% 22% 19%
Multiple sentences per utterance 14% 17% 29% 39%
Average words per utterance 8.59 9.46 10.18 14.33
17
• IWSLT 2009 BTEC and CT datasets• Consists of both English (EN) and Chinese (CN)• 90% used for training, 10% for testing
Experimental Setup
• Designed extensive experiments for Hidden Event Language Model– Duplication vs. No duplication– Single-pass vs. Cascaded – Trigram vs. 5-gram
• Conducted the following experiments– Accuracy on CRR texts (F1 measure)– Accuracy on ASR texts (F1 measure)– Translation performance with punctuated ASR texts
(BLEU metric)
18
• Precision # correctly predicted punctuation symbols
# predicted punctuation symbols
• Recall # correctly predicted punctuation symbols
# expected punctuation symbols
• F1 measure 2
1/Precision + 1/Recall
Punctuation Prediction: Evaluation Metrics
19
BTECNO DUPLICATION USE DUPLICATION
L-CRF F-CRFSingle Pass Cascaded Single Pass Cascaded
LM ORDER 3 5 3 5 3 5 3 5
CN
Prec. 87.40 86.44 87.72 87.13 76.74 77.58 77.89 78.50 94.82 94.83
Rec. 83.01 83.58 82.04 83.76 72.62 73.72 73.02 75.53 87.06 87.94
F1 85.15 84.99 84.79 85.41 74.63 75.60 75.37 76.99 90.78 91.25
EN
Prec. 64.72 62.70 62.39 58.10 85.33 85.74 84.44 81.37 88.37 92.76
Rec. 60.76 59.49 58.57 55.28 80.42 80.98 79.43 77.52 80.28 84.73
F1 62.68 61.06 60.42 56.66 82.80 83.29 81.86 79.40 84.13 88.56
Punctuation Prediction Evaluation: Correctly Recognized Texts (I)
20
• The “duplication” trick for hidden event language model is language specific
• Unlike English, indicative words can appear anywhere in a Chinese sentence
CTNO DUPLICATION USE DUPLICATION
L-CRF F-CRFSingle Pass Cascaded Single Pass Cascaded
LM ORDER 3 5 3 5 3 5 3 5
CN
Prec. 89.14 87.83 90.97 88.04 74.63 75.42 75.37 76.87 93.14 92.77
Rec. 84.71 84.16 77.78 84.08 70.69 70.84 64.62 73.60 83.45 86.92
F1 86.87 85.96 83.86 86.01 72.60 73.06 69.58 75.20 88.03 89.75
EN
Prec. 73.86 73.42 67.02 65.15 75.87 77.78 74.75 74.44 83.07 86.69
Rec. 68.94 68.79 62.13 61.23 70.33 72.56 69.28 69.93 76.09 79.62
F1 71.31 71.03 64.48 63.13 72.99 75.08 71.91 72.12 79.43 83.01
Punctuation Prediction Evaluation: Correctly Recognized Texts (II)
21
• Significant improvement over L-CRF (p<0.01)• Our approach is general: requires minimal
linguistic knowledge, consistently performs well across different languages
BTECNO DUPLICATION USE DUPLICATION
L-CRF F-CRFSingle Pass Cascaded Single Pass Cascaded
LM ORDER 3 5 3 5 3 5 3 5
CN
Prec. 85.96 84.80 86.48 85.12 66.86 68.76 68.00 68.75 92.81 93.82
Rec. 81.87 82.78 83.15 82.78 63.92 66.12 65.38 66.48 85.16 89.01
F1 83.86 83.78 84.78 83.94 65.36 67.41 66.67 67.60 88.83 91.35
EN
Prec. 62.38 59.29 56.86 54.22 85.23 87.29 84.49 81.32 90.67 93.72
Rec. 64.17 60.99 58.76 56.21 88.22 89.65 87.58 84.55 88.22 92.68
F1 63.27 60.13 57.79 55.20 86.70 88.45 86.00 82.90 89.43 93.19
Punctuation Prediction Evaluation: Automatically Recognized Texts
22
• 504 Chinese utterances, and 498 English utterances• Recognition accuracy: 86% and 80% respectively• Significant improvement (p < 0.01)
BTECNO DUPLICATION USE DUPLICATION
L-CRF F-CRFSingle Pass Cascaded Single Pass Cascaded
LM ORDER 3 5 3 5 3 5 3 5
CN EN 30.77 30.71 30.98 30.64 30.16 30.26 30.33 30.42 31.27 31.30
EN CN 21.21 21.00 21.16 20.76 23.03 24.04 23.61 23.34 23.44 24.18
Punctuation Prediction Evaluation: Translation Performance
23
• This tells us how well the punctuated ASR outputs can be used for downstream NLP tasks
• Use Berkeley aligner and Moses (lexicalized reordering)
• Averaged BLEU-4 scores over 10 MERT runs with random initial parameters
Conclusion
24
• We propose a novel approach for punctuation prediction without relying on prosodic features– Jointly performs punctuation prediction, sentence
boundary detection, and sentence type identification– Performs better than the hidden event language
model and a linear-chain CRF model – A general approach that consistently works well
across different languages– Effective when incorporated with downstream NLP
tasks