Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore

Better Punctuation Prediction with Dynamic Conditional Random Fields

Wei Lu and Hwee Tou Ng

National University of Singapore

Talk Overview

• Background• Related Work• Approaches

– Previous approach: Hidden Event Language Model– Previous approach: Linear-Chain CRF– This work: Factorial CRF

• Evaluation• Conclusion

2

• Automatically insert punctuation symbols into transcribed speech utterances

• Widely studied in speech processing community• Example:

>> Original speech utterance:

>> Punctuated (and cased) version:

You are quite welcome . And by the way , we may get other reservations , so could you please call us as soon as you fix the date ?

you are quite welcome and by the way we may get other reservations so could you please call us as soon as you fix the date

Punctuation Prediction

3

Our Task

• Processing prosodic features requires access to the raw speech data, which may be unavailable

• Tackles the problem from a text processing perspective

Perform punctuation prediction for conversational speech texts without relying on prosodic features

4

Related Work

• With prosodic features– Kim and Woodland (2001): a decision tree framework– Christensen et al. (2001): a finite state and a multi-

layer perceptron– Huang and Zweig (2002): a maximum entropy-based

approach– Liu et al. (2005): linear-chain conditional random

fields

• Without prosodic features– Beeferman et al. (1998): comma prediction with a

trigram language model– Gravano et al. (2009): n-gram based approach

5

Related Work (continued)

• One well-known approach that does not exploit prosodic features– Stolcke et al. (1998) presented a hidden event

language model– It treats boundary detection and punctuation insertion

as an inter-word hidden event detection task– Widely used in many recent spoken language

translation tasks as either a pre-processing (Wang et al., 2008) or post-processing (Kirchhoff and Yang, 2007) step

6

Hidden Event Language Model

7

• HMM (Hidden Markov Model)-based approach– A joint distribution over words and inter-word events– Observations are the words, and word/event pairs are

hidden states

• Implemented in the SRILM toolkit (Stolcke, 2002)• Variant of this approach

– Relocates/duplicates the ending punctuation symbol to be closer to the indicative words

– Works well for predicting English question marks

where is the nearest bus stop ?

? where is the nearest bus stop

Linear-Chain CRF

8

• Linear-chain conditional random fields (L-CRF): Undirected graphical model used for sequence learning– Avoid the strong assumptions about dependencies in

the hidden event language model – Capable of modeling dependencies with arbitrary non-

independent overlapping features

Y1 Y2 Y3 Yn

X1 X2 X3 Xn

…word-layer tags

utterance

An Example L-CRF

• A linear-chain CRF assigns a single tag to each individual word at each time step– Tags: NONE, COMMA, PERIOD, QMARK, EMARK

– Factorized features

• Sentence: no , please do not . would you save your questions for the end

of my talk , when i ask for them ?

COMMA NONE NONE PERIOD NONE NONE … NONE COMMA NONE … QMARK

no please do not would you … my talk when … them

9

Features for L-CRF

• Feature factorization (Sutton et al., 2007)– Product of a binary function on assignment of the set

of cliques at each time step, and a feature function solely defined on the observation sequence

– Feature functions: n-gram (n = 1,2,3) occurrences within 5 words from the current word

Example: for the word “do”:

do@0, please@-1, would_you@[2,3], no_please_do@[-2,0]



10

Problems with L-CRF

• Long-range dependency between the punctuation symbols and the indicative words cannot be captured properly

• For example: no please do not would you save your questions for the end of

my talk when i ask for them

It is hard to capture the long range dependency between the ending question mark (?) and the initial phrase “would you” with a linear-chain CRF

11

Problems with L-CRF

• What humans might do– no please do not would you save your questions for the end

of my talk when i ask for them

– no please do not would you save your questions for the end of my talk when i ask for them

– no , please do not . would you save your questions for the end of my talk , when i ask for them ?

• Sentence level punctuation (. ? !) are associated with the complete sentence, and therefore should be assigned at the sentence level

12

What Do We Want?

• A model that jointly performs all the following three tasks together– Sentence boundary detection (or sentence

segmentation)– Sentence type identification– Punctuation insertion

13

Factorial CRF

14

• An instance of dynamic CRF– Two-layer factorial CRF (F-CRF) jointly annotates an

observation sequence with two label sequences– Models the conditional probability of the label

sequence pairs <Y,Z> given the observation sequence X

Y1 Y2 Y3 Yn

X1 X2 X3 Xn

…

Z1 Z2 Z3 Zn…sentence-layer tags

word-layer tags

utterance

Example of F-CRF

DEBEG DEIN DEIN DEIN QNBEG QNIN … QNIN QNIN QNIN … QNIN



• Propose two sets of tags for this joint task– Word-layer: NONE, COMMA, PERIOD, QMARK, EMARK

– Sentence-layer: DEBEG, DEIN, QNBEG, QNIN, EXBEG, EXIN

– Analogous feature factorization and the same feature functions used in L-CRF are used

15

Why Does it Work?

• The sentence-layer tags are used for sentence segmentation and sentence type identification

• The word-layer tags are used for punctuation insertion

• Knowledge learned from the sentence-layer can guide the word-layer tagging process

• The two layers are jointly learned, thus providing evidences that influence each other’s tagging process[no please do not]declarative sent. [would you save your questions

for the end of my talk when i ask for them]question sent.

?QNBEG QNIN …

16

Evaluation Datasets

BTEC CT

CN EN CN EN

Number of utterance pairs 19,972 10,061

Percentage of declarative sentences 64% 65% 77% 81%

Percentage of question sentences 36% 35% 22% 19%

Multiple sentences per utterance 14% 17% 29% 39%

Average words per utterance 8.59 9.46 10.18 14.33

17

• IWSLT 2009 BTEC and CT datasets• Consists of both English (EN) and Chinese (CN)• 90% used for training, 10% for testing

Experimental Setup

• Designed extensive experiments for Hidden Event Language Model– Duplication vs. No duplication– Single-pass vs. Cascaded – Trigram vs. 5-gram

• Conducted the following experiments– Accuracy on CRR texts (F1 measure)– Accuracy on ASR texts (F1 measure)– Translation performance with punctuated ASR texts

(BLEU metric)

18

• Precision # correctly predicted punctuation symbols

# predicted punctuation symbols

• Recall # correctly predicted punctuation symbols

# expected punctuation symbols

• F1 measure 2

1/Precision + 1/Recall

Punctuation Prediction: Evaluation Metrics

19

BTECNO DUPLICATION USE DUPLICATION

L-CRF F-CRFSingle Pass Cascaded Single Pass Cascaded

LM ORDER 3 5 3 5 3 5 3 5

CN

Prec. 87.40 86.44 87.72 87.13 76.74 77.58 77.89 78.50 94.82 94.83

Rec. 83.01 83.58 82.04 83.76 72.62 73.72 73.02 75.53 87.06 87.94

F1 85.15 84.99 84.79 85.41 74.63 75.60 75.37 76.99 90.78 91.25

EN

Prec. 64.72 62.70 62.39 58.10 85.33 85.74 84.44 81.37 88.37 92.76

Rec. 60.76 59.49 58.57 55.28 80.42 80.98 79.43 77.52 80.28 84.73

F1 62.68 61.06 60.42 56.66 82.80 83.29 81.86 79.40 84.13 88.56

Punctuation Prediction Evaluation: Correctly Recognized Texts (I)

20

• The “duplication” trick for hidden event language model is language specific

• Unlike English, indicative words can appear anywhere in a Chinese sentence

CTNO DUPLICATION USE DUPLICATION


LM ORDER 3 5 3 5 3 5 3 5

CN

Prec. 89.14 87.83 90.97 88.04 74.63 75.42 75.37 76.87 93.14 92.77

Rec. 84.71 84.16 77.78 84.08 70.69 70.84 64.62 73.60 83.45 86.92

F1 86.87 85.96 83.86 86.01 72.60 73.06 69.58 75.20 88.03 89.75

EN

Prec. 73.86 73.42 67.02 65.15 75.87 77.78 74.75 74.44 83.07 86.69

Rec. 68.94 68.79 62.13 61.23 70.33 72.56 69.28 69.93 76.09 79.62

F1 71.31 71.03 64.48 63.13 72.99 75.08 71.91 72.12 79.43 83.01

Punctuation Prediction Evaluation: Correctly Recognized Texts (II)

21

• Significant improvement over L-CRF (p<0.01)• Our approach is general: requires minimal

linguistic knowledge, consistently performs well across different languages



LM ORDER 3 5 3 5 3 5 3 5

CN

Prec. 85.96 84.80 86.48 85.12 66.86 68.76 68.00 68.75 92.81 93.82

Rec. 81.87 82.78 83.15 82.78 63.92 66.12 65.38 66.48 85.16 89.01

F1 83.86 83.78 84.78 83.94 65.36 67.41 66.67 67.60 88.83 91.35

EN

Prec. 62.38 59.29 56.86 54.22 85.23 87.29 84.49 81.32 90.67 93.72

Rec. 64.17 60.99 58.76 56.21 88.22 89.65 87.58 84.55 88.22 92.68

F1 63.27 60.13 57.79 55.20 86.70 88.45 86.00 82.90 89.43 93.19

Punctuation Prediction Evaluation: Automatically Recognized Texts

22

• 504 Chinese utterances, and 498 English utterances• Recognition accuracy: 86% and 80% respectively• Significant improvement (p < 0.01)



LM ORDER 3 5 3 5 3 5 3 5

CN EN 30.77 30.71 30.98 30.64 30.16 30.26 30.33 30.42 31.27 31.30

EN CN 21.21 21.00 21.16 20.76 23.03 24.04 23.61 23.34 23.44 24.18

Punctuation Prediction Evaluation: Translation Performance

23

• This tells us how well the punctuated ASR outputs can be used for downstream NLP tasks

• Use Berkeley aligner and Moses (lexicalized reordering)

• Averaged BLEU-4 scores over 10 MERT runs with random initial parameters

Conclusion

24

• We propose a novel approach for punctuation prediction without relying on prosodic features– Jointly performs punctuation prediction, sentence

boundary detection, and sentence type identification– Performs better than the hidden event language

model and a linear-chain CRF model – A general approach that consistently works well

across different languages– Effective when incorporated with downstream NLP

tasks

Documents

Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore