46
1 Adaptive Parser- Centric Text Normalization Congle Zhang* Tyler Baldwin** Howard Ho** Benny Kimelfeld** Yunyao Li** * University of Washington **IBM Research - Almaden

Adaptive Parser-Centric Text Normalization

Embed Size (px)

DESCRIPTION

Wonderful work done with Congle Zhang (my summer intern in 2012) and my IBM colleagues. Nominated for best paper award and presented at ACL 2013. Adaptive Parser-Centric Text Normalization Congle Zhang, Tyler Baldwin, Howard Ho, Benny Kimelfeld, Yunyao Li Proceedings of ACL, pp. 1159--1168, 2013

Citation preview

Page 1: Adaptive Parser-Centric Text Normalization

1

Adaptive Parser-Centric

Text Normalization

Congle Zhang* Tyler Baldwin** Howard Ho** Benny Kimelfeld** Yunyao Li**

* University of Washington **IBM Research - Almaden

Page 2: Adaptive Parser-Centric Text Normalization

Public Text

Web Text

Private Text

TextAnalytics

MarketingFinancial investmentDrug discoveryLaw enforcement…

Applications

Social media

News

SEC

InternalData

SubscriptionData

USPTO

Text analytics is the key for discovering hidden value from text

Page 3: Adaptive Parser-Centric Text Normalization

DREAM

Page 4: Adaptive Parser-Centric Text Normalization

REALITY

Page 5: Adaptive Parser-Centric Text Normalization

Image from http://samasource.org

Page 6: Adaptive Parser-Centric Text Normalization

CAN YOU READ THIS IN FIRST ATEMPT?

Page 7: Adaptive Parser-Centric Text Normalization

ay woundent of see ’ em

CAN YOU READ THIS IN FIRST ATEMPT?

00:0000:0100:02

I would not have seen them.

Page 8: Adaptive Parser-Centric Text Normalization

When a machine reads it

Results from Google translation

Chinese 唉看见他们woundent

Spanish ay woundent de verlas

Japanese ローマ法王進呈の AY woundent

Portuguese

ay woundent de vê-los

German ay woundent de voir 'em

Page 9: Adaptive Parser-Centric Text Normalization

Text Normalization• Informal writing standard written

form

9

I would not have seen them .

normalize

ay woundent of see ’ em

Page 10: Adaptive Parser-Centric Text Normalization

Challenge: Grammar

10

text normalization

would not of see them

ay woundent of see ’ em

I would not have seen them. Vs.

mapping out-of-vocabulary non-standard tokens to their in-vocabulary standard form

Page 11: Adaptive Parser-Centric Text Normalization

Challenge: Domain Adaptation

Tailor the same text normalization solution towards different writing style of different data sources

11

Page 12: Adaptive Parser-Centric Text Normalization

Challenge: Evaluation• Previous: word error rate & BLEU score

• However,– Words are not equally important – non-word information (punctuations,

capitalization) can be important– Word reordering is important

• How does the normalization actually impact the downstream applications?

12

Page 13: Adaptive Parser-Centric Text Normalization

Adaptive Parser-Centric Text Normalization

GrammaticalSentence

Domain Transferrable

Parsing performance

Page 14: Adaptive Parser-Centric Text Normalization

Outlines• Model• Inference• Learning• Instantiation• Evaluation• Conclusion

14

Page 15: Adaptive Parser-Centric Text Normalization

Model: Replacement Generator

15

• Replacement <i,j,s>: replace tokens xi … xj-1 with s

• Domain customization– Generic (cross-domain) replacements– Domain-specific replacements

Ay1 woudent2 of3 see4 ‘em5

<2,3,”would not”><1,2,”Ay”><1,2,”I”><1,2,ε>

<6,6,”.”>…

EditSameEditDeleteInsert

Page 16: Adaptive Parser-Centric Text Normalization

Model: Boolean Variables• Associate a unique Boolean

variable Xr with each replacement r

– Xr =true: replacement r is used to produce the output sentence

16

<2,3,”would not”> = true

… would not …

Page 17: Adaptive Parser-Centric Text Normalization

Model: Normalization Graph

17

• A graphical model Ay woudent of see ‘em

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>

<2,4,”would not have”> <2,3,”would”>

<4,5,”seen”>

<5,6,”them”>

*START*

*END*

<6,6,”.”>

<3,4,”of”>

Page 18: Adaptive Parser-Centric Text Normalization

Model: Legal Assignment• Sound

– Any two true replacements do not overlap

– <1,2,”Ay”> and <1,2,”I”> cannot be both true

• Completeness– Every input token is captured by at least

one true replacement18

Page 19: Adaptive Parser-Centric Text Normalization

Model: Legal = Path• A legal assignment: a path from start

to end

19

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>

<2,4,”would not have”> <2,3,”would”>

<4,5,”seen”>

<5,6,”them”>

*START*

*END*

<6,6,”.”>

<3,4,”of”>

I would not have see him.

Output

Page 20: Adaptive Parser-Centric Text Normalization

Model: Assignment Probability

20

• Log-linear model; feature functions on edges

20

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>

<2,4,”would not have”> <2,3,”would”>

<4,5,”seen”>

<5,6,”them”>

*START*

*END*

<6,6,”.”>

<3,4,”of”>

Page 21: Adaptive Parser-Centric Text Normalization

Outlines• Model• Inference• Learning• Instantiation• Evaluation• Conclusion

21

Page 22: Adaptive Parser-Centric Text Normalization

Inference• Select the assignment with the highest

probability

• Computationally hard on general graph models …

• But, in our model it boils down to finding the longest path in a weighted and directed acyclic graph

22

Page 23: Adaptive Parser-Centric Text Normalization

Inference

23

• weighted longest path

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>

<2,4,”would not have”> <2,3,”would”>

<4,5,”seen”>

<5,6,”them”>

*START*

*END*

<6,6,”.”>

<3,4,”of”>

I would not have see him.

Page 24: Adaptive Parser-Centric Text Normalization

Outlines• Model• Inference• Learning• Instantiation• Evaluation• Conclusion

24

Page 25: Adaptive Parser-Centric Text Normalization

Learning

• Perceptron-style algorithm– Update weights by– Comparing (1) most probable output with

the current weights (2) gold sequence25

(1) Informal: Ay woudent of see ‘em(2) Gold: I would not have seen them.(3) Graph

Input

Output (1) weights of features

Page 26: Adaptive Parser-Centric Text Normalization

Learning: Gold vs. Inferred

26

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>

<2,4,”would not have”> <2,3,”would”>

<4,5,”seen”>

<5,6,”them”>

*START*

*END*

<6,6,”.”>

<3,4,”of”>

Gold sequence

Most probable sequence with current θ

Page 27: Adaptive Parser-Centric Text Normalization

Learning: Update Weights on the Differential Edges

27

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>

<2,4,”would not have”> <2,3,”would”>

<4,5,”seen”>

<5,6,”them”>

*START*

*END*

<6,6,”.”>

<3,4,”of”>

the gold sequence becomes “longer”

Increase wi

Page 28: Adaptive Parser-Centric Text Normalization

Outlines• Model• Inference• Learning• Instantiation• Evaluation• Conclusion

28

Page 29: Adaptive Parser-Centric Text Normalization

Instantiation: Replacement Generators

29

Generator From To

leave intact good good

edit distance bac back

lowercase NEED need

capitalize it It

Google spell dispaear disappear

contraction wouldn’t would not

slang language ima I am going to

insert punctuation ε .

duplicated punctuation

!? !

delete filler lmao ε

Page 30: Adaptive Parser-Centric Text Normalization

Instantiation: Features• N-gram

– Frequency of the phrases induced by an edge

• Part-of-speech– Encourage certain behavior, such as

avoiding the deletion of noun phrases.• Positional

– Capitalize words after stop punctuations• Lineage

– Which generator spawned the replacement30

Page 31: Adaptive Parser-Centric Text Normalization

Outlines• Model• Inference• Learning• Instantiation• Evaluation• Conclusion

31

Page 32: Adaptive Parser-Centric Text Normalization

Evaluation Metrics: Compare Parses

Input sentence

32

Human Expert

Gold sentence

Normalized sentence

Normalizer

Parser

Parser

Compare

Gold Parse

Normalized Parse

Focus on subjects, verbs, and objects (SVO)

Page 33: Adaptive Parser-Centric Text Normalization

Evaluation Metrics: ExampleTest Gold SVO

I kinda wanna get ipad NEW

I kind of want to get a

new iPad.

verb(get) verb(want)verb(get)

precisionv = 1/1

recallv = 1/2

subj(get,I)subj(get,wanna

)obj(get,NEW)

subj(want, I)subj(get,I)obj(get,iPad)

precisionso = 1/3

recallso= 1/333

Page 34: Adaptive Parser-Centric Text Normalization

Evaluation: Baselines• w/oN: without normalization

• Google: Google spell checker

• w2wN: word-to-word normalization [Han and Baldwin 2011]

• Gw2wN: gold standard for word-to-word normalizations of previous work (whenever available).

34

Page 35: Adaptive Parser-Centric Text Normalization

Evaluation: Domains

• Twitter [Han and Baldwin 2011]

– Gold: Grammatical sentences

• SMS [Choudhury et al 2007]

– Gold: Grammatical sentences

• Call-Center Log: proprietary– Text-based responses about users’

experience with a call-center for a major company

– Gold: Grammatical sentences35

Page 36: Adaptive Parser-Centric Text Normalization

Evaluation: Twitter

36

• Twitter-specific replacement generators– Hashtags (#), ats (@), and retweets (RT)– Generators that allowed for either the initial

symbol or the entire token to be deleted

Page 37: Adaptive Parser-Centric Text Normalization

Evaluation: TwitterSystem

Verb Subject-Object

Pre Rec F1 Pre Rec F1

w/oN 83.7 68.1 75.1 31.7 38.6 34.8

Google 88.9 78.8 83.5 36.1 46.3 40.6

w2wN 87.5 81.5 84.4 44.5 58.9 50.7

Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0

generic 91.7 88.9 90.3 53.6 70.2 60.8

domain specific

95.3 88.7 91.9 72.5 76.3 74.4

37

Domain-specific generators yielded the best overall performance

Page 38: Adaptive Parser-Centric Text Normalization

Evaluation: TwitterSystem

Verb Subject-Object

Pre Rec F1 Pre Rec F1

w/oN 83.7 68.1 75.1 31.7 38.6 34.8

Google 88.9 78.8 83.5 36.1 46.3 40.6

w2wN 87.5 81.5 84.4 44.5 58.9 50.7

Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0

generic 91.7 88.9 90.3 53.6 70.2 60.8

domain specific

95.3 88.7 91.9 72.5 76.3 74.4

38

w/o domain-specific generators, our system outperformed the word-to-word normalization approaches

Page 39: Adaptive Parser-Centric Text Normalization

Evaluation: TwitterSystem

Verb Subject-Object

Pre Rec F1 Pre Rec F1

w/oN 83.7 68.1 75.1 31.7 38.6 34.8

Google 88.9 78.8 83.5 36.1 46.3 40.6

w2wN 87.5 81.5 84.4 44.5 58.9 50.7

Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0

generic 91.7 88.9 90.3 53.6 70.2 60.8

domain specific

95.3 88.7 91.9 72.5 76.3 74.4

39

Even perfect word-to-word normalization is not good enough!

Page 40: Adaptive Parser-Centric Text Normalization

Evaluation: SMS

40

SMS-specific replacement generator:- Mapping

dictionary of SMS abbreviations

Page 41: Adaptive Parser-Centric Text Normalization

Evaluation: SMS

41

SystemVerb Subject-Object

Pre Rec F1 Pre Rec F1

w/oN 76.4 48.1 59.0 19.5 21.5 20.4

Google 85.1 61.6 71.5 22.4 26.2 24.1

w2wN 78.5 61.5 68.9 29.9 36.0 32.6

Gw2wN 87.6 76.6 81.8 38.0 50.6 43.4

generic 86.5 77.4 81.7 35.5 47.7 40.7

domain specific

88.1 75.0 81.0 41.0 49.5 44.8

Page 42: Adaptive Parser-Centric Text Normalization

Evaluation: Call-Center

42

Call Center-specific generator:- Mapping dictionary

of call center abbreviations (e.g. “rep.”

“representative”)

Page 43: Adaptive Parser-Centric Text Normalization

Evaluation: Call-Center

43

SystemVerb Subject-Object

Pre Rec F1 Pre Rec F1

w/oN 98.5 97.1 97.8 69.2 66.1 67.6

Google 99.2 97.9 98.5 70.5 67.3 68.8

generic 98.9 97.4 98.1 71.3 67.9 69.6

domain specific

99.2 97.4 98.3 87.9 83.1 85.4

Page 44: Adaptive Parser-Centric Text Normalization

Discussion• Domain transfer w/ small amount of

effort is possible

• Performing normalization is indeed beneficial to dependency parsing– Simple word-to-word normalization is not

enough

44

Page 45: Adaptive Parser-Centric Text Normalization

Conclusion• Normalization framework with an eye

toward domain adaptation

• Parser-centric view of normalization

• Our system outperformed competitive baselines over three different domains

• Dataset to spur future research– https://www.cs.washington.edu/node/9091/

45

Page 46: Adaptive Parser-Centric Text Normalization

Team

46