Adaptive Parser-Centric Text Normalization

1

Adaptive Parser-Centric

Text Normalization

Congle Zhang* Tyler Baldwin** Howard Ho** Benny Kimelfeld** Yunyao Li**

* University of Washington **IBM Research - Almaden

Public Text

Web Text

Private Text

TextAnalytics

MarketingFinancial investmentDrug discoveryLaw enforcement…

Applications

Social media

News

SEC

InternalData

SubscriptionData

USPTO

Text analytics is the key for discovering hidden value from text

http://www.iconfinder.com/icondetails/51400/128/rss_icon

DREAM

REALITY

Image from http://samasource.org

CAN YOU READ THIS IN FIRST ATEMPT?

ay woundent of see ’ em

CAN YOU READ THIS IN FIRST ATEMPT?

00:0000:0100:02

I would not have seen them.

When a machine reads it

Results from Google translation

Chinese 唉看见他们woundent

Spanish ay woundent de verlas

Japanese ローマ法王進呈の AY woundent

Portuguese

ay woundent de vê-los

German ay woundent de voir 'em

Text Normalization• Informal writing standard written

form

9

I would not have seen them .

normalize


Challenge: Grammar

10

text normalization

would not of see them


I would not have seen them. Vs.

mapping out-of-vocabulary non-standard tokens to their in-vocabulary standard form

≠

Challenge: Domain Adaptation

Tailor the same text normalization solution towards different writing style of different data sources

11

Challenge: Evaluation• Previous: word error rate & BLEU score

• However,– Words are not equally important – non-word information (punctuations,

capitalization) can be important– Word reordering is important

• How does the normalization actually impact the downstream applications?

12

Adaptive Parser-Centric Text Normalization

GrammaticalSentence

Domain Transferrable

Parsing performance

Outlines• Model• Inference• Learning• Instantiation• Evaluation• Conclusion

14

Model: Replacement Generator

15

• Replacement <i,j,s>: replace tokens xi … xj-1 with s

• Domain customization– Generic (cross-domain) replacements– Domain-specific replacements

Ay1 woudent2 of3 see4 ‘em5

<2,3,”would not”><1,2,”Ay”><1,2,”I”><1,2,ε>

<6,6,”.”>…

EditSameEditDeleteInsert

…

Model: Boolean Variables• Associate a unique Boolean

variable Xr with each replacement r

– Xr =true: replacement r is used to produce the output sentence

16

<2,3,”would not”> = true

… would not …

Model: Normalization Graph

17

• A graphical model Ay woudent of see ‘em

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>

<2,4,”would not have”> <2,3,”would”>

<4,5,”seen”>

<5,6,”them”>

*START*

*END*

<6,6,”.”>

<3,4,”of”>

Model: Legal Assignment• Sound

– Any two true replacements do not overlap

– <1,2,”Ay”> and <1,2,”I”> cannot be both true

• Completeness– Every input token is captured by at least

one true replacement18

Model: Legal = Path• A legal assignment: a path from start

to end

19

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>


<4,5,”seen”>

<5,6,”them”>

*START*

*END*

<6,6,”.”>

<3,4,”of”>

I would not have see him.

Output

Model: Assignment Probability

20

• Log-linear model; feature functions on edges

20

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>


<4,5,”seen”>

<5,6,”them”>

*START*

*END*

<6,6,”.”>

<3,4,”of”>


21

Inference• Select the assignment with the highest

probability

• Computationally hard on general graph models …

• But, in our model it boils down to finding the longest path in a weighted and directed acyclic graph

22

Inference

23

• weighted longest path

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>


<4,5,”seen”>

<5,6,”them”>

*START*

*END*

<6,6,”.”>

<3,4,”of”>

I would not have see him.


24

Learning

• Perceptron-style algorithm– Update weights by– Comparing (1) most probable output with

the current weights (2) gold sequence25

(1) Informal: Ay woudent of see ‘em(2) Gold: I would not have seen them.(3) Graph

Input

Output (1) weights of features

Learning: Gold vs. Inferred

26

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>


<4,5,”seen”>

<5,6,”them”>

*START*

*END*

<6,6,”.”>

<3,4,”of”>

Gold sequence

Most probable sequence with current θ

Learning: Update Weights on the Differential Edges

27

<4,6,”see him”>

<1,2,”Ay”> <1,2,”I”>


<4,5,”seen”>

<5,6,”them”>

*START*

*END*

<6,6,”.”>

<3,4,”of”>

the gold sequence becomes “longer”

Increase wi


28

Instantiation: Replacement Generators

29

Generator From To

leave intact good good

edit distance bac back

lowercase NEED need

capitalize it It

Google spell dispaear disappear

contraction wouldn’t would not

slang language ima I am going to

insert punctuation ε .

duplicated punctuation

!? !

delete filler lmao ε

Instantiation: Features• N-gram

– Frequency of the phrases induced by an edge

• Part-of-speech– Encourage certain behavior, such as

avoiding the deletion of noun phrases.• Positional

– Capitalize words after stop punctuations• Lineage

– Which generator spawned the replacement30


31

Evaluation Metrics: Compare Parses

Input sentence

32

Human Expert

Gold sentence

Normalized sentence

Normalizer

Parser

Parser

Compare

Gold Parse

Normalized Parse

Focus on subjects, verbs, and objects (SVO)

Evaluation Metrics: ExampleTest Gold SVO

I kinda wanna get ipad NEW

I kind of want to get a

new iPad.

verb(get) verb(want)verb(get)

precisionv = 1/1

recallv = 1/2

subj(get,I)subj(get,wanna

)obj(get,NEW)

subj(want, I)subj(get,I)obj(get,iPad)

precisionso = 1/3

recallso= 1/333

Evaluation: Baselines• w/oN: without normalization

• Google: Google spell checker

• w2wN: word-to-word normalization [Han and Baldwin 2011]

• Gw2wN: gold standard for word-to-word normalizations of previous work (whenever available).

34

Evaluation: Domains

• Twitter [Han and Baldwin 2011]

– Gold: Grammatical sentences

• SMS [Choudhury et al 2007]

– Gold: Grammatical sentences

• Call-Center Log: proprietary– Text-based responses about users’

experience with a call-center for a major company

– Gold: Grammatical sentences35

Evaluation: Twitter

36

• Twitter-specific replacement generators– Hashtags (#), ats (@), and retweets (RT)– Generators that allowed for either the initial

symbol or the entire token to be deleted

Evaluation: TwitterSystem

Verb Subject-Object

Pre Rec F1 Pre Rec F1

w/oN 83.7 68.1 75.1 31.7 38.6 34.8

Google 88.9 78.8 83.5 36.1 46.3 40.6

w2wN 87.5 81.5 84.4 44.5 58.9 50.7

Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0

generic 91.7 88.9 90.3 53.6 70.2 60.8

domain specific

95.3 88.7 91.9 72.5 76.3 74.4

37

Domain-specific generators yielded the best overall performance


Verb Subject-Object


w/oN 83.7 68.1 75.1 31.7 38.6 34.8

Google 88.9 78.8 83.5 36.1 46.3 40.6

w2wN 87.5 81.5 84.4 44.5 58.9 50.7

Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0

generic 91.7 88.9 90.3 53.6 70.2 60.8

domain specific

95.3 88.7 91.9 72.5 76.3 74.4

38

w/o domain-specific generators, our system outperformed the word-to-word normalization approaches


Verb Subject-Object


w/oN 83.7 68.1 75.1 31.7 38.6 34.8

Google 88.9 78.8 83.5 36.1 46.3 40.6

w2wN 87.5 81.5 84.4 44.5 58.9 50.7

Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0

generic 91.7 88.9 90.3 53.6 70.2 60.8

domain specific

95.3 88.7 91.9 72.5 76.3 74.4

39

Even perfect word-to-word normalization is not good enough!

Evaluation: SMS

40

SMS-specific replacement generator:- Mapping

dictionary of SMS abbreviations

Evaluation: SMS

41

SystemVerb Subject-Object


w/oN 76.4 48.1 59.0 19.5 21.5 20.4

Google 85.1 61.6 71.5 22.4 26.2 24.1

w2wN 78.5 61.5 68.9 29.9 36.0 32.6

Gw2wN 87.6 76.6 81.8 38.0 50.6 43.4

generic 86.5 77.4 81.7 35.5 47.7 40.7

domain specific

88.1 75.0 81.0 41.0 49.5 44.8

Evaluation: Call-Center

42

Call Center-specific generator:- Mapping dictionary

of call center abbreviations (e.g. “rep.”

“representative”)

Evaluation: Call-Center

43

SystemVerb Subject-Object


w/oN 98.5 97.1 97.8 69.2 66.1 67.6

Google 99.2 97.9 98.5 70.5 67.3 68.8

generic 98.9 97.4 98.1 71.3 67.9 69.6

domain specific

99.2 97.4 98.3 87.9 83.1 85.4

Discussion• Domain transfer w/ small amount of

effort is possible

• Performing normalization is indeed beneficial to dependency parsing– Simple word-to-word normalization is not

enough

44

Conclusion• Normalization framework with an eye

toward domain adaptation

• Parser-centric view of normalization

• Our system outperformed competitive baselines over three different domains

• Dataset to spur future research– https://www.cs.washington.edu/node/9091/

45

https://www.cs.washington.edu/node/9091/

Team

46

Technology

Adaptive Parser-Centric Text Normalization