A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

A log-linear model for unsupervisedtext normalization

Yi Yang and Jacob Eisenstein

Georgia Institute of Technology

Lexical variation

I Social media language is di�erent from standardEnglish.

I Tweet: ya ur website suxx brahI Standard: yeah, your website sucks bro

I Word variants: phonetic substitutions,abbreviations, unconventional spellings, etc.

I Normalization: map from orthographicvariants to standard spellings.

I Large label space.I No labeled data.

Lexical variation






Lexical variation






Lexical variation






Why normalization?

I Improve downstream NLP applications.I Part-of-speech tagging [6]I Dependency parsing [11]I Machine translation [7, 8]

I Is there more systematic variation in socialmedia?

I Mine patterns in how words are spelled.I Discover coherent orthographic styles.

Why normalization?

I Improve downstream NLP applications.I Part-of-speech tagging [6]I Dependency parsing [11]I Machine translation [7, 8]

I Is there more systematic variation in socialmedia?

I Mine patterns in how words are spelled.I Discover coherent orthographic styles.

Related work

Method Surface Context Final SystemHan & Baldwin2011

edit distanceLCS

language modelsyntax

linearcombination

Liu et al. 2012 character-basedtranslation

distributionalsimilarity

decoding withlanguage model

Hassan et al.2013

edit distanceLCS

random walk decoding withlanguage model

Our approach edit distanceLCS, word pairs

language model uni�ed modelwith features

Our approach

I Unsupervised

I Featurized

I Context-driven

I Joint

Characteristics

Unsupervised

I Labeled data for Twitter normalization is limited.

I Language changes, annotations may becomestale and ill-suited to new spellings and words.

Term frequency of �af� over months

Term frequency of �lml� over months

Characteristics

Featurized: permitting overlapping features

I String similarity features.

I Lexical features that memorize conventionalizedword pairs, e.g. you/u, to/2.

Characteristics

Context-driven

give me suttin to believe in

I Unsupervised normalization needs strong cue oflocal context.

I String similarity needs to be overcome bycontextual preference.

I Edit-Dist(button/suiting, suttin) = 2I Edit-Dist(something, suttin) = 5

Characteristics

Joint

gimme suttin 2 beleive innnn

I No words are in the standard vocabulary.

I Joint inference is needed to take advantage ofcontext information.

Normalization in a probabilistic model

Maximize likelihood of observed (Twitter) text,marginalizing over normalizations.

P(s) P(ya ur website suxx brah)

=∑t

P(s, t) P(ya ur website . . . , yam urn website . . . )

+ P(ya ur website . . . , yak your website . . . )

+ P(ya ur website . . . , yea cure website . . . )

+ . . .

Normalization in a probabilistic model

Maximize likelihood of observed (Twitter) text,marginalizing over normalizations.

P(s) P(ya ur website suxx brah)

=∑t

P(s, t) P(ya ur website . . . , yam urn website . . . )

+ P(ya ur website . . . , yak your website . . . )

+ P(ya ur website . . . , yea cure website . . . )

+ . . .

A noisy channel model

P(s, t) = Pt(t)Pe(s|t) Pe(ya ur . . . |yam urn . . . )

× Pt(yam urn website . . . )

I The language model can be estimated o�ine.

I The emission model is locally normalized and log-linear.

Pe(s|t) =∏n

Pe(sn|tn) Pe(ya|yam)Pe(ur|urn) . . .

Pe(sn|tn) =exp (θ′f (sn, tn))

Z (tn)

exp (θ′f (ur, your))

Z (your)

A noisy channel model

P(s, t) = Pt(t)Pe(s|t) Pe(ya ur . . . |yam urn . . . )

× Pt(yam urn website . . . )

I The language model can be estimated o�ine.

I The emission model is locally normalized and log-linear.

Pe(s|t) =∏n

Pe(sn|tn) Pe(ya|yam)Pe(ur|urn) . . .

Pe(sn|tn) =exp (θ′f (sn, tn))

Z (tn)

exp (θ′f (ur, your))

Z (your)

Features

String similarity

I Combine edit distance and longest-commonsubsequence (LCS) [3]:

I Bin to create binary features, e.g.,

top5(s, t) =

{1, t ∈ Top5(s)

0, otherwise

Word pairs

I Assign weights to commonly occurring wordpairs, e.g. you/u, sucks/suxx.

I This allows the model to �memorize� frequentsubstitutions.

Features

String similarity

I Combine edit distance and longest-commonsubsequence (LCS) [3]:

I Bin to create binary features, e.g.,

top5(s, t) =

{1, t ∈ Top5(s)

0, otherwise

Word pairs

I Assign weights to commonly occurring wordpairs, e.g. you/u, sucks/suxx.

I This allows the model to �memorize� frequentsubstitutions.

Learning

Our goal is to learn to weight these features.We compute the gradient of the likelihood...

CRF

`(θ) = logP(y |x)∂`

∂θ= f (x , y)− Ey |x ;θ[f (x , y)]

Our model (MRF)

`(θ) = logP(s)

∂`

∂θ= Et|s [f (s, t)− Es′|t [f (s, t)]]

Learning


CRF

`(θ) = logP(y |x)∂`

∂θ= f (x , y)− Ey |x ;θ[f (x , y)]

Our model (MRF)

`(θ) = logP(s)

∂`

∂θ= Et|s [f (s, t)− Es′|t [f (s, t)]]

Learning


CRF

`(θ) = logP(y |x)∂`

∂θ= f (x , y)− Ey |x ;θ[f (x , y)]

Our model (MRF)

`(θ) = logP(s)

∂`

∂θ= Et|s [f (s, t)− Es′|t [f (s, t)]]

Dynamic programmingIn a locally-normalized model, these expectations can be

computed from the marginals P(tn|s1:N) [1].

ya ur website suxx brahyam urn website sucks bra

◦ you your suck brat ◦yeah you're six bro. . . . . . . . . . . .

I Outer expectation Et|s : forward-backward

I Inner expectation Es ′|t : sum over all possiblesn at each n.

I Time complexity: O(NV 2T + NV 2

TVS) ...But VT ,VS > 104

















Sequential Monte CarloA randomized algorithm to approximate P(t|s) as aweighted sum,

P(t|s) ≈∑k

ω(k)δt(k)(t)

P(t|ya ur website . . . ) ≈

0.2× δ(you your website . . . )0.6× δ(yeah your website . . . )

. . .0.1× δ(you you're website . . . )

I e�cient and simpleI easy to parallelizeI number of samples K provides intuitive tuningbetween accuracy and speed

Sequential Importance Sampling (SIS)At each n, for each sample k

I Draw t(k)n from the proposal distribution

Q(t).

I Update ω(k)n so that

ω(k)n ∝

P(t(k)1:n|s1:n)

Q(t(k)1:n)

= . . .

=

Emission model︷︸︸︷Pe(sn|t(k)n )

Language model︷︸︸︷Pt(t

(k)n |t

(k)n−1)

Q(t(k)n |t(k)n−1, sn)︸︷︷︸

Proposal distribution

ω(k)n−1

Computing the gradient from SIS

I Outer expectation:

Et|s [f (s, t)] =K∑k

ω(k)N

N∑n

f (sn, t(k)n )

I Inner expectation: draw L samples of s ′n|tn:

Et|s [Es ′|tf (s′, t)]] =

K∑k

ω(k)N

N∑n

1

L

L∑`

f (s(`,k)n , t(k)n )

(In practice, we set K = 10 and L = 1)




ω(k)N

N∑n

f (sn, t(k)n )


Et|s [Es ′|tf (s′, t)]] =

K∑k

ω(k)N

N∑n

1

L

L∑`

f (s(`,k)n , t(k)n )





ω(k)N

N∑n

f (sn, t(k)n )


Et|s [Es ′|tf (s′, t)]] =

K∑k

ω(k)N

N∑n

1

L

L∑`

f (s(`,k)n , t(k)n )


Examplesn ya ur website . . .

yam urn website . . .

t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n

P(yeah,ya|◦)Q(yeah|◦,ya) ω

(k)1

P(your,ur |yeah)Q(your|yeah,ur) ω

(k)2

Pt(website,website|your)Q(website|your,website)

. . .

s(`)n

yea youu website . . .

Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).Update:

ω(k)N (f (yeah, ya)− f (yeah, yea)

+ f (your,ur)− f (your, youu) + . . .)



t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n


(k)1


(k)2


. . .

s(`)n


Sample t(k)n according to Q(t|◦, ya).

Sample s(l)n according to P(s|yeah).Update:





t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n

P(yeah,ya|◦)Q(yeah|◦,ya)

ω(k)1


(k)2


. . .

s(`)n







t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n

P(yeah,ya|◦)Q(yeah|◦,ya)

ω(k)1


(k)2


. . .

s(`)n

yea

youu website . . .

Sample t(k)n according to Q(t|◦, ya).

Sample s(l)n according to P(s|yeah).

Update:





t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n


(k)1

P(your,ur |yeah)Q(your|yeah,ur)

ω(k)2


. . .

s(`)n

yea youu

website . . .






t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n


(k)1


(k)2


. . .

s(`)n







t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n


(k)1


(k)2


. . .

s(`)n


Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).

Update:





t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n


(k)1


(k)2


. . .

s(`)n



Update:





t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n


(k)1


(k)2


. . .

s(`)n



Update:





t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n


(k)1


(k)2


. . .

s(`)n



Update:





t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n


(k)1


(k)2


. . .

s(`)n



Update:



The proposal distribution

A proposal distribution can guide sampling focuseson high-probability region.

1. Q(tn|sn, tn−1) = P(tn|sn, tn−1) ∝ P(sn|tn)P(tn|tn−1)I Accurate (optimal!)I Not fast: computing P(sn|tn) costs O(VTVS)

2. Our proposal

Q(tn|sn, tn−1) ∝ P(sn|tn)Z (tn)P(tn|tn−1)= exp (θ′f (sn, tn))P(tn|tn−1)

I Fast: O(VS + VT ) to compute ω(k)n

I Fairly accurate: biased by a factor of Z (tn)




2. Our proposal







2. Our proposal




Implementation details

unLOL: unsupervised normalization in a LOg-Linear model

I Decoding: propose K sequences t(k), performViterbi within this limited set.

I Normalization targets:I Training: all alphanumeric strings not in aspellI Test: normalization targets are given

(standard in this task)

I Language model: Kneser-Ney smoothedtrigram model, from tweets with no OOV words

I Training data: For each OOV word in the testset, draw 50 tweets from Edinburgh corpus [10]and Twitter API.

Evaluation

Datasets

I LWWL11 [9]: 3802 isolated words

I LexNorm1.1 [5]: 549 tweets with 1184nonstandard tokens

I LexNorm1.2: 172 manual corrections toLexNorm1.1 (www.cc.gatech.edu/~jeisenst/lexnorm.v1.2.tgz)

www.cc.gatech.edu/~jeisenst/lexnorm.v1.2.tgz

www.cc.gatech.edu/~jeisenst/lexnorm.v1.2.tgz

Results

Adding L1 Regularization

79

80

81

82

83

0 100000 200000 300000 400000number of features

F-measure

•

•

•

• • •

λ = 1e − 04

λ = 5e − 05

λ = 1e − 05

λ = 5e − 06 λ = 1e − 06 λ = 0

Analysis

Can normalization help identify orthographic styles?

I Use unLOL to automatically label lots of tweets

I Use Levenshtein alignment between original andnormalized tweets to �nd substitution patterns,e.g., ng/n_

I Use matrix factorization to �nd sets of patternsthat are used by the same set of authors.

tweets

authors

tweets

authors

unLOLnormalized

tweets

tweets

authors

unLOLnormalized

tweets

aligner characterpatterns

e.g., t/_, ng/n_

tweets

authors

unLOLnormalized

tweets

aligner

author- patternmatrix

characterpatterns

tweets

authors

unLOLnormalized

tweets

aligner

author- patternmatrix au

thor

lo

adin

gs

style dictionaries

NMF

characterpatterns

Observations

Some styles mirror phonological variables:g-dropping, (TH)-stopping, t-deletion... [4]

style rules examples

g-dropping g*/_*

ng/n_

g/_

goin, talkin, watchin, feelin, makin

t-deletion t*/_*

st/s_t/_

jus, bc, shh, wha, gota, wea, mus,firts, jes, subsistutes

th-stopping h/_*t/*d

th/d_t/d

dat, de, skool, fone, dese, dha, shid,dhat, dat’s

Observations

Others are known�netspeak� phenomena, like expressive lengthening [2]


(kd)-adding i_/id_/k _/d_*/k*

idk, fuckk, okk, backk, workk, badd,andd, goodd, bedd, elidgible, pidgeon

o-adding o_/oo

_*/o*

_/o

soo, noo, doo, oohh, loove, thoo,helloo

e-adding _/ie_/ee

_/e_*/e*

mee, ive, retweet, bestie, lovee,nicee, heey, likee, iphone, homie, ii,damnit

Observations

We get meaningful styles even from mistakennormalizations! (e.g., i'm/ima, out/outta)


a-adding _/a__/ma

_/m_*/a*

ima, outta, needa, shoulda, woulda,mm, comming, tomm, boutt, ppreci-

ate

Summary

I unLOL: joint, unsupervised, and log-linearI Features balance between edit distance and

conventionalized word pairs.I Main challenge: massive label space, in which

quadratic algorithms are not practical.I Solution: approximate gradient with SMCI Proposal distribution focuses the samples on

the high-likelihood part of the search space.

I Normalization enables the study of systematicorthographic variation.

References IT. Berg-Kirkpatrick, A. Bouchard-Côté, J. DeNero, and D. Klein.

Painless unsupervised learning with features.

In Proceedings of NAACL, pages 582�590, 2010.

S. Brody and N. Diakopoulos.

Cooooooooooooooollllllllllllll!!!!!!!!!!!!!!: using word lengthening todetect sentiment in microblogs.

In Proceedings of EMNLP, 2011.

D. Contractor, T. A. Faruquie, and L. V. Subramaniam.

Unsupervised cleansing of noisy text.

In Proceedings of COLING, pages 189�196, 2010.

J. Eisenstein.

Phonological factors in social media writing.

In Proceedings of the NAACL Workshop on Language Analysis in

Social Media, 2013.

References II

B. Han and T. Baldwin.

Lexical normalisation of short text messages: makn sens a #twitter.

In Proceedings of ACL, pages 368�378, 2011.

B. Han, P. Cook, and T. Baldwin.

Lexical normalization for social media text.

ACM Transactions on Intelligent Systems and Technology, 4(1):5,2013.

H. Hassan and A. Menezes.

Social text normalization using contextual graph random walks.

In Proceedings of ACL, 2013.

References III

W. Ling, C. Dyer, I. Trancoso, and A. Black.

Paraphrasing 4 microblog normalization.

In Proceedings of the 2013 Joint Conference on Empirical Methods

in Natural Language Processing and Computational Natural

Language Learning, Seattle, USA, October 2013. Association forComputational Linguistics.

F. Liu, F. Weng, B. Wang, and Y. Liu.

Insertion, deletion, or substitution?: normalizing text messageswithout pre-categorization nor supervision.


S. Petrovi¢, M. Osborne, and V. Lavrenko.

The edinburgh twitter corpus.

In Proceedings of the NAACL HLT Workshop on Computational

Linguistics in a World of Social Media, pages 25�26, 2010.

References IV

C. Zhang, T. Baldwin, H. Ho, B. Kimelfeld, and Y. Li.

Adaptive parser-centric text normalization.


Documents

A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx