Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
A log-linear model for unsupervisedtext normalization
Yi Yang and Jacob Eisenstein
Georgia Institute of Technology
Lexical variation
I Social media language is di�erent from standardEnglish.
I Tweet: ya ur website suxx brahI Standard: yeah, your website sucks bro
I Word variants: phonetic substitutions,abbreviations, unconventional spellings, etc.
I Normalization: map from orthographicvariants to standard spellings.
I Large label space.I No labeled data.
Lexical variation
I Social media language is di�erent from standardEnglish.
I Tweet: ya ur website suxx brahI Standard: yeah, your website sucks bro
I Word variants: phonetic substitutions,abbreviations, unconventional spellings, etc.
I Normalization: map from orthographicvariants to standard spellings.
I Large label space.I No labeled data.
Lexical variation
I Social media language is di�erent from standardEnglish.
I Tweet: ya ur website suxx brahI Standard: yeah, your website sucks bro
I Word variants: phonetic substitutions,abbreviations, unconventional spellings, etc.
I Normalization: map from orthographicvariants to standard spellings.
I Large label space.I No labeled data.
Lexical variation
I Social media language is di�erent from standardEnglish.
I Tweet: ya ur website suxx brahI Standard: yeah, your website sucks bro
I Word variants: phonetic substitutions,abbreviations, unconventional spellings, etc.
I Normalization: map from orthographicvariants to standard spellings.
I Large label space.I No labeled data.
Why normalization?
I Improve downstream NLP applications.I Part-of-speech tagging [6]I Dependency parsing [11]I Machine translation [7, 8]
I Is there more systematic variation in socialmedia?
I Mine patterns in how words are spelled.I Discover coherent orthographic styles.
Why normalization?
I Improve downstream NLP applications.I Part-of-speech tagging [6]I Dependency parsing [11]I Machine translation [7, 8]
I Is there more systematic variation in socialmedia?
I Mine patterns in how words are spelled.I Discover coherent orthographic styles.
Related work
Method Surface Context Final SystemHan & Baldwin2011
edit distanceLCS
language modelsyntax
linearcombination
Liu et al. 2012 character-basedtranslation
distributionalsimilarity
decoding withlanguage model
Hassan et al.2013
edit distanceLCS
random walk decoding withlanguage model
Our approach edit distanceLCS, word pairs
language model uni�ed modelwith features
Our approach
I Unsupervised
I Featurized
I Context-driven
I Joint
Characteristics
Unsupervised
I Labeled data for Twitter normalization is limited.
I Language changes, annotations may becomestale and ill-suited to new spellings and words.
Term frequency of �af� over months
Term frequency of �lml� over months
Characteristics
Featurized: permitting overlapping features
I String similarity features.
I Lexical features that memorize conventionalizedword pairs, e.g. you/u, to/2.
Characteristics
Context-driven
give me suttin to believe in
I Unsupervised normalization needs strong cue oflocal context.
I String similarity needs to be overcome bycontextual preference.
I Edit-Dist(button/suiting, suttin) = 2I Edit-Dist(something, suttin) = 5
Characteristics
Joint
gimme suttin 2 beleive innnn
I No words are in the standard vocabulary.
I Joint inference is needed to take advantage ofcontext information.
Normalization in a probabilistic model
Maximize likelihood of observed (Twitter) text,marginalizing over normalizations.
P(s) P(ya ur website suxx brah)
=∑t
P(s, t) P(ya ur website . . . , yam urn website . . . )
+ P(ya ur website . . . , yak your website . . . )
+ P(ya ur website . . . , yea cure website . . . )
+ . . .
Normalization in a probabilistic model
Maximize likelihood of observed (Twitter) text,marginalizing over normalizations.
P(s) P(ya ur website suxx brah)
=∑t
P(s, t) P(ya ur website . . . , yam urn website . . . )
+ P(ya ur website . . . , yak your website . . . )
+ P(ya ur website . . . , yea cure website . . . )
+ . . .
A noisy channel model
P(s, t) = Pt(t)Pe(s|t) Pe(ya ur . . . |yam urn . . . )
× Pt(yam urn website . . . )
I The language model can be estimated o�ine.
I The emission model is locally normalized and log-linear.
Pe(s|t) =∏n
Pe(sn|tn) Pe(ya|yam)Pe(ur|urn) . . .
Pe(sn|tn) =exp (θ′f (sn, tn))
Z (tn)
exp (θ′f (ur, your))
Z (your)
A noisy channel model
P(s, t) = Pt(t)Pe(s|t) Pe(ya ur . . . |yam urn . . . )
× Pt(yam urn website . . . )
I The language model can be estimated o�ine.
I The emission model is locally normalized and log-linear.
Pe(s|t) =∏n
Pe(sn|tn) Pe(ya|yam)Pe(ur|urn) . . .
Pe(sn|tn) =exp (θ′f (sn, tn))
Z (tn)
exp (θ′f (ur, your))
Z (your)
Features
String similarity
I Combine edit distance and longest-commonsubsequence (LCS) [3]:
I Bin to create binary features, e.g.,
top5(s, t) =
{1, t ∈ Top5(s)
0, otherwise
Word pairs
I Assign weights to commonly occurring wordpairs, e.g. you/u, sucks/suxx.
I This allows the model to �memorize� frequentsubstitutions.
Features
String similarity
I Combine edit distance and longest-commonsubsequence (LCS) [3]:
I Bin to create binary features, e.g.,
top5(s, t) =
{1, t ∈ Top5(s)
0, otherwise
Word pairs
I Assign weights to commonly occurring wordpairs, e.g. you/u, sucks/suxx.
I This allows the model to �memorize� frequentsubstitutions.
Learning
Our goal is to learn to weight these features.We compute the gradient of the likelihood...
CRF
`(θ) = logP(y |x)∂`
∂θ= f (x , y)− Ey |x ;θ[f (x , y)]
Our model (MRF)
`(θ) = logP(s)
∂`
∂θ= Et|s [f (s, t)− Es′|t [f (s, t)]]
Learning
Our goal is to learn to weight these features.We compute the gradient of the likelihood...
CRF
`(θ) = logP(y |x)∂`
∂θ= f (x , y)− Ey |x ;θ[f (x , y)]
Our model (MRF)
`(θ) = logP(s)
∂`
∂θ= Et|s [f (s, t)− Es′|t [f (s, t)]]
Learning
Our goal is to learn to weight these features.We compute the gradient of the likelihood...
CRF
`(θ) = logP(y |x)∂`
∂θ= f (x , y)− Ey |x ;θ[f (x , y)]
Our model (MRF)
`(θ) = logP(s)
∂`
∂θ= Et|s [f (s, t)− Es′|t [f (s, t)]]
Dynamic programmingIn a locally-normalized model, these expectations can be
computed from the marginals P(tn|s1:N) [1].
ya ur website suxx brahyam urn website sucks bra
◦ you your suck brat ◦yeah you're six bro. . . . . . . . . . . .
I Outer expectation Et|s : forward-backward
I Inner expectation Es ′|t : sum over all possiblesn at each n.
I Time complexity: O(NV 2T + NV 2
TVS) ...But VT ,VS > 104
Dynamic programmingIn a locally-normalized model, these expectations can be
computed from the marginals P(tn|s1:N) [1].
ya ur website suxx brahyam urn website sucks bra
◦ you your suck brat ◦yeah you're six bro. . . . . . . . . . . .
I Outer expectation Et|s : forward-backward
I Inner expectation Es ′|t : sum over all possiblesn at each n.
I Time complexity: O(NV 2T + NV 2
TVS) ...But VT ,VS > 104
Dynamic programmingIn a locally-normalized model, these expectations can be
computed from the marginals P(tn|s1:N) [1].
ya ur website suxx brahyam urn website sucks bra
◦ you your suck brat ◦yeah you're six bro. . . . . . . . . . . .
I Outer expectation Et|s : forward-backward
I Inner expectation Es ′|t : sum over all possiblesn at each n.
I Time complexity: O(NV 2T + NV 2
TVS) ...But VT ,VS > 104
Sequential Monte CarloA randomized algorithm to approximate P(t|s) as aweighted sum,
P(t|s) ≈∑k
ω(k)δt(k)(t)
P(t|ya ur website . . . ) ≈
0.2× δ(you your website . . . )0.6× δ(yeah your website . . . )
. . .0.1× δ(you you're website . . . )
I e�cient and simpleI easy to parallelizeI number of samples K provides intuitive tuningbetween accuracy and speed
Sequential Importance Sampling (SIS)At each n, for each sample k
I Draw t(k)n from the proposal distribution
Q(t).
I Update ω(k)n so that
ω(k)n ∝
P(t(k)1:n|s1:n)
Q(t(k)1:n)
= . . .
=
Emission model︷ ︸︸ ︷Pe(sn|t(k)n )
Language model︷ ︸︸ ︷Pt(t
(k)n |t
(k)n−1)
Q(t(k)n |t(k)n−1, sn)︸ ︷︷ ︸
Proposal distribution
ω(k)n−1
Computing the gradient from SIS
I Outer expectation:
Et|s [f (s, t)] =K∑k
ω(k)N
N∑n
f (sn, t(k)n )
I Inner expectation: draw L samples of s ′n|tn:
Et|s [Es ′|tf (s′, t)]] =
K∑k
ω(k)N
N∑n
1
L
L∑`
f (s(`,k)n , t(k)n )
(In practice, we set K = 10 and L = 1)
Computing the gradient from SIS
I Outer expectation:
Et|s [f (s, t)] =K∑k
ω(k)N
N∑n
f (sn, t(k)n )
I Inner expectation: draw L samples of s ′n|tn:
Et|s [Es ′|tf (s′, t)]] =
K∑k
ω(k)N
N∑n
1
L
L∑`
f (s(`,k)n , t(k)n )
(In practice, we set K = 10 and L = 1)
Computing the gradient from SIS
I Outer expectation:
Et|s [f (s, t)] =K∑k
ω(k)N
N∑n
f (sn, t(k)n )
I Inner expectation: draw L samples of s ′n|tn:
Et|s [Es ′|tf (s′, t)]] =
K∑k
ω(k)N
N∑n
1
L
L∑`
f (s(`,k)n , t(k)n )
(In practice, we set K = 10 and L = 1)
Examplesn ya ur website . . .
yam urn website . . .
t(k)n
you your . . .
yeah you're . . .. . . . . . . . .
ω(k)n
P(yeah,ya|◦)Q(yeah|◦,ya) ω
(k)1
P(your,ur |yeah)Q(your|yeah,ur) ω
(k)2
Pt(website,website|your)Q(website|your,website)
. . .
s(`)n
yea youu website . . .
Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).Update:
ω(k)N (f (yeah, ya)− f (yeah, yea)
+ f (your,ur)− f (your, youu) + . . .)
Examplesn ya ur website . . .
yam urn website . . .
t(k)n
you your . . .
yeah you're . . .. . . . . . . . .
ω(k)n
P(yeah,ya|◦)Q(yeah|◦,ya) ω
(k)1
P(your,ur |yeah)Q(your|yeah,ur) ω
(k)2
Pt(website,website|your)Q(website|your,website)
. . .
s(`)n
yea youu website . . .
Sample t(k)n according to Q(t|◦, ya).
Sample s(l)n according to P(s|yeah).Update:
ω(k)N (f (yeah, ya)− f (yeah, yea)
+ f (your,ur)− f (your, youu) + . . .)
Examplesn ya ur website . . .
yam urn website . . .
t(k)n
you your . . .
yeah you're . . .. . . . . . . . .
ω(k)n
P(yeah,ya|◦)Q(yeah|◦,ya)
ω(k)1
P(your,ur |yeah)Q(your|yeah,ur) ω
(k)2
Pt(website,website|your)Q(website|your,website)
. . .
s(`)n
yea youu website . . .
Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).Update:
ω(k)N (f (yeah, ya)− f (yeah, yea)
+ f (your,ur)− f (your, youu) + . . .)
Examplesn ya ur website . . .
yam urn website . . .
t(k)n
you your . . .
yeah you're . . .. . . . . . . . .
ω(k)n
P(yeah,ya|◦)Q(yeah|◦,ya)
ω(k)1
P(your,ur |yeah)Q(your|yeah,ur) ω
(k)2
Pt(website,website|your)Q(website|your,website)
. . .
s(`)n
yea
youu website . . .
Sample t(k)n according to Q(t|◦, ya).
Sample s(l)n according to P(s|yeah).
Update:
ω(k)N (f (yeah, ya)− f (yeah, yea)
+ f (your,ur)− f (your, youu) + . . .)
Examplesn ya ur website . . .
yam urn website . . .
t(k)n
you your . . .
yeah you're . . .. . . . . . . . .
ω(k)n
P(yeah,ya|◦)Q(yeah|◦,ya) ω
(k)1
P(your,ur |yeah)Q(your|yeah,ur)
ω(k)2
Pt(website,website|your)Q(website|your,website)
. . .
s(`)n
yea youu
website . . .
Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).Update:
ω(k)N (f (yeah, ya)− f (yeah, yea)
+ f (your,ur)− f (your, youu) + . . .)
Examplesn ya ur website . . .
yam urn website . . .
t(k)n
you your . . .
yeah you're . . .. . . . . . . . .
ω(k)n
P(yeah,ya|◦)Q(yeah|◦,ya) ω
(k)1
P(your,ur |yeah)Q(your|yeah,ur) ω
(k)2
Pt(website,website|your)Q(website|your,website)
. . .
s(`)n
yea youu website . . .
Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).Update:
ω(k)N (f (yeah, ya)− f (yeah, yea)
+ f (your,ur)− f (your, youu) + . . .)
Examplesn ya ur website . . .
yam urn website . . .
t(k)n
you your . . .
yeah you're . . .. . . . . . . . .
ω(k)n
P(yeah,ya|◦)Q(yeah|◦,ya) ω
(k)1
P(your,ur |yeah)Q(your|yeah,ur) ω
(k)2
Pt(website,website|your)Q(website|your,website)
. . .
s(`)n
yea youu website . . .
Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).
Update:
ω(k)N (f (yeah, ya)− f (yeah, yea)
+ f (your,ur)− f (your, youu) + . . .)
Examplesn ya ur website . . .
yam urn website . . .
t(k)n
you your . . .
yeah you're . . .. . . . . . . . .
ω(k)n
P(yeah,ya|◦)Q(yeah|◦,ya) ω
(k)1
P(your,ur |yeah)Q(your|yeah,ur) ω
(k)2
Pt(website,website|your)Q(website|your,website)
. . .
s(`)n
yea youu website . . .
Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).
Update:
ω(k)N (f (yeah, ya)− f (yeah, yea)
+ f (your,ur)− f (your, youu) + . . .)
Examplesn ya ur website . . .
yam urn website . . .
t(k)n
you your . . .
yeah you're . . .. . . . . . . . .
ω(k)n
P(yeah,ya|◦)Q(yeah|◦,ya) ω
(k)1
P(your,ur |yeah)Q(your|yeah,ur) ω
(k)2
Pt(website,website|your)Q(website|your,website)
. . .
s(`)n
yea youu website . . .
Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).
Update:
ω(k)N (f (yeah, ya)− f (yeah, yea)
+ f (your,ur)− f (your, youu) + . . .)
Examplesn ya ur website . . .
yam urn website . . .
t(k)n
you your . . .
yeah you're . . .. . . . . . . . .
ω(k)n
P(yeah,ya|◦)Q(yeah|◦,ya) ω
(k)1
P(your,ur |yeah)Q(your|yeah,ur) ω
(k)2
Pt(website,website|your)Q(website|your,website)
. . .
s(`)n
yea youu website . . .
Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).
Update:
ω(k)N (f (yeah, ya)− f (yeah, yea)
+ f (your,ur)− f (your, youu) + . . .)
Examplesn ya ur website . . .
yam urn website . . .
t(k)n
you your . . .
yeah you're . . .. . . . . . . . .
ω(k)n
P(yeah,ya|◦)Q(yeah|◦,ya) ω
(k)1
P(your,ur |yeah)Q(your|yeah,ur) ω
(k)2
Pt(website,website|your)Q(website|your,website)
. . .
s(`)n
yea youu website . . .
Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).
Update:
ω(k)N (f (yeah, ya)− f (yeah, yea)
+ f (your,ur)− f (your, youu) + . . .)
The proposal distribution
A proposal distribution can guide sampling focuseson high-probability region.
1. Q(tn|sn, tn−1) = P(tn|sn, tn−1) ∝ P(sn|tn)P(tn|tn−1)I Accurate (optimal!)I Not fast: computing P(sn|tn) costs O(VTVS)
2. Our proposal
Q(tn|sn, tn−1) ∝ P(sn|tn)Z (tn)P(tn|tn−1)= exp (θ′f (sn, tn))P(tn|tn−1)
I Fast: O(VS + VT ) to compute ω(k)n
I Fairly accurate: biased by a factor of Z (tn)
The proposal distribution
A proposal distribution can guide sampling focuseson high-probability region.
1. Q(tn|sn, tn−1) = P(tn|sn, tn−1) ∝ P(sn|tn)P(tn|tn−1)I Accurate (optimal!)I Not fast: computing P(sn|tn) costs O(VTVS)
2. Our proposal
Q(tn|sn, tn−1) ∝ P(sn|tn)Z (tn)P(tn|tn−1)= exp (θ′f (sn, tn))P(tn|tn−1)
I Fast: O(VS + VT ) to compute ω(k)n
I Fairly accurate: biased by a factor of Z (tn)
The proposal distribution
A proposal distribution can guide sampling focuseson high-probability region.
1. Q(tn|sn, tn−1) = P(tn|sn, tn−1) ∝ P(sn|tn)P(tn|tn−1)I Accurate (optimal!)I Not fast: computing P(sn|tn) costs O(VTVS)
2. Our proposal
Q(tn|sn, tn−1) ∝ P(sn|tn)Z (tn)P(tn|tn−1)= exp (θ′f (sn, tn))P(tn|tn−1)
I Fast: O(VS + VT ) to compute ω(k)n
I Fairly accurate: biased by a factor of Z (tn)
Implementation details
unLOL: unsupervised normalization in a LOg-Linear model
I Decoding: propose K sequences t(k), performViterbi within this limited set.
I Normalization targets:I Training: all alphanumeric strings not in aspellI Test: normalization targets are given
(standard in this task)
I Language model: Kneser-Ney smoothedtrigram model, from tweets with no OOV words
I Training data: For each OOV word in the testset, draw 50 tweets from Edinburgh corpus [10]and Twitter API.
Evaluation
Datasets
I LWWL11 [9]: 3802 isolated words
I LexNorm1.1 [5]: 549 tweets with 1184nonstandard tokens
I LexNorm1.2: 172 manual corrections toLexNorm1.1 (www.cc.gatech.edu/~jeisenst/lexnorm.v1.2.tgz)
Results
Adding L1 Regularization
79
80
81
82
83
0 100000 200000 300000 400000number of features
F-measure
•
•
•
• • •
λ = 1e − 04
λ = 5e − 05
λ = 1e − 05
λ = 5e − 06 λ = 1e − 06 λ = 0
Analysis
Can normalization help identify orthographic styles?
I Use unLOL to automatically label lots of tweets
I Use Levenshtein alignment between original andnormalized tweets to �nd substitution patterns,e.g., ng/n_
I Use matrix factorization to �nd sets of patternsthat are used by the same set of authors.
tweets
authors
tweets
authors
unLOLnormalized
tweets
tweets
authors
unLOLnormalized
tweets
aligner characterpatterns
e.g., t/_, ng/n_
tweets
authors
unLOLnormalized
tweets
aligner
author- patternmatrix
characterpatterns
tweets
authors
unLOLnormalized
tweets
aligner
author- patternmatrix au
thor
lo
adin
gs
style dictionaries
NMF
characterpatterns
Observations
Some styles mirror phonological variables:g-dropping, (TH)-stopping, t-deletion... [4]
style rules examples
g-dropping g*/_*
ng/n_
g/_
goin, talkin, watchin, feelin, makin
t-deletion t*/_*
st/s_t/_
jus, bc, shh, wha, gota, wea, mus,firts, jes, subsistutes
th-stopping h/_*t/*d
th/d_t/d
dat, de, skool, fone, dese, dha, shid,dhat, dat’s
Observations
Others are known�netspeak� phenomena, like expressive lengthening [2]
style rules examples
(kd)-adding i_/id_/k _/d_*/k*
idk, fuckk, okk, backk, workk, badd,andd, goodd, bedd, elidgible, pidgeon
o-adding o_/oo
_*/o*
_/o
soo, noo, doo, oohh, loove, thoo,helloo
e-adding _/ie_/ee
_/e_*/e*
mee, ive, retweet, bestie, lovee,nicee, heey, likee, iphone, homie, ii,damnit
Observations
We get meaningful styles even from mistakennormalizations! (e.g., i'm/ima, out/outta)
style rules examples
a-adding _/a__/ma
_/m_*/a*
ima, outta, needa, shoulda, woulda,mm, comming, tomm, boutt, ppreci-
ate
Summary
I unLOL: joint, unsupervised, and log-linearI Features balance between edit distance and
conventionalized word pairs.I Main challenge: massive label space, in which
quadratic algorithms are not practical.I Solution: approximate gradient with SMCI Proposal distribution focuses the samples on
the high-likelihood part of the search space.
I Normalization enables the study of systematicorthographic variation.
References IT. Berg-Kirkpatrick, A. Bouchard-Côté, J. DeNero, and D. Klein.
Painless unsupervised learning with features.
In Proceedings of NAACL, pages 582�590, 2010.
S. Brody and N. Diakopoulos.
Cooooooooooooooollllllllllllll!!!!!!!!!!!!!!: using word lengthening todetect sentiment in microblogs.
In Proceedings of EMNLP, 2011.
D. Contractor, T. A. Faruquie, and L. V. Subramaniam.
Unsupervised cleansing of noisy text.
In Proceedings of COLING, pages 189�196, 2010.
J. Eisenstein.
Phonological factors in social media writing.
In Proceedings of the NAACL Workshop on Language Analysis in
Social Media, 2013.
References II
B. Han and T. Baldwin.
Lexical normalisation of short text messages: makn sens a #twitter.
In Proceedings of ACL, pages 368�378, 2011.
B. Han, P. Cook, and T. Baldwin.
Lexical normalization for social media text.
ACM Transactions on Intelligent Systems and Technology, 4(1):5,2013.
H. Hassan and A. Menezes.
Social text normalization using contextual graph random walks.
In Proceedings of ACL, 2013.
References III
W. Ling, C. Dyer, I. Trancoso, and A. Black.
Paraphrasing 4 microblog normalization.
In Proceedings of the 2013 Joint Conference on Empirical Methods
in Natural Language Processing and Computational Natural
Language Learning, Seattle, USA, October 2013. Association forComputational Linguistics.
F. Liu, F. Weng, B. Wang, and Y. Liu.
Insertion, deletion, or substitution?: normalizing text messageswithout pre-categorization nor supervision.
In Proceedings of ACL, pages 71�76, 2011.
S. Petrovi¢, M. Osborne, and V. Lavrenko.
The edinburgh twitter corpus.
In Proceedings of the NAACL HLT Workshop on Computational
Linguistics in a World of Social Media, pages 25�26, 2010.
References IV
C. Zhang, T. Baldwin, H. Ho, B. Kimelfeld, and Y. Li.
Adaptive parser-centric text normalization.
In Proceedings of ACL, pages 1159�1168, 2013.