62

A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

A log-linear model for unsupervisedtext normalization

Yi Yang and Jacob Eisenstein

Georgia Institute of Technology

Page 2: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Lexical variation

I Social media language is di�erent from standardEnglish.

I Tweet: ya ur website suxx brahI Standard: yeah, your website sucks bro

I Word variants: phonetic substitutions,abbreviations, unconventional spellings, etc.

I Normalization: map from orthographicvariants to standard spellings.

I Large label space.I No labeled data.

Page 3: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Lexical variation

I Social media language is di�erent from standardEnglish.

I Tweet: ya ur website suxx brahI Standard: yeah, your website sucks bro

I Word variants: phonetic substitutions,abbreviations, unconventional spellings, etc.

I Normalization: map from orthographicvariants to standard spellings.

I Large label space.I No labeled data.

Page 4: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Lexical variation

I Social media language is di�erent from standardEnglish.

I Tweet: ya ur website suxx brahI Standard: yeah, your website sucks bro

I Word variants: phonetic substitutions,abbreviations, unconventional spellings, etc.

I Normalization: map from orthographicvariants to standard spellings.

I Large label space.I No labeled data.

Page 5: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Lexical variation

I Social media language is di�erent from standardEnglish.

I Tweet: ya ur website suxx brahI Standard: yeah, your website sucks bro

I Word variants: phonetic substitutions,abbreviations, unconventional spellings, etc.

I Normalization: map from orthographicvariants to standard spellings.

I Large label space.I No labeled data.

Page 6: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Why normalization?

I Improve downstream NLP applications.I Part-of-speech tagging [6]I Dependency parsing [11]I Machine translation [7, 8]

I Is there more systematic variation in socialmedia?

I Mine patterns in how words are spelled.I Discover coherent orthographic styles.

Page 7: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Why normalization?

I Improve downstream NLP applications.I Part-of-speech tagging [6]I Dependency parsing [11]I Machine translation [7, 8]

I Is there more systematic variation in socialmedia?

I Mine patterns in how words are spelled.I Discover coherent orthographic styles.

Page 8: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Related work

Method Surface Context Final SystemHan & Baldwin2011

edit distanceLCS

language modelsyntax

linearcombination

Liu et al. 2012 character-basedtranslation

distributionalsimilarity

decoding withlanguage model

Hassan et al.2013

edit distanceLCS

random walk decoding withlanguage model

Our approach edit distanceLCS, word pairs

language model uni�ed modelwith features

Page 9: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Our approach

I Unsupervised

I Featurized

I Context-driven

I Joint

Page 10: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Characteristics

Unsupervised

I Labeled data for Twitter normalization is limited.

I Language changes, annotations may becomestale and ill-suited to new spellings and words.

Term frequency of �af� over months

Term frequency of �lml� over months

Page 11: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Characteristics

Featurized: permitting overlapping features

I String similarity features.

I Lexical features that memorize conventionalizedword pairs, e.g. you/u, to/2.

Page 12: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Characteristics

Context-driven

give me suttin to believe in

I Unsupervised normalization needs strong cue oflocal context.

I String similarity needs to be overcome bycontextual preference.

I Edit-Dist(button/suiting, suttin) = 2I Edit-Dist(something, suttin) = 5

Page 13: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Characteristics

Joint

gimme suttin 2 beleive innnn

I No words are in the standard vocabulary.

I Joint inference is needed to take advantage ofcontext information.

Page 14: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Normalization in a probabilistic model

Maximize likelihood of observed (Twitter) text,marginalizing over normalizations.

P(s) P(ya ur website suxx brah)

=∑t

P(s, t) P(ya ur website . . . , yam urn website . . . )

+ P(ya ur website . . . , yak your website . . . )

+ P(ya ur website . . . , yea cure website . . . )

+ . . .

Page 15: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Normalization in a probabilistic model

Maximize likelihood of observed (Twitter) text,marginalizing over normalizations.

P(s) P(ya ur website suxx brah)

=∑t

P(s, t) P(ya ur website . . . , yam urn website . . . )

+ P(ya ur website . . . , yak your website . . . )

+ P(ya ur website . . . , yea cure website . . . )

+ . . .

Page 16: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

A noisy channel model

P(s, t) = Pt(t)Pe(s|t) Pe(ya ur . . . |yam urn . . . )

× Pt(yam urn website . . . )

I The language model can be estimated o�ine.

I The emission model is locally normalized and log-linear.

Pe(s|t) =∏n

Pe(sn|tn) Pe(ya|yam)Pe(ur|urn) . . .

Pe(sn|tn) =exp (θ′f (sn, tn))

Z (tn)

exp (θ′f (ur, your))

Z (your)

Page 17: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

A noisy channel model

P(s, t) = Pt(t)Pe(s|t) Pe(ya ur . . . |yam urn . . . )

× Pt(yam urn website . . . )

I The language model can be estimated o�ine.

I The emission model is locally normalized and log-linear.

Pe(s|t) =∏n

Pe(sn|tn) Pe(ya|yam)Pe(ur|urn) . . .

Pe(sn|tn) =exp (θ′f (sn, tn))

Z (tn)

exp (θ′f (ur, your))

Z (your)

Page 18: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Features

String similarity

I Combine edit distance and longest-commonsubsequence (LCS) [3]:

I Bin to create binary features, e.g.,

top5(s, t) =

{1, t ∈ Top5(s)

0, otherwise

Word pairs

I Assign weights to commonly occurring wordpairs, e.g. you/u, sucks/suxx.

I This allows the model to �memorize� frequentsubstitutions.

Page 19: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Features

String similarity

I Combine edit distance and longest-commonsubsequence (LCS) [3]:

I Bin to create binary features, e.g.,

top5(s, t) =

{1, t ∈ Top5(s)

0, otherwise

Word pairs

I Assign weights to commonly occurring wordpairs, e.g. you/u, sucks/suxx.

I This allows the model to �memorize� frequentsubstitutions.

Page 20: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Learning

Our goal is to learn to weight these features.We compute the gradient of the likelihood...

CRF

`(θ) = logP(y |x)∂`

∂θ= f (x , y)− Ey |x ;θ[f (x , y)]

Our model (MRF)

`(θ) = logP(s)

∂`

∂θ= Et|s [f (s, t)− Es′|t [f (s, t)]]

Page 21: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Learning

Our goal is to learn to weight these features.We compute the gradient of the likelihood...

CRF

`(θ) = logP(y |x)∂`

∂θ= f (x , y)− Ey |x ;θ[f (x , y)]

Our model (MRF)

`(θ) = logP(s)

∂`

∂θ= Et|s [f (s, t)− Es′|t [f (s, t)]]

Page 22: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Learning

Our goal is to learn to weight these features.We compute the gradient of the likelihood...

CRF

`(θ) = logP(y |x)∂`

∂θ= f (x , y)− Ey |x ;θ[f (x , y)]

Our model (MRF)

`(θ) = logP(s)

∂`

∂θ= Et|s [f (s, t)− Es′|t [f (s, t)]]

Page 23: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Dynamic programmingIn a locally-normalized model, these expectations can be

computed from the marginals P(tn|s1:N) [1].

ya ur website suxx brahyam urn website sucks bra

◦ you your suck brat ◦yeah you're six bro. . . . . . . . . . . .

I Outer expectation Et|s : forward-backward

I Inner expectation Es ′|t : sum over all possiblesn at each n.

I Time complexity: O(NV 2T + NV 2

TVS) ...But VT ,VS > 104

Page 24: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Dynamic programmingIn a locally-normalized model, these expectations can be

computed from the marginals P(tn|s1:N) [1].

ya ur website suxx brahyam urn website sucks bra

◦ you your suck brat ◦yeah you're six bro. . . . . . . . . . . .

I Outer expectation Et|s : forward-backward

I Inner expectation Es ′|t : sum over all possiblesn at each n.

I Time complexity: O(NV 2T + NV 2

TVS) ...But VT ,VS > 104

Page 25: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Dynamic programmingIn a locally-normalized model, these expectations can be

computed from the marginals P(tn|s1:N) [1].

ya ur website suxx brahyam urn website sucks bra

◦ you your suck brat ◦yeah you're six bro. . . . . . . . . . . .

I Outer expectation Et|s : forward-backward

I Inner expectation Es ′|t : sum over all possiblesn at each n.

I Time complexity: O(NV 2T + NV 2

TVS) ...But VT ,VS > 104

Page 26: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Sequential Monte CarloA randomized algorithm to approximate P(t|s) as aweighted sum,

P(t|s) ≈∑k

ω(k)δt(k)(t)

P(t|ya ur website . . . ) ≈

0.2× δ(you your website . . . )0.6× δ(yeah your website . . . )

. . .0.1× δ(you you're website . . . )

I e�cient and simpleI easy to parallelizeI number of samples K provides intuitive tuningbetween accuracy and speed

Page 27: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Sequential Importance Sampling (SIS)At each n, for each sample k

I Draw t(k)n from the proposal distribution

Q(t).

I Update ω(k)n so that

ω(k)n ∝

P(t(k)1:n|s1:n)

Q(t(k)1:n)

= . . .

=

Emission model︷ ︸︸ ︷Pe(sn|t(k)n )

Language model︷ ︸︸ ︷Pt(t

(k)n |t

(k)n−1)

Q(t(k)n |t(k)n−1, sn)︸ ︷︷ ︸

Proposal distribution

ω(k)n−1

Page 28: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Computing the gradient from SIS

I Outer expectation:

Et|s [f (s, t)] =K∑k

ω(k)N

N∑n

f (sn, t(k)n )

I Inner expectation: draw L samples of s ′n|tn:

Et|s [Es ′|tf (s′, t)]] =

K∑k

ω(k)N

N∑n

1

L

L∑`

f (s(`,k)n , t(k)n )

(In practice, we set K = 10 and L = 1)

Page 29: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Computing the gradient from SIS

I Outer expectation:

Et|s [f (s, t)] =K∑k

ω(k)N

N∑n

f (sn, t(k)n )

I Inner expectation: draw L samples of s ′n|tn:

Et|s [Es ′|tf (s′, t)]] =

K∑k

ω(k)N

N∑n

1

L

L∑`

f (s(`,k)n , t(k)n )

(In practice, we set K = 10 and L = 1)

Page 30: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Computing the gradient from SIS

I Outer expectation:

Et|s [f (s, t)] =K∑k

ω(k)N

N∑n

f (sn, t(k)n )

I Inner expectation: draw L samples of s ′n|tn:

Et|s [Es ′|tf (s′, t)]] =

K∑k

ω(k)N

N∑n

1

L

L∑`

f (s(`,k)n , t(k)n )

(In practice, we set K = 10 and L = 1)

Page 31: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Examplesn ya ur website . . .

yam urn website . . .

t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n

P(yeah,ya|◦)Q(yeah|◦,ya) ω

(k)1

P(your,ur |yeah)Q(your|yeah,ur) ω

(k)2

Pt(website,website|your)Q(website|your,website)

. . .

s(`)n

yea youu website . . .

Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).Update:

ω(k)N (f (yeah, ya)− f (yeah, yea)

+ f (your,ur)− f (your, youu) + . . .)

Page 32: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Examplesn ya ur website . . .

yam urn website . . .

t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n

P(yeah,ya|◦)Q(yeah|◦,ya) ω

(k)1

P(your,ur |yeah)Q(your|yeah,ur) ω

(k)2

Pt(website,website|your)Q(website|your,website)

. . .

s(`)n

yea youu website . . .

Sample t(k)n according to Q(t|◦, ya).

Sample s(l)n according to P(s|yeah).Update:

ω(k)N (f (yeah, ya)− f (yeah, yea)

+ f (your,ur)− f (your, youu) + . . .)

Page 33: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Examplesn ya ur website . . .

yam urn website . . .

t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n

P(yeah,ya|◦)Q(yeah|◦,ya)

ω(k)1

P(your,ur |yeah)Q(your|yeah,ur) ω

(k)2

Pt(website,website|your)Q(website|your,website)

. . .

s(`)n

yea youu website . . .

Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).Update:

ω(k)N (f (yeah, ya)− f (yeah, yea)

+ f (your,ur)− f (your, youu) + . . .)

Page 34: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Examplesn ya ur website . . .

yam urn website . . .

t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n

P(yeah,ya|◦)Q(yeah|◦,ya)

ω(k)1

P(your,ur |yeah)Q(your|yeah,ur) ω

(k)2

Pt(website,website|your)Q(website|your,website)

. . .

s(`)n

yea

youu website . . .

Sample t(k)n according to Q(t|◦, ya).

Sample s(l)n according to P(s|yeah).

Update:

ω(k)N (f (yeah, ya)− f (yeah, yea)

+ f (your,ur)− f (your, youu) + . . .)

Page 35: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Examplesn ya ur website . . .

yam urn website . . .

t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n

P(yeah,ya|◦)Q(yeah|◦,ya) ω

(k)1

P(your,ur |yeah)Q(your|yeah,ur)

ω(k)2

Pt(website,website|your)Q(website|your,website)

. . .

s(`)n

yea youu

website . . .

Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).Update:

ω(k)N (f (yeah, ya)− f (yeah, yea)

+ f (your,ur)− f (your, youu) + . . .)

Page 36: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Examplesn ya ur website . . .

yam urn website . . .

t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n

P(yeah,ya|◦)Q(yeah|◦,ya) ω

(k)1

P(your,ur |yeah)Q(your|yeah,ur) ω

(k)2

Pt(website,website|your)Q(website|your,website)

. . .

s(`)n

yea youu website . . .

Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).Update:

ω(k)N (f (yeah, ya)− f (yeah, yea)

+ f (your,ur)− f (your, youu) + . . .)

Page 37: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Examplesn ya ur website . . .

yam urn website . . .

t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n

P(yeah,ya|◦)Q(yeah|◦,ya) ω

(k)1

P(your,ur |yeah)Q(your|yeah,ur) ω

(k)2

Pt(website,website|your)Q(website|your,website)

. . .

s(`)n

yea youu website . . .

Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).

Update:

ω(k)N (f (yeah, ya)− f (yeah, yea)

+ f (your,ur)− f (your, youu) + . . .)

Page 38: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Examplesn ya ur website . . .

yam urn website . . .

t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n

P(yeah,ya|◦)Q(yeah|◦,ya) ω

(k)1

P(your,ur |yeah)Q(your|yeah,ur) ω

(k)2

Pt(website,website|your)Q(website|your,website)

. . .

s(`)n

yea youu website . . .

Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).

Update:

ω(k)N (f (yeah, ya)− f (yeah, yea)

+ f (your,ur)− f (your, youu) + . . .)

Page 39: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Examplesn ya ur website . . .

yam urn website . . .

t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n

P(yeah,ya|◦)Q(yeah|◦,ya) ω

(k)1

P(your,ur |yeah)Q(your|yeah,ur) ω

(k)2

Pt(website,website|your)Q(website|your,website)

. . .

s(`)n

yea youu website . . .

Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).

Update:

ω(k)N (f (yeah, ya)− f (yeah, yea)

+ f (your,ur)− f (your, youu) + . . .)

Page 40: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Examplesn ya ur website . . .

yam urn website . . .

t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n

P(yeah,ya|◦)Q(yeah|◦,ya) ω

(k)1

P(your,ur |yeah)Q(your|yeah,ur) ω

(k)2

Pt(website,website|your)Q(website|your,website)

. . .

s(`)n

yea youu website . . .

Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).

Update:

ω(k)N (f (yeah, ya)− f (yeah, yea)

+ f (your,ur)− f (your, youu) + . . .)

Page 41: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Examplesn ya ur website . . .

yam urn website . . .

t(k)n

you your . . .

yeah you're . . .. . . . . . . . .

ω(k)n

P(yeah,ya|◦)Q(yeah|◦,ya) ω

(k)1

P(your,ur |yeah)Q(your|yeah,ur) ω

(k)2

Pt(website,website|your)Q(website|your,website)

. . .

s(`)n

yea youu website . . .

Sample t(k)n according to Q(t|◦, ya).Sample s(l)n according to P(s|yeah).

Update:

ω(k)N (f (yeah, ya)− f (yeah, yea)

+ f (your,ur)− f (your, youu) + . . .)

Page 42: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

The proposal distribution

A proposal distribution can guide sampling focuseson high-probability region.

1. Q(tn|sn, tn−1) = P(tn|sn, tn−1) ∝ P(sn|tn)P(tn|tn−1)I Accurate (optimal!)I Not fast: computing P(sn|tn) costs O(VTVS)

2. Our proposal

Q(tn|sn, tn−1) ∝ P(sn|tn)Z (tn)P(tn|tn−1)= exp (θ′f (sn, tn))P(tn|tn−1)

I Fast: O(VS + VT ) to compute ω(k)n

I Fairly accurate: biased by a factor of Z (tn)

Page 43: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

The proposal distribution

A proposal distribution can guide sampling focuseson high-probability region.

1. Q(tn|sn, tn−1) = P(tn|sn, tn−1) ∝ P(sn|tn)P(tn|tn−1)I Accurate (optimal!)I Not fast: computing P(sn|tn) costs O(VTVS)

2. Our proposal

Q(tn|sn, tn−1) ∝ P(sn|tn)Z (tn)P(tn|tn−1)= exp (θ′f (sn, tn))P(tn|tn−1)

I Fast: O(VS + VT ) to compute ω(k)n

I Fairly accurate: biased by a factor of Z (tn)

Page 44: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

The proposal distribution

A proposal distribution can guide sampling focuseson high-probability region.

1. Q(tn|sn, tn−1) = P(tn|sn, tn−1) ∝ P(sn|tn)P(tn|tn−1)I Accurate (optimal!)I Not fast: computing P(sn|tn) costs O(VTVS)

2. Our proposal

Q(tn|sn, tn−1) ∝ P(sn|tn)Z (tn)P(tn|tn−1)= exp (θ′f (sn, tn))P(tn|tn−1)

I Fast: O(VS + VT ) to compute ω(k)n

I Fairly accurate: biased by a factor of Z (tn)

Page 45: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Implementation details

unLOL: unsupervised normalization in a LOg-Linear model

I Decoding: propose K sequences t(k), performViterbi within this limited set.

I Normalization targets:I Training: all alphanumeric strings not in aspellI Test: normalization targets are given

(standard in this task)

I Language model: Kneser-Ney smoothedtrigram model, from tweets with no OOV words

I Training data: For each OOV word in the testset, draw 50 tweets from Edinburgh corpus [10]and Twitter API.

Page 46: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Evaluation

Datasets

I LWWL11 [9]: 3802 isolated words

I LexNorm1.1 [5]: 549 tweets with 1184nonstandard tokens

I LexNorm1.2: 172 manual corrections toLexNorm1.1 (www.cc.gatech.edu/~jeisenst/lexnorm.v1.2.tgz)

Page 47: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Results

Page 48: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Adding L1 Regularization

79

80

81

82

83

0 100000 200000 300000 400000number of features

F-measure

• • •

λ = 1e − 04

λ = 5e − 05

λ = 1e − 05

λ = 5e − 06 λ = 1e − 06 λ = 0

Page 49: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Analysis

Can normalization help identify orthographic styles?

I Use unLOL to automatically label lots of tweets

I Use Levenshtein alignment between original andnormalized tweets to �nd substitution patterns,e.g., ng/n_

I Use matrix factorization to �nd sets of patternsthat are used by the same set of authors.

Page 50: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

tweets

authors

Page 51: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

tweets

authors

unLOLnormalized

tweets

Page 52: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

tweets

authors

unLOLnormalized

tweets

aligner characterpatterns

e.g., t/_, ng/n_

Page 53: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

tweets

authors

unLOLnormalized

tweets

aligner

author- patternmatrix

characterpatterns

Page 54: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

tweets

authors

unLOLnormalized

tweets

aligner

author- patternmatrix au

thor

lo

adin

gs

style dictionaries

NMF

characterpatterns

Page 55: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Observations

Some styles mirror phonological variables:g-dropping, (TH)-stopping, t-deletion... [4]

style rules examples

g-dropping g*/_*

ng/n_

g/_

goin, talkin, watchin, feelin, makin

t-deletion t*/_*

st/s_t/_

jus, bc, shh, wha, gota, wea, mus,firts, jes, subsistutes

th-stopping h/_*t/*d

th/d_t/d

dat, de, skool, fone, dese, dha, shid,dhat, dat’s

Page 56: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Observations

Others are known�netspeak� phenomena, like expressive lengthening [2]

style rules examples

(kd)-adding i_/id_/k _/d_*/k*

idk, fuckk, okk, backk, workk, badd,andd, goodd, bedd, elidgible, pidgeon

o-adding o_/oo

_*/o*

_/o

soo, noo, doo, oohh, loove, thoo,helloo

e-adding _/ie_/ee

_/e_*/e*

mee, ive, retweet, bestie, lovee,nicee, heey, likee, iphone, homie, ii,damnit

Page 57: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Observations

We get meaningful styles even from mistakennormalizations! (e.g., i'm/ima, out/outta)

style rules examples

a-adding _/a__/ma

_/m_*/a*

ima, outta, needa, shoulda, woulda,mm, comming, tomm, boutt, ppreci-

ate

Page 58: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

Summary

I unLOL: joint, unsupervised, and log-linearI Features balance between edit distance and

conventionalized word pairs.I Main challenge: massive label space, in which

quadratic algorithms are not practical.I Solution: approximate gradient with SMCI Proposal distribution focuses the samples on

the high-likelihood part of the search space.

I Normalization enables the study of systematicorthographic variation.

Page 59: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

References IT. Berg-Kirkpatrick, A. Bouchard-Côté, J. DeNero, and D. Klein.

Painless unsupervised learning with features.

In Proceedings of NAACL, pages 582�590, 2010.

S. Brody and N. Diakopoulos.

Cooooooooooooooollllllllllllll!!!!!!!!!!!!!!: using word lengthening todetect sentiment in microblogs.

In Proceedings of EMNLP, 2011.

D. Contractor, T. A. Faruquie, and L. V. Subramaniam.

Unsupervised cleansing of noisy text.

In Proceedings of COLING, pages 189�196, 2010.

J. Eisenstein.

Phonological factors in social media writing.

In Proceedings of the NAACL Workshop on Language Analysis in

Social Media, 2013.

Page 60: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

References II

B. Han and T. Baldwin.

Lexical normalisation of short text messages: makn sens a #twitter.

In Proceedings of ACL, pages 368�378, 2011.

B. Han, P. Cook, and T. Baldwin.

Lexical normalization for social media text.

ACM Transactions on Intelligent Systems and Technology, 4(1):5,2013.

H. Hassan and A. Menezes.

Social text normalization using contextual graph random walks.

In Proceedings of ACL, 2013.

Page 61: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

References III

W. Ling, C. Dyer, I. Trancoso, and A. Black.

Paraphrasing 4 microblog normalization.

In Proceedings of the 2013 Joint Conference on Empirical Methods

in Natural Language Processing and Computational Natural

Language Learning, Seattle, USA, October 2013. Association forComputational Linguistics.

F. Liu, F. Weng, B. Wang, and Y. Liu.

Insertion, deletion, or substitution?: normalizing text messageswithout pre-categorization nor supervision.

In Proceedings of ACL, pages 71�76, 2011.

S. Petrovi¢, M. Osborne, and V. Lavrenko.

The edinburgh twitter corpus.

In Proceedings of the NAACL HLT Workshop on Computational

Linguistics in a World of Social Media, pages 25�26, 2010.

Page 62: A log-linear model for unsupervised text normalization · 2020-05-01 · Lexical variation I Social media language is di erent from standard English. I wTeet : ya ur website suxx

References IV

C. Zhang, T. Baldwin, H. Ho, B. Kimelfeld, and Y. Li.

Adaptive parser-centric text normalization.

In Proceedings of ACL, pages 1159�1168, 2013.