OVERVIEW OF STATISTICAL NATURAL LANGUAGE PROCESSINGcs785/SNLP-Overview.pdf · OVERVIEW OF STATISTICAL NATURAL LANGUAGE PROCESSING May 18, 2016 1 Dr. Fei Song School of Computer Science

OVERVIEW OF STATISTICAL NATURAL LANGUAGE PROCESSING

May 18, 2016

1

Dr. Fei Song School of Computer Science University of Guelph

Outline

q  What is Statistical Natural Language Processing (SNLP)?

q  Language Models for Information Retrieval q  Text Classification and Sentiment Analysis q  Probabilistic Models (LDA, Bayesian HMM,

and POSLDA) for language processing q  References

2

What is SNLP? 3

q  Infer and rank the structures from text based on statistical language modeling. §  Probability and Statistics §  Machine Learning Techniques

q  Started in late 1950’s, but didn’t get popular until early 1980’s.

q  Many applications: Information Retrieval, Information Extraction, Text Classification, Text Mining, and Biological Data Analysis.

Language Modeling 4

q  A statistical language model requires the estimates for such probabilities: P(w1,n) = P(w1,w2,…,wn)

q  Probabilities to word sequences?

q  Left-context only? §  The {big, pig} dog … §  P(dog|the big) >> P(dog|the pig)

P(w1 w2 … wn) = P(w1) P(w2|w1) … P(wn|w1 w2 … wn-1) e.g., Jack went to the {hospital, number, if, … }

Noisy Channel Framework

q  Through decoding, we want to find the most likely input for the given observation.

5

i)|p(i)p(oargmax p(o)

i)|p(i)p(oargmax o)|p(iargmax Iiii

===ˆ

Decoder Noisy Channel p(o|i)

I O I

§  Applications: machine translation, optical character recognition, speech recognition, spelling correction.

Language Models for IR

q  N-gram models: Unigram: P(w1,n) = P(w1) P(w2) … P(wn)

Bigram: P(w1,n) = P(w1) P(w2| w1) … P(wn|wn-1)

Trigram: P(w1,n) = P(w1) P(w2| w1) … P(wn|wn-2,n-1)

q  Documents as language samples:

∏=

=n

iin dtPdtttP

121 )|()|,...,,(

6

Language Models for IR

q  Query as a generation process:

),...,,(/)|,...,,()(),...,,|(

2121

21

mm

m

tttPdtttPdPtttdP

⇒

)|,...,,()( 21 dtttPdP m⇒

∏=

⇒⇒m

iim dtPdtttP

121 )|()|,...,,(

(Bayesian theorem)

(Uniform prior documents)

(Unigram terms)

7

A Naïve Solution

q  Maximum likelihood estimate:

: the raw term frequency of term t in document d : the total number of tokens in document d.

d

dtmle dl

tfdtP ,)|( =

dttf ,

ddl

8

Sparse Data Problem

q  A document size is often too small

q  A document size is fixed: P(information, retrieval|d) > 0 && keyword ∉d &&

crocodile ∉d => P(keyword|d) >> P(crocodile|d).

0)|(0)|(1

⇒⇒= ∏=

m

iii dtPdtP

9

Zipf's Law 10

q  Given the frequency f of a word and its rank r in the list of words ordered by their frequencies:

f ∝ 1/r or f x r = k for a constant k

0.00

1000.00

2000.00

3000.00

4000.00

5000.00

6000.00

7000.00

8000.00

9000.00

0 10 20 30 40 50 60

rank

frequency

A small number of common words

A reasonable number of medium-freq words

A large number of rare words

Data Smoothing

q  Laplace’s Law: T is the max number of terms.

q  Extensions to Laplace’s: Lidstone’s Law.

Tdltf

dtPd

dtLAP +

+=

1)|( ,

TdtPTdl

tfdtP mle

d

dtLID /)1()|()|( , µµ

λ

λ−+=

+

+=

where )/( λµ Tdldl dd +=

11

Data Smoothing

q  Smoothed with the collection model:

§  The combined probability is still normalized with values between 0 and 1.

§  Further differentiation between missing terms such as “keyword” and “crocodile”.

§  Collection model can be made stable by adding more documents into the collection.

)()1()|()|( tPdtPdtP collectiondocumentcombined ωω −+×=

12

Text Classifications/Categorizations

q  Common classification problems:

q  Common classification methods: decision trees, maximum entropy modeling, neural networks, and clustering.

Problems Input Categories Tagging context of a word tag for the word Disambiguation context of a word sense for the word PP attachment sentence parse trees Author identification document author(s) Language identification document language(s) Text categorization document topic(s)

13

What is Sentiment Analysis?

“… after a week of using the camera, I am very unhappy with the camera. The LCD screen is too small and the picture quality is poor. This camera is junk.”

14

Subjective Words

q  A consumer is unlikely to write: “This camera is great. It takes great pictures. The LCD screen is great. I love this camera”.

q  But more likely to write: “This camera is great. It takes breathtaking pictures. The LCD screen is bright and clear. I love this camera”.

q  More diverse usage of subjective words: infrequent within but frequent across documents.

15

Topic Models 16

q  Topic modeling is a relatively new statistical approach to understanding the thematic structure in a collection of data §  Uncovering hidden topics in a corpus of documents §  Reducing dimensionality from words down to topics

q  Topic models treat the document creation as a random process of determining a topic proportion and selecting words from the related topic distributions.

Discover Topics study found drug

research risk

drugs researchers

dr patients

disease vioxx health

increased merck text

brain schizophrenia

studies medical effects

charles prince london

marriage parker camilla bowles

wedding british

thursday king royal

married marry wales

queen diana april

relationship couple

bush protest texas

bushs iraq

president cindy war

ranch

crawford sheehan

son casey killed

antiwar

california george mother road

peace

surface atmosphere

space

system earth probe

european moon

huygens

titan mission friday nasa

scientists cassini

saturns agency data titans 14

17

Discover Hierarchies 18

Topic Use Changing Through Time

Year

Frequency

0.005

0.010

0.015

0.020

0.025

1800 1820 1840 1860 1880 1900

Topichorsesshipsvehicles

19

Each document is a random mixture of topics that are shared across the corpus…

*Harvard Law Review, Vol. 118, No. 7 (May, 2005), pp. 2314-2335 (Note).

Documents exhibit multiple topics… Each word is randomly drawn from a topic …

election voter vote president ballot

law legal lawyer court judge

city state florida united california

This is the generative model of LDA

20

However, all of this thematic information is hidden. We only observe the words…

*Harvard Law Review, Vol. 118, No. 7 (May, 2005), pp. 2314-2335 (Note).

…Using probabilistic reasoning, we wish to infer the latent structure of the documents

Topic Indices

Topics

Topic Proportions

21

Bayesian Probability 22

q  Bayes’ Theorem

q  Subjective probability: model prior by a given distribution

Beta Prior Linear Likelihood Posterior

P(θ | x) = p(x |θ )p(θ )p(x)

posterior∝ likelihood × prior

Dirichlet Distribution 23

q  Distribution over distributions:

P(θ |α) = 1B(α)

θαi−1

i=1

K

∏

B(α) =Γ(αi )i=1

K∏Γ( αii=1

K∑ )

Latent Dirichlet Allocation (LDA)

§  Initially proposed by Blei, et al. (2003): Generative Process: 1.  ϕ(k) ~ Dir(β) 2.  For each document d ∈ M:

a.  θd ~ Dir(α) b.  For each word w ∈ d:

i.  z ~ Discrete(θd) ii.  w ~ Discrete(ϕ(z))

24

M

Kϕ

α

β

Nz wθ

∏∏

∏∏

= =

==

××

M

m

N

nznmmnm

M

mm

K

kk

m

mwpzp

pp

1 1,,

11

)|()|(

)|()|(

φθ

αθβφ

=),|,,,( βαφθzwp

Inference 25

q  We are interested in the posterior distributions for ϕ, z and θ

q  Computing these distributions exactly is intractable q  We therefore turn to approximate inference

techniques: §  Gibbs sampling, variational inference, …

q  Collapsed Gibbs sampling §  The multinomial parameters are integrated out before

sampling

Gibbs Sampling 26

q  Popular MCMC (Markov Chain Monte Carlo) method that samples from the conditional distributions for the posterior variables

q  For the joint distribution p(x)=p(x1,x2,…, xm): 1. Randomly initialize each xi

2. For t = 1, 2, …, T: 2.1.

2.2.

… 2.m.

),,,|(~ 3211

1tm

ttt xxxxpx !+

),,,|(~ 31

121

2tm

ttt xxxxpx !++

),,,|(~ 11

12

11

1 +−

+++ tm

ttm

tm xxxxpx !

(Collapsed) Gibbs Sampling 27

q  We integrate out the multinomial parameters φ and θ so that the Markov chain stabilizes more quickly and we have less variables to sample.

q  Our sampling equation is given as follows:

q  GibbsLDA++: a free C/C++ implementation of LDA

ββ

α

α

Wnn

nn

wzzpi

iii

z

zw

dz

dz

ii +

+×

+

+∝− )(

.

)(

.)(

.

)(

),|(

Syntax Models 28

q  Hidden Markov Model (HMM): The probability distribution of the latent variable zi follows the Markov property and depends on the value of the previous latent variable zi-1

z2z1 z4z3

x3 x4x1 x2

...

A

ϕ

q  Each latent state z has a unique emission probability §  This is a mixture model like LDA

q  Useful for unsupervised POS tagging §  Language exhibits a structure due to syntax rules §  State-of-the-art: “Bayesian” HMM where transition rows and

emission probabilities are random variables drawn from Dirichlet distributions [3]

Combining Topic and Syntax Models? 29

q  Considering both axes of information can help us model text more precisely and can thus aid in prediction, processing, and ultimately many NLP tasks

q  Example 1: §  Our favourite city during the trip was _________. §  How do we reason about what the missing word might be? §  An HMM should be able to predict that it’s a noun §  LDA might be able to predict that it’s a travel word* §  A combined model could theoretically determine that it’s a

noun about travel

Combining Topic and Syntax Models? 30

q  Example 2: §  Is the word “book” a noun or a verb?

•  If we know that a “library” topic generated it, it’s much more likely to be a noun

•  If we know that an “airline” topic generated it, it’s more likely to be a verb (“to book a flight”)

q  Example 3: §  We know that the word “seal” is a noun, what is its topic?

•  More likely to be related to “marine mammals” than “construction” (“to seal a crack”)

POSLDA (Part-Of-Speech LDA) Model 31

q  A “multi-faceted” topic model where word w depends on both topic z and class c when c is a “semantic” class §  wi ~ p(wi | ci , zi)

q  When c is a “syntactic” class the emitted word only depends on class c itself

q  This model results in POS-specific topics and can automatically filter out “stop-words” that must be manually removed in LDA

T*SSEM + SSYN

S

M

θα

w1

z1

c1ϕ

β

π γ

w2

z2

c2

w3

z3

c3

...

...

...

POSLDA Generative Process 32

1.  For each row πr ∈ π: a.  Draw πr ~ Dirichlet(γ)

2.  For each word distribution ϕn ∈ ϕ: a.  Draw ϕn ~ Dirichlet(β)

3.  For each document d ∈ D: a.  Draw θd ~ Dirichlet(α) b.  For each token i ∈ d:

i.  Draw ci ~ π(ci-1) ii.  If ci ∈ CSYN:

A.  Draw wi ~ ϕSYN(ci) iii.  Else (ci ∈ CSEM):

i.  Draw zi ~ θd ii.  Draw wi ~ ϕSEM(ci, zi)

POSLDA Interpretability 33

q  Learned word distributions from TREC AP corpus:

Generalized Probabilistic Model 34

q  POSLDA reduces to LDA when the number of classes S = 1.

q  POSLDA reduces to Bayesian HMM when the number of topics K = 1.

q  POSLDA reduces to HMMLDA when the number of semantic classes Ssem = 1.

FS from Semantic Classes 35

q  Research has shown that semantic classes such as adjectives, adverbs, and verbs are more useful for SA.

q  Select representative words for a semantic class by picking the top-ranked words with the accumulative probability ≥ θ (e.g., 75% or 90%).

q  Merge all selected words into one set Wsem, and reduce it further by DF-cutoff if needed.

FS from Semantic Classes with Tagging 36

q  POSLDA is unsupervised and the results do not usually match with human labeled answers.

q  A tagging dictionary contains all the POS tags that can be used for the given words in a corpus.

q  With a tagging dictionary, a word is only assigned to its related POS classes, but if not in the dictionary, the word will participate in all POS classes, same as the unsupervised process for POSLDA.

FS with Automatic Stopword Removal 37

q  Similar to Wsem, we can also build Wsyn from the syntactic classes to extract topic-independent stopwords.

q  Such a process is both automatic and corpus-specific, avoiding under- or over-removal of the related words.

q  Although POSLDA can separate semantic and syntactic classes, removing stopwords explicitly helps reduce the noise in the dataset.

FS for Aspect-Based SA 38

q  POSLDA associates each topic with its related semantic classes such as “nouns about sports” and “verbs about travel”.

q  By modeling topics as aspects, we can then select features from the corresponding semantic classes using the methods described earlier.

q  To model aspects, we use manually prepared seed lists (possibly extended with a bootstrapping method), and pin them in the related aspects during the modeling process.

39

Questions?

References 40

q  Chris Manning and Hinrich Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999.

q  Chris Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008 (online copy available on the web)

q  Daniel Jurafsky and James H. Martin. Speech and Language Processing. Second Edition. Pearson Education, 2008.

References 41

q  David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, 2003.

q  Sharon Goldwater and Tom Griffiths. A fully Bayesian approach to unsupervised part-of-speech tagging. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 744–751, Prague, Czech Republic, June 2007.

References 42

q  Bo Pang , Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. Processings of EMNLP, 2002.

q  David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993-1022, 2003.

q  Sharon Goldwater and Thomas Griffiths. A fully bayesian approach to unsupervised part-of-speech tagging. Proceedings of the 45th Annual Meeting of the ACL, 2007.

References 43

q  William M. Darling. Generalized Probabilistic Topic and Syntax Models for Natural Language Processing. Ph.D. Thesis, University of Guelph, 2012.

q  Haochen Zhou and Fei Song. Aspect-Level Sentiment Analysis Based on a Generalized Probabilistic Topic and Syntax Model. Proceedings of FLAIRS-28, 2015.