Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
OVERVIEW OF STATISTICAL NATURAL LANGUAGE PROCESSING
May 18, 2016
1
Dr. Fei Song School of Computer Science University of Guelph
Outline
q What is Statistical Natural Language Processing (SNLP)?
q Language Models for Information Retrieval q Text Classification and Sentiment Analysis q Probabilistic Models (LDA, Bayesian HMM,
and POSLDA) for language processing q References
2
What is SNLP? 3
q Infer and rank the structures from text based on statistical language modeling. § Probability and Statistics § Machine Learning Techniques
q Started in late 1950’s, but didn’t get popular until early 1980’s.
q Many applications: Information Retrieval, Information Extraction, Text Classification, Text Mining, and Biological Data Analysis.
Language Modeling 4
q A statistical language model requires the estimates for such probabilities: P(w1,n) = P(w1,w2,…,wn)
q Probabilities to word sequences?
q Left-context only? § The {big, pig} dog … § P(dog|the big) >> P(dog|the pig)
P(w1 w2 … wn) = P(w1) P(w2|w1) … P(wn|w1 w2 … wn-1) e.g., Jack went to the {hospital, number, if, … }
Noisy Channel Framework
q Through decoding, we want to find the most likely input for the given observation.
5
i)|p(i)p(oargmax p(o)
i)|p(i)p(oargmax o)|p(iargmax Iiii
===ˆ
Decoder Noisy Channel p(o|i)
I O I
§ Applications: machine translation, optical character recognition, speech recognition, spelling correction.
Language Models for IR
q N-gram models: Unigram: P(w1,n) = P(w1) P(w2) … P(wn)
Bigram: P(w1,n) = P(w1) P(w2| w1) … P(wn|wn-1)
Trigram: P(w1,n) = P(w1) P(w2| w1) … P(wn|wn-2,n-1)
q Documents as language samples:
∏=
=n
iin dtPdtttP
121 )|()|,...,,(
6
Language Models for IR
q Query as a generation process:
),...,,(/)|,...,,()(),...,,|(
2121
21
mm
m
tttPdtttPdPtttdP
⇒
)|,...,,()( 21 dtttPdP m⇒
∏=
⇒⇒m
iim dtPdtttP
121 )|()|,...,,(
(Bayesian theorem)
(Uniform prior documents)
(Unigram terms)
7
A Naïve Solution
q Maximum likelihood estimate:
: the raw term frequency of term t in document d : the total number of tokens in document d.
d
dtmle dl
tfdtP ,)|( =
dttf ,
ddl
8
Sparse Data Problem
q A document size is often too small
q A document size is fixed: P(information, retrieval|d) > 0 && keyword ∉d &&
crocodile ∉d => P(keyword|d) >> P(crocodile|d).
0)|(0)|(1
⇒⇒= ∏=
m
iii dtPdtP
9
Zipf's Law 10
q Given the frequency f of a word and its rank r in the list of words ordered by their frequencies:
f ∝ 1/r or f x r = k for a constant k
0.00
1000.00
2000.00
3000.00
4000.00
5000.00
6000.00
7000.00
8000.00
9000.00
0 10 20 30 40 50 60
rank
frequency
A small number of common words
A reasonable number of medium-freq words
A large number of rare words
Data Smoothing
q Laplace’s Law: T is the max number of terms.
q Extensions to Laplace’s: Lidstone’s Law.
Tdltf
dtPd
dtLAP +
+=
1)|( ,
TdtPTdl
tfdtP mle
d
dtLID /)1()|()|( , µµ
λ
λ−+=
+
+=
where )/( λµ Tdldl dd +=
11
Data Smoothing
q Smoothed with the collection model:
§ The combined probability is still normalized with values between 0 and 1.
§ Further differentiation between missing terms such as “keyword” and “crocodile”.
§ Collection model can be made stable by adding more documents into the collection.
)()1()|()|( tPdtPdtP collectiondocumentcombined ωω −+×=
12
Text Classifications/Categorizations
q Common classification problems:
q Common classification methods: decision trees, maximum entropy modeling, neural networks, and clustering.
Problems Input Categories Tagging context of a word tag for the word Disambiguation context of a word sense for the word PP attachment sentence parse trees Author identification document author(s) Language identification document language(s) Text categorization document topic(s)
13
What is Sentiment Analysis?
“… after a week of using the camera, I am very unhappy with the camera. The LCD screen is too small and the picture quality is poor. This camera is junk.”
14
Subjective Words
q A consumer is unlikely to write: “This camera is great. It takes great pictures. The LCD screen is great. I love this camera”.
q But more likely to write: “This camera is great. It takes breathtaking pictures. The LCD screen is bright and clear. I love this camera”.
q More diverse usage of subjective words: infrequent within but frequent across documents.
15
Topic Models 16
q Topic modeling is a relatively new statistical approach to understanding the thematic structure in a collection of data § Uncovering hidden topics in a corpus of documents § Reducing dimensionality from words down to topics
q Topic models treat the document creation as a random process of determining a topic proportion and selecting words from the related topic distributions.
Discover Topics study found drug
research risk
drugs researchers
dr patients
disease vioxx health
increased merck text
brain schizophrenia
studies medical effects
charles prince london
marriage parker camilla bowles
wedding british
thursday king royal
married marry wales
queen diana april
relationship couple
bush protest texas
bushs iraq
president cindy war
ranch
crawford sheehan
son casey killed
antiwar
california george mother road
peace
surface atmosphere
space
system earth probe
european moon
huygens
titan mission friday nasa
scientists cassini
saturns agency data titans 14
17
Discover Hierarchies 18
Topic Use Changing Through Time
Year
Frequency
0.005
0.010
0.015
0.020
0.025
1800 1820 1840 1860 1880 1900
Topichorsesshipsvehicles
19
Each document is a random mixture of topics that are shared across the corpus…
*Harvard Law Review, Vol. 118, No. 7 (May, 2005), pp. 2314-2335 (Note).
Documents exhibit multiple topics… Each word is randomly drawn from a topic …
election voter vote president ballot
law legal lawyer court judge
city state florida united california
This is the generative model of LDA
20
However, all of this thematic information is hidden. We only observe the words…
*Harvard Law Review, Vol. 118, No. 7 (May, 2005), pp. 2314-2335 (Note).
…Using probabilistic reasoning, we wish to infer the latent structure of the documents
Topic Indices
Topics
Topic Proportions
21
Bayesian Probability 22
q Bayes’ Theorem
q Subjective probability: model prior by a given distribution
Beta Prior Linear Likelihood Posterior
P(θ | x) = p(x |θ )p(θ )p(x)
posterior∝ likelihood × prior
Dirichlet Distribution 23
q Distribution over distributions:
P(θ |α) = 1B(α)
θαi−1
i=1
K
∏
B(α) =Γ(αi )i=1
K∏Γ( αii=1
K∑ )
Latent Dirichlet Allocation (LDA)
§ Initially proposed by Blei, et al. (2003): Generative Process: 1. ϕ(k) ~ Dir(β) 2. For each document d ∈ M:
a. θd ~ Dir(α) b. For each word w ∈ d:
i. z ~ Discrete(θd) ii. w ~ Discrete(ϕ(z))
24
M
Kϕ
α
β
Nz wθ
∏∏
∏∏
= =
==
××
M
m
N
nznmmnm
M
mm
K
kk
m
mwpzp
pp
1 1,,
11
)|()|(
)|()|(
φθ
αθβφ
=),|,,,( βαφθzwp
Inference 25
q We are interested in the posterior distributions for ϕ, z and θ
q Computing these distributions exactly is intractable q We therefore turn to approximate inference
techniques: § Gibbs sampling, variational inference, …
q Collapsed Gibbs sampling § The multinomial parameters are integrated out before
sampling
Gibbs Sampling 26
q Popular MCMC (Markov Chain Monte Carlo) method that samples from the conditional distributions for the posterior variables
q For the joint distribution p(x)=p(x1,x2,…, xm): 1. Randomly initialize each xi
2. For t = 1, 2, …, T: 2.1.
2.2.
… 2.m.
),,,|(~ 3211
1tm
ttt xxxxpx !+
),,,|(~ 31
121
2tm
ttt xxxxpx !++
),,,|(~ 11
12
11
1 +−
+++ tm
ttm
tm xxxxpx !
(Collapsed) Gibbs Sampling 27
q We integrate out the multinomial parameters φ and θ so that the Markov chain stabilizes more quickly and we have less variables to sample.
q Our sampling equation is given as follows:
q GibbsLDA++: a free C/C++ implementation of LDA
ββ
α
α
Wnn
nn
wzzpi
iii
z
zw
dz
dz
ii +
+×
+
+∝− )(
.
)(
.)(
.
)(
),|(
Syntax Models 28
q Hidden Markov Model (HMM): The probability distribution of the latent variable zi follows the Markov property and depends on the value of the previous latent variable zi-1
z2z1 z4z3
x3 x4x1 x2
...
A
ϕ
q Each latent state z has a unique emission probability § This is a mixture model like LDA
q Useful for unsupervised POS tagging § Language exhibits a structure due to syntax rules § State-of-the-art: “Bayesian” HMM where transition rows and
emission probabilities are random variables drawn from Dirichlet distributions [3]
Combining Topic and Syntax Models? 29
q Considering both axes of information can help us model text more precisely and can thus aid in prediction, processing, and ultimately many NLP tasks
q Example 1: § Our favourite city during the trip was _________. § How do we reason about what the missing word might be? § An HMM should be able to predict that it’s a noun § LDA might be able to predict that it’s a travel word* § A combined model could theoretically determine that it’s a
noun about travel
Combining Topic and Syntax Models? 30
q Example 2: § Is the word “book” a noun or a verb?
• If we know that a “library” topic generated it, it’s much more likely to be a noun
• If we know that an “airline” topic generated it, it’s more likely to be a verb (“to book a flight”)
q Example 3: § We know that the word “seal” is a noun, what is its topic?
• More likely to be related to “marine mammals” than “construction” (“to seal a crack”)
POSLDA (Part-Of-Speech LDA) Model 31
q A “multi-faceted” topic model where word w depends on both topic z and class c when c is a “semantic” class § wi ~ p(wi | ci , zi)
q When c is a “syntactic” class the emitted word only depends on class c itself
q This model results in POS-specific topics and can automatically filter out “stop-words” that must be manually removed in LDA
T*SSEM + SSYN
S
M
θα
w1
z1
c1ϕ
β
π γ
w2
z2
c2
w3
z3
c3
...
...
...
POSLDA Generative Process 32
1. For each row πr ∈ π: a. Draw πr ~ Dirichlet(γ)
2. For each word distribution ϕn ∈ ϕ: a. Draw ϕn ~ Dirichlet(β)
3. For each document d ∈ D: a. Draw θd ~ Dirichlet(α) b. For each token i ∈ d:
i. Draw ci ~ π(ci-1) ii. If ci ∈ CSYN:
A. Draw wi ~ ϕSYN(ci) iii. Else (ci ∈ CSEM):
i. Draw zi ~ θd ii. Draw wi ~ ϕSEM(ci, zi)
POSLDA Interpretability 33
q Learned word distributions from TREC AP corpus:
Generalized Probabilistic Model 34
q POSLDA reduces to LDA when the number of classes S = 1.
q POSLDA reduces to Bayesian HMM when the number of topics K = 1.
q POSLDA reduces to HMMLDA when the number of semantic classes Ssem = 1.
FS from Semantic Classes 35
q Research has shown that semantic classes such as adjectives, adverbs, and verbs are more useful for SA.
q Select representative words for a semantic class by picking the top-ranked words with the accumulative probability ≥ θ (e.g., 75% or 90%).
q Merge all selected words into one set Wsem, and reduce it further by DF-cutoff if needed.
FS from Semantic Classes with Tagging 36
q POSLDA is unsupervised and the results do not usually match with human labeled answers.
q A tagging dictionary contains all the POS tags that can be used for the given words in a corpus.
q With a tagging dictionary, a word is only assigned to its related POS classes, but if not in the dictionary, the word will participate in all POS classes, same as the unsupervised process for POSLDA.
FS with Automatic Stopword Removal 37
q Similar to Wsem, we can also build Wsyn from the syntactic classes to extract topic-independent stopwords.
q Such a process is both automatic and corpus-specific, avoiding under- or over-removal of the related words.
q Although POSLDA can separate semantic and syntactic classes, removing stopwords explicitly helps reduce the noise in the dataset.
FS for Aspect-Based SA 38
q POSLDA associates each topic with its related semantic classes such as “nouns about sports” and “verbs about travel”.
q By modeling topics as aspects, we can then select features from the corresponding semantic classes using the methods described earlier.
q To model aspects, we use manually prepared seed lists (possibly extended with a bootstrapping method), and pin them in the related aspects during the modeling process.
39
Questions?
References 40
q Chris Manning and Hinrich Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999.
q Chris Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008 (online copy available on the web)
q Daniel Jurafsky and James H. Martin. Speech and Language Processing. Second Edition. Pearson Education, 2008.
References 41
q David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, 2003.
q Sharon Goldwater and Tom Griffiths. A fully Bayesian approach to unsupervised part-of-speech tagging. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 744–751, Prague, Czech Republic, June 2007.
References 42
q Bo Pang , Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. Processings of EMNLP, 2002.
q David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993-1022, 2003.
q Sharon Goldwater and Thomas Griffiths. A fully bayesian approach to unsupervised part-of-speech tagging. Proceedings of the 45th Annual Meeting of the ACL, 2007.
References 43
q William M. Darling. Generalized Probabilistic Topic and Syntax Models for Natural Language Processing. Ph.D. Thesis, University of Guelph, 2012.
q Haochen Zhou and Fei Song. Aspect-Level Sentiment Analysis Based on a Generalized Probabilistic Topic and Syntax Model. Proceedings of FLAIRS-28, 2015.