Upload
damian-weaver
View
221
Download
3
Embed Size (px)
Citation preview
Text Models
Why?
• To “understand” text
• To assist in text search & ranking
• For autocompletion
• Part of Speech Tagging
Simple application: spelling suggestions
• Say that we have a dictionary of words– Real dictionary or the result of crawling– Sentences instead of words
• Now we are given a word w not in the dictionary
• How can we correct it to something in the dictionary
String editing
• Given two strings (sequences) the “distance” between the two strings is defined by the minimum number of “character edit operations” needed to turn one sequence into the other.
• Edit operations: delete, insert, modify (a character) – Cost assigned to each operation (e.g. uniform =1 )
Edit distance
• Already a simple model for languages
• Modeling the creation of strings (and errors in them) through simple edit operations
Distance between strings
• Edit distance between strings = minimum number of edit operations that can be used to get from one string to the other– Symmetric because of the particular choice of edit
operations and uniform cost
• distance(“Willliam Cohon”,“William Cohen”)
• 2
Finding the edit distance
• An “alignment” problem
• Deciding how to align the two strings
• Can we try all alignments?
• How many (reasonable options) are there?
Dynamic Programming
• An umbrella name for a collection of algorithms
• Main idea: reuse computation for sub-problems, combined in different ways
Example: Fibonnaci
if n = 0 or n = 1 return nelse return fib(n-1) + fib(n-2)
Exponential time!
Fib with Dynamic Programmingtable = {}def fib(n):
global tableif table.has_key(n):
return table[n]if n == 0 or n == 1:
table[n] = nreturn n
else:value = fib(n-1) + fib(n-2)
table[n] = valuereturn value
Using a partial solution
• Partial solution: – Alignment of s up to location i, with t up to
location j
• How to reuse?
• Try all options for the “last” operation
• Base case : D(i,0)=I, D(0,i)=i for i inserts \ deletions
• Easy to generalize to arbitrary cost functions!
Models
• Bag-of-words
• N-grams
• Hidden Markov Models
• Probabilistic Context Free Grammar
Bag-of-words
• Every document is represented as a bag of the words it contains
• Bag means that we keep the multiplicity (=number of occurrences) of each word
• Very simple, but we lose all track of structure
n-grams
• Limited structure
• Sliding window of n words
n-gram model
How would we infer the probabilities?
• Issues:– Overfitting– Probability 0
How would we infer the probabilities?
• Maximum Likelihood:
"add-one" (Laplace) smoothing
• V = Vocabulary size
Good-Turing Estimate
Good-Turing
More than a fixed n..Linear Interpolation
Precision vs. Recall
Richer Models
• HMM
• PCFG
Motivation: Part-of-Speech Tagging
– Useful for ranking– For machine translation– Word-Sense Disambiguation– …
Part-of-Speech Tagging
• Tag this word. This word is a tag.
• He dogs like a flea
• The can is in the fridge
• The sailor dogs me every day
A Learning Problem
• Training set: tagged corpus– Most famous is the Brown Corpus with about 1M
words
– The goal is to learn a model from the training set, and then perform tagging of untagged text
– Performance tested on a test-set
Simple Algorithm
• Assign to each word its most popular tag in the training set
• Problem: Ignores context
• Dogs, tag will always be tagged as a noun…
• Can will be tagged as a verb
• Still, achieves around 80% correctness for real-life test-sets– Goes up to as high as 90% when combined with some simple rules
(HMM) Hidden Markov Model
• Model: sentences are generated by a probabilistic process
• In particular, a Markov Chain whose states correspond to Parts-of-Speech
• Transitions are probabilistic
• In each state a word is outputted– The output word is again chosen probabilistically based on the
state
HMM
• HMM is:– A set of N states– A set of M symbols (words) – A matrix NXN of transition probabilities Ptrans– A vector of size N of initial state probabilities
Pstart– A matrix NXM of emissions probabilities Pout
• “Hidden” because we see only the outputs, not the sequence of states traversed
Example
3 Fundamental Problems
1) Compute the probability of a given observationSequence (=sentence) 2) Given an observation sequence, find the most likely hidden state sequence This is tagging3) Given a training set find the model that would make the observations most likely
Tagging
• Find the most likely sequence of states that led to an observed output sequence
• Problem: exponentially many possible sequences!
Viterbi Algorithm
• Dynamic Programming• Vt,k is the probability of the most probable
state sequence – Generating the first t + 1 observations (X0,..Xt)– And terminating at state k
Viterbi Algorithm
• Dynamic Programming• Vt,k is the probability of the most probable
state sequence – Generating the first t + 1 observations (X0,..Xt)– And terminating at state k
• V0,k = Pstart(k)*Pout(k,X0)• Vt,k= Pout(k,Xt)*max{Vt-1k’ *Ptrans(k’,k)}
Finding the path
• Note that we are interested in the most likely path, not only in its probability
• So we need to keep track at each point of the argmax– Combine them to form a sequence
• What about top-k?
Complexity
• O(T*|S|^2)
• Where T is the sequence (=sentence) length, |S| is the number of states (= number of possible tags)
Computing the probability of a sequence
• Forward probabilities: αt(k) is the probability of seeing the sequence X1…Xt and terminating at state k• Backward probabilities:
βt(k) is the probability of seeing the sequenceXt+1…Xn given that the Markov process is atstate k at time t.
Computing the probabilities
Forward algorithmα0(k)= Pstart(k)*Pout(k,X0)αt(k)= Pout(k,Xt)*Σk’{αt-1k’ *Ptrans(k’,k)}P(O1,…On)= Σk αn(k)
Backward algorithmβt(k) = P(Ot+1…On| state at time t is k)βt(k) = Σk’{Ptrans(k,k’)* Pout(k’,Xt+1)* βt+1(k’)}βn(k) = 1 for all kP(O)= Σk β 0(k)* Pstart(k)
Learning the HMM probabilities
• Expectation-Maximization Algorithm1. Start with initial probabilities2. Compute Eij the expected number of transitions
from i to j while generating a sequence, for each i,j (see next)3. Set the probability of transition from i to j to be Eij/ (Σk Eik)4. Similarly for omission probability5. Repeat 2-4 using the new model, until convergence
Estimating the expectancies
• By sampling– Re-run a random a execution of the model 100
times– Count transitions
• By analysis– Use Bayes rule on the formula for sequence
probability– Called the Forward-backward algorithm
Accuracy
• Tested experimentally
• Exceeds 96% for the Brown corpus– Trained on half and tested on the other half
• Compare with the 80-90% by the trivial algorithm
• The hard cases are few but are very hard..
NLTK
• http://www.nltk.org/
• Natrual Language ToolKit
• Open source python modules for NLP tasks– Including stemming, POS tagging and much more
Context Free Grammars
• Context Free Grammars are a more natural model for Natural Language
• Syntax rules are very easy to formulate using CFGs
• Provably more expressive than Finite State Machines– E.g. Can check for balanced parentheses
Context Free Grammars
• Non-terminals
• Terminals
• Production rules– V → w where V is a non-terminal and w is a
sequence of terminals and non-terminals
Context Free Grammars
• Can be used as acceptors
• Can be used as a generative model
• Similarly to the case of Finite State Machines
• How long can a string generated by a CFG be?
Stochastic Context Free Grammar
• Non-terminals
• Terminals
• Production rules associated with probability– V → w where V is a non-terminal and w is a
sequence of terminals and non-terminals
Chomsky Normal Form
• Every rule is of the form• V → V1V2 where V,V1,V2 are non-terminals • V → t where V is a non-terminal and t is a terminal
Every (S)CFG can be written in this form• Makes designing many algorithms easier
Questions
• What is the probability of a string?– Defined as the sum of probabilities of all possible
derivations of the string• Given a string, what is its most likely derivation?– Called also the Viterbi derivation or parse– Easy adaptation of the Viterbi Algorithm for HMMs
• Given a training corpus, and a CFG (no probabilities) learn the probabilities on derivation rule
• Inside probability: probability of generating wp…wq from non-terminal Nj.
• Outside probability: total prob of beginning with the start symbol N1 and generating and everything outside wp…wq
jpqN
(,,)(,) (1)(1)1 mqjpqpj wNwPqp
(|)(,) jpqpqj NwPqp
Inside-outside probabilities
CYK algorithm
(,1)(,)()(,),
1
qddpNNNPqp srsr
sr
q
pd
jj
Nj
Nr Ns
wp wd Wd+1 wq
()(,) kj
j wNPkk
CYK algorithm
(,1)()(,)(,), 1
eqNNNPepqp ggjf
gf
m
qefj
Nj Ng
wp wq Wq+1 we
Nf
N1
w1 wm
CYK algorithm
(1,)()(,)(,),
1
1
peNNNPqeqp gjgf
gf
p
efj
Ng Nj
we Wp-1 Wp wq
Nf
N1
w1 wm
Outside probability
(1,)()(,)
(,1)()(,)(,)
,
1
1
, 1
peNNNPqe
eqNNNPepqp
gjgf
gf
p
ef
ggjf
gf
m
qefj
otherwise
jifmj 0
11(,1)
Probability of a sentence
(,1)() 11 mwP m
kanyforwNPkkwP kj
jjm ()(,)() 1
(,)(,)(,) 1 qpqpNwP jjjpqm
The probability that a binary rule is used
()
(,1)(,)()(,)
(|,)1
1
1m
q
pdsr
srjj
msrjj
pq wP
qddpNNNPqp
wNNNNP
()
(,1)(,)()(,)
(|,)
(|,)
1
1
1 1
1 11
1
m
q
pdsr
srjj
m
p
m
q
m
p
m
qm
srjjpq
msrjj
wP
qddpNNNPqp
wNNNNP
wNNNNP
(1)
The probability that Nj is used
(,)(,)(,) 1 qpqpNwP jjjpqm
()
(,)(,)
()
(,)(|)
11
11
m
jj
m
mjpq
mjpq wP
qpqp
wP
wNPwNP
()
(,)(,)
(|)
(|)
(|)
11 1
1
11 1
1
m
jjm
p
m
q
msrj
r s
mjpq
m
p
m
q
mj
wP
qpqp
wNNNP
wNP
wNP
(2)
m
p
m
pqjj
m
p
m
pq
q
pdsr
srjj
mj
mjsrj
msrj
qpqp
qddpNNNPqp
wNP
wNNNNPwNNNP
1
1
1
1
11
(,)(,)
(,1)(,)()(,)
(2)
(1)
(|)
(|,)(|)
The probability that a unaryrule is used
()
(,)(,)(,)(|,)
1
11
m
m
h
khjj
mjkj
wP
wwhhhhwusedisNwNP
m
p
m
pqjj
m
h
khjj
mj
mjkj
mjkj
qpqp
wwhhhh
wNP
wNwNPwNwNP
1
1
1
11
(,)(,)
(,)(,)(,)
(2)
(3)
(|)
(|,)(,|)
(3)
Multiple training sentences
ii
m
jjm
p
m
qm
j
Wsentenceforjh
wP
qpqpwNP
()
()
(,)(,)(|)
11 11
(,,)
()
(,1)(,)()(,)
(|,)1
11
1 11
srjf
wP
qddpNNNPqp
wNNNNP
i
m
q
pdsr
srjj
m
p
m
pqm
srjj
(1)
(2)
()
(,,)()
jh
srjfNNNP
ii
iisrj