Upload
mizell
View
51
Download
2
Embed Size (px)
DESCRIPTION
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 26– Recap HMM; Probabilistic Parsing cntd ). Pushpak Bhattacharyya CSE Dept., IIT Bombay 15 th March, 2011. Formal Definition of PCFG. A PCFG consists of A set of terminals {w k }, k = 1,….,V - PowerPoint PPT Presentation
Citation preview
CS460/626 : Natural Language Processing/Speech, NLP and the Web(Lecture 26– Recap HMM; Probabilistic
Parsing cntd)
Pushpak BhattacharyyaCSE Dept., IIT Bombay
15th March, 2011
Formal Definition of PCFG A PCFG consists of
A set of terminals {wk}, k = 1,….,V {wk} = { child, teddy, bear, played…}
A set of non-terminals {Ni}, i = 1,…,n {Ni} = { NP, VP, DT…} A designated start symbol N1
A set of rules {Ni j}, where j is a
sequence of terminals & non-terminals NP DT NN A corresponding set of rule probabilities
Rule Probabilities Rule probabilities are such that
E.g., P( NP DT NN) = 0.2 P( NP NN) = 0.5 P( NP NP PP) = 0.3
P( NP DT NN) = 0.2 Means 20 % of the training data
parses use the rule NP DT NN
i
P(N ) 1i ji
Probabilistic Context Free Grammars
S NP VP 1.0 NP DT NN 0.5 NP NNS 0.3 NP NP PP 0.2 PP P NP 1.0 VP VP PP 0.6 VP VBD NP 0.4
DT the 1.0 NN gunman 0.5 NN building 0.5 VBD sprayed 1.0 NNS bullets 1.0
Example Parse t1`
The gunman sprayed the building with bullets. S1.0
NP0.5 VP0.6
DT1.0NN0.5
VBD1.0NP0.5
PP1.0
DT1.0 NN0.5
P1.0 NP0.3
NNS1.0
bullets
with
buildingthe
The gunman
sprayed
P (t1) = 1.0 * 0.5 * 1.0 * 0.5 * 0.6 * 0.4 * 1.0 * 0.5 * 1.0 * 0.5 * 1.0 * 1.0 * 0.3 * 1.0 = 0.00225VP0.4
Another Parse t2
S1.0
NP0.5 VP0.4
DT1.0 NN0.5VBD1.0
NP0.5 PP1.0
DT1.0 NN0.5 P1.0 NP0.3
NNS1.
0bullets
withbuilding
the
Thegunman
sprayed
NP0.2
P (t2) = 1.0 * 0.5 * 1.0 * 0.5 * 0.4 * 1.0 * 0.2 * 0.5 * 1.0 * 0.5 * 1.0 * 1.0 * 0.3 * 1.0 = 0.0015
The gunman sprayed the building with bullets.
Probability of a sentence Notation :
wab – subsequence wa….wb Nj dominates wa….wb
or yield(Nj) = wa….wbwa……………..wb
Nj
1
1 1
1
: ( )
( ) ( , )
= ( ) ( | )
( )m
m mt
mt
t yield t w
P w P w t
P t P w t
P t
Where t is a parse tree of the sentence
the..sweet..teddy ..bear
NP
• Probability of a sentence = P(w1m)
1( | ) = 1mP w t If t is a parse tree for the sentence w1m, this will be 1 !!
Assumptions of the PCFG model Place invariance :
P(NP DT NN) is same in locations 1 and 2 Context-free :
P(NP DT NN | anything outside “The child”) = P(NP DT NN)
Ancestor free : At 2,P(NP DT NN|its ancestor is VP)
= P(NP DT NN)
S
NP
The child
VP
NP
The toy
1
2
Probability of a parse tree Domination :We say Nj dominates from k
to l, symbolized as , if Wk,l is derived from Nj
P (tree |sentence) = P (tree | S1,l ) where S1,l means that the start symbol S dominates the word sequence
W1,l
P (t |s) approximately equals joint probability of constituent non-terminals dominating the sentence fragments (next slide)
Probability of a parse tree (cont.)S1,l
NP1,2 VP3,l
N2V3,3 PP4,l
P4,4 NP5,lw2
w4
DT1
w1 w3
w5 wl
P ( t|s ) = P (t | S1,l )= P ( NP1,2, DT1,1 , w1,
N2,2, w2,VP3,l, V3,3 , w3,
PP4,l, P4,4 , w4, NP5,l, w5…l | S1,l )
= P ( NP1,2 , VP3,l | S1,l) * P ( DT1,1 , N2,2 | NP1,2) * D(w1 | DT1,1) * P (w2 | N2,2) * P (V3,3, PP4,l | VP3,l) * P(w3 | V3,3) * P( P4,4, NP5,l | PP4,l ) * P(w4|P4,4) * P (w5…l | NP5,l)
(Using Chain Rule, Context Freeness and Ancestor Freeness )
HMM ↔ PCFG O observed sequence ↔ w1m
sentence X state sequence ↔ t parse
tree model ↔ G grammar
Three fundamental questions
HMM ↔ PCFG How likely is a certain observation given the
model? ↔ How likely is a sentence given the grammar?
How to choose a state sequence which best explains the observations? ↔ How to choose a parse which best supports the sentence?
arg max ( | , )X
P X O 1arg max ( | , )mt
P t w G↔
1( | )mP w G( | )P O ↔
HMM ↔ PCFG
How to choose the model parameters that best explain the observed data? ↔ How to choose rule probabilities which maximize the probabilities of the observed sentences?arg max ( | )P O
1arg max ( | )m
GP w G↔
Recap of HMM
HMM Definition Set of states: S where |S|=N Start state S0 /*P(S0)=1*/ Output Alphabet: O where |O|=M Transition Probabilities: A= {aij} /*state i
to state j*/ Emission Probabilities : B= {bj(ok)}
/*prob. of emitting or absorbing ok from state j*/
Initial State Probabilities: Π={p1,p2,p3,…pN}
Each pi=P(o0=ε,Si|S0)
Markov Processes Properties
Limited Horizon: Given previous t states, a state i, is independent of preceding 0 to t-k+1 states.
P(Xt=i|Xt-1, Xt-2 ,… X0) = P(Xt=i|Xt-1, Xt-2… Xt-
k) Order k Markov process
Time invariance: (shown for k=1) P(Xt=i|Xt-1=j) = P(X1=i|X0=j) …= P(Xn=i|
Xn-1=j)
Three basic problems (contd.) Problem 1: Likelihood of a
sequence Forward Procedure Backward Procedure
Problem 2: Best state sequence Viterbi Algorithm
Problem 3: Re-estimation Baum-Welch ( Forward-Backward
Algorithm )
Probabilistic Inference O: Observation Sequence S: State Sequence
Given O find S* where called Probabilistic Inference
Infer “Hidden” from “Observed” How is this inference different from logical
inference based on propositional or predicate calculus?
* arg max ( / )S
S p S O
Essentials of Hidden Markov Model
1. Markov + Naive Bayes
2. Uses both transition and observation probability
3. Effectively makes Hidden Markov Model a Finite State
Machine (FSM) with probability
1 1( ) ( / ) ( / )kOk k k k k kp S S p O S p S S
Probability of Observation Sequence
Without any restriction, Search space size= |S||O|
( ) ( , )
= ( ) ( / )S
S
p O p O S
p S p O S
Continuing with the Urn example
Urn 1# of Red = 30
# of Green = 50 # of Blue = 20
Urn 3# of Red =60
# of Green =10 # of Blue = 30
Urn 2# of Red = 10
# of Green = 40 # of Blue = 50
Colored Ball choosing
Example (contd.)U1 U2 U3
U1 0.1 0.4 0.5U2 0.6 0.2 0.2U3 0.3 0.4 0.3
Given :
Observation : RRGGBRGR
What is the corresponding state sequence ?
and
R G BU1 0.3 0.5 0.2U2 0.1 0.4 0.5U3 0.6 0.1 0.3
Transition Probability Observation/output Probability
Diagrammatic representation (1/2)
U1
U2
U3
0.1
0.20.4
0.6
0.4
0.5
0.3
0.2
0.3
R, 0.6
G, 0.1B, 0.3
R, 0.1
B, 0.5G, 0.4
B, 0.2R, 0.3 G, 0.5
Diagrammatic representation (2/2)
U1
U2
U3
R,0.02G,0.08B,0.10
R,0.24G,0.04B,0.12
R,0.06G,0.24B,0.30R, 0.08
G, 0.20B, 0.12
R,0.15G,0.25B,0.10
R,0.18G,0.03B,0.09
R,0.18G,0.03B,0.09
R,0.02G,0.08B,0.10
R,0.03G,0.05B,0.02
Probabilistic FSM
(a1:0.3)
(a2:0.4)
(a1:0.2)
(a2:0.3)
(a1:0.1)
(a2:0.2)
(a1:0.3)
(a2:0.2)
The question here is:“what is the most likely state sequence given the output sequence seen”
S1 S2
Developing the treeStart
S1 S2
S1 S2 S1 S2
S1 S2 S1 S2
1.0 0.0
0.1 0.3 0.2 0.3
1*0.1=0.1 0.3 0.0 0.0
0.1*0.2=0.02 0.1*0.4=0.04 0.3*0.3=0.09 0.3*0.2=0.06
. .
. .
€
a1
a2
Choose the winning sequence per state per iteration
0.2 0.4 0.3 0.2
Tree structure contd…S1 S2
S1 S2 S1 S2
0.1 0.3 0.2 0.3
0.027 0.012 . .
0.09 0.06
0.09*0.1=0.009 0.018
S1
0.3
0.0081
S2
0.2
0.0054
S2
0.4
0.0048
S1
0.2
0.0024
.
a1
a2
The problem being addressed by this tree is )|(maxarg* ,2121 aaaaSPSs
a1-a2-a1-a2 is the output sequence and μ the model or the machine
Viterbi Algorithm for the Urn problem (first two symbols)
S0
U1
U2
U3
0.50.3
0.2
U1
U2
U3
0.03
0.08
0.15
U1
U2
U3
U1
U2
U3
0.060.02
0.020.18
0.24
0.18
0.015
0.04 0.075* 0.018
0.006
0.006
0.036*
0.048*
0.036*: winner sequences
ε
R
Markov process of order>1 (say 2)
Same theory worksP(S).P(O|S)= P(O0|S0).P(S1|S0).
[P(O1|S1). P(S2|S1S0)].[P(O2|S2). P(S3|S2S1)]. [P(O3|S3).P(S4|S3S2)]. [P(O4|S4).P(S5|S4S3)]. [P(O5|S5).P(S6|S5S4)]. [P(O6|S6).P(S7|S6S5)]. [P(O7|S7).P(S8|S7S6)].[P(O8|S8).P(S9|S8S7)].
We introduce the states S0 and S9 as initial and final states respectively.
After S8 the next state is S9 with probability 1, i.e., P(S9|S8S7)=1
O0 is ε-transition
O0 O1 O2 O3 O4 O5 O6 O7 O8
Obs: ε R R G G B R G RState: S0 S1 S2 S3 S4 S5 S6 S7 S8
S9
Adjustments Transition probability table will have
tuples on rows and states on columns Output probability table will remain the
same In the Viterbi tree, the Markov process
will take effect from the 3rd input symbol (εRR)
There will be 27 leaves, out of which only 9 will remain
Sequences ending in same tuples will be compared Instead of U1, U2 and U3 U1U1, U1U2, U1U3, U2U1, U2U2,U2U3,
U3U1,U3U2,U3U3
Forward and Backward Probability Calculation
Forward probability F(k,i) Define F(k,i)= Probability of being
in state Si having seen o0o1o2…ok
F(k,i)=P(o0o1o2…ok , Si ) With m as the length of the
observed sequence P(observed
sequence)=P(o0o1o2..om) =Σp=0,N P(o0o1o2..om , Sp) =Σp=0,N F(m , p)
Forward probability (contd.)F(k , q)= P(o0o1o2..ok , Sq)= P(o0o1o2..ok , Sq)= P(o0o1o2..ok-1 , ok ,Sq)= Σp=0,N P(o0o1o2..ok-1 , Sp ,
ok ,Sq)= Σp=0,N P(o0o1o2..ok-1 , Sp ). P(om ,Sq|o0o1o2..ok-1 , Sp)= Σp=0,N F(k-1,p). P(ok ,Sq|Sp) = Σp=0,N F(k-1,p). P(Sp Sq)
ok
O0 O1 O2 O3 … Ok Ok+1 … Om-1 Om
S0 S1 S2 S3 … Sp Sq … Sm Sfinal
Backward probability B(k,i) Define B(k,i)= Probability of seeing
okok+1ok+2…om given that the state was Si
B(k,i)=P(okok+1ok+2…om \ Si ) With m as the length of the
observed sequence P(observed
sequence)=P(o0o1o2..om) = P(o0o1o2..om| S0) =B(0,0)
Backward probability (contd.)
B(k , p)= P(okok+1ok+2…om \ Sp)= P(ok+1ok+2…om , ok |Sp)= Σq=0,N P(ok+1ok+2…om , ok , Sq|
Sp)= Σq=0,N P(ok ,Sq|Sp) P(ok+1ok+2…
om|ok ,Sq ,Sp )= Σq=0,N P(ok+1ok+2…om|Sq ).
P(ok , Sq|Sp)= Σq=0,N B(k+1,q). P(Sp Sq)
ok
O0 O1 O2 O3 … Ok Ok+1 … Om-1 Om
S0 S1 S2 S3 … Sp Sq … Sm Sfinal
Back to PCFG
Interesting Probabilities
The gunman sprayed the building with bullets 1 2 3 4 5 6 7
(4,5) NP
N1
NP
(4,5)NP
What is the probability of having a NP at this position such that it will derive “the building” ? -
What is the probability of starting from N1 and deriving “The gunman sprayed”, a NP and “with bullets” ? -
Inside Probabilities
Outside Probabilities
Interesting Probabilities Random variables to be considered
The non-terminal being expanded. E.g., NP
The word-span covered by the non-terminal. E.g., (4,5) refers to words “the building”
While calculating probabilities, consider: The rule to be used for expansion :
E.g., NP DT NN The probabilities associated with the RHS
non-terminals : E.g., DT subtree’s inside/outside probabilities & NN subtree’s inside/outside probabilities
Outside Probability j(p,q) : The probability of beginning
with N1 & generating the non-terminal Nj
pq and all words outside wp..wq1( 1) ( 1)( , ) ( , , | )j
j p pq q mp q P w N w G
w1 ………wp-1 wp…wqwq+1 ……… wm
N1
Nj
Inside Probabilities j(p,q) : The probability of generating
the words wp..wq starting with the non-terminal Nj
pq.( , ) ( | , )jj pq pqp q P w N G
w1 ………wp-1 wp…wqwq+1 ……… wm
N1
Nj
Outside & Inside Probabilities: example
The gunman sprayed the building with bullets 1 2 3 4 5 6 7
4,5
(4,5) for "the building"
(The gunman sprayed, , with bullets | )NP
P NP G
N1
NP
4,5(4,5) for "the building" (the building | , )NP P NP G
Calculating Inside probabilities j(p,q)
Base case: ( , ) ( | , ) ( | )j jj k kk kk kk k P w N G P N w G
Base case is used for rules which derive the words or terminals directly E.g., Suppose Nj = NN is being considered & NN building is one of the rules with probability 0.5
5,5
5,5
(5,5) ( | , )
( | ) 0.5NN P building NN G
P NN building G
Induction Step: Assuming Grammar in Chomsky Normal Form
Induction step :
wp
Nj
Nr Ns
wdwd+
1
wq
1
,
( , ) ( | , )
( )*
( , )* ( 1, )
jj pq pq
qj r s
r s d p
r
s
p q P w N G
P N N N
p dd q
Consider different splits of the words - indicated by d E.g., the huge building
Consider different non-terminals to be used in the rule: NP DT NN, NP DT NNS are available options Consider summation over all these.
Split here for d=2 d=3
The Bottom-Up Approach The idea of induction Consider “the gunman”
Base cases : Apply unary rulesDT the Prob = 1.0NN gunman Prob = 0.5
Induction : Prob that a NP covers these 2 words= P (NP DT NN) * P (DT deriving the word
“the”) * P (NN deriving the word “gunman”) = 0.5 * 1.0 * 0.5 = 0.25
The gunman
NP0.5
DT1.0 NN0.5
Parse Triangle
1 11 1 1 1 1( | ) ( | ) ( | , ) (1, )m m m mP w G P N w G P w N G m
• A parse triangle is constructed for calculating j(p,q)
• Probability of a sentence using j(p,q):
Parse TriangleThe
(1)gunman
(2)sprayed
(3)the
(4)building
(5) with
(6) bullets
(7)
1
2
3
4
5
6
7
• Fill diagonals with ( , )j k k
1.0DT
0.5NN
0.5NN
1.0NNS
1.0VBD
1.0DT
1.0P
Parse TriangleThe
(1)gunman
(2)sprayed
(3)the
(4)building
(5) with
(6) bullets
(7)123
4
5
6
7
• Calculate using induction formula
1.0DT 0.5NN
0.5NN
1.0NNS
1.0VBD
1.0DT
1.0P
1,2(1,2) (the gunman | , )
( )* (1,1)* (2,2) 0.5*1.0*0.5 0.25
NP
DT NN
P NP G
P NP DT NN
0.25NP
Example Parse t1
S1.0
NP0.5 VP0.6
DT1.0 NN0.5
VBD1.0 NP0.5
PP1.0
DT1.0 NN0.5
P1.0 NP0.3
NNS1.
0
bullets
with
building
the
Thegunman
sprayed
VP0.4
Rule used here is VP VP PP
The gunman sprayed the building with bullets.
Another Parse t2
S1.0
NP0.5 VP0.4
DT1.0 NN0.5VBD1.0
NP0.5 PP1.0
DT1.0 NN0.5 P1.0 NP0.3
NNS1.
0bullets
withbuilding
the
Thegunman
sprayed
NP0.2
Rule used here is VP VBD NP
The gunman sprayed the building with bullets.
Parse TriangleThe (1) gunman
(2)sprayed
(3)the
(4)building
(5) with
(6) bullets (7)
123
4
5
6
7
1.0DT 0.5NN
0.5NN
1.0NNS
1.0VBD
1.0DT
1.0P
3,7(3,7) (sprayed the building with bullets | , )
( )* (3,5)* (6,7) ( )* (3,3)* (4,7) 0.6*1.0*0.3 0.4*1.0*0.015 0.186
VP
VP PP
VBD NP
P VP G
P VP VP PPP VP VBD NP
0.25NP
0.25NP
0.3PP
1.0VP
0.015NP
0.186VP
0.0465S
Different Parses Consider
Different splitting points : E.g., 5th and 3rd position
Using different rules for VP expansion : E.g., VP VP PP, VP VBD NP
Different parses for the VP “sprayed the building with bullets” can be constructed this way.
Outside Probabilities j(p,q)
Base case:
Inductive step for calculating :
1(1, ) 1 for start symbol(1, ) 0 for 1j
mm j
wp
Nfpe
Njpq Ng
(q+1)
e
wqwq+
1
we
( , )f p e
wp-1w1 wmwe+1
N1
( 1, )g q e
( )f j gP N N N
( , )j p q
Summation over f, g & e
Probability of a Sentence
• Joint probability of a sentence w1m and that there is a constituent spanning words wp to wq is given as:
1 1( , | ) ( | , ) ( , ) ( , )jm pq m pq j j
j j
P w N G P w N G p q p q
The gunman sprayed the building with bullets 1 2 3 4 5 6 7
N1
NP
4,5
4,5
(The gunman....bullets, | )
(The gunman...bullets | , )
(4,5) (4,5) (4,5) (4,5) ...
j
j
NP NP
VP VP
P N G
P N G