Upload
dobao
View
513
Download
0
Embed Size (px)
Citation preview
Parsing AlgorithmsYoav Goldberg
(with slides by Michael Collins, Julia Hockenmaier)
Parsing: recovering the constituents of a sentence.
Why is parsing hard?Ambiguity
Fat people eat candy
S
NP
Adj
Fat
Nn
people
VP
Vb
eat
NP
Nn
candy
Fat people eat accumulates
S
NP
Nn
Fat
AdjP
Nn
people
Vb
eat
VP
Vb
accumulates
11 / 48
Why is parsing hard?Ambiguity
Fat people eat candy
S
NP
Adj
Fat
Nn
people
VP
Vb
eat
NP
Nn
candy
Fat people eat accumulates
S
NP
Nn
Fat
AdjP
Nn
people
Vb
eat
VP
Vb
accumulates
11 / 48
Why is parsing hard?Ambiguity
Fat people eat candyS
NP
Adj
Fat
Nn
people
VP
Vb
eat
NP
Nn
candy
Fat people eat accumulates
S
NP
Nn
Fat
AdjP
Nn
people
Vb
eat
VP
Vb
accumulates
11 / 48
Why is parsing hard?Ambiguity
Fat people eat candyS
NP
Adj
Fat
Nn
people
VP
Vb
eat
NP
Nn
candy
Fat people eat accumulates
S
NP
Nn
Fat
AdjP
Nn
people
Vb
eat
VP
Vb
accumulates
11 / 48
Why is parsing hard?Ambiguity
Fat people eat candyS
NP
Adj
Fat
Nn
people
VP
Vb
eat
NP
Nn
candy
Fat people eat accumulates
S
NP
Nn
Fat
AdjP
Nn
people
Vb
eat
VP
Vb
accumulates
11 / 48
Why is parsing hard?Real Sentences are long. . .
“Former Beatle Paul McCartney today was ordered to paynearly $50M to his estranged wife as their bitter divorce battlecame to an end . ”
“Welcome to our Columbus hotels guide, where you’ll findhonest, concise hotel reviews, all discounts, a lowest rateguarantee, and no booking fees.”
12 / 48
Let’s learn how to parse
Context Free Grammars
A context free grammar G = (N,⌃,R,S) where:I N is a set of non-terminal symbolsI ⌃ is a set of terminal symbolsI R is a set of rules of the form X ! Y1Y2 · · ·Yn
for n � 0, X 2 N, Yi 2 (N [ ⌃)
I S 2 N is a special start symbol
14 / 48
Context Free Grammars
a simple grammarN = {S,NP,VP,Adj ,Det ,Vb,Noun}⌃ = {fruit , flies, like, a, banana, tomato, angry}S =‘S’R =
S ! NP VPNP ! Adj NounNP ! Det NounVP ! Vb NPAdj ! fruitNoun ! fliesVb ! likeDet ! aNoun ! bananaNoun ! tomatoAdj ! angry
15 / 48
Left-most derivations
Left-most derivation is a sequence of strings s1, · · · , sn whereI s1 = S the start symbolI sn 2 ⌃⇤, meaning sn is only terminal symbolsI Each si for i = 2 · · · n is derived from si�1 by picking the
left-most non-terminal X in si�1 and replacing it by some �where X ! � is a rule in R.
For example: [S],[NP VP],[Adj Noun VP], [fruit Noun VP], [fruitflies VP],[fruit flies Vb NP],[fruit flies like NP], [fruit flies like DetNoun], [fruit flies like a], [fruit flies like a banana]
16 / 48
Left-most derivation example
S
NP VPAdj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana
S ! NP VPNP ! Adj NounAdj ! fruitNoun ! fliesVP ! Vb NPVb ! likeNP ! Det NounDet ! aNoun ! banana
I The resulting derivation can be written as a tree.I Many trees can be generated.
17 / 48
Left-most derivation example
SNP VP
Adj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana
S ! NP VP
NP ! Adj NounAdj ! fruitNoun ! fliesVP ! Vb NPVb ! likeNP ! Det NounDet ! aNoun ! banana
I The resulting derivation can be written as a tree.I Many trees can be generated.
17 / 48
Left-most derivation example
SNP VPAdj Noun VP
fruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana
S ! NP VP
NP ! Adj Noun
Adj ! fruitNoun ! fliesVP ! Vb NPVb ! likeNP ! Det NounDet ! aNoun ! banana
I The resulting derivation can be written as a tree.I Many trees can be generated.
17 / 48
Left-most derivation example
SNP VPAdj Noun VPfruit Noun VP
fruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana
S ! NP VPNP ! Adj Noun
Adj ! fruit
Noun ! fliesVP ! Vb NPVb ! likeNP ! Det NounDet ! aNoun ! banana
I The resulting derivation can be written as a tree.I Many trees can be generated.
17 / 48
Left-most derivation example
SNP VPAdj Noun VPfruit Noun VPfruit flies VP
fruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana
S ! NP VPNP ! Adj NounAdj ! fruit
Noun ! flies
VP ! Vb NPVb ! likeNP ! Det NounDet ! aNoun ! banana
I The resulting derivation can be written as a tree.I Many trees can be generated.
17 / 48
Left-most derivation example
SNP VPAdj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NP
fruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana
S ! NP VPNP ! Adj NounAdj ! fruitNoun ! flies
VP ! Vb NP
Vb ! likeNP ! Det NounDet ! aNoun ! banana
I The resulting derivation can be written as a tree.I Many trees can be generated.
17 / 48
Left-most derivation example
SNP VPAdj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NP
fruit flies like Det Nounfruit flies like a Nounfruit flies like a banana
S ! NP VPNP ! Adj NounAdj ! fruitNoun ! fliesVP ! Vb NP
Vb ! like
NP ! Det NounDet ! aNoun ! banana
I The resulting derivation can be written as a tree.I Many trees can be generated.
17 / 48
Left-most derivation example
SNP VPAdj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Noun
fruit flies like a Nounfruit flies like a banana
S ! NP VPNP ! Adj NounAdj ! fruitNoun ! fliesVP ! Vb NPVb ! like
NP ! Det Noun
Det ! aNoun ! banana
I The resulting derivation can be written as a tree.I Many trees can be generated.
17 / 48
Left-most derivation example
SNP VPAdj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Noun
fruit flies like a banana
S ! NP VPNP ! Adj NounAdj ! fruitNoun ! fliesVP ! Vb NPVb ! likeNP ! Det Noun
Det ! a
Noun ! banana
I The resulting derivation can be written as a tree.I Many trees can be generated.
17 / 48
Left-most derivation example
SNP VPAdj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana
S ! NP VPNP ! Adj NounAdj ! fruitNoun ! fliesVP ! Vb NPVb ! likeNP ! Det NounDet ! a
Noun ! banana
I The resulting derivation can be written as a tree.I Many trees can be generated.
17 / 48
Left-most derivation example
SNP VPAdj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana
S ! NP VPNP ! Adj NounAdj ! fruitNoun ! fliesVP ! Vb NPVb ! likeNP ! Det NounDet ! aNoun ! banana
I The resulting derivation can be written as a tree.
I Many trees can be generated.
17 / 48
Left-most derivation example
SNP VPAdj Noun VPfruit Noun VPfruit flies VPfruit flies Vb NPfruit flies like NPfruit flies like Det Nounfruit flies like a Nounfruit flies like a banana
S ! NP VPNP ! Adj NounAdj ! fruitNoun ! fliesVP ! Vb NPVb ! likeNP ! Det NounDet ! aNoun ! banana
I The resulting derivation can be written as a tree.I Many trees can be generated.
17 / 48
Context Free Grammars
a simple grammarS ! NP VPNP ! Adj NounNP ! Det NounVP ! Vb NP-Adj ! fruitNoun ! fliesVb ! likeDet ! aNoun ! bananaNoun ! tomatoAdj ! angry. . .
Example
18 / 48
Context Free Grammars
a simple grammarS ! NP VPNP ! Adj NounNP ! Det NounVP ! Vb NP-Adj ! fruitNoun ! fliesVb ! likeDet ! aNoun ! bananaNoun ! tomatoAdj ! angry. . .
ExampleS
NP
Adj
Fruit
Noun
Flies
VP
Vb
like
NP
Det
a
Noun
banana
18 / 48
Context Free Grammars
a simple grammarS ! NP VPNP ! Adj NounNP ! Det NounVP ! Vb NP-Adj ! fruitNoun ! fliesVb ! likeDet ! aNoun ! bananaNoun ! tomatoAdj ! angry. . .
ExampleS
NP
Adj
Angry
Noun
Flies
VP
Vb
like
NP
Det
a
Noun
banana
18 / 48
Context Free Grammars
a simple grammarS ! NP VPNP ! Adj NounNP ! Det NounVP ! Vb NP-Adj ! fruitNoun ! fliesVb ! likeDet ! aNoun ! bananaNoun ! tomatoAdj ! angry. . .
ExampleS
NP
Adj
Angry
Noun
Flies
VP
Vb
like
NP
Det
a
Noun
tomato
18 / 48
Context Free Grammars
a simple grammarS ! NP VPNP ! Adj NounNP ! Det NounVP ! Vb NP-Adj ! fruitNoun ! fliesVb ! likeDet ! aNoun ! bananaNoun ! tomatoAdj ! angry. . .
ExampleS
NP
Adj
Angry
Noun
banana
VP
Vb
like
NP
Det
a
Noun
tomato
18 / 48
Context Free Grammars
a simple grammarS ! NP VPNP ! Adj NounNP ! Det NounVP ! Vb NP-Adj ! fruitNoun ! fliesVb ! likeDet ! aNoun ! bananaNoun ! tomatoAdj ! angry. . .
ExampleS
NP
Det
a
Noun
banana
VP
Vb
like
NP
Det
a
Noun
tomato
18 / 48
Context Free Grammars
a simple grammarS ! NP VPNP ! Adj NounNP ! Det NounVP ! Vb NP-Adj ! fruitNoun ! fliesVb ! likeDet ! aNoun ! bananaNoun ! tomatoAdj ! angry. . .
ExampleS
NP
Det
a
Noun
banana
VP
Vb
like
NP
Adj
angry
Noun
banana
18 / 48
Parsing with (P)CFGs
20 / 48
Parsing with CFGs
Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of
derivations starting from S that generates it.
21 / 48
Parsing with CFGs
Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of
derivations starting from S that generates it.
ProblemI Natural Language is NOT generated by a CFG.
SolutionI We assume really hard that it is.
21 / 48
Parsing with CFGs
Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of
derivations starting from S that generates it.
ProblemI Natural Language is NOT generated by a CFG.
SolutionI We assume really hard that it is.
21 / 48
Parsing with CFGs
Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of
derivations starting from S that generates it.
ProblemI We don’t have the grammar.
Solution
I We’ll ask a genius linguist to write it!
21 / 48
Parsing with CFGs
Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of
derivations starting from S that generates it.
ProblemI We don’t have the grammar.
SolutionI We’ll ask a genius linguist to write it!
21 / 48
Parsing with CFGs
Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of
derivations starting from S that generates it.
ProblemI How do we find the chain of derivations?
Solution
I With dynamic programming! (soon)
21 / 48
Parsing with CFGs
Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of
derivations starting from S that generates it.
ProblemI How do we find the chain of derivations?
SolutionI With dynamic programming! (soon)
21 / 48
Parsing with CFGs
Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of
derivations starting from S that generates it.
ProblemI Real grammar: hundreds of possible derivations per
sentence.
Solution
I No problem! We’ll choose the best one. (sooner)
21 / 48
Parsing with CFGs
Let’s assume. . .I Let’s assume natural language is generated by a CFG.I . . . and let’s assume we have the grammar.I Then parsing is easy: given a sentence, find the chain of
derivations starting from S that generates it.
ProblemI Real grammar: hundreds of possible derivations per
sentence.
SolutionI No problem! We’ll choose the best one. (sooner)
21 / 48
Obtaining a Grammar
Let a genius linguist write it
I Hard. Many rules, many complex interactions.I Genius linguists don’t grow on trees !
An easier way - ask a linguist to grow trees
I Ask a linguist to annotate sentences with tree structure.I (This need not be a genius – Smart is enough.)I Then extract the rules from the annotated trees.
TreebanksI English Treebank: 40k sentences, manually annotated
with tree structure.I Hebrew Treebank: about 5k sentences
22 / 48
Obtaining a Grammar
Let a genius linguist write it
I Hard. Many rules, many complex interactions.I Genius linguists don’t grow on trees !
An easier way - ask a linguist to grow trees
I Ask a linguist to annotate sentences with tree structure.I (This need not be a genius – Smart is enough.)I Then extract the rules from the annotated trees.
TreebanksI English Treebank: 40k sentences, manually annotated
with tree structure.I Hebrew Treebank: about 5k sentences
22 / 48
Obtaining a Grammar
Let a genius linguist write it
I Hard. Many rules, many complex interactions.I Genius linguists don’t grow on trees !
An easier way - ask a linguist to grow trees
I Ask a linguist to annotate sentences with tree structure.I (This need not be a genius – Smart is enough.)I Then extract the rules from the annotated trees.
TreebanksI English Treebank: 40k sentences, manually annotated
with tree structure.I Hebrew Treebank: about 5k sentences
22 / 48
Treebank Sentence Example
( (S(NP-SBJ
(NP (NNP Pierre) (NNP Vinken) )(, ,)(ADJP
(NP (CD 61) (NNS years) )(JJ old) )
(, ,) )(VP (MD will)
(VP (VB join)(NP (DT the) (NN board) )(PP-CLR (IN as)
(NP (DT a) (JJ nonexecutive) (NN director) ))(NP-TMP (NNP Nov.) (CD 29) )))
(. .) ))
23 / 48
Supervised Learning from a Treebank
((fruit/ADJ flies/NN) (like/VB(a/DET banana/NN)))(time/NN (flies/VB (like/IN
(an/DET (arrow/NN))))). . . . . . . . .. . . . . . . . .
24 / 48
Extracting CFG from TreesI The leafs of the trees define ⌃I The internal nodes of the trees define NI Add a special S symbol on top of all treesI Each node an its children is a rule in R
Extracting RulesS
NP
Adj
Fruit
Noun
Flies
VP
Vb
like
NP
Det
a
Noun
banana
S ! NP VPNP ! Adj NounAdj ! fruit
25 / 48
Extracting CFG from TreesI The leafs of the trees define ⌃I The internal nodes of the trees define NI Add a special S symbol on top of all treesI Each node an its children is a rule in R
Extracting RulesS
NP
Adj
Fruit
Noun
Flies
VP
Vb
like
NP
Det
a
Noun
banana
S ! NP VPNP ! Adj NounAdj ! fruit
25 / 48
Extracting CFG from TreesI The leafs of the trees define ⌃I The internal nodes of the trees define NI Add a special S symbol on top of all treesI Each node an its children is a rule in R
Extracting RulesS
NP
Adj
Fruit
Noun
Flies
VP
Vb
like
NP
Det
a
Noun
bananaS ! NP VP
NP ! Adj NounAdj ! fruit
25 / 48
Extracting CFG from TreesI The leafs of the trees define ⌃I The internal nodes of the trees define NI Add a special S symbol on top of all treesI Each node an its children is a rule in R
Extracting RulesS
NP
Adj
Fruit
Noun
Flies
VP
Vb
like
NP
Det
a
Noun
bananaS ! NP VPNP ! Adj Noun
Adj ! fruit
25 / 48
Extracting CFG from TreesI The leafs of the trees define ⌃I The internal nodes of the trees define NI Add a special S symbol on top of all treesI Each node an its children is a rule in R
Extracting RulesS
NP
Adj
Fruit
Noun
Flies
VP
Vb
like
NP
Det
a
Noun
bananaS ! NP VPNP ! Adj NounAdj ! fruit
25 / 48
From CFG to PCFGI English is NOT generated from CFG ) It’s generated by a
PCFG!
I PCFG: probabilistic context free grammar. Just like a CFG,but each rule has an associated probability.
I All probabilities for the same LHS sum to 1.I Multiplying all the rule probs in a derivation gives the
probability of the derivation.I We want the tree with maximum probability.
More Formally
P(tree, sent) =Y
l!r2deriv(tree)
p(l ! r)
tree = arg maxtree2trees(sent)
P(tree|sent) = arg maxtree2trees(sent)
P(tree, sent)
26 / 48
From CFG to PCFGI English is NOT generated from CFG ) It’s generated by a
PCFG!I PCFG: probabilistic context free grammar. Just like a CFG,
but each rule has an associated probability.I All probabilities for the same LHS sum to 1.
I Multiplying all the rule probs in a derivation gives theprobability of the derivation.
I We want the tree with maximum probability.
More Formally
P(tree, sent) =Y
l!r2deriv(tree)
p(l ! r)
tree = arg maxtree2trees(sent)
P(tree|sent) = arg maxtree2trees(sent)
P(tree, sent)
26 / 48
From CFG to PCFGI English is NOT generated from CFG ) It’s generated by a
PCFG!I PCFG: probabilistic context free grammar. Just like a CFG,
but each rule has an associated probability.I All probabilities for the same LHS sum to 1.I Multiplying all the rule probs in a derivation gives the
probability of the derivation.I We want the tree with maximum probability.
More Formally
P(tree, sent) =Y
l!r2deriv(tree)
p(l ! r)
tree = arg maxtree2trees(sent)
P(tree|sent) = arg maxtree2trees(sent)
P(tree, sent)
26 / 48
From CFG to PCFGI English is NOT generated from CFG ) It’s generated by a
PCFG!I PCFG: probabilistic context free grammar. Just like a CFG,
but each rule has an associated probability.I All probabilities for the same LHS sum to 1.I Multiplying all the rule probs in a derivation gives the
probability of the derivation.I We want the tree with maximum probability.
More Formally
P(tree, sent) =Y
l!r2deriv(tree)
p(l ! r)
tree = arg maxtree2trees(sent)
P(tree|sent) = arg maxtree2trees(sent)
P(tree, sent)
26 / 48
From CFG to PCFGI English is NOT generated from CFG ) It’s generated by a
PCFG!I PCFG: probabilistic context free grammar. Just like a CFG,
but each rule has an associated probability.I All probabilities for the same LHS sum to 1.I Multiplying all the rule probs in a derivation gives the
probability of the derivation.I We want the tree with maximum probability.
More Formally
P(tree, sent) =Y
l!r2deriv(tree)
p(l ! r)
tree = arg maxtree2trees(sent)
P(tree|sent) = arg maxtree2trees(sent)
P(tree, sent)
26 / 48
PCFG Example
a simple PCFG1.0 S ! NP VP0.3 NP ! Adj Noun0.7 NP ! Det Noun1.0 VP ! Vb NP-0.2 Adj ! fruit0.2 Noun ! flies1.0 Vb ! like1.0 Det ! a0.4 Noun ! banana0.4 Noun ! tomato0.8 Adj ! angry
ExampleS
NP
Adj
Fruit
Noun
Flies
VP
Vb
like
NP
Det
a
Noun
banana1⇤0.3⇤0.2⇤0.7⇤1.0⇤0.2⇤1⇤1⇤0.4 =0.0033
27 / 48
PCFG Example
a simple PCFG1.0 S ! NP VP0.3 NP ! Adj Noun0.7 NP ! Det Noun1.0 VP ! Vb NP-0.2 Adj ! fruit0.2 Noun ! flies1.0 Vb ! like1.0 Det ! a0.4 Noun ! banana0.4 Noun ! tomato0.8 Adj ! angry
ExampleS
NP
Adj
Fruit
Noun
Flies
VP
Vb
like
NP
Det
a
Noun
banana1⇤0.3⇤0.2⇤0.7⇤1.0⇤0.2⇤1⇤1⇤0.4 =0.0033
27 / 48
PCFG Example
a simple PCFG1.0 S ! NP VP0.3 NP ! Adj Noun0.7 NP ! Det Noun1.0 VP ! Vb NP-0.2 Adj ! fruit0.2 Noun ! flies1.0 Vb ! like1.0 Det ! a0.4 Noun ! banana0.4 Noun ! tomato0.8 Adj ! angry
ExampleS
NP
Adj
Fruit
Noun
Flies
VP
Vb
like
NP
Det
a
Noun
banana1⇤0.3⇤0.2⇤0.7⇤1.0⇤0.2⇤1⇤1⇤0.4 =0.0033
27 / 48
PCFG Example
a simple PCFG1.0 S ! NP VP0.3 NP ! Adj Noun0.7 NP ! Det Noun1.0 VP ! Vb NP-0.2 Adj ! fruit0.2 Noun ! flies1.0 Vb ! like1.0 Det ! a0.4 Noun ! banana0.4 Noun ! tomato0.8 Adj ! angry
ExampleS
NP
Adj
Fruit
Noun
Flies
VP
Vb
like
NP
Det
a
Noun
banana1⇤0.3⇤0.2⇤0.7⇤1.0⇤0.2⇤1⇤1⇤0.4 =0.0033
27 / 48
Parsing with PCFG
I Parsing with a PCFG is finding the most probablederivation for a given sentence.
I This can be done quite efficiently with dynamicprogramming (the CKY algorithm)
Obtaining the probabilities
I We estimate them from the Treebank.I P(LHS ! RHS) = count(LHS!RHS)
count(LHS!⌃)I We can also add smoothing and backoff, as before.I Dealing with unknown words - like in the HMM
28 / 48
Parsing with PCFG
I Parsing with a PCFG is finding the most probablederivation for a given sentence.
I This can be done quite efficiently with dynamicprogramming (the CKY algorithm)
Obtaining the probabilities
I We estimate them from the Treebank.I P(LHS ! RHS) = count(LHS!RHS)
count(LHS!⌃)I We can also add smoothing and backoff, as before.I Dealing with unknown words - like in the HMM
28 / 48
The CKY algorithm
29 / 48
The ProblemInput
I Sentence (a list of words)I n – sentence length
I CFG Grammar (with weights on rules)I g – number of non-terminal symbols
Output
I A parse tree / the best parse tree
But. . .I Exponentially many possible parse trees!
SolutionI Dynamic Programming!
30 / 48
CKY
Cocke Kasami Younger
31 / 48
CKY
Cocke Kasami Younger196?
31 / 48
CKY
Cocke Kasami Younger196? 1965
31 / 48
CKY
Cocke Kasami Younger196? 1965 1967
31 / 48
3 Interesting Problems
I Recognition
I Can this string be generated by the grammar?
I Parsing
I Show me a possible derivation. . .
I Disambiguation
I Show me THE BEST derivation
CKY can do all of these in polynomial time
I For any CNF grammar
32 / 48
3 Interesting Problems
I RecognitionI Can this string be generated by the grammar?
I Parsing
I Show me a possible derivation. . .
I Disambiguation
I Show me THE BEST derivation
CKY can do all of these in polynomial time
I For any CNF grammar
32 / 48
3 Interesting Problems
I RecognitionI Can this string be generated by the grammar?
I ParsingI Show me a possible derivation. . .
I Disambiguation
I Show me THE BEST derivation
CKY can do all of these in polynomial time
I For any CNF grammar
32 / 48
3 Interesting Problems
I RecognitionI Can this string be generated by the grammar?
I ParsingI Show me a possible derivation. . .
I DisambiguationI Show me THE BEST derivation
CKY can do all of these in polynomial time
I For any CNF grammar
32 / 48
3 Interesting Problems
I RecognitionI Can this string be generated by the grammar?
I ParsingI Show me a possible derivation. . .
I DisambiguationI Show me THE BEST derivation
CKY can do all of these in polynomial time
I For any CNF grammar
32 / 48
3 Interesting Problems
I RecognitionI Can this string be generated by the grammar?
I ParsingI Show me a possible derivation. . .
I DisambiguationI Show me THE BEST derivation
CKY can do all of these in polynomial time
I For any CNF grammar
32 / 48
CNFChomsky Normal Form
DefinitionA CFG is in CNF form if it only has rules like:
I A ! B CI A ! ↵
A,B,C are non terminal symbols↵ is a terminal symbol (a word. . . )
I All terminal symbols are RHS of unary rulesI All non terminal symbols are RHS of binary rules
CKY can be easily extended to handle also unary rules: A ! B
33 / 48
Binarization
FactI Any CFG grammar can be converted to CNF form
Speficifally for Natural Language grammarsI We already have A ! ↵
I (A ! ↵ � is also easy to handle)I Unary rules (A ! B) are OKI Only problem:S ! NP PP VP PP
BinarizationS ! NP NP|PP.VP.PPNP|PP.VP.PP ! PP NP.PP|VP.PPNP.PP|VP.PP ! VP NP.PP.VP|PP
34 / 48
Binarization
FactI Any CFG grammar can be converted to CNF form
Speficifally for Natural Language grammarsI We already have A ! ↵
I (A ! ↵ � is also easy to handle)I Unary rules (A ! B) are OKI Only problem:S ! NP PP VP PP
BinarizationS ! NP NP|PP.VP.PPNP|PP.VP.PP ! PP NP.PP|VP.PPNP.PP|VP.PP ! VP NP.PP.VP|PP
34 / 48
Binarization
FactI Any CFG grammar can be converted to CNF form
Speficifally for Natural Language grammarsI We already have A ! ↵
I (A ! ↵ � is also easy to handle)
I Unary rules (A ! B) are OKI Only problem:S ! NP PP VP PP
BinarizationS ! NP NP|PP.VP.PPNP|PP.VP.PP ! PP NP.PP|VP.PPNP.PP|VP.PP ! VP NP.PP.VP|PP
34 / 48
Binarization
FactI Any CFG grammar can be converted to CNF form
Speficifally for Natural Language grammarsI We already have A ! ↵
I (A ! ↵ � is also easy to handle)I Unary rules (A ! B) are OK
I Only problem:S ! NP PP VP PP
BinarizationS ! NP NP|PP.VP.PPNP|PP.VP.PP ! PP NP.PP|VP.PPNP.PP|VP.PP ! VP NP.PP.VP|PP
34 / 48
Binarization
FactI Any CFG grammar can be converted to CNF form
Speficifally for Natural Language grammarsI We already have A ! ↵
I (A ! ↵ � is also easy to handle)I Unary rules (A ! B) are OKI Only problem:S ! NP PP VP PP
BinarizationS ! NP NP|PP.VP.PPNP|PP.VP.PP ! PP NP.PP|VP.PPNP.PP|VP.PP ! VP NP.PP.VP|PP
34 / 48
Binarization
FactI Any CFG grammar can be converted to CNF form
Speficifally for Natural Language grammarsI We already have A ! ↵
I (A ! ↵ � is also easy to handle)I Unary rules (A ! B) are OKI Only problem:S ! NP PP VP PP
BinarizationS ! NP NP|PP.VP.PPNP|PP.VP.PP ! PP NP.PP|VP.PPNP.PP|VP.PP ! VP NP.PP.VP|PP
34 / 48
Finally, CKY
Recognition
I Main idea:I Build parse tree from bottom upI Combine built trees to form bigger trees using grammar
rulesI When left with a single tree, verify root is S
I Exponentially many possible trees. . .I Search over all of them in polynomial time using DPI Shared structure – smaller trees
35 / 48
Main Idea
If we know:
I wi . . .wj is an NPI wj+1 . . .wk is a VP
and grammar has rule:I S ! NP VP
Then we know:I S can derive wi . . .wk
36 / 48
Data Structure(Half a) two dimensional array (n x n)
37 / 48
Data StructureOn its side
38 / 48
Data StructureEach cell: all nonterminals than can derive word i to word j
Sue saw her girl with a telescope
38 / 48
Data StructureEach cell: all nonterminals than can derive word i to word jimagine each cell as a g dimensional array
Sue saw her girl with a telescope
38 / 48
Filling the table
Sue saw her girl with a telescope
39 / 48
Handling Unary rules?
Sue saw her girl with a telescope
40 / 48
Which Order?
Sue saw her boy with a telescope
41 / 48
Complexity?
I n2g cells to fillI g2n ways to fill each one
O(g3n3)
42 / 48
Complexity?
I n2g cells to fill
I g2n ways to fill each one
O(g3n3)
42 / 48
Complexity?
I n2g cells to fillI g2n ways to fill each one
O(g3n3)
42 / 48
Complexity?
I n2g cells to fillI g2n ways to fill each one
O(g3n3)
42 / 48
A Note on Implementation
Smart implementation can reduce the runtime:I Worst case is still O(g3n3), but it helps in practice
I No need to check all grammar rules A ! BC at eachlocation:
I only those compatible with B or C of current splitI prune binarized symbols which are too long for current
positionI once you found 1 way to derive A can break out of loopI order grammar rules from frequent to infrequent
I Need both efficient random access and iteration overpossible symbols
I Keep both hash and list, implemented as arrays
43 / 48
Finding a parseParsing – we want to actually find a parse tree
Easy: also keep a possible split point for each NT
44 / 48
PCFG Parsing and DisambiguationDisambiguation – we want THE BEST parse tree
Easy: for each NT, keep best split point, and score.
45 / 48
Implementation Tricks#1: sum instead of product
As in the HMM - Multiplying probabilities is evilI keeping the product of many floating point numbers is
dangerous, because product get really smallI either grow in runtimeI or loose precision (overflowing to 0)I either way, multiplying floats is expensive
Solution: use sum of logs instead
I remember: log(p1 ⇤ p2) = log(p1) + log(p2)) Use log probabilities instead of probabilities) add instead of multiply
46 / 48
The big question
Does this work?
8 / 1
Evaluation
9 / 1
Parsing Evaluation
I Let’s assume we have a parser, how do we know howgood it is?
) Compare output trees to gold trees.
I But how do we compare trees?I Credit of 1 if tree is correct and 0 otherwise, is too harsh.
I Represent each tree as a set of labeled spans.I NP from word 1 to word 5.I VP from word 3 to word 4.I S from word 1 to word 23.I . . .
I Measure Precision, Recall and F1 over these spans, as inthe segmentation case.
10 / 1
Parsing Evaluation
I Let’s assume we have a parser, how do we know howgood it is?
) Compare output trees to gold trees.I But how do we compare trees?I Credit of 1 if tree is correct and 0 otherwise, is too harsh.
I Represent each tree as a set of labeled spans.I NP from word 1 to word 5.I VP from word 3 to word 4.I S from word 1 to word 23.I . . .
I Measure Precision, Recall and F1 over these spans, as inthe segmentation case.
10 / 1
Parsing Evaluation
I Let’s assume we have a parser, how do we know howgood it is?
) Compare output trees to gold trees.I But how do we compare trees?I Credit of 1 if tree is correct and 0 otherwise, is too harsh.
I Represent each tree as a set of labeled spans.I NP from word 1 to word 5.I VP from word 3 to word 4.I S from word 1 to word 23.I . . .
I Measure Precision, Recall and F1 over these spans, as inthe segmentation case.
10 / 1
Evaluation: Representing Trees as Constituents
S
NP
DT
the
NN
lawyer
VP
Vt
questioned
NP
DT
the
NN
witness
Label Start Point End Point
NP 1 2NP 4 5VP 3 5S 1 5
(by Mike Collins)
Precision and RecallLabel Start Point End Point
NP 1 2NP 4 5NP 4 8PP 6 8NP 7 8VP 3 8S 1 8
Label Start Point End Point
NP 1 2NP 4 5PP 6 8NP 7 8VP 3 8S 1 8
I G = number of constituents in gold standard = 7
I P = number in parse output = 6
I C = number correct = 6
Recall = 100%⇥ C
G= 100%⇥ 6
7Precision = 100%⇥ C
P= 100%⇥ 6
6
(by Mike Collins)
Parsing Evaluation
I Is this a good measure?I Why? Why not?
11 / 1
(by Mike Collins)
Parsing Evaluation
How well does the PCFG parser we learned do?
Not very well: about 73% F1 score.
12 / 1
(by Mike Collins)
Problems with PCFGs
13 / 1
(by Mike Collins)
Weaknesses of Probabilistic Context-Free Grammars
Michael Collins, Columbia University
(by Mike Collins)
Weaknesses of PCFGs
I Lack of sensitivity to lexical information
I Lack of sensitivity to structural frequencies
(by Mike Collins)
S
NP
NNP
IBM
VP
Vt
bought
NP
NNP
Lotus
p(t) = q(S ! NP VP) ⇥q(NNP ! IBM)⇥q(VP ! V NP) ⇥q(Vt ! bought)⇥q(NP ! NNP) ⇥q(NNP ! Lotus)⇥q(NP ! NNP)
(by Mike Collins)
Another Case of PP Attachment Ambiguity(a) S
NP
NNS
workers
VP
VP
VBD
dumped
NP
NNS
sacks
PP
IN
into
NP
DT
a
NN
bin(b) S
NP
NNS
workers
VP
VBD
dumped
NP
NP
NNS
sacks
PP
IN
into
NP
DT
a
NN
bin
(by Mike Collins)
(a)
RulesS ! NP VPNP ! NNSVP ! VP PP
VP ! VBD NPNP ! NNSPP ! IN NPNP ! DT NNNNS ! workersVBD ! dumpedNNS ! sacksIN ! intoDT ! aNN ! bin
(b)
RulesS ! NP VPNP ! NNSNP ! NP PP
VP ! VBD NPNP ! NNSPP ! IN NPNP ! DT NNNNS ! workersVBD ! dumpedNNS ! sacksIN ! intoDT ! aNN ! bin
If q(NP ! NP PP) > q(VP ! VP PP) then (b) is moreprobable, else (a) is more probable.Attachment decision is completely independent of the
words
(by Mike Collins)
A Case of Coordination Ambiguity
(a) NP
NP
NP
NNS
dogs
PP
IN
in
NP
NNS
houses
CC
and
NP
NNS
cats
(b) NP
NP
NNS
dogs
PP
IN
in
NP
NP
NNS
houses
CC
and
NP
NNS
cats
(by Mike Collins)
(a)
RulesNP ! NP CC NPNP ! NP PPNP ! NNSPP ! IN NPNP ! NNSNP ! NNSNNS ! dogsIN ! inNNS ! housesCC ! andNNS ! cats
(b)
RulesNP ! NP CC NPNP ! NP PPNP ! NNSPP ! IN NPNP ! NNSNP ! NNSNNS ! dogsIN ! inNNS ! housesCC ! andNNS ! cats
Here the two parses have identical rules, and
therefore have identical probability under any
assignment of PCFG rule probabilities
(by Mike Collins)
Structural Preferences: Close Attachment
(a) NP
NP
NN
PP
IN NP
NP
NN
PP
IN NP
NN
(b) NP
NP
NP
NN
PP
IN NP
NN
PP
IN NP
NN
I Example: president of a company in Africa
I Both parses have the same rules, therefore receive sameprobability under a PCFG
I “Close attachment” (structure (a)) is twice as likely in WallStreet Journal text.
Lexicalized PCFGs
PCFG Problem 1Lack of sensitivity to lexical information (words)
SolutionI Make PCFG aware of words (lexicalized PCFG)I Main Idea: Head Words
14 / 1
Head Words
Each constituent has one words which captures its “essence”.
I (S John saw the young boy with the large hat)I (VP saw the young boy with the large hat)I (NP the young boy with the large hat)I (NP the large hat)I (PP with the large hat)
I hat is the “semantic head”I with is the “functional head”I (it is common to choose the functional head)
15 / 1
Heads in Context-Free Rules
Add annotations specifying the “head” of each rule:
S ) NP VPVP ) ViVP ) Vt NPVP ) VP PPNP ) DT NNNP ) NP PPPP ) IN NP
Vi ) sleepsVt ) sawNN ) manNN ) womanNN ) telescopeDT ) theIN ) withIN ) in
More about Heads
I Each context-free rule has one “special” child that is thehead of the rule. e.g.,
S ) NP VP (VP is the head)VP ) Vt NP (Vt is the head)NP ) DT NN NN (NN is the head)
I A core idea in syntax(e.g., see X-bar Theory, Head-Driven Phrase StructureGrammar)
I Some intuitions:
I The central sub-constituent of each rule.I The semantic predicate in each rule.
Rules which Recover Heads: An Example for NPs
If the rule contains NN, NNS, or NNP:Choose the rightmost NN, NNS, or NNP
Else If the rule contains an NP: Choose the leftmost NP
Else If the rule contains a JJ: Choose the rightmost JJ
Else If the rule contains a CD: Choose the rightmost CD
Else Choose the rightmost child
e.g.,NP ) DT NNP NNNP ) DT NN NNPNP ) NP PPNP ) DT JJNP ) DT
Rules which Recover Heads: An Example for VPs
If the rule contains Vi or Vt: Choose the leftmost Vi or Vt
Else If the rule contains an VP: Choose the leftmost VP
Else Choose the leftmost child
e.g.,VP ) Vt NPVP ) VP PP
Adding Headwords to Trees
S
NP
DT
the
NN
lawyer
VP
Vt
questioned
NP
DT
the
NN
witness
+
S(questioned)
NP(lawyer)
DT(the)
the
NN(lawyer)
lawyer
VP(questioned)
Vt(questioned)
questioned
NP(witness)
DT(the)
the
NN(witness)
witness
Adding Headwords to Trees (Continued)S(questioned)
NP(lawyer)
DT(the)
the
NN(lawyer)
lawyer
VP(questioned)
Vt(questioned)
questioned
NP(witness)
DT(the)
the
NN(witness)
witness
I A constituent receives its headword from its head child.
S ) NP VP (S receives headword from VP)VP ) Vt NP (VP receives headword from Vt)NP ) DT NN (NP receives headword from NN)
Adding Headwords to Trees (Continued)S(questioned)
NP(lawyer)
DT(the)
the
NN(lawyer)
lawyer
VP(questioned)
Vt(questioned)
questioned
NP(witness)
DT(the)
the
NN(witness)
witness
I A constituent receives its headword from its head child.
S ) NP VP (S receives headword from VP)VP ) Vt NP (VP receives headword from Vt)NP ) DT NN (NP receives headword from NN)
We can parse a lexicalized grammar in O( ) [how?]n5
Dependency Representation
16 / 1
Dependency Representation
S(questioned)
NP(lawyer)
DT(the)
the
NN(lawyer)
lawyer
VP(questioned)
Vt(questioned)
questioned
NP(witness)
DT(the)
the
NN(witness)
witness
Dependency representation is very common.We will return to it in the future.
18 / 1
Dependency Representation
questioned
lawyer
the
the
lawyer
lawyer
questioned
questioned
questioned
witness
the
the
witness
witness
Dependency representation is very common.We will return to it in the future.
18 / 1
Dependency Representation
questioned
lawyer
the
the
lawyer
lawyer
questioned
questioned
questioned
witness
the
the
witness
witness
Dependency representation is very common.We will return to it in the future.
18 / 1
Dependency Representation
questioned
lawyer
the
witness
the
Dependency representation is very common.We will return to it in the future.
18 / 1
Dependency Representation
questioned
lawyer
the
witness
the
Dependency representation is very common.We will return to it in the future.
18 / 1
Dependency Representation
Dependency representation is very common.We will return to it in the future.
18 / 1
Dependency Representation
Dependency representation is very common.We will return to it in the future.
18 / 1
Dependency Parsing
21 / 1
Evaluation Measures
I UAS. Unlabeled Attachment Scores(% of words with correct head)
I LAS. Labeled Attachment Scores(% of words with correct head and label)
I Root(% of sentences with correct root)
I Exact(% of sentences with exact correct structure)
22 / 1
Evaluation Measures
I UAS. Unlabeled Attachment Scores 90-94 (Eng, WSJ)(% of words with correct head)
I LAS. Labeled Attachment Scores 87-92 (Eng, WSJ)(% of words with correct head and label)
I Root ⇠90 (Eng, WSJ)(% of sentences with correct root)
I Exact 40-50 (Eng, WSJ)(% of sentences with exact correct structure)
22 / 1
Three main approaches to Dependency ParsingConversion
I Parse to constituency structure.I Extract dependencies from the trees.
Global Optimization (Graph based)
I Define a scoring function over <sentence,tree> pairs.I Search for best-scoring structure.I Simpler scoring ) easier search.I (Similar to how we do tagging, constituency parsing.)
Greedy decoding (Transition based)
I Start with an unparsed sentence.I Apply locally-optimal actions until sentence is parsed.
23 / 1
Three main approaches to Dependency ParsingConversion
I Parse to constituency structure.I Extract dependencies from the trees.
Global Optimization (Graph based)
I Define a scoring function over <sentence,tree> pairs.I Search for best-scoring structure.I Simpler scoring ) easier search.I (Similar to how we do tagging, constituency parsing.)
Greedy decoding (Transition based)
I Start with an unparsed sentence.I Apply locally-optimal actions until sentence is parsed.
23 / 1
argmax over combinatorial space
while (!done) { do best thing }
Graph-based parsing (Global Search)
24 / 1
Arcs
Dependency parsing is concerned with head-modifier relationships.
Definitions:
I head; the main word in a phrase
I modifier; an auxiliary word in a phrase
Meaning depends on underlying linguistic formalism.
Common to use head!modifier arc notation
* Millions on the coast face freak storm
Input Notation
Input:
I x = (w , t)
I w1 . . .wn; the words of the sentence
I t1 . . . tn; the tags of the sentence
I Special symbol w0 = ⇤; the pseudo-root
Note: Unlike in CFG parsing, we assume tags are given.
Output Notation
Output:
I set of possible dependency arcs
A = {(h,m) : h 2 {0 . . . n},m 2 {1 . . . n}}
I Y ⇢ {0, 1}|A|; set of all valid dependency parses
I y 2 Y; a valid dependency parse
Example
* Millions/N on/P the/D coast/N face/V freak/A storm/N
I w0 = ⇤, w1 = Millions, w2 = on, w3 = the, . . .
I t0 = ⇤, t1 = N, t2 = P, t3 = D, . . .
I y(0, 5) = 1, y(5, 1) = 1, y(1, 2) = 1 . . .
Example
* Millions/N on/P the/D coast/N face/V freak/A storm/N
I w0 = ⇤, w1 = Millions, w2 = on, w3 = the, . . .
I t0 = ⇤, t1 = N, t2 = P, t3 = D, . . .
I y(0, 5) = 1, y(5, 1) = 1, y(1, 2) = 1 . . .
Example
* Millions/N on/P the/D coast/N face/V freak/A storm/N
I w0 = ⇤, w1 = Millions, w2 = on, w3 = the, . . .
I t0 = ⇤, t1 = N, t2 = P, t3 = D, . . .
I y(0, 5) = 1, y(5, 1) = 1, y(1, 2) = 1 . . .
Forbidden Structures
I Each (non-root) word must modify exactly one word.
* Millions on the coast face freak storm
I Arcs must form a tree.
* Millions on the coast face freak storm
I (Projective) Arcs may not cross each other.
* Millions on the coast face freak storm
Main Idea
I Define a scoring function g(y; x, ✓)
I This function will tell us, for every x (sentence) and y (tree)pair, how good the pair is.
I ✓ are the parameters, or weights (we called them w before)I For example: g(y; x, ✓) =
Pi�i(x, y)✓i = �(x, y) · ✓
I (a linear model)I Look for the best y for a given sentence arg max
yg(y; x, ✓)
25 / 1
at is a good dependency parse?
y⇤ = argmax
y2Yg(y ; x , ✓)
Method:
I Define features for this problem.
I Learn parameters ✓ from corpus data.
I Maximize objective to find best parse y⇤.
First-order Scoring Function
Scoring function g(y ; x , ✓) is the sum of first-order arc scores
* Millions on the coast face freak storm
g(y ; x , ✓) =
score(coast ! the)
+ score(on ! coast)
+ score(Millions ! on)
+ score(face ! millions)
+ score(face ! storm)
+ score(storm ! freak)
+ score(⇤ ! face)
First-order Scoring Function
Scoring function g(y ; x , ✓) is the sum of first-order arc scores
* Millions on the coast face freak storm
g(y ; x , ✓) = score(coast ! the)
+ score(on ! coast)
+ score(Millions ! on)
+ score(face ! millions)
+ score(face ! storm)
+ score(storm ! freak)
+ score(⇤ ! face)
First-order Scoring Function
Scoring function g(y ; x , ✓) is the sum of first-order arc scores
* Millions on the coast face freak storm
g(y ; x , ✓) = score(coast ! the)
+ score(on ! coast)
+ score(Millions ! on)
+ score(face ! millions)
+ score(face ! storm)
+ score(storm ! freak)
+ score(⇤ ! face)
First-order Scoring Function
Scoring function g(y ; x , ✓) is the sum of first-order arc scores
* Millions on the coast face freak storm
g(y ; x , ✓) = score(coast ! the)
+ score(on ! coast)
+ score(Millions ! on)
+ score(face ! millions)
+ score(face ! storm)
+ score(storm ! freak)
+ score(⇤ ! face)
First-order Scoring Function
Scoring function g(y ; x , ✓) is the sum of first-order arc scores
* Millions on the coast face freak storm
g(y ; x , ✓) = score(coast ! the)
+ score(on ! coast)
+ score(Millions ! on)
+ score(face ! millions)
+ score(face ! storm)
+ score(storm ! freak)
+ score(⇤ ! face)
First-order Scoring Function
Scoring function g(y ; x , ✓) is the sum of first-order arc scores
* Millions on the coast face freak storm
g(y ; x , ✓) = score(coast ! the)
+ score(on ! coast)
+ score(Millions ! on)
+ score(face ! millions)
+ score(face ! storm)
+ score(storm ! freak)
+ score(⇤ ! face)
First-order Scoring Function
Scoring function g(y ; x , ✓) is the sum of first-order arc scores
* Millions on the coast face freak storm
g(y ; x , ✓) = score(coast ! the)
+ score(on ! coast)
+ score(Millions ! on)
+ score(face ! millions)
+ score(face ! storm)
+ score(storm ! freak)
+ score(⇤ ! face)
First-order Scoring Function
Scoring function g(y ; x , ✓) is the sum of first-order arc scores
* Millions on the coast face freak storm
g(y ; x , ✓) = score(coast ! the)
+ score(on ! coast)
+ score(Millions ! on)
+ score(face ! millions)
+ score(face ! storm)
+ score(storm ! freak)
+ score(⇤ ! face)
Conditional Model (e.g. CRF)
Define:
score(wh ! wm) = �(x , hh,mi) · ✓
where:
I �(x , hh,mi) : X ⇥A ! {0, 1}p; a feature function
I ✓ 2 Rp; a parameter vector (assume given)
I p; number of features
Feature-based Discriminative Model
Features
I Features are critical for dependency parsing performance.
I Specified as a vector of indicators.
�NAME(ht,wi, hh,mi) =⇢
1, if tm = u
0, o.w.
I Each feature has a corresponding real-value weight.
✓NAME = 9.23
Features: Tags
8u 2 T �TAG:M:u(ht,wi, hh,mi) =⇢
1, if tm = u
0, o.w.
8u 2 T �TAG:H:u(ht,wi, hh,mi) =⇢
1, if th = u
0, o.w.
8u, v 2 T �TAG:H:M:u:v (ht,wi, hh,mi) =⇢
1, if th = u and tm = v
0, o.w.
* Millions/N on/P the/D coast/N face/V freak/A storm/N
Features: Words
8u 2 W �WORD:M:u(ht,wi, hh,mi) =⇢
1, if wm = u
0, o.w.
8u 2 W �WORD:H:u(ht,wi, hh,mi) =⇢
1, if wh = u
0, o.w.
8u, v 2 W �WORD:H:M:u:v (ht,wi, hh,mi) =⇢
1, if wh = u and wm = v
0, o.w.
* Millions/N on/P the/D coast/N face/V freak/A storm/N
Features: Context Tags
8u 2 T 4 �CON:�1:�1:u(ht,wi, hh,mi) =
8<
:
1, if th�1 = u1 and th = u2
and tm�1 = u3 and tm = u4
0, o.w.
8u 2 T 4 �CON:1:�1:u(ht,wi, hh,mi) =
8<
:
1, if th+1 = u1 and th = u2
and tm�1 = u3 and tm = u4
0, o.w.
* Millions/N on/P the/D coast/N face/V freak/A storm/N
Features: Between Tags
8u 2 T �BET:u(ht,wi, hh,mi) =⇢
1, if ti = u for i between h and m
0, o.w.
* Millions/N on/P the/D coast/N face/V freak/A storm/N
Features: Direction
�RIGHT(ht,wi, hh,mi) =⇢
1, if h > m
0, o.w.
�LEFT(ht,wi, hh,mi) =⇢
1, if h < m
0, o.w.
* Millions/N on/P the/D coast/N face/V freak/A storm/N
Features: Length
�LEN:2(ht,wi, hh,mi) =⇢
1, if |h �m| > 20, o.w.
�LEN:5(ht,wi, hh,mi) =⇢
1, if |h �m| > 50, o.w.
�LEN:10(ht,wi, hh,mi) =⇢
1, if |h �m| > 100, o.w.
* Millions/N on/P the/D coast/N face/V freak/A storm/N
Features: Backo↵s and Combinations
I Additionally include backo↵.
8u 2 T 3 �CON:�1:u(ht,wi, hh,mi) =
8<
:
1, if th�1 = u1 and th = u2
and tm = u3
0, o.w.
I As well as combination features.
8u 2 W �LEN:2:DIR:LEFT:TAG:M:u(ht,wi, hh,mi) =⇢
1, if all on0, o.w.
First-Order Results
Model AccuracyNoPOSContextBetween 86.0NoEdge 87.3NoAttachmentOrDistance 88.1NoBiLex 90.6Full 90.7
From McDonald (2006)
What’s left
I Define features for this problem.
I Learn parameters ✓ from corpus data.
I Maximize objective to find best parse y⇤.
Downside: Higher-order models make inference more di�cult
y⇤ = argmax
y2Yg(y ; x , ✓)
What’s left
I Define features for this problem.
I Learn parameters ✓ from corpus data.
I Maximize objective to find best parse y⇤.
Downside: Higher-order models make inference more di�cult
y⇤ = argmax
y2Yg(y ; x , ✓)
Parsing
Goal: Finding the best parse.
y⇤ = argmax
y2Yg(y ; x , ✓)
Graph Algorithms
Algorithm 2: Use graph algorithms for parsing.
Find the maximum directed spanning tree.
I Chou-Liu-Edmonds Algorithm O(n3)
I Tarjan’s Extension O(n2)
Graph Algorithms
Algorithm 2: Use graph algorithms for parsing.
Find the maximum directed spanning tree.
I Chou-Liu-Edmonds Algorithm O(n3)
I Tarjan’s Extension O(n2)
Maximum Directed Spanning Tree Algorithm
Issues with MST Algorithm
I Allows non-projective parses.
* Millions on the coast face freak storm
I Good for some languages.
I Cannot incorporate higher-order parts.I Problem becomes NP-Hard.
Dynamic Programming for Parsing
Algorithm 3: Use a specialized dynamic programming algorithm.
I The Eisner algorithm (1996) for bilexical parsing.
I Use split-head trick. Handle left and right dependenciesseparately.
Dependency Parsing New Example
* As McGwire neared , fans went wild
Base Case
* As McGwire neared , fans went wild;
Dependency Parsing Algorithm - First-order Model
h m
h r
+
mr + 1
h e
h m
+
m e
Parsing
* As McGwire neared , fans went wild
Parsing
* As McGwire neared , fans went wild
Parsing
* As McGwire neared , fans went wild
Parsing
* As McGwire neared , fans went wild
Parsing
* As McGwire neared , fans went wild
Parsing
* As McGwire neared , fans went wild
Parsing
* As McGwire neared , fans went wild
Parsing
* As McGwire neared , fans went wild
Parsing
* As McGwire neared , fans went wild
Algorithm Key
I L; left-facing item
I R; right-facing item
I C; completed item (triangle)
I I; incomplete item (trapezoid)
AlgorithmInitialize:for i in 0 . . . n do
⇡[C,L, i , i ] = 0⇡[C,R, i , i ] = 0⇡[I,L, i , i ] = 0⇡[I,R, i , i ] = 0
Inner Loop:for k in 1 . . . n do
for s in 0 . . . n do
t k + s
if t � n then break
⇡[I,L, s, t] = maxr2s...t�1 ⇡[C,R, s, r ] + ⇡[C,L, r + 1, t]⇡[I,R, s, t] = maxr2s...t�1 ⇡[C,R, s, r ] + ⇡[C ,L, r + 1, t]⇡[C,L, s, t] = maxr2s...t�1 ⇡[C,L, s, r ] + ⇡[I,L, r , t]⇡[C,R, s, t] = maxr2s+1...t ⇡[I,R, s, r ] + ⇡[C,R, r , t]
return ⇡[C,R, 0, n]
Graph-based parsing algorithm
I Begin with a tagged sentence (can use a POS-tagger)
I Extract a set of “parts”I For a first-order model, each part is a (h,m) pair
(O(n2) parts)I For a second-order model, each part is a (h,m1,m2) tuple
(O(n3) parts)I Calculate a score for each part (using feature-extractor �
and parameters ✓)I Find a valid parse tree that is composed of the best parts.
I using Chu-Liu-Edmunds (for first-order non-projective)(O(n2))
I using a dynamic-programming algorithm (for first- andsecond-order projective)(O(n3))
Does this remind you of anything?
26 / 1
Graph-based parsing algorithm
I Begin with a tagged sentence (can use a POS-tagger)I Extract a set of “parts”
I For a first-order model, each part is a (h,m) pair(O(n2) parts)
I For a second-order model, each part is a (h,m1,m2) tuple(O(n3) parts)
I Calculate a score for each part (using feature-extractor �and parameters ✓)
I Find a valid parse tree that is composed of the best parts.I using Chu-Liu-Edmunds (for first-order non-projective)
(O(n2))I using a dynamic-programming algorithm (for first- and
second-order projective)(O(n3))
Does this remind you of anything?
26 / 1
Graph-based parsing algorithm
I Begin with a tagged sentence (can use a POS-tagger)I Extract a set of “parts”
I For a first-order model, each part is a (h,m) pair(O(n2) parts)
I For a second-order model, each part is a (h,m1,m2) tuple(O(n3) parts)
I Calculate a score for each part (using feature-extractor �and parameters ✓)
I Find a valid parse tree that is composed of the best parts.I using Chu-Liu-Edmunds (for first-order non-projective)
(O(n2))I using a dynamic-programming algorithm (for first- and
second-order projective)(O(n3))
Does this remind you of anything?
26 / 1
Graph-based parsing algorithm
I Begin with a tagged sentence (can use a POS-tagger)I Extract a set of “parts”
I For a first-order model, each part is a (h,m) pair(O(n2) parts)
I For a second-order model, each part is a (h,m1,m2) tuple(O(n3) parts)
I Calculate a score for each part (using feature-extractor �and parameters ✓)
I Find a valid parse tree that is composed of the best parts.I using Chu-Liu-Edmunds (for first-order non-projective)
(O(n2))I using a dynamic-programming algorithm (for first- and
second-order projective)(O(n3))
Does this remind you of anything?
26 / 1
Graph-based parsing algorithm
I Begin with a tagged sentence (can use a POS-tagger)I Extract a set of “parts”
I For a first-order model, each part is a (h,m) pair(O(n2) parts)
I For a second-order model, each part is a (h,m1,m2) tuple(O(n3) parts)
I Calculate a score for each part (using feature-extractor �and parameters ✓)
I Find a valid parse tree that is composed of the best parts.I using Chu-Liu-Edmunds (for first-order non-projective)
(O(n2))I using a dynamic-programming algorithm (for first- and
second-order projective)(O(n3))
Does this remind you of anything?
26 / 1
Training - setting values for ✓
27 / 1
Note: we need values such that g(y; x, ✓) of gold tree y is largerthan g(y0; x, ✓) for all other trees y
0.
28 / 1
Perceptron Sketch: Part 1
I (x1, y1) . . . (xn, yn); training data
I Gold features X
a2A:y(a)=1
�(xi , a)
Idea: Increase value (in ✓) of gold features.
Perceptron Sketch: Part 2
I Best-scoring structure
zi = argmaxz2Y
g(z ; x , ✓)
I Best-scoring structure features
X
a2A:z(a)=1
�(xi , a)
Idea: Decrease value (in ✓) of wrong best-scoring features
Perceptron Algorithm
✓ 0for t = 1 . . . T, i = 1 . . . n do
zi = argmaxy2Y
g(y ; xi , ✓)
gold X
a2A:yi (a)=1
�(xi , a)
best X
a2A:zi (a)=1
�(xi , a)
✓ ✓ + gold � best
return ✓
Theory
I If possible, perceptron will separate the correct structure fromthe incorrect structure.
I That is, it will find a ✓ that assigns yi a higher score thanother y 2 Y for each example.
Practical Training Considerations
I Training requires solving inference many times.
I Often times computing feature values is time consuming.
I In practice, averaged perceptron variant preferred (Collins,2002).
Conclusion
Method:
I Define features for this problem.
I Learn parameters ✓ from corpus data.
I Maximize objective to find best parse y⇤.
Structured prediction framework, applicable to many problems.
Transition-based parsing
29 / 1
Transition-based (greedy) parsing
1. Start with an unparsed sentence.2. Apply locally-optimal actions until sentence is parsed.
3. Use whatever features you want.4. Surprisingly accurate.5. Can be extremely fast.
30 / 1
Transition-based (greedy) parsing
1. Start with an unparsed sentence.2. Apply locally-optimal actions until sentence is parsed.3. Use whatever features you want.4. Surprisingly accurate.5. Can be extremely fast.
30 / 1
Intro to Transition-based Dependency Parsing
An abstract machine composed of a stack and a buffer.
Machine is initialized with the words of a sentence.
A set of actions process the words by moving them from bufferto stack, removing them from the stack, or adding links betweenthem.
A specific set of actions define a transition system.
31 / 1
The Arc-Eager Transition System
I SHIFT move first word from bufferto stack.(pre: Buffer not empty.)
I LEFTARClabel make first word inbuffer head of top of stack, popthe stack.(pre: Stack not empty. Top of stack doesnot have a parent.)
I RIGHTARClabel make top of stackhead of first in buffer, move firstin buffer to stack.(pre: Buffer not empty.)
I REDUCE pop the stack(pre: Stack not empty. Top of stack has aparent.)
A A B C D
A A B C D
A A B C D
A A B C D
A A B C D
A A B C D
A A B C D
A A B C D
32 / 1
The Arc-Eager Transition System
I SHIFT move first word from bufferto stack.(pre: Buffer not empty.)
I LEFTARClabel make first word inbuffer head of top of stack, popthe stack.(pre: Stack not empty. Top of stack doesnot have a parent.)
I RIGHTARClabel make top of stackhead of first in buffer, move firstin buffer to stack.(pre: Buffer not empty.)
I REDUCE pop the stack(pre: Stack not empty. Top of stack has aparent.)
A A B C D
A A B C D
A A B C D
A A B C D
A A B C D
A A B C D
A A B C D
A A B C D
32 / 1
The Arc-Eager Transition System
I SHIFT move first word from bufferto stack.(pre: Buffer not empty.)
I LEFTARClabel make first word inbuffer head of top of stack, popthe stack.(pre: Stack not empty. Top of stack doesnot have a parent.)
I RIGHTARClabel make top of stackhead of first in buffer, move firstin buffer to stack.(pre: Buffer not empty.)
I REDUCE pop the stack(pre: Stack not empty. Top of stack has aparent.)
A A B C D
A A B C D
A A B C D
A A B C D
A A B C D
A A B C D
A A B C D
A A B C D
32 / 1
The Arc-Eager Transition System
I SHIFT move first word from bufferto stack.(pre: Buffer not empty.)
I LEFTARClabel make first word inbuffer head of top of stack, popthe stack.(pre: Stack not empty. Top of stack doesnot have a parent.)
I RIGHTARClabel make top of stackhead of first in buffer, move firstin buffer to stack.(pre: Buffer not empty.)
I REDUCE pop the stack(pre: Stack not empty. Top of stack has aparent.)
A A B C D
A A B C D
A A B C D
A A B C D
A A B C D
A A B C D
A A B C D
A A B C D
32 / 1
Parsing Example
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
33 / 1
Parsing Example
A She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
33 / 1
Parsing Example
A She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
33 / 1
Parsing Example
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
33 / 1
Parsing Example
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
33 / 1
Parsing Example
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
33 / 1
Parsing Example
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
33 / 1
Parsing Example
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
33 / 1
Parsing Example
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasure
33 / 1
Parsing Example
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasure
33 / 1
Parsing Example
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
33 / 1
What do we know about the arc-eager transitionsystem?
I Every sequence of actions result in a valid projectivestructure.
I Every projective tree is derivable by (at least one)sequence of actions.
I Given a tree, finding a sequence of actions for deriving it.("oracle")
we know these things also for thearc-standard, arc-hybrid and other transition systems
34 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasureA She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure
A She ate pizza with pleasure
35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure35 / 1
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
“She ate pizza with pleasure”
SH LEFT SH RIGHT RE RIGHT RIGHT RE RE RE
A She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasureA She ate pizza with pleasure
A She ate pizza with pleasure35 / 1
This knowledge is quite powerful
Parsing without an oracle
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
return configuration.tree
36 / 1
This knowledge is quite powerful
Parsing without an oracle
placeholderfor sentence,tree pair in corpus do
start with weight vector w
configuration initialize(sentence)while not configuration.IsFinal() do
action predict(w, �(configuration))configuration configuration.apply(action)
return configuration.tree
36 / 1
This knowledge is quite powerful
Parsing without an oracle
placeholderfor sentence,tree pair in corpus do
start with weight vector w
configuration initialize(sentence)while not configuration.IsFinal() do
action predict(w, �(configuration))configuration configuration.apply(action)
return configuration.tree
36 / 1
summarize the configurationas a feature vector
This knowledge is quite powerful
Parsing without an oracle
placeholderfor sentence,tree pair in corpus do
start with weight vector w
configuration initialize(sentence)while not configuration.IsFinal() do
action predict(w, �(configuration))configuration configuration.apply(action)
return configuration.tree
36 / 1
summarize the configurationas a feature vector
predict the action based on the features
This knowledge is quite powerful
Parsing without an oracle
placeholderfor sentence,tree pair in corpus do
start with weight vector w
configuration initialize(sentence)while not configuration.IsFinal() do
action predict(w, �(configuration))configuration configuration.apply(action)
return configuration.tree
36 / 1
summarize the configurationas a feature vector
predict the action based on the features
need to learn the correct weights
This knowledge is quite powerful
Parsing with an oracle sequence
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()configuration configuration.apply(action)
37 / 1
This knowledge is quite powerful
Learning a parser (batch)
placeholderfor sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()
features �(configuration)training_set.add(features, action)
configuration configuration.apply(action)
37 / 1
This knowledge is quite powerful
Learning a parser (batch)training_set []for sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()features �(configuration)training_set.add(features, action)configuration configuration.apply(action)
train a classifier on training_set
37 / 1
This knowledge is quite powerful
Learning a parser (batch)training_set []for sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()features �(configuration)training_set.add(features, action)configuration configuration.apply(action)
train a classifier on training_set
37 / 1
This knowledge is quite powerful
Learning a parser (online)training_set []for sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()features �(configuration)training_set.add(features, action)configuration configuration.apply(action)
train a classifier on training_set
37 / 1
This knowledge is quite powerful
Learning a parser (online)w 0for sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()features �(configuration)predicted predict(w, �(configuration))if predicted 6= action then
w.update(�(configuration), action, predicted)configuration configuration.apply(action)
return w
37 / 1
This knowledge is quite powerful
Learning a parser (online)w 0for sentence,tree pair in corpus do
sequence oracle(sentence, tree)configuration initialize(sentence)while not configuration.IsFinal() do
action sequence.next()features �(configuration)predicted predict(w, �(configuration))if predicted 6= action then
w.update(�(configuration), action, predicted)configuration configuration.apply(action)
return w
37 / 1
This knowledge is quite powerful
Parsing timeconfiguration initialize(sentence)while not configuration.isFinal() do
action predict(w, �(configuration))configuration configuration.apply(action)
return configuration.tree
38 / 1
In short
I Summarize configuration by a set of features.I Learn the best action to take at each configuration.I Hope this generalizes well.
39 / 1
Transition Based Parsing
I A different approach.I Very common.I Can be as accurate as first-order graph-based parsing.
I Higher-order graph-based are still better.I Easy to implement.I Very fast. (O(n))I Can be improved further:
I Easy-firstI Dynamic oracleI Beam Search
41 / 1
Neural Networks
42 / 1
Neural-network (deep learning) based approaches
I Both graph based and transition-based models benefitfrom the move to neural networks.
I Same over-all approach and algorithm as before, but:I Replace classifier from linear to MLP.I Use pre-trained word embeddings.I Replace feature-extractor with Bi-LSTM.
I Now exploring;
I Semi-supervised learning.I Multi-task learning objectives.I Out of domain parsing.
43 / 1
Neural-network (deep learning) based approaches
I Both graph based and transition-based models benefitfrom the move to neural networks.
I Same over-all approach and algorithm as before, but:I Replace classifier from linear to MLP.I Use pre-trained word embeddings.I Replace feature-extractor with Bi-LSTM.
I Now exploring;I Semi-supervised learning.I Multi-task learning objectives.
I Out of domain parsing.
43 / 1
Neural-network (deep learning) based approaches
I Both graph based and transition-based models benefitfrom the move to neural networks.
I Same over-all approach and algorithm as before, but:I Replace classifier from linear to MLP.I Use pre-trained word embeddings.I Replace feature-extractor with Bi-LSTM.
I Now exploring;I Semi-supervised learning.I Multi-task learning objectives.I Out of domain parsing.
43 / 1
Neural Networks (deep learning): seq2seq to linearized trees.
Use the "sequence to sequence with attention" model used for Machine Translation (details in DL4Seq course).
Treat parsing as a translation from sentence to linearized tree.
(S (NP Adj Noun NP) (VP Vb (NP Det Noun NP) VP) S)
Linearize Tree
Neural Networks (deep learning): seq2seq to linearized trees.
Use the "sequence to sequence with attention" model used for Machine Translation (details in DL4Seq course).
Treat parsing as a translation from sentence to linearized tree.
(S (NP Adj Noun NP) (VP Vb (NP Det Noun NP) VP) S)
FruitFlieslikeabanana
NMT (seq2seq+att)
Hybrid Approaches
44 / 1
Hybrid-approaches
I Different parsers have different strengths.) Combine several parsers.
Stacking
I Run parser A.I Use tree from parser A to add features to parser B.
Voting
I Parse the sentence with k different parsers.I Each parser “votes” on its dependency arcs.I Run first-order graph-parser to find tree with best arcs
according to votes.
45 / 1
Hybrid-approaches
I Different parsers have different strengths.) Combine several parsers.
Stacking
I Run parser A.I Use tree from parser A to add features to parser B.
Voting
I Parse the sentence with k different parsers.I Each parser “votes” on its dependency arcs.I Run first-order graph-parser to find tree with best arcs
according to votes.
45 / 1
Hybrid-approaches
I Different parsers have different strengths.) Combine several parsers.
Stacking
I Run parser A.I Use tree from parser A to add features to parser B.
Voting
I Parse the sentence with k different parsers.I Each parser “votes” on its dependency arcs.I Run first-order graph-parser to find tree with best arcs
according to votes.
45 / 1
Semi-supervised-approachesI We only see very few words (and word-pairs) in training
data.I If we know (eat, carrot) is a good pair, what do we know
about (eat, tomato)?I Nothing, if the pair is not in our training data!) Use unlabeled data.
Cluster FeaturesI Represent words as context vectors.I Define a similarity measure between vectors.I Use a clustering algorithm to cluster the words.I We hope that:
I (eat, drink, devour,. . . ) are in the same cluster.I (tomato, carrot, pizza, . . . ) are in the same cluster.
I Use clusters as additional features to the parser.I This works well (better?) also for POS-tagging, NER.
46 / 1
Semi-supervised-approachesI We only see very few words (and word-pairs) in training
data.I If we know (eat, carrot) is a good pair, what do we know
about (eat, tomato)?I Nothing, if the pair is not in our training data!) Use unlabeled data.
Cluster FeaturesI Represent words as context vectors.I Define a similarity measure between vectors.I Use a clustering algorithm to cluster the words.I We hope that:
I (eat, drink, devour,. . . ) are in the same cluster.I (tomato, carrot, pizza, . . . ) are in the same cluster.
I Use clusters as additional features to the parser.I This works well (better?) also for POS-tagging, NER.
46 / 1
Semi-supervised-approachesI We only see very few words (and word-pairs) in training
data.I If we know (eat, carrot) is a good pair, what do we know
about (eat, tomato)?I Nothing, if the pair is not in our training data!) Use unlabeled data.
Cluster FeaturesI Represent words as context vectors.I Define a similarity measure between vectors.I Use a clustering algorithm to cluster the words.I We hope that:
I (eat, drink, devour,. . . ) are in the same cluster.I (tomato, carrot, pizza, . . . ) are in the same cluster.
I Use clusters as additional features to the parser.
I This works well (better?) also for POS-tagging, NER.
46 / 1
Semi-supervised-approachesI We only see very few words (and word-pairs) in training
data.I If we know (eat, carrot) is a good pair, what do we know
about (eat, tomato)?I Nothing, if the pair is not in our training data!) Use unlabeled data.
Cluster FeaturesI Represent words as context vectors.I Define a similarity measure between vectors.I Use a clustering algorithm to cluster the words.I We hope that:
I (eat, drink, devour,. . . ) are in the same cluster.I (tomato, carrot, pizza, . . . ) are in the same cluster.
I Use clusters as additional features to the parser.I This works well (better?) also for POS-tagging, NER.
46 / 1
Available SoftwareThere are many parsers available for download, including:Constituency (PCFG)
I Stanford Parser (can produce also dependencies)I Berkeley ParserI Charniak ParserI Collins Parser
Dependency
I RBGParser, TurboParser (graph based)I ZPar (transition+beam)I ClearNLP (many variants)I EasyFirst (my own)I Bist Parser (from BGU lab, biLSTM, graph + transition)I SpaCy (nice API, super fast!!)
47 / 1
Summary
Dependency Parsers
I Conversion from ConstituencyI Graph-basedI Transition-basedI Hybrid / EnsembleI Semi-supervised (cluster features)
48 / 1