Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Better Better kk-best Parsing-best Parsing
Liang Huang (Penn)
David Chiang (Maryland)
9th International Workshop on Parsing Technologies (IWPT 2005)Vancouver, B.C., Canada
Liang Huang and David Chiang 1IWPT 2005
POS tagging
Syntactic Parsing
Semantic Interpretation
compact:lattice
k-best lists
MotivationsMotivations
• NLP pipeline– 1-best is not always optimal in the future– postpone disambiguation to next phases– next phase compatible with current?
• Y: packed representation (forest, lattice)• N: k-best lists
• Discriminative Training– Reranking (Collins, 2000)– Minimum error training (Och, 2003)– k-best MIRA/Perceptron (McDonald,
Crammer and Pereira, 2005)
Liang Huang and David Chiang 2IWPT 2005
Previous WorkPrevious Work
• Collins (2000); Bikel (2004)– Turn down Dynamic Programming– Aggressive Pruning
• Tight Beam Width• Hard Cell Limit
• Charniak and Johnson (ACL 2005)– multi-pass, coarse-to-fine k-best– improvement: f-score: 89.7% ==> 91.0%
• Jiménez and Marzal (2000)– very close to our lazy offline algorithm (but for CKY only)– tested on a tiny grammar on WSJ (512 rules)
Liang Huang and David Chiang 3IWPT 2005
OutlineOutline
• Formulation– directed monotonic hypergraphs
• Algorithms– Alg.0 thru Alg. 3
• Experiments– k-best parser on top of the Collins/Bikel Parser– k-best CKY-based Hiero decoder (Chiang, 2005)
• Conclusion
Liang Huang and David Chiang 4IWPT 2005
HypergraphHypergraph
• A hypergraph is a pair <V, E>– V is the set of vertices (the items in derivation)– E is the set of hyperedges, each hyperedge connecting several vertices (the antecedents of a derivation rule)
to one vertex (the consequent of a derivation rule)
Liang Huang and David Chiang 5IWPT 2005
I saw a boy with a telescope
S
VPNP
I
NP
a boy
v
sawNP
a telescope
prep
with
PP
VP
Packed Forest as HypergraphPacked Forest as Hypergraph
logical deduction(Shieber et al., 1995)
hypergraph search(Klein & Manning, 2001)
Liang Huang and David Chiang 6IWPT 2005
I saw a boy with a telescope
S
VPNP
I
NP
a boy
v
sawNP
a telescope
prep
with
PP
VP NP
VP
S
Packed Forest as HypergraphPacked Forest as Hypergraph
logical deduction(Shieber et al., 1995)
hypergraph search(Klein & Manning, 2001)
Liang Huang and David Chiang 7IWPT 2005
Packed Forest as HypergraphPacked Forest as Hypergraph
I saw a boy with a telescope
S
VPNP
I
NP
a boy
v
sawNP
a telescope
prep
with
PP
VP NP
VP
a hypergraph!
vertices
hyperedges
logical deduction(Shieber et al., 1995)
hypergraph search(Klein & Manning, 2001)
weighted deduction(Nederhof, 2003)
weighted hypergraph
Liang Huang and David Chiang 8IWPT 2005
Weighted HypergraphWeighted Hypergraph
• A tuple <V, E, t, R>• t - target vertex (goal item) e.g. t = (S, 1, n)
• R - weight set with a total-ordering ≤• every e = <T(e), h(e), f > f - weight function
: a : b : c
: f (a, b, c)
e = < ( ), , f >(Nederhof, 2003)
Liang Huang and David Chiang 9IWPT 2005
Monotonic WeightMonotonic Weight FunctionsFunctions
• all weight functions must be monotonic on eachof its arguments
• optimal sub-problem property in dynamicprogramming
A: f (b, c)
B: bC: c
A: f (b’, c) ≤ f (b, c)
B: b’≤bC: c
e = < ((NP, 1, 2) (VP, 3, 5)),(S, 1, 5), f >
f (b, c)= b•c •Pr(S→NP VP)
CKY example:
Liang Huang and David Chiang 10IWPT 2005
in CKY, t = (S, 1, n)
kk-best Problem in Hypergraph-best Problem in Hypergraph
• 1-best problem– find the best derivation of the target vertex t
• k-best problem– find the top k derivations of the target vertex t
• assumptions– acyclic: so that we can use topological order
Liang Huang and David Chiang 11IWPT 2005
OutlineOutline
• Formulations• Algorithms
– Algorithm 0: naïve polynomial– Algorithm 1: speeding up multk
– Algorithm 2: speeding up mergek
– Algorithm 3: offline lazy algorithm
• Experiments• Conclusion and Future Work
Liang Huang and David Chiang 12IWPT 2005
Generic 1-best Generic 1-best ViterbiViterbi Algorithm Algorithm
• traverse the hypergraph in topological order– for each incoming hyperedge
• compute the result of the f function along the hyperedge• update the 1-best value for the current vertex if possible
v
u: a
w: b
f1
: f1(a, b)
Liang Huang and David Chiang 13IWPT 2005
Generic 1-best Viterbi AlgorithmGeneric 1-best Viterbi Algorithm
v
u: a
w: b
u’: c
w’: d
f1
f2
: better ( f1(a, b), f2(c, d))
• traverse the hypergraph in topological order– for each incoming hyperedge
• compute the result of the f function along the hyperedge• update the 1-best value for the current vertex if possible
Liang Huang and David Chiang 14IWPT 2005
Generic 1-best Viterbi AlgorithmGeneric 1-best Viterbi Algorithm
v
u: a
w: b
u’: c
w’: d
f1
f2
: better( better ( f1(a, b), f2(c, d)), …)
… overall time complexity: O(|E|)
CKY + CNF: |E|=O (n3|P|)
• traverse the hypergraph in topological order– for each incoming hyperedge
• compute the result of the f function along the hyperedge• update the 1-best value for the current vertex if possible
Liang Huang and David Chiang 15IWPT 2005
Dynamic Programming: 1957Dynamic Programming: 1957
Dr. Andrew Viterbi
We knew everything so farin your talk 40 years ago
Dr. Richard Bellman
Liang Huang and David Chiang 16IWPT 2005
kk-best Viterbi algorithm 0: naïve-best Viterbi algorithm 0: naïve
• straightforward k-best extension:– a vector of length k instead of a single value– vector components maintain sorted– now what’s f (a, b) ?
• k2 values -- Cartesian Product f (ai, bj)• just need top k out of the k2 values• O(k2 logk) (sorting) or O(k2) (selection)
v
u: a
w: b
f1
: multk ( f1, a, b)
multk ( f, a, b) = topk { f (ai, bj) }
Liang Huang and David Chiang 17IWPT 2005
v
u: a
w: b
f1
u’: c
w’: d
f2
: mergek(multk ( f1, a, b), multk ( f2, c, d))
overall time complexity: O(k2|E|)
Algorithm 0: naïveAlgorithm 0: naïve
• straightforward k-best extension:– a vector of length k instead of a single value– and how to update?
• from two k-lengthed vectors (2k elements)• select the top k elements: O(k)
Liang Huang and David Chiang 18IWPT 2005
Algorithm 1: speedup Algorithm 1: speedup multmultkk
• only interested in top k, why enumerate all k2 ?• a and b are sorted!• f is monotonic!• so …?• f (a1, b1) must be the 1-best• the 2nd-best must be either f (a2, b1) or f (a1, b2)• what about the 3rd-best?
multk ( f, a, b) = topk{ f (ai, bj) }
Liang Huang and David Chiang 19IWPT 2005
Algorithm 1Algorithm 1 (Demo)(Demo)
.3.3.4.6
.5
.4
.3
.1
ai
b j .24.24
.30.30 .20.20
f (a, b) = ab
Liang Huang and David Chiang 20IWPT 2005
.3.3.4.6
.20.30.5
.24.4
.3
.1
Algorithm 1 (Demo)Algorithm 1 (Demo)
.18.18
.16.16b j
ai
f (a, b) = ab
Liang Huang and David Chiang 21IWPT 2005
Algorithm 1 (Demo)Algorithm 1 (Demo)
.3.3.4.6
.20.30.5
.16.24.4
.18.3
.1
.15.15
b j
ai
use a priority queue (heap) tostore the candidates (frontier) in each iteration:
1. extract-max from theheap
2. push the two“shoulders” into theheap
k iterations.
O(k logk |E|) overall time
Liang Huang and David Chiang 22IWPT 2005
Algorithm 2: speedup Algorithm 2: speedup mergemergekk
• if a vertex has d incoming hyperedges,Algorithm 1 takes time O(d k logk )– d multk and d mergek
• multk ( f, a, b) is just intermediate resultswe are only interested in the result ofmergek (multk ( f1, a, b), …, multk ( fd, x, y) )
v
u: a w: b
f1 fd
…p: x q: y
…fi
Liang Huang and David Chiang 23IWPT 2005
AlgorithmAlgorithm 2 (Demo)2 (Demo)
• can we do the mergek and multksimultaneously? same trick -- heapsort
0.70.4
0.6.42
0.1
0.40.3
0.9.36
0.5
0.90.7
0.4.36
0.1
item-level heap
B1 x C1 B2 x C2 B3 x C3
v k = 2, d = 3
Liang Huang and David Chiang 24IWPT 2005
Algorithm 2 (Demo)Algorithm 2 (Demo)
0.70.4
0.6.42
0.1
0.40.3
0.9.36
0.5
0.80.7
0.4.32
0.1
item-level heap
B1 x C1 B2 x C2 B3 x C3
.42
starts with an initial heap of the 1-best derivationsfrom each hyperedge
v k = 2, d = 3
Liang Huang and David Chiang 25IWPT 2005
Algorithm 2 (Demo)Algorithm 2 (Demo)
0.70.4
0.6.42
0.1
0.40.3
0.9.36
0.5
0.80.7
0.4.32
0.1
item-level heap
B1 x C1 B2 x C2 B3 x C3
starts with an initial heap of the 1-best derivationsfrom each hyperedge
v
but just need the top k among the d 1-best derivations
k = 2, d = 3
Liang Huang and David Chiang 26IWPT 2005
Algorithm 2 (Demo)Algorithm 2 (Demo)
0.70.4
0.6.42
0.1
0.40.3
0.9.36
0.5
0.80.7
0.4.32
0.1
item-level heap
B1 x C1 B2 x C2 B3 x C3
.42
v
pop the best (.42) and …
k = 2, d = 3
Liang Huang and David Chiang 27IWPT 2005
Algorithm 2 (Demo)Algorithm 2 (Demo)
0.70.4
0.6.42.24
0.1.07
0.40.3
0.9.36
0.5
0.80.7
0.4.32
0.1
item-level heap
B1 x C1 B2 x C2 B3 x C3
output
.42
v
pop the best (.42) and …
push the two shoulders (.07 and .24) as its successors
k = 2, d = 3
Liang Huang and David Chiang 28IWPT 2005
Algorithm 2 (Demo)Algorithm 2 (Demo)
0.70.4
0.6.42.24
0.1.07
0.40.3
0.9.36
0.5
0.80.7
0.4.32
0.1
item-level heap
B1 x C1 B2 x C2 B3 x C3
output
.36
.42
improves the O(dk log k) to O(d + k log k )
overall time complexity: O(|E|+|V|k log k )
v k = 2, d = 3
Liang Huang and David Chiang 29IWPT 2005
AlgorithmAlgorithm 3: Offline (lazy)3: Offline (lazy)
• from Alg. 0 to Alg. 2:– delaying the calculations until needed -- lazier– larger locality
• Even lazier… (one step further)– we are interested in the k-best derivations of the
final item only!
Liang Huang and David Chiang 30IWPT 2005
Algorithm 3: Offline (lazy)Algorithm 3: Offline (lazy)
• forward phase– do a normal 1-best search till the final item– construct the hypergraph (parse forest) along the way
• recursive backward phase– ask the final item: what’s your 2nd-best?– final item will propagate this question till the leaves– then ask the final item: what’s your 3rd-best?
Liang Huang and David Chiang 31IWPT 2005
Algorithm 3 demoAlgorithm 3 demo
0.7
?0.6
.42 ?
NP (1, 3) VP (4, 7)
.42k=2S (1, 7)
0.4
?0.5
.20 ?
NP (1, 2) VP (3, 7)
0.7
?0.4
.28 ?
VP (1, 5) NP (6, 7)1-best
after the “forward” step (1-best parsing):
forest = 1-best derivations from each hyperarc
Liang Huang and David Chiang 32IWPT 2005
Algorithm 3 demoAlgorithm 3 demo
0.7
?0.6
.42 ?
NP (1, 3) VP (4, 7)
.42k=2S (1, 7)
0.7
?0.4
.28 ?
VP (1, 5) NP (6, 7)1-best
what’s your 2nd-best?
now the backward step
0.4
?0.5
.20 ?
NP (1, 2) VP (3, 7)
Liang Huang and David Chiang 33IWPT 2005
Algorithm 3 demoAlgorithm 3 demo
0.7
?0.6
.42 ?
NP (1, 3) VP (4, 7)
.42k=2S (1, 7)
0.7
?0.4
.28 ?
VP (1, 5) NP (6, 7)1-best
I’m not sure... let meask my parents…
0.4
?0.5
.20 ?
NP (1, 2) VP (3, 7)
Liang Huang and David Chiang 34IWPT 2005
Algorithm 3 demoAlgorithm 3 demo
0.7
?0.6
.42 ?
NP (1, 3) VP (4, 7)
.42k=2S (1, 7)
0.7
?0.4
.28 ?
VP (1, 5) NP (6, 7)1-best
what’s your 2nd-best?
0.4
?0.5
.20 ?
NP (1, 2) VP (3, 7)
Liang Huang and David Chiang 35IWPT 2005
Algorithm 3 demoAlgorithm 3 demo
0.7
?0.6
.42 ?
NP (1, 3) VP (4, 7)
.42k=2S (1, 7)
0.7
?0.4
.28 ?
VP (1, 5) NP (6, 7)1-best
or, equivalently…who’s your successor in
this hyperarc?
0.4
?0.5
.20 ?
NP (1, 2) VP (3, 7)
Liang Huang and David Chiang 36IWPT 2005
Algorithm 3 demoAlgorithm 3 demo
0.7
?0.6
.42
?
?
?
NP (1, 3) VP (4, 7)
.42k=2S (1, 7)
0.7
?0.4
.28 ?
VP (1, 5) NP (6, 7)
well, it must be either … or …
0.4
?0.5
.20 ?
NP (1, 2) VP (3, 7)
Liang Huang and David Chiang 37IWPT 2005
Algorithm 3 demoAlgorithm 3 demo
0.7
?0.6
.42
?
?
?
NP (1, 3) VP (4, 7)
.42k=2S (1, 7)
0.7
?0.4
.28 ?
VP (1, 5) NP (6, 7)
these are candidatesfor my 2nd-best
0.4
?0.5
.20 ?
NP (1, 2) VP (3, 7)
Liang Huang and David Chiang 38IWPT 2005
Algorithm 3 demoAlgorithm 3 demo
0.7
?0.6
.42
??
?
NP (1, 3) VP (4, 7)
.42k=2S (1, 7)
0.7
?0.4
.28 ?
VP (1, 5) NP (6, 7)
but wait a minute… did you already know the ?’s ?
oops… forgot to askmore questionsrecursively …
0.4
?0.5
.20 ?
NP (1, 2) VP (3, 7)
Liang Huang and David Chiang 39IWPT 2005
Algorithm 3 demoAlgorithm 3 demo
0.7
?0.6
.42
?
?
?
NP (1, 3) VP (4, 7)
.42k=2S (1, 7)
0.7
?0.4
.28 ?
VP (1, 5) NP (6, 7)
what’s your 2nd-best?
0.4
?0.5
.20 ?
NP (1, 2) VP (3, 7)
Liang Huang and David Chiang 40IWPT 2005
Algorithm 3 demoAlgorithm 3 demo
0.7
?0.6
.42
?
??
NP (1, 3) VP (4, 7)
.42k=2S (1, 7)
0.7
?0.4
.28 ?
VP (1, 5) NP (6, 7)
… …
… …
recursion goes on to the leaf nodes
0.4
?0.5
.20 ?
NP (1, 2) VP (3, 7)
Liang Huang and David Chiang 41IWPT 2005
Algorithm 3 demoAlgorithm 3 demo
0.7
0.50.6
.42
?0.3
?
NP (1, 3) VP (4, 7)
.42k=2S (1, 7)
0.7
?0.4
.28 ?
VP (1, 5) NP (6, 7)
and reports back the numbers…
0.4
?0.5
.20 ?
NP (1, 2) VP (3, 7)
Liang Huang and David Chiang 42IWPT 2005
Algorithm 3 demoAlgorithm 3 demo
0.7
0.50.6
.42
.300.3
.21
NP (1, 3) VP (4, 7)
.30
.42k=2S (1, 7)
0.7
?0.4
.28 ?
VP (1, 5) NP (6, 7)
push .30 and .21 to the candidate heap (priority queue)
0.4
?0.5
.20 ?
NP (1, 2) VP (3, 7)
Liang Huang and David Chiang 43IWPT 2005
Algorithm 3 demoAlgorithm 3 demo
0.7
0.50.6
.42
.300.3
.21
NP (1, 3) VP (4, 7)
.30
.42k=2S (1, 7)
0.4
?0.5
.20 ?
NP (1, 2) VP (3, 7)
0.7
?0.4
.28 ?
VP (1, 5) NP (6, 7)
now I know my 2nd-best
pop the root of the heap (.30)
Liang Huang and David Chiang 44IWPT 2005
Summary of AlgorithmsSummary of Algorithms
globalO(|E|+|D|k log (d+k))generalized J&M
hyperedgeO(|E|)1-best
globalO(|E|+|D|k log k)alg. 3: lazy
item (mergek)O(|E|+|V|k log k)alg. 2
hyperedge (multk)O(k log k |E|)alg. 1
hyperedge (multk)O(ka|E|)alg. 0: naïve
LocalityTime ComplexityAlgorithms
for CKY: a=2 |E|=n3|P| |V|=n2|N| |D|=O(n)
Liang Huang and David Chiang 45IWPT 2005
OutlineOutline
• Formulations• Algorithms: Alg.0 thru Alg. 3• Experiments
– Collins/Bikel Parser– CKY-based MT decoder (Chiang, 2005)
• Conclusion
Liang Huang and David Chiang 46IWPT 2005
Bikel ParserBikel Parser
• Based on lexical context-free models (Collins, 2003)• We use it to emulate Collins Model 2• beam search (pruning on cells)
– cell [i, j] contains all items in the form of (A, i, j)– beam width x
• prune away items worse than x times the best item in the cell (threshold pruning in MT)
– cell limit y• only keep at most y best items in a cell (histogram pruning in MT)
Liang Huang and David Chiang 47IWPT 2005
EfficiencyEfficiency
Implemented Algorithms 0, 1, 3 on top of Bikel ParserAverage (wall-clock) time on section 23 (per sentence):
O(|E|+|D|k log k)
Liang Huang and David Chiang 48IWPT 2005
Quality of the Quality of the kk-best lists-best listsOracle Reranking --Oracle Reranking -- Accuracy (F-score)Accuracy (F-score)
Liang Huang and David Chiang 49IWPT 2005
Why are our k-best lists better?Why are our k-best lists better?
average number of parses for sentences of certain length
as sentences get longer, the number of parses should goup (exponentially)!
Collins
beam width 10-3 k=100beam width 10-4
k=100
Liang Huang and David Chiang 50IWPT 2005
MT decoder (Efficiency)MT decoder (Efficiency)
CKY-based Hiero decoder (Chiang, ACL 2005):implemented algorithms 2 and 3average decoding time (excluding 1-best part)
Liang Huang and David Chiang 51IWPT 2005
DiscussionsDiscussionsHyperpaths Hyperpaths vsvs. Derivations. Derivations
• hyperpath– a minimal sub-hypergraph
• every vertex has at mostone hyperedge
• derivation– a tree– a vertex can appear more
than once
• 1-best: always coincide
hypergraph
hyperpath
derivation
Liang Huang and David Chiang 52IWPT 2005
ConclusionConclusion
• monotonic hypergraph formulation– we solved the k-best derivations problem– not the k-shortest hyperpaths problem
• k-best Algorithms– Alg. 0 (naïve) thru Alg. 3 (lazy)
• experimental results– efficiency– accuracy (effective search over larger space)
Liang Huang and David Chiang 53IWPT 2005
THE ENDTHE END
Questions?
Comments?
Liang Huang and David Chiang 54IWPT 2005
Liang Huang and David Chiang 55IWPT 2005
Discussions (contDiscussions (cont’’d)d)Hyperpaths Hyperpaths vsvs. Derivations. Derivations
(B, i, j) (C, j+1, k)
(A, i, k)
Earley: not the case, but easy to fix
CKY: always coincide
Liang Huang and David Chiang 56IWPT 2005
Interesting PropertiesInteresting Properties
• 1-best is best everywhere (all decisionsoptimal)
• 2nd-best is optimal everywhere exceptone decision– and that decision must be 2nd-best– and it’s the best of all 2nd-best decisions
• so what about the 3rd-best?• kth-best is…
(Charniak and Johnson, ACL 2005)
local picture:
.3.3.4.6
.15.15.15.15.20.30.5
.12.12.12.12.16.16.24.4
.09.09.09.09.12.12.18.18.3
.03.03.03.03.04.04.06.06.1
.18.18
.16.16b j
ai
Liang Huang and David Chiang 57IWPT 2005
Quality of the Quality of the kk-best lists-best listsOracle Reranking --Oracle Reranking -- Relative ImprovementRelative Improvement
f ∈ R|T(e)|
→ R
Liang Huang and David Chiang 58IWPT 2005
Future WorkFuture Work
• k-best discriminative supertagging• Implement Alg. 2 and 3 for Bikel Parser and Alg. 0
and 1 for MT decoder (David)– so both experiments have all four algorithms
• Real Reranking (w/ Libin)• Chinese Parsing?• Formal Grammars and Hypergraphs• Case Factor Diagrams and Hypergraphs
Liang Huang and David Chiang 59IWPT 2005
Hypergraph is Everything!Hypergraph is Everything!
Generic Dynamic Programming
Shared Forest
Weighted Deduction
Hypergraph Searching
branching structures vs. finite-state structures
Liang Huang and David Chiang 60IWPT 2005
Convergence of FomulationsConvergence of Fomulations
Shared Forest
Weighted Deduction
Hypergraph
Liang Huang and David Chiang 61IWPT 2005
on aon a log scalelog scale……
?