Better k-best Parsing...–very close to our lazy offline algorithm (but for CKY only) –tested on a tiny grammar on WSJ (512 rules) IWPT 2005 Liang Huang and David Chiang 3 Outline

Better Better kk-best Parsing-best Parsing

Liang Huang (Penn)

David Chiang (Maryland)

9th International Workshop on Parsing Technologies (IWPT 2005)Vancouver, B.C., Canada

Liang Huang and David Chiang 1IWPT 2005

POS tagging

Syntactic Parsing

Semantic Interpretation

compact:lattice

k-best lists

MotivationsMotivations

• NLP pipeline– 1-best is not always optimal in the future– postpone disambiguation to next phases– next phase compatible with current?

• Y: packed representation (forest, lattice)• N: k-best lists

• Discriminative Training– Reranking (Collins, 2000)– Minimum error training (Och, 2003)– k-best MIRA/Perceptron (McDonald,

Crammer and Pereira, 2005)


Previous WorkPrevious Work

• Collins (2000); Bikel (2004)– Turn down Dynamic Programming– Aggressive Pruning

• Tight Beam Width• Hard Cell Limit

• Charniak and Johnson (ACL 2005)– multi-pass, coarse-to-fine k-best– improvement: f-score: 89.7% ==> 91.0%

• Jiménez and Marzal (2000)– very close to our lazy offline algorithm (but for CKY only)– tested on a tiny grammar on WSJ (512 rules)


OutlineOutline

• Formulation– directed monotonic hypergraphs

• Algorithms– Alg.0 thru Alg. 3

• Experiments– k-best parser on top of the Collins/Bikel Parser– k-best CKY-based Hiero decoder (Chiang, 2005)

• Conclusion


HypergraphHypergraph

• A hypergraph is a pair <V, E>– V is the set of vertices (the items in derivation)– E is the set of hyperedges, each hyperedge connecting several vertices (the antecedents of a derivation rule)

to one vertex (the consequent of a derivation rule)


I saw a boy with a telescope

S

VPNP

I

NP

a boy

v

sawNP

a telescope

prep

with

PP

VP

Packed Forest as HypergraphPacked Forest as Hypergraph

logical deduction(Shieber et al., 1995)

hypergraph search(Klein & Manning, 2001)



S

VPNP

I

NP

a boy

v

sawNP

a telescope

prep

with

PP

VP NP

VP

S







S

VPNP

I

NP

a boy

v

sawNP

a telescope

prep

with

PP

VP NP

VP

a hypergraph!

vertices

hyperedges



weighted deduction(Nederhof, 2003)

weighted hypergraph


Weighted HypergraphWeighted Hypergraph

• A tuple <V, E, t, R>• t - target vertex (goal item) e.g. t = (S, 1, n)

• R - weight set with a total-ordering ≤• every e = <T(e), h(e), f > f - weight function

: a : b : c

: f (a, b, c)

e = < ( ), , f >(Nederhof, 2003)


Monotonic WeightMonotonic Weight FunctionsFunctions

• all weight functions must be monotonic on eachof its arguments

• optimal sub-problem property in dynamicprogramming

A: f (b, c)

B: bC: c

A: f (b’, c) ≤ f (b, c)

B: b’≤bC: c

e = < ((NP, 1, 2) (VP, 3, 5)),(S, 1, 5), f >

f (b, c)= b•c •Pr(S→NP VP)

CKY example:


in CKY, t = (S, 1, n)

kk-best Problem in Hypergraph-best Problem in Hypergraph

• 1-best problem– find the best derivation of the target vertex t

• k-best problem– find the top k derivations of the target vertex t

• assumptions– acyclic: so that we can use topological order


OutlineOutline

• Formulations• Algorithms

– Algorithm 0: naïve polynomial– Algorithm 1: speeding up multk

– Algorithm 2: speeding up mergek

– Algorithm 3: offline lazy algorithm

• Experiments• Conclusion and Future Work


Generic 1-best Generic 1-best ViterbiViterbi Algorithm Algorithm

• traverse the hypergraph in topological order– for each incoming hyperedge

• compute the result of the f function along the hyperedge• update the 1-best value for the current vertex if possible

v

u: a

w: b

f1

: f1(a, b)


Generic 1-best Viterbi AlgorithmGeneric 1-best Viterbi Algorithm

v

u: a

w: b

u’: c

w’: d

f1

f2

: better ( f1(a, b), f2(c, d))




Generic 1-best Viterbi AlgorithmGeneric 1-best Viterbi Algorithm

v

u: a

w: b

u’: c

w’: d

f1

f2

: better( better ( f1(a, b), f2(c, d)), …)

… overall time complexity: O(|E|)

CKY + CNF: |E|=O (n3|P|)




Dynamic Programming: 1957Dynamic Programming: 1957

Dr. Andrew Viterbi

We knew everything so farin your talk 40 years ago

Dr. Richard Bellman


kk-best Viterbi algorithm 0: naïve-best Viterbi algorithm 0: naïve

• straightforward k-best extension:– a vector of length k instead of a single value– vector components maintain sorted– now what’s f (a, b) ?

• k2 values -- Cartesian Product f (ai, bj)• just need top k out of the k2 values• O(k2 logk) (sorting) or O(k2) (selection)

v

u: a

w: b

f1

: multk ( f1, a, b)

multk ( f, a, b) = topk { f (ai, bj) }


v

u: a

w: b

f1

u’: c

w’: d

f2

: mergek(multk ( f1, a, b), multk ( f2, c, d))

overall time complexity: O(k2|E|)

Algorithm 0: naïveAlgorithm 0: naïve

• straightforward k-best extension:– a vector of length k instead of a single value– and how to update?

• from two k-lengthed vectors (2k elements)• select the top k elements: O(k)


Algorithm 1: speedup Algorithm 1: speedup multmultkk

• only interested in top k, why enumerate all k2 ?• a and b are sorted!• f is monotonic!• so …?• f (a1, b1) must be the 1-best• the 2nd-best must be either f (a2, b1) or f (a1, b2)• what about the 3rd-best?

multk ( f, a, b) = topk{ f (ai, bj) }


Algorithm 1Algorithm 1 (Demo)(Demo)

.3.3.4.6

.5

.4

.3

.1

ai

b j .24.24

.30.30 .20.20

f (a, b) = ab


.3.3.4.6

.20.30.5

.24.4

.3

.1

Algorithm 1 (Demo)Algorithm 1 (Demo)

.18.18

.16.16b j

ai

f (a, b) = ab



.3.3.4.6

.20.30.5

.16.24.4

.18.3

.1

.15.15

b j

ai

use a priority queue (heap) tostore the candidates (frontier) in each iteration:

1. extract-max from theheap

2. push the two“shoulders” into theheap

k iterations.

O(k logk |E|) overall time


Algorithm 2: speedup Algorithm 2: speedup mergemergekk

• if a vertex has d incoming hyperedges,Algorithm 1 takes time O(d k logk )– d multk and d mergek

• multk ( f, a, b) is just intermediate resultswe are only interested in the result ofmergek (multk ( f1, a, b), …, multk ( fd, x, y) )

v

u: a w: b

f1 fd

…p: x q: y

…fi


AlgorithmAlgorithm 2 (Demo)2 (Demo)

• can we do the mergek and multksimultaneously? same trick -- heapsort

0.70.4

0.6.42

0.1

0.40.3

0.9.36

0.5

0.90.7

0.4.36

0.1

item-level heap

B1 x C1 B2 x C2 B3 x C3

v k = 2, d = 3



0.70.4

0.6.42

0.1

0.40.3

0.9.36

0.5

0.80.7

0.4.32

0.1

item-level heap

B1 x C1 B2 x C2 B3 x C3

.42

starts with an initial heap of the 1-best derivationsfrom each hyperedge

v k = 2, d = 3



0.70.4

0.6.42

0.1

0.40.3

0.9.36

0.5

0.80.7

0.4.32

0.1

item-level heap

B1 x C1 B2 x C2 B3 x C3

starts with an initial heap of the 1-best derivationsfrom each hyperedge

v

but just need the top k among the d 1-best derivations

k = 2, d = 3



0.70.4

0.6.42

0.1

0.40.3

0.9.36

0.5

0.80.7

0.4.32

0.1

item-level heap

B1 x C1 B2 x C2 B3 x C3

.42

v

pop the best (.42) and …

k = 2, d = 3



0.70.4

0.6.42.24

0.1.07

0.40.3

0.9.36

0.5

0.80.7

0.4.32

0.1

item-level heap

B1 x C1 B2 x C2 B3 x C3

output

.42

v

pop the best (.42) and …

push the two shoulders (.07 and .24) as its successors

k = 2, d = 3



0.70.4

0.6.42.24

0.1.07

0.40.3

0.9.36

0.5

0.80.7

0.4.32

0.1

item-level heap

B1 x C1 B2 x C2 B3 x C3

output

.36

.42

improves the O(dk log k) to O(d + k log k )

overall time complexity: O(|E|+|V|k log k )

v k = 2, d = 3


AlgorithmAlgorithm 3: Offline (lazy)3: Offline (lazy)

• from Alg. 0 to Alg. 2:– delaying the calculations until needed -- lazier– larger locality

• Even lazier… (one step further)– we are interested in the k-best derivations of the

final item only!


Algorithm 3: Offline (lazy)Algorithm 3: Offline (lazy)

• forward phase– do a normal 1-best search till the final item– construct the hypergraph (parse forest) along the way

• recursive backward phase– ask the final item: what’s your 2nd-best?– final item will propagate this question till the leaves– then ask the final item: what’s your 3rd-best?


Algorithm 3 demoAlgorithm 3 demo

0.7

?0.6

.42 ?

NP (1, 3) VP (4, 7)

.42k=2S (1, 7)

0.4

?0.5

.20 ?

NP (1, 2) VP (3, 7)

0.7

?0.4

.28 ?

VP (1, 5) NP (6, 7)1-best

after the “forward” step (1-best parsing):

forest = 1-best derivations from each hyperarc



0.7

?0.6

.42 ?

NP (1, 3) VP (4, 7)

.42k=2S (1, 7)

0.7

?0.4

.28 ?

VP (1, 5) NP (6, 7)1-best

what’s your 2nd-best?

now the backward step

0.4

?0.5

.20 ?

NP (1, 2) VP (3, 7)



0.7

?0.6

.42 ?

NP (1, 3) VP (4, 7)

.42k=2S (1, 7)

0.7

?0.4

.28 ?

VP (1, 5) NP (6, 7)1-best

I’m not sure... let meask my parents…

0.4

?0.5

.20 ?

NP (1, 2) VP (3, 7)



0.7

?0.6

.42 ?

NP (1, 3) VP (4, 7)

.42k=2S (1, 7)

0.7

?0.4

.28 ?

VP (1, 5) NP (6, 7)1-best


0.4

?0.5

.20 ?

NP (1, 2) VP (3, 7)



0.7

?0.6

.42 ?

NP (1, 3) VP (4, 7)

.42k=2S (1, 7)

0.7

?0.4

.28 ?

VP (1, 5) NP (6, 7)1-best

or, equivalently…who’s your successor in

this hyperarc?

0.4

?0.5

.20 ?

NP (1, 2) VP (3, 7)



0.7

?0.6

.42

?

?

?

NP (1, 3) VP (4, 7)

.42k=2S (1, 7)

0.7

?0.4

.28 ?

VP (1, 5) NP (6, 7)

well, it must be either … or …

0.4

?0.5

.20 ?

NP (1, 2) VP (3, 7)



0.7

?0.6

.42

?

?

?

NP (1, 3) VP (4, 7)

.42k=2S (1, 7)

0.7

?0.4

.28 ?

VP (1, 5) NP (6, 7)

these are candidatesfor my 2nd-best

0.4

?0.5

.20 ?

NP (1, 2) VP (3, 7)



0.7

?0.6

.42

??

?

NP (1, 3) VP (4, 7)

.42k=2S (1, 7)

0.7

?0.4

.28 ?

VP (1, 5) NP (6, 7)

but wait a minute… did you already know the ?’s ?

oops… forgot to askmore questionsrecursively …

0.4

?0.5

.20 ?

NP (1, 2) VP (3, 7)



0.7

?0.6

.42

?

?

?

NP (1, 3) VP (4, 7)

.42k=2S (1, 7)

0.7

?0.4

.28 ?

VP (1, 5) NP (6, 7)


0.4

?0.5

.20 ?

NP (1, 2) VP (3, 7)



0.7

?0.6

.42

?

??

NP (1, 3) VP (4, 7)

.42k=2S (1, 7)

0.7

?0.4

.28 ?

VP (1, 5) NP (6, 7)

… …

… …

recursion goes on to the leaf nodes

0.4

?0.5

.20 ?

NP (1, 2) VP (3, 7)



0.7

0.50.6

.42

?0.3

?

NP (1, 3) VP (4, 7)

.42k=2S (1, 7)

0.7

?0.4

.28 ?

VP (1, 5) NP (6, 7)

and reports back the numbers…

0.4

?0.5

.20 ?

NP (1, 2) VP (3, 7)



0.7

0.50.6

.42

.300.3

.21

NP (1, 3) VP (4, 7)

.30

.42k=2S (1, 7)

0.7

?0.4

.28 ?

VP (1, 5) NP (6, 7)

push .30 and .21 to the candidate heap (priority queue)

0.4

?0.5

.20 ?

NP (1, 2) VP (3, 7)



0.7

0.50.6

.42

.300.3

.21

NP (1, 3) VP (4, 7)

.30

.42k=2S (1, 7)

0.4

?0.5

.20 ?

NP (1, 2) VP (3, 7)

0.7

?0.4

.28 ?

VP (1, 5) NP (6, 7)

now I know my 2nd-best

pop the root of the heap (.30)


Summary of AlgorithmsSummary of Algorithms

globalO(|E|+|D|k log (d+k))generalized J&M

hyperedgeO(|E|)1-best

globalO(|E|+|D|k log k)alg. 3: lazy

item (mergek)O(|E|+|V|k log k)alg. 2

hyperedge (multk)O(k log k |E|)alg. 1

hyperedge (multk)O(ka|E|)alg. 0: naïve

LocalityTime ComplexityAlgorithms

for CKY: a=2 |E|=n3|P| |V|=n2|N| |D|=O(n)


OutlineOutline

• Formulations• Algorithms: Alg.0 thru Alg. 3• Experiments

– Collins/Bikel Parser– CKY-based MT decoder (Chiang, 2005)

• Conclusion


Bikel ParserBikel Parser

• Based on lexical context-free models (Collins, 2003)• We use it to emulate Collins Model 2• beam search (pruning on cells)

– cell [i, j] contains all items in the form of (A, i, j)– beam width x

• prune away items worse than x times the best item in the cell (threshold pruning in MT)

– cell limit y• only keep at most y best items in a cell (histogram pruning in MT)


EfficiencyEfficiency

Implemented Algorithms 0, 1, 3 on top of Bikel ParserAverage (wall-clock) time on section 23 (per sentence):

O(|E|+|D|k log k)


Quality of the Quality of the kk-best lists-best listsOracle Reranking --Oracle Reranking -- Accuracy (F-score)Accuracy (F-score)


Why are our k-best lists better?Why are our k-best lists better?

average number of parses for sentences of certain length

as sentences get longer, the number of parses should goup (exponentially)!

Collins

beam width 10-3 k=100beam width 10-4

k=100


MT decoder (Efficiency)MT decoder (Efficiency)

CKY-based Hiero decoder (Chiang, ACL 2005):implemented algorithms 2 and 3average decoding time (excluding 1-best part)


DiscussionsDiscussionsHyperpaths Hyperpaths vsvs. Derivations. Derivations

• hyperpath– a minimal sub-hypergraph

• every vertex has at mostone hyperedge

• derivation– a tree– a vertex can appear more

than once

• 1-best: always coincide

hypergraph

hyperpath

derivation


ConclusionConclusion

• monotonic hypergraph formulation– we solved the k-best derivations problem– not the k-shortest hyperpaths problem

• k-best Algorithms– Alg. 0 (naïve) thru Alg. 3 (lazy)

• experimental results– efficiency– accuracy (effective search over larger space)


THE ENDTHE END

Questions?

Comments?



Discussions (contDiscussions (cont’’d)d)Hyperpaths Hyperpaths vsvs. Derivations. Derivations

(B, i, j) (C, j+1, k)

(A, i, k)

Earley: not the case, but easy to fix

CKY: always coincide


Interesting PropertiesInteresting Properties

• 1-best is best everywhere (all decisionsoptimal)

• 2nd-best is optimal everywhere exceptone decision– and that decision must be 2nd-best– and it’s the best of all 2nd-best decisions

• so what about the 3rd-best?• kth-best is…

(Charniak and Johnson, ACL 2005)

local picture:

.3.3.4.6

.15.15.15.15.20.30.5

.12.12.12.12.16.16.24.4

.09.09.09.09.12.12.18.18.3

.03.03.03.03.04.04.06.06.1

.18.18

.16.16b j

ai


Quality of the Quality of the kk-best lists-best listsOracle Reranking --Oracle Reranking -- Relative ImprovementRelative Improvement

f ∈ R|T(e)|

→ R


Future WorkFuture Work

• k-best discriminative supertagging• Implement Alg. 2 and 3 for Bikel Parser and Alg. 0

and 1 for MT decoder (David)– so both experiments have all four algorithms

• Real Reranking (w/ Libin)• Chinese Parsing?• Formal Grammars and Hypergraphs• Case Factor Diagrams and Hypergraphs


Hypergraph is Everything!Hypergraph is Everything!

Generic Dynamic Programming

Shared Forest

Weighted Deduction

Hypergraph Searching

branching structures vs. finite-state structures


Convergence of FomulationsConvergence of Fomulations

Shared Forest

Weighted Deduction

Hypergraph


on aon a log scalelog scale……

?

Documents

Better k-best Parsing...–very close to our lazy offline algorithm (but for CKY only) –tested on a tiny grammar on WSJ (512 rules) IWPT 2005 Liang Huang and David Chiang 3 Outline