Dependency Treelet Translation: Syntactically Informed Phrasal SMT (ACL 2005)

Dependency Treelet Translation: Syntactically Informed Phrasal SMT

(ACL 2005)

Chris Quirk, Arul Menezes

and Colin Cherry

Outline

• Limitations of SMT and previous work

• Modeling and training

• Decoding

• Experiments

• Conclusion

Limitations of string-based phrasal SMT

• It allows only limit phrase reordering.– Ex: max jump, max skip

• It cannot express linguistic generalization: – Ex: they cannot express “SOV SVO”

• Source and target phrases have to be contiguous: – Ex: it cannot handle “ne … pas”

Previous work on syntactic SMT: Simultaneous parsing

• Inversion Transduction Grammars (Wu, 1997)– Using simplifying assumptions: X AB

• Head transducer (Alshawi et al., 2000)– Simultaneous induction of src and tgt

dependency trees

Previous work on syntactic SMT: parsing + transfer

• Tree-to-string (Yamada and Knight, 2001)– Parse tgt sentence, and convert the tgt tree to a src

string

• Path-based transfer model (Lin, 2004)– Translate paths in src dependency trees

• LF-level transfer (Menezes and Richardson, 2001)– Parse both sr and tgt.

Previous work on syntactic SMT:pre- or post-processing

• Post-processing (JHU 2003): re-ranking the n-best list of SMT output using syntactic models.– Parse MT output– No improvement, even when n=16,000

• Pre-processing (Xia & McCord, 2004; Colins et al, 2005; ….): – Reorder src sents before SMT– Some improvement

Outline



• Decoding

• Experiments

• Conclusion

What’s new?

• The union of translation: a treelet pair.– A treelet is an arbitrary connected subgraph (not

necessarily a subtree) of a dependency tree.– In comparison:

• Src n-grams: “phrase”-based SMT:• Path: (Lin, 2004)• Context-free rules: many transfer-based MT systems

Decoding is more complicated.

Required modules

• Source dependency parser

• Target word segmenter / tokenizer

• Word aligner: GIZA++

Major steps for training

1. Align src and tgt words

2. Parse source side

3. Project dependency trees

4. Extract treelet translation pairs

5. Train an order model

6. Train other models

Step 1: Word alignment

• Use GIZA++ to get alignments in both directions, and combine the results with heuristics.

• One constraint: for n-to-1 alignments, the n src words have to be adjacent in the src dependency tree.

Heuristics used to accept alignments from the union

It does not accept m-to-n alignments

Step 2: parsing source side

• It requires a source dependency parser that – produces unlabeled, ordered dependency trees, and – annotates each src word with a POS tag

• Their system does not allow crossing dependencies:– h(i)=k for any j between i and k, h(j) is also

between i and k.

Step 3: Projecting dependency trees

• Add links in the tgt dependency tree according to word alignment types:– 1-to-1: trivial– n-to-1: trivial– 1-to-n: use heuristics– Unaligned tgt words: use heuristics– Unaligned src words: ignore them

1-to-1 and n-to-1 alignments

sk sl Sl’

titj

1-to-n alignment

a b

b2’a’b1’

The n tgt words should move as a unit: - treat the rightmost one as the head - all other words depend on it.

Unaligned target words

tktjti

Given unaligned tgt word at position j, find the closest positions (i,k), s.t. j is between i and k and ti depends on tk (or vice versa).

Such (i,k) might not exist.Because no crossing is allowed, if (i,k) exists, it is unique.

An example

startup properties and options

proprietes et options de demarrage

The reattachment pass to ensure phrasal cohesion

demarrageproprietes et options de

et

proprietes options

de

demarrage proprietes options

de

demarrage

et

Reattachment pass

• “For each node in the wrong order (relative to its siblings), we reattach it to the lowest of its ancestors s.t. it is in the correct place relative to its siblings and parent”.

• Question: how does the reattachment work?– In what order are tree nodes checked?– Once a node is moved, can it be moved again?– How many levels do we have to check to decide

where to attach a node?

An example

11

913

5 8

3 1 12

6 15

7 10 2

4 14

Step 3: Projecting dependency trees(Recap)

• Before reattachment, the src and tgt dependency trees are almost isomorphic:– n-to-1: treat “many” src words as one node– 1-to-n: treat “many” tgt words as one node.– Unaligned tgt words: – Unaligned src words:

• After reattachment, the two trees can look very different.

Step 4: Extracting treelet translation pairs

• “We extract all pairs of aligned src and tgt treelets along with word-level alignment linkages, up to a configurable max size.”

• Due to the reattachment step, a src treelet might not align to a tgt treelet.

Extraction algorithm

• Enumerate all possible source treelets.

• Look at the union of the target nodes aligned to source nodes. If it is a treelet, keep the treelet pair.

• Allow treelets with wildcard roots.– Ex: doesn’t * ne * pas

• Max size of treelets: in practice, up to 4 src words.

• Question: how many source treelets are there?

An example

startup properties and options

proprietes et options de demarrage

Step 5: training an order model

Another representation

Learning a dependent’s position w.r.t. its head

P(pos(m,t) | S, T): S: src dependency tree T: unordered tgt dependency tree t (a.k.a. “h”): a node in T m: a child of t

)))(()),(()),((

)),(()),((

),(),(|)((

),|)((

)),|),((

msrcpostsrccatmsrccat

tsrclexmsrclex

tlexmlexmposP

TSmposP

TStmposP

Use a decision tree to decide pos(m)

The prob of the order of tgt tree

c(t) is the set of nodes modifying t. (i.e., the children of t in the dependency tree)

Assumption: the position of each child can be modeled independently in terms of head-relative position

The order model (cont)

Tt tcm

Tt

TStmposP

TStcorderP

TSTorderP

)(

),|),((

),|))(((

),|)((

Comment: this model is both straightforward andKind of counter-intuitive since treelets are subgraphs.

Step 6: train other models

)|()|( ii

i tspTSP (si, ti) is a treelet pair.

Two models: - MLE:

- IBM Model 1

It assumes the uniform dist over all possible Decompositions of a tree into treelets.

Step 6: train other models (cont)

• Target LM: n-gram LM

• Other features:– Target word number: word penalty– The number of “phrases” used.– ….

Treelet vs. string-based SMT

• Similarities:– Use the log-linear framework.– Similar features: LM, word penalty, …

• Differences:– Use treelet TM, instead of string-based TM.– The order model is w.r.t. dependency trees.

Outline



• Decoding

• Experiments

• Conclusion

Challenges

• Traditional left-to-right decoding approach is inapplicable.

• The need to handle treelets: perhaps discontiguous or overlapping

Ordering strategies

• Exhaustive search

• Greedy ordering

• No ordering

Exhaustive search

• For each input node s, find the set of all treelet pairs that match S and are “rooted” at s.

• Move bottom up through the src dependency tree, computing a list of possible tgt trees for each src subtree.

• When attaching one subtree to another, try all possible permutations of children of root node.

Definitions

Exhaustive decoding algorithm

Greedy ordering

• Too many permutations to consider in exhaustive search.

• In the greedy ordering:– Given a fixed pre- and post-modifier count,

we choose the best modifier for each position.

Greedy ordering algorithm

Numbers of candidatesconsidered at each node

• c: # of children specified in treelet pair

r: # of subtrees needed to be attached.

• Exhaustive search: (c+r+1)! / (c+1)!

• Greedy search: (c+r)r2

Dynamic Programming

• In string-based SMT, hyps for the same covered src word vector:– The last two target words in the hyp: for LMList size is O(V2)

• In treelet translation, hyps for the same src subtree:

– The head word: for the order model– The first two target words: for LM– The last two target words: for LMList size is O(V5)DP does not allow for great saving because of the context we

have to keep.

Duplicate elimination

• To eliminate unnecessary ordering operations, they use a hash table to check whether an unordered T has appeared before.

Pruning

• Prune treelet pairs (before the search starts):– Keep pairs whose MLE prob > threshold– Given a src treelet, keep those whose prob

within a ratio r of the best pair.

• N-best lists:– Keep the N-best for each node in src dep

tree.

Outline



• Decoding

• Experiments

• Conclusion

Setting• Eng-Fr corpus of Microsoft technical data

• Eng parser (NLPWIN): rule-based in-house parser.

Main results

Max phrase size = 4

Effect of max phrase size

Effect of training set size

1K 3K 10K 30K 100K 300K

Pharaoh 17.20 22.51 27.70 33.73 38.83 42.75

Treelet 18.70 25.39 30.96 35.81 40.66 44.32

diff 1.50 2.88 3.26 2.08 1.83 1.57

Effect of ordering strategies

Effect of allowing discontiguous phrases

Effect of optimization

Conclusion

• Modeling:– Treelet translation– Order model based on dependency structure

• Training:– Projecting tgt dependency tree using heuristics– Learn treelet pairs

• Decoding:– Exhaustive search– Greedy ordering

• Results: better performance than SMT, specially for small max phrase size.

Advantages

• Over SMT– Src phrases do not have to be contiguous n-

grams.– It can express linguistic generations.

• Over previous transfer-based approaches:– Treelets are more expressive than paths or

context-free rules.

Discussion

• Projecting tgt dependency tree:– Reattachment: how and why?

• Extracting treelet pairs:– How many subgraphs?

• Order model:

• Decoding: when hyps are extended, updating the score is more complicated.

Documents

Dependency Treelet Translation: Syntactically Informed Phrasal SMT (ACL 2005)