Dependency Treelet Translation: Syntactically Informed Phrasal SMT
(ACL 2005)
Chris Quirk, Arul Menezes
and Colin Cherry
Outline
• Limitations of SMT and previous work
• Modeling and training
• Decoding
• Experiments
• Conclusion
Limitations of string-based phrasal SMT
• It allows only limit phrase reordering.– Ex: max jump, max skip
• It cannot express linguistic generalization: – Ex: they cannot express “SOV SVO”
• Source and target phrases have to be contiguous: – Ex: it cannot handle “ne … pas”
Previous work on syntactic SMT: Simultaneous parsing
• Inversion Transduction Grammars (Wu, 1997)– Using simplifying assumptions: X AB
• Head transducer (Alshawi et al., 2000)– Simultaneous induction of src and tgt
dependency trees
Previous work on syntactic SMT: parsing + transfer
• Tree-to-string (Yamada and Knight, 2001)– Parse tgt sentence, and convert the tgt tree to a src
string
• Path-based transfer model (Lin, 2004)– Translate paths in src dependency trees
• LF-level transfer (Menezes and Richardson, 2001)– Parse both sr and tgt.
Previous work on syntactic SMT:pre- or post-processing
• Post-processing (JHU 2003): re-ranking the n-best list of SMT output using syntactic models.– Parse MT output– No improvement, even when n=16,000
• Pre-processing (Xia & McCord, 2004; Colins et al, 2005; ….): – Reorder src sents before SMT– Some improvement
Outline
• Limitations of SMT and previous work
• Modeling and training
• Decoding
• Experiments
• Conclusion
What’s new?
• The union of translation: a treelet pair.– A treelet is an arbitrary connected subgraph (not
necessarily a subtree) of a dependency tree.– In comparison:
• Src n-grams: “phrase”-based SMT:• Path: (Lin, 2004)• Context-free rules: many transfer-based MT systems
Decoding is more complicated.
Required modules
• Source dependency parser
• Target word segmenter / tokenizer
• Word aligner: GIZA++
Major steps for training
1. Align src and tgt words
2. Parse source side
3. Project dependency trees
4. Extract treelet translation pairs
5. Train an order model
6. Train other models
Step 1: Word alignment
• Use GIZA++ to get alignments in both directions, and combine the results with heuristics.
• One constraint: for n-to-1 alignments, the n src words have to be adjacent in the src dependency tree.
Heuristics used to accept alignments from the union
It does not accept m-to-n alignments
Step 2: parsing source side
• It requires a source dependency parser that – produces unlabeled, ordered dependency trees, and – annotates each src word with a POS tag
• Their system does not allow crossing dependencies:– h(i)=k for any j between i and k, h(j) is also
between i and k.
Step 3: Projecting dependency trees
• Add links in the tgt dependency tree according to word alignment types:– 1-to-1: trivial– n-to-1: trivial– 1-to-n: use heuristics– Unaligned tgt words: use heuristics– Unaligned src words: ignore them
1-to-1 and n-to-1 alignments
sk sl Sl’
titj
1-to-n alignment
a b
b2’a’b1’
The n tgt words should move as a unit: - treat the rightmost one as the head - all other words depend on it.
Unaligned target words
tktjti
Given unaligned tgt word at position j, find the closest positions (i,k), s.t. j is between i and k and ti depends on tk (or vice versa).
Such (i,k) might not exist.Because no crossing is allowed, if (i,k) exists, it is unique.
An example
startup properties and options
proprietes et options de demarrage
The reattachment pass to ensure phrasal cohesion
demarrageproprietes et options de
et
proprietes options
de
demarrage proprietes options
de
demarrage
et
Reattachment pass
• “For each node in the wrong order (relative to its siblings), we reattach it to the lowest of its ancestors s.t. it is in the correct place relative to its siblings and parent”.
• Question: how does the reattachment work?– In what order are tree nodes checked?– Once a node is moved, can it be moved again?– How many levels do we have to check to decide
where to attach a node?
An example
11
913
5 8
3 1 12
6 15
7 10 2
4 14
Step 3: Projecting dependency trees(Recap)
• Before reattachment, the src and tgt dependency trees are almost isomorphic:– n-to-1: treat “many” src words as one node– 1-to-n: treat “many” tgt words as one node.– Unaligned tgt words: – Unaligned src words:
• After reattachment, the two trees can look very different.
Step 4: Extracting treelet translation pairs
• “We extract all pairs of aligned src and tgt treelets along with word-level alignment linkages, up to a configurable max size.”
• Due to the reattachment step, a src treelet might not align to a tgt treelet.
Extraction algorithm
• Enumerate all possible source treelets.
• Look at the union of the target nodes aligned to source nodes. If it is a treelet, keep the treelet pair.
• Allow treelets with wildcard roots.– Ex: doesn’t * ne * pas
• Max size of treelets: in practice, up to 4 src words.
• Question: how many source treelets are there?
An example
startup properties and options
proprietes et options de demarrage
Step 5: training an order model
Another representation
Learning a dependent’s position w.r.t. its head
P(pos(m,t) | S, T): S: src dependency tree T: unordered tgt dependency tree t (a.k.a. “h”): a node in T m: a child of t
)))(()),(()),((
)),(()),((
),(),(|)((
),|)((
)),|),((
msrcpostsrccatmsrccat
tsrclexmsrclex
tlexmlexmposP
TSmposP
TStmposP
Use a decision tree to decide pos(m)
The prob of the order of tgt tree
c(t) is the set of nodes modifying t. (i.e., the children of t in the dependency tree)
Assumption: the position of each child can be modeled independently in terms of head-relative position
The order model (cont)
Tt tcm
Tt
TStmposP
TStcorderP
TSTorderP
)(
),|),((
),|))(((
),|)((
Comment: this model is both straightforward andKind of counter-intuitive since treelets are subgraphs.
Step 6: train other models
)|()|( ii
i tspTSP (si, ti) is a treelet pair.
Two models: - MLE:
- IBM Model 1
It assumes the uniform dist over all possible Decompositions of a tree into treelets.
Step 6: train other models (cont)
• Target LM: n-gram LM
• Other features:– Target word number: word penalty– The number of “phrases” used.– ….
Treelet vs. string-based SMT
• Similarities:– Use the log-linear framework.– Similar features: LM, word penalty, …
• Differences:– Use treelet TM, instead of string-based TM.– The order model is w.r.t. dependency trees.
Outline
• Limitations of SMT and previous work
• Modeling and training
• Decoding
• Experiments
• Conclusion
Challenges
• Traditional left-to-right decoding approach is inapplicable.
• The need to handle treelets: perhaps discontiguous or overlapping
Ordering strategies
• Exhaustive search
• Greedy ordering
• No ordering
Exhaustive search
• For each input node s, find the set of all treelet pairs that match S and are “rooted” at s.
• Move bottom up through the src dependency tree, computing a list of possible tgt trees for each src subtree.
• When attaching one subtree to another, try all possible permutations of children of root node.
Definitions
Exhaustive decoding algorithm
Greedy ordering
• Too many permutations to consider in exhaustive search.
• In the greedy ordering:– Given a fixed pre- and post-modifier count,
we choose the best modifier for each position.
Greedy ordering algorithm
Numbers of candidatesconsidered at each node
• c: # of children specified in treelet pair
r: # of subtrees needed to be attached.
• Exhaustive search: (c+r+1)! / (c+1)!
• Greedy search: (c+r)r2
Dynamic Programming
• In string-based SMT, hyps for the same covered src word vector:– The last two target words in the hyp: for LMList size is O(V2)
• In treelet translation, hyps for the same src subtree:
– The head word: for the order model– The first two target words: for LM– The last two target words: for LMList size is O(V5)DP does not allow for great saving because of the context we
have to keep.
Duplicate elimination
• To eliminate unnecessary ordering operations, they use a hash table to check whether an unordered T has appeared before.
Pruning
• Prune treelet pairs (before the search starts):– Keep pairs whose MLE prob > threshold– Given a src treelet, keep those whose prob
within a ratio r of the best pair.
• N-best lists:– Keep the N-best for each node in src dep
tree.
Outline
• Limitations of SMT and previous work
• Modeling and training
• Decoding
• Experiments
• Conclusion
Setting• Eng-Fr corpus of Microsoft technical data
• Eng parser (NLPWIN): rule-based in-house parser.
Main results
Max phrase size = 4
Effect of max phrase size
Effect of training set size
1K 3K 10K 30K 100K 300K
Pharaoh 17.20 22.51 27.70 33.73 38.83 42.75
Treelet 18.70 25.39 30.96 35.81 40.66 44.32
diff 1.50 2.88 3.26 2.08 1.83 1.57
Effect of ordering strategies
Effect of allowing discontiguous phrases
Effect of optimization
Conclusion
• Modeling:– Treelet translation– Order model based on dependency structure
• Training:– Projecting tgt dependency tree using heuristics– Learn treelet pairs
• Decoding:– Exhaustive search– Greedy ordering
• Results: better performance than SMT, specially for small max phrase size.
Advantages
• Over SMT– Src phrases do not have to be contiguous n-
grams.– It can express linguistic generations.
• Over previous transfer-based approaches:– Treelets are more expressive than paths or
context-free rules.
Discussion
• Projecting tgt dependency tree:– Reattachment: how and why?
• Extracting treelet pairs:– How many subgraphs?
• Order model:
• Decoding: when hyps are extended, updating the score is more complicated.