Model Combination for Machine Translationdenero.org/content/pubs/naacl10_denero_combination_slides.pdf · dog ate my homework Combiners assume little about systems Objectives similar

research

Model Combinationfor Machine Translation

John DeNero, Shankar Kumar, Ciprian Chelba, and Franz Josef Och

Monday, June 7, 2010

Motivation

A statistical machine translation model scores derivations

(log) model score sums language model and translation model


Motivation


θ · φ(d) = θ ·

�

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)



Motivation


θ · φ(d) = θ ·

�

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)



Motivation


θ · φ(d) = θ ·

�

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)



Motivation


We can improve over max-derivation decoding

θ · φ(d) = θ ·

�

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)



Motivation


• Consensus decoding (e.g., minimum Bayes risk)


θ · φ(d) = θ ·

�

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)



Motivation


• Consensus decoding (e.g., minimum Bayes risk)�


θ · φ(d) = θ ·

�

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)



Motivation



• System combination (e.g., confusion networks)

�We can improve over max-derivation decoding

θ · φ(d) = θ ·

�

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)



Motivation





θ · φ(d) = θ ·

�

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)



Motivation





In this work, we develop a technique that integrates both

θ · φ(d) = θ ·

�

w∈n-grams(d)

φlm(w) +�

r∈rules(d)

φtm(r)



Consensus Decoding

Derivation scores can be interpreted as probabilities

P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))


Consensus Decoding


We can query this distribution for:



Consensus Decoding




Whole translations [Blunsom et al., ’08]:

emax-trans = arg maxe

�

d:σe(d)=e

P(d|f)


Consensus Decoding






�

d:σe(d)=e

P(d|f)

N-gram overlap [Kumar and Byrne, ’04]:

embr = arg maxe

E [bleu(e;σe(d))]


Consensus Decoding




Free BLEU

Points



�

d:σe(d)=e

P(d|f)

N-gram overlap [Kumar and Byrne, ’04]:

embr = arg maxe

E [bleu(e;σe(d))]


System Combination

We often have multiple translation systems


System Combination


el perro comí mi tarea


System Combination



1

2

3


System Combination


the dog bit my work

the dog ate homework

dog ate homeworkmy


1

2

3


System Combination



the dog bit my work


dog ate homeworkmy

1

2

3


System Combination



the dog bit my work


dog ate homeworkmy

1

2

3


System Combination



the dog bit my work


dog ate homeworkmy

Combiners assume little about systems

1

2

3


System Combination



the dog bit my work


dog ate homeworkmy


Objectives similar to consensus decoding

1

2

3


System Combination



the dog bit my work


dog ate homeworkmy


Objectives similar to consensus decoding

1

2

3

More BLEU

Points


Model Combination


Model Combination

f


Model Combination

1

2

3

f


Model Combination

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f


Model Combination

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3


Model Combination

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

System combo

e


Model Combination

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

System combo

e


Model Combination

Consensus decoding with multiple models

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

System combo

e


Model Combination


Distribution-driven approach to system combination

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

System combo

e


Model Combination


Distribution-driven approach to system combination

Unifies consensus and combination objectives

1

2

3

P1(d|f)

P2(d|f)

P3(d|f)

f

Consensus

Consensus

Consensus

e∗1

e∗2

e∗3

System combo

e


Outline

Consensus decoding review

Our model combination technique

Comparison to system combination


Outline




With new

algorithms

and results


Forest-Based Consensus DecodingReview


Forest-Based Consensus Decoding

Build a forest that encodes the model posterior1

P(d|f) =

Review



Yo vi al hombre con el telescopio

I saw the man with {a,the} telescope


P(d|f) =

Review





I ... telescope

“I saw the man with {a,the} telescope”


P(d|f) =

Review





I ... telescope

I ... telescope


I ... man

“I saw with {a,the} telescope the man”


P(d|f) =

Review





I ... telescope

I ... telescope


I ... man



Compute n-gram statistics from the posterior 2

P(d|f) =

Review





I ... telescope

I ... telescope


I ... man




P(d|f) =

Review





I ... telescope

I ... telescope


I ... man


P(edge|f)



P(d|f) =

Review





I ... telescope

I ... telescope


I ... man


“telescope the”

P(edge|f)



P(d|f) =

Review





I ... telescope

I ... telescope


I ... man


“telescope the”

P(edge|f)



Optimize a consensus objective using these statistics3

P(d|f) =

Review


Types of Efficient Consensus Techniques

[Tromble et al., ’08]

Lattice Minimum Bayes-Risk Decoding

Review





Posteriors Expected counts

N-gram Statistics

Review





Lear

ned

Fixe

d

Con

sens

us O

bjec

tive


N-gram Statistics

Review





[DeNero et al., ’09]

Fast Consensus Decoding

[Kumar et al., ’09]

Minimum Bayes-Risk Decoding for Hypergraphs

[Li et al., ’09]

Variational Decoding for Machine Translation

Lear

ned

Fixe

d

Con

sens

us O

bjec

tive


N-gram Statistics

Review


Learned Objectives are Better than FixedReview


Learned Objectives are Better than Fixed

C(d) =4�

n=1

w(n)�

g∈n-grams

c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)

Review



C(d) =4�

n=1

w(n)�

g∈n-grams


N-gram statistics

Review



C(d) =4�

n=1

w(n)�

g∈n-grams


N-gram statistics Length

Review



C(d) =4�

n=1

w(n)�

g∈n-grams


N-gram statistics Length Base model

Review



C(d) =4�

n=1

w(n)�

g∈n-grams


Objective parameters


Review



C(d) =4�

n=1

w(n)�

g∈n-grams




Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]

Review



C(d) =4�

n=1

w(n)�

g∈n-grams





Choose to maximizewLearned: bleu

��arg max

dCw(d)

�; e

�


Review



C(d) =4�

n=1

w(n)�

g∈n-grams






��arg max

dCw(d)

�; e

�References


Review



C(d) =4�

n=1

w(n)�

g∈n-grams






��arg max

dCw(d)

�; e

�References

Increased test set BLEU by ≥0.2Decreased test set BLEU by ≥0.2

Consensus performance versus max-derivation decoding (39 pairs)


Review



C(d) =4�

n=1

w(n)�

g∈n-grams






��arg max

dCw(d)

�; e

�References




1218

Fixed

Review



C(d) =4�

n=1

w(n)�

g∈n-grams






��arg max

dCw(d)

�; e

�References




1218

5

26

Fixed Learned

Review


N-gram Posteriors can also be Computed Quickly



P(g|f) = 1.0− P(g|f)



I ... telescope

I saw

the man with {a,the} telescope


P(g|f) = 1.0− P(g|f)



I ... telescope

I saw



eθ·φ(edge) = 0.2

“saw the”

P(g|f) = 1.0− P(g|f)


Inside: 0.1

...

“I saw” : 0“with the” : 0.1


I ... telescope

I saw



eθ·φ(edge) = 0.2

“saw the”

P(g|f) = 1.0− P(g|f)


Inside: 0.1

...



I ... telescope

I saw



eθ·φ(edge) = 0.2

“saw the”

Derivations that don’t contain “with the”

P(g|f) = 1.0− P(g|f)


Inside: 0.1

...



I ... telescope

I saw



eθ·φ(edge) = 0.2

“saw the”Inside: 0.5

“with a” : 0.3“with the” : 0.2

...Derivations that don’t

contain “with the”

P(g|f) = 1.0− P(g|f)


Inside: 0.1

...



I ... telescope

I saw



eθ·φ(edge) = 0.2



...

Inside: 0.01“saw the” : 0

“I saw” : 0“with the” : 0.004

...


P(g|f) = 1.0− P(g|f)


Inside: 0.1

...



I ... telescope

I saw



eθ·φ(edge) = 0.2



...


“I saw” : 0“with the” : 0.004

...


P(g|f) = 1.0− P(g|f)


Inside: 0.1

...



Audience challenge: What semiring computes n-gram posteriors?

I ... telescope

I saw



eθ·φ(edge) = 0.2



...


“I saw” : 0“with the” : 0.004

...


P(g|f) = 1.0− P(g|f)


Results for Learned Consensus Decoding

Constrained data track of the 2008 NIST MT task

max-derivation posterior probabilities expected counts

Learned consensus decoding techniquesBaseline






Phrase-Based

Hiero

SAMT

40 42 44 46

44.4

43.8

44.6

44.5

43.8

44.6

43.8

43.3

43.9

Arabic-to-English






Phrase-Based

Hiero

SAMT

40 42 44 46

44.4

43.8

44.6

44.5

43.8

44.6

43.8

43.3

43.9

Arabic-to-English

24 26 28 30

28.7

28.2

27.2

28.8

27.8

27.3

28.4

27.2

25.4

Chinese-to-English


Outline





Outline




All in

One Slide!


Extending Consensus Decoding to Multiple Models

1. Build posterior forests



Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :




C(d) =4�

n=1

w(n)�

g∈n-grams



Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :





Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :

C(d) = +w(�)|σe(d)| + w(b)b(d)4�

n=1

w(n)v(n)(d)





Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :


C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d)

Sum over models




Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :


2. Compute n-gram statistics from forests

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d)

Sum over models




Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :



3. Union all forests

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d)

Sum over models




Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :




C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d)

Sum over models




Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :




C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)

i (d) +w(α)i αi(d)

Model choiceSum over models




Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :




C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)



dIndicator feature for the system that originally generatedModel choice:

Base model: Model score under the system that generated d




Phrase-based

fHiero- style

P1(d|f) :

P2(d|f) :




4. Optimize multi-model consensus objective

C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)











C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)












C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)





Original features








C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)





Original features








C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)





Original features








C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)





Original features



Properties of Model Combination


C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)






C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)



Reduces to consensus decoding when we have only one model




C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)




A linear model: can be tuned to maximize output performancew




C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)





No concept of a primary system




C(d) = +w(�)|σe(d)| + w(b)b(d)I�

i=1

4�

n=1

w(n)i v(n)





No concept of a primary system

Every possible output was a derivation under some original model


Model Combination Experimental Results

Compared three in-house Google systems

Constrained data track of 2008 NIST task

Parameters tuned on NIST 2004 eval set

Max-derivationConsensus


42

43

44

45

46

Phrase Hiero SAMT Combo

45.3

44.5

43.8

44.6

43.8

43.3

43.9

Arabic-to-English BLEU







42

43

44

45

46


45.3

44.5

43.8

44.6

43.8

43.3

43.9






+1.4



42

43

44

45

46


45.3

44.5

43.8

44.6

43.8

43.3

43.9






+1.4

24

25

26

27

28

29

30


29.028.8

27.827.3

28.4

27.2

25.4

Chinese-to-English BLEU

+0.6



Sources of Improvement: Arabic-to-English



43 44 45 46

43.9

44.5

Best max-derivation system

Best single-system with consensus decoding



43 44 45 46

43.9

44.5


Do we need model choice features?




43 44 45 46

43.9

44.5



Should n-gram statistics be model specific?




43 44 45 46

43.9

44.5




Phrase-based

fHiero- style




43 44 45 46

43.9

44.5




Phrase-based

fHiero- style


“Union”



43 44 45 46

43.9

44.5

44.5




Phrase-based

fHiero- style


Union + consensus decoding

“Union”



43 44 45 46

43.9

44.5

44.5

44.9




Phrase-based

fHiero- style



Union + consensus + model choice features

“Union”



43 44 45 46

43.9

44.5

44.5

44.9

45.1




Phrase-based

fHiero- style




Union with model-specific n-gram statistics

“Union”



43 44 45 46

43.9

44.5

44.5

44.9

45.1

45.3




Phrase-based

fHiero- style




Union with model-specific n-gram statistics

“Union”

Model-specific & union n-gram statistics


Outline





Outline




The Final

Showdown


System Combination Baselines

Two established system combination methods [Macherey & Och, ’07]




P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3



Sentence-level Choose among system outputs using an MBR objective


e = arg maxe∈{e∗1 ,...,e∗k}

E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3




Word-level Confusion network approach



E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3







E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3

All are aligned to a backbone e∗i eb ∈ {e∗1, . . . , e∗k}





Sentences + alignments form a confusion network



E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3






Sentences + alignments form a confusion network

Output maximizes a consensus objective



E [bleu(e)]

P1(d|f)

P2(d|f)

P3(d|f)

f

e∗1

e∗2

e∗3

Consensus

Consensus

Consensus

System combo

e

1

2

3



Model versus System Combination

Qualitative

Quantitative



Single-system n-gram statistics are required in both methods

Qualitative

Quantitative




Model combination searches only once under a consensus objective

Qualitative

Quantitative



Inter-hypothesis alignment problem exists only in system combination



Qualitative

Quantitative



Best single-system consensus

Sentence-level system combination

Word-level system combination

Model combination

43 44 45 46

45.3

44.7

44.6

44.5

Arabic-to-English




Qualitative

Quantitative



Best single-system consensus

Sentence-level system combination

Word-level system combination

Model combination

43 44 45 46

45.3

44.7

44.6

44.5

Arabic-to-English

27 28 29 30

29.0

28.8

28.8

28.8

Chinese-to-English




Qualitative

Quantitative


Conclusion


Conclusion

Consensus decoding with a learned objective extends naturally to multiple models


Conclusion


Model combination provides consensus and combination effects in a unified, distribution-driven objective


Conclusion

It outperforms a pipeline of consensus decoding followed by system combination, using less total computation




Conclusion




It’s easy, it’s clean, and it works


Conclusion




It’s easy, it’s clean, and it works Thanks!


Documents

Model Combination for Machine Translationdenero.org/content/pubs/naacl10_denero_combination_slides.pdf · dog ate my homework Combiners assume little about systems Objectives similar