Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
research
Model Combinationfor Machine Translation
John DeNero, Shankar Kumar, Ciprian Chelba, and Franz Josef Och
Monday, June 7, 2010
Motivation
A statistical machine translation model scores derivations
(log) model score sums language model and translation model
Monday, June 7, 2010
Motivation
A statistical machine translation model scores derivations
θ · φ(d) = θ ·
�
w∈n-grams(d)
φlm(w) +�
r∈rules(d)
φtm(r)
(log) model score sums language model and translation model
Monday, June 7, 2010
Motivation
A statistical machine translation model scores derivations
θ · φ(d) = θ ·
�
w∈n-grams(d)
φlm(w) +�
r∈rules(d)
φtm(r)
(log) model score sums language model and translation model
Monday, June 7, 2010
Motivation
A statistical machine translation model scores derivations
θ · φ(d) = θ ·
�
w∈n-grams(d)
φlm(w) +�
r∈rules(d)
φtm(r)
(log) model score sums language model and translation model
Monday, June 7, 2010
Motivation
A statistical machine translation model scores derivations
We can improve over max-derivation decoding
θ · φ(d) = θ ·
�
w∈n-grams(d)
φlm(w) +�
r∈rules(d)
φtm(r)
(log) model score sums language model and translation model
Monday, June 7, 2010
Motivation
A statistical machine translation model scores derivations
• Consensus decoding (e.g., minimum Bayes risk)
We can improve over max-derivation decoding
θ · φ(d) = θ ·
�
w∈n-grams(d)
φlm(w) +�
r∈rules(d)
φtm(r)
(log) model score sums language model and translation model
Monday, June 7, 2010
Motivation
A statistical machine translation model scores derivations
• Consensus decoding (e.g., minimum Bayes risk)�
We can improve over max-derivation decoding
θ · φ(d) = θ ·
�
w∈n-grams(d)
φlm(w) +�
r∈rules(d)
φtm(r)
(log) model score sums language model and translation model
Monday, June 7, 2010
Motivation
A statistical machine translation model scores derivations
• Consensus decoding (e.g., minimum Bayes risk)
• System combination (e.g., confusion networks)
�We can improve over max-derivation decoding
θ · φ(d) = θ ·
�
w∈n-grams(d)
φlm(w) +�
r∈rules(d)
φtm(r)
(log) model score sums language model and translation model
Monday, June 7, 2010
Motivation
A statistical machine translation model scores derivations
• Consensus decoding (e.g., minimum Bayes risk)
• System combination (e.g., confusion networks)
�We can improve over max-derivation decoding
θ · φ(d) = θ ·
�
w∈n-grams(d)
φlm(w) +�
r∈rules(d)
φtm(r)
(log) model score sums language model and translation model
Monday, June 7, 2010
Motivation
A statistical machine translation model scores derivations
• Consensus decoding (e.g., minimum Bayes risk)
• System combination (e.g., confusion networks)
�We can improve over max-derivation decoding
In this work, we develop a technique that integrates both
θ · φ(d) = θ ·
�
w∈n-grams(d)
φlm(w) +�
r∈rules(d)
φtm(r)
(log) model score sums language model and translation model
Monday, June 7, 2010
Consensus Decoding
Derivation scores can be interpreted as probabilities
P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))
Monday, June 7, 2010
Consensus Decoding
Derivation scores can be interpreted as probabilities
We can query this distribution for:
P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))
Monday, June 7, 2010
Consensus Decoding
Derivation scores can be interpreted as probabilities
We can query this distribution for:
P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))
Whole translations [Blunsom et al., ’08]:
emax-trans = arg maxe
�
d:σe(d)=e
P(d|f)
Monday, June 7, 2010
Consensus Decoding
Derivation scores can be interpreted as probabilities
We can query this distribution for:
P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))
Whole translations [Blunsom et al., ’08]:
emax-trans = arg maxe
�
d:σe(d)=e
P(d|f)
N-gram overlap [Kumar and Byrne, ’04]:
embr = arg maxe
E [bleu(e;σe(d))]
Monday, June 7, 2010
Consensus Decoding
Derivation scores can be interpreted as probabilities
We can query this distribution for:
P(d|f) =exp (θ · φ(d))�d� exp (θ · φ(d�))
Free BLEU
Points
Whole translations [Blunsom et al., ’08]:
emax-trans = arg maxe
�
d:σe(d)=e
P(d|f)
N-gram overlap [Kumar and Byrne, ’04]:
embr = arg maxe
E [bleu(e;σe(d))]
Monday, June 7, 2010
System Combination
We often have multiple translation systems
Monday, June 7, 2010
System Combination
We often have multiple translation systems
el perro comí mi tarea
Monday, June 7, 2010
System Combination
We often have multiple translation systems
el perro comí mi tarea
1
2
3
Monday, June 7, 2010
System Combination
We often have multiple translation systems
the dog bit my work
the dog ate homework
dog ate homeworkmy
el perro comí mi tarea
1
2
3
Monday, June 7, 2010
System Combination
We often have multiple translation systems
el perro comí mi tarea
the dog bit my work
the dog ate homework
dog ate homeworkmy
1
2
3
Monday, June 7, 2010
System Combination
We often have multiple translation systems
el perro comí mi tarea
the dog bit my work
the dog ate homework
dog ate homeworkmy
1
2
3
Monday, June 7, 2010
System Combination
We often have multiple translation systems
el perro comí mi tarea
the dog bit my work
the dog ate homework
dog ate homeworkmy
Combiners assume little about systems
1
2
3
Monday, June 7, 2010
System Combination
We often have multiple translation systems
el perro comí mi tarea
the dog bit my work
the dog ate homework
dog ate homeworkmy
Combiners assume little about systems
Objectives similar to consensus decoding
1
2
3
Monday, June 7, 2010
System Combination
We often have multiple translation systems
el perro comí mi tarea
the dog bit my work
the dog ate homework
dog ate homeworkmy
Combiners assume little about systems
Objectives similar to consensus decoding
1
2
3
More BLEU
Points
Monday, June 7, 2010
Model Combination
Monday, June 7, 2010
Model Combination
f
Monday, June 7, 2010
Model Combination
1
2
3
f
Monday, June 7, 2010
Model Combination
1
2
3
P1(d|f)
P2(d|f)
P3(d|f)
f
Monday, June 7, 2010
Model Combination
1
2
3
P1(d|f)
P2(d|f)
P3(d|f)
f
Consensus
Consensus
Consensus
e∗1
e∗2
e∗3
Monday, June 7, 2010
Model Combination
1
2
3
P1(d|f)
P2(d|f)
P3(d|f)
f
Consensus
Consensus
Consensus
e∗1
e∗2
e∗3
System combo
e
Monday, June 7, 2010
Model Combination
1
2
3
P1(d|f)
P2(d|f)
P3(d|f)
f
Consensus
Consensus
Consensus
e∗1
e∗2
e∗3
System combo
e
Monday, June 7, 2010
Model Combination
Consensus decoding with multiple models
1
2
3
P1(d|f)
P2(d|f)
P3(d|f)
f
Consensus
Consensus
Consensus
e∗1
e∗2
e∗3
System combo
e
Monday, June 7, 2010
Model Combination
Consensus decoding with multiple models
Distribution-driven approach to system combination
1
2
3
P1(d|f)
P2(d|f)
P3(d|f)
f
Consensus
Consensus
Consensus
e∗1
e∗2
e∗3
System combo
e
Monday, June 7, 2010
Model Combination
Consensus decoding with multiple models
Distribution-driven approach to system combination
Unifies consensus and combination objectives
1
2
3
P1(d|f)
P2(d|f)
P3(d|f)
f
Consensus
Consensus
Consensus
e∗1
e∗2
e∗3
System combo
e
Monday, June 7, 2010
Outline
Consensus decoding review
Our model combination technique
Comparison to system combination
Monday, June 7, 2010
Outline
Consensus decoding review
Our model combination technique
Comparison to system combination
With new
algorithms
and results
Monday, June 7, 2010
Forest-Based Consensus DecodingReview
Monday, June 7, 2010
Forest-Based Consensus Decoding
Build a forest that encodes the model posterior1
P(d|f) =
Review
Monday, June 7, 2010
Forest-Based Consensus Decoding
Yo vi al hombre con el telescopio
I saw the man with {a,the} telescope
Build a forest that encodes the model posterior1
P(d|f) =
Review
Monday, June 7, 2010
Forest-Based Consensus Decoding
Yo vi al hombre con el telescopio
I saw the man with {a,the} telescope
I ... telescope
“I saw the man with {a,the} telescope”
Build a forest that encodes the model posterior1
P(d|f) =
Review
Monday, June 7, 2010
Forest-Based Consensus Decoding
Yo vi al hombre con el telescopio
I saw the man with {a,the} telescope
I ... telescope
I ... telescope
“I saw the man with {a,the} telescope”
I ... man
“I saw with {a,the} telescope the man”
Build a forest that encodes the model posterior1
P(d|f) =
Review
Monday, June 7, 2010
Forest-Based Consensus Decoding
Yo vi al hombre con el telescopio
I saw the man with {a,the} telescope
I ... telescope
I ... telescope
“I saw the man with {a,the} telescope”
I ... man
“I saw with {a,the} telescope the man”
Build a forest that encodes the model posterior1
Compute n-gram statistics from the posterior 2
P(d|f) =
Review
Monday, June 7, 2010
Forest-Based Consensus Decoding
Yo vi al hombre con el telescopio
I saw the man with {a,the} telescope
I ... telescope
I ... telescope
“I saw the man with {a,the} telescope”
I ... man
“I saw with {a,the} telescope the man”
Build a forest that encodes the model posterior1
Compute n-gram statistics from the posterior 2
P(d|f) =
Review
Monday, June 7, 2010
Forest-Based Consensus Decoding
Yo vi al hombre con el telescopio
I saw the man with {a,the} telescope
I ... telescope
I ... telescope
“I saw the man with {a,the} telescope”
I ... man
“I saw with {a,the} telescope the man”
P(edge|f)
Build a forest that encodes the model posterior1
Compute n-gram statistics from the posterior 2
P(d|f) =
Review
Monday, June 7, 2010
Forest-Based Consensus Decoding
Yo vi al hombre con el telescopio
I saw the man with {a,the} telescope
I ... telescope
I ... telescope
“I saw the man with {a,the} telescope”
I ... man
“I saw with {a,the} telescope the man”
“telescope the”
P(edge|f)
Build a forest that encodes the model posterior1
Compute n-gram statistics from the posterior 2
P(d|f) =
Review
Monday, June 7, 2010
Forest-Based Consensus Decoding
Yo vi al hombre con el telescopio
I saw the man with {a,the} telescope
I ... telescope
I ... telescope
“I saw the man with {a,the} telescope”
I ... man
“I saw with {a,the} telescope the man”
“telescope the”
P(edge|f)
Build a forest that encodes the model posterior1
Compute n-gram statistics from the posterior 2
Optimize a consensus objective using these statistics3
P(d|f) =
Review
Monday, June 7, 2010
Types of Efficient Consensus Techniques
[Tromble et al., ’08]
Lattice Minimum Bayes-Risk Decoding
Review
Monday, June 7, 2010
Types of Efficient Consensus Techniques
[Tromble et al., ’08]
Lattice Minimum Bayes-Risk Decoding
Posteriors Expected counts
N-gram Statistics
Review
Monday, June 7, 2010
Types of Efficient Consensus Techniques
[Tromble et al., ’08]
Lattice Minimum Bayes-Risk Decoding
Lear
ned
Fixe
d
Con
sens
us O
bjec
tive
Posteriors Expected counts
N-gram Statistics
Review
Monday, June 7, 2010
Types of Efficient Consensus Techniques
[Tromble et al., ’08]
Lattice Minimum Bayes-Risk Decoding
[DeNero et al., ’09]
Fast Consensus Decoding
[Kumar et al., ’09]
Minimum Bayes-Risk Decoding for Hypergraphs
[Li et al., ’09]
Variational Decoding for Machine Translation
Lear
ned
Fixe
d
Con
sens
us O
bjec
tive
Posteriors Expected counts
N-gram Statistics
Review
Monday, June 7, 2010
Learned Objectives are Better than FixedReview
Monday, June 7, 2010
Learned Objectives are Better than Fixed
C(d) =4�
n=1
w(n)�
g∈n-grams
c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)
Review
Monday, June 7, 2010
Learned Objectives are Better than Fixed
C(d) =4�
n=1
w(n)�
g∈n-grams
c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)
N-gram statistics
Review
Monday, June 7, 2010
Learned Objectives are Better than Fixed
C(d) =4�
n=1
w(n)�
g∈n-grams
c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)
N-gram statistics Length
Review
Monday, June 7, 2010
Learned Objectives are Better than Fixed
C(d) =4�
n=1
w(n)�
g∈n-grams
c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)
N-gram statistics Length Base model
Review
Monday, June 7, 2010
Learned Objectives are Better than Fixed
C(d) =4�
n=1
w(n)�
g∈n-grams
c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)
Objective parameters
N-gram statistics Length Base model
Review
Monday, June 7, 2010
Learned Objectives are Better than Fixed
C(d) =4�
n=1
w(n)�
g∈n-grams
c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)
Objective parameters
N-gram statistics Length Base model
Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]
Review
Monday, June 7, 2010
Learned Objectives are Better than Fixed
C(d) =4�
n=1
w(n)�
g∈n-grams
c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)
Objective parameters
N-gram statistics Length Base model
Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]
Choose to maximizewLearned: bleu
��arg max
dCw(d)
�; e
�
[Kumar et al., ’09]
Review
Monday, June 7, 2010
Learned Objectives are Better than Fixed
C(d) =4�
n=1
w(n)�
g∈n-grams
c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)
Objective parameters
N-gram statistics Length Base model
Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]
Choose to maximizewLearned: bleu
��arg max
dCw(d)
�; e
�References
[Kumar et al., ’09]
Review
Monday, June 7, 2010
Learned Objectives are Better than Fixed
C(d) =4�
n=1
w(n)�
g∈n-grams
c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)
Objective parameters
N-gram statistics Length Base model
Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]
Choose to maximizewLearned: bleu
��arg max
dCw(d)
�; e
�References
Increased test set BLEU by ≥0.2Decreased test set BLEU by ≥0.2
Consensus performance versus max-derivation decoding (39 pairs)
[Kumar et al., ’09]
Review
Monday, June 7, 2010
Learned Objectives are Better than Fixed
C(d) =4�
n=1
w(n)�
g∈n-grams
c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)
Objective parameters
N-gram statistics Length Base model
Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]
Choose to maximizewLearned: bleu
��arg max
dCw(d)
�; e
�References
Increased test set BLEU by ≥0.2Decreased test set BLEU by ≥0.2
Consensus performance versus max-derivation decoding (39 pairs)
[Kumar et al., ’09]
1218
Fixed
Review
Monday, June 7, 2010
Learned Objectives are Better than Fixed
C(d) =4�
n=1
w(n)�
g∈n-grams
c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)
Objective parameters
N-gram statistics Length Base model
Choose such thatwFixed: Cw(d) ≈ E [bleu(d)]
Choose to maximizewLearned: bleu
��arg max
dCw(d)
�; e
�References
Increased test set BLEU by ≥0.2Decreased test set BLEU by ≥0.2
Consensus performance versus max-derivation decoding (39 pairs)
[Kumar et al., ’09]
1218
5
26
Fixed Learned
Review
Monday, June 7, 2010
N-gram Posteriors can also be Computed Quickly
Monday, June 7, 2010
N-gram Posteriors can also be Computed Quickly
P(g|f) = 1.0− P(g|f)
Monday, June 7, 2010
N-gram Posteriors can also be Computed Quickly
I ... telescope
I saw
the man with {a,the} telescope
“I saw the man with {a,the} telescope”
P(g|f) = 1.0− P(g|f)
Monday, June 7, 2010
N-gram Posteriors can also be Computed Quickly
I ... telescope
I saw
the man with {a,the} telescope
“I saw the man with {a,the} telescope”
eθ·φ(edge) = 0.2
“saw the”
P(g|f) = 1.0− P(g|f)
Monday, June 7, 2010
Inside: 0.1
...
“I saw” : 0“with the” : 0.1
N-gram Posteriors can also be Computed Quickly
I ... telescope
I saw
the man with {a,the} telescope
“I saw the man with {a,the} telescope”
eθ·φ(edge) = 0.2
“saw the”
P(g|f) = 1.0− P(g|f)
Monday, June 7, 2010
Inside: 0.1
...
“I saw” : 0“with the” : 0.1
N-gram Posteriors can also be Computed Quickly
I ... telescope
I saw
the man with {a,the} telescope
“I saw the man with {a,the} telescope”
eθ·φ(edge) = 0.2
“saw the”
Derivations that don’t contain “with the”
P(g|f) = 1.0− P(g|f)
Monday, June 7, 2010
Inside: 0.1
...
“I saw” : 0“with the” : 0.1
N-gram Posteriors can also be Computed Quickly
I ... telescope
I saw
the man with {a,the} telescope
“I saw the man with {a,the} telescope”
eθ·φ(edge) = 0.2
“saw the”Inside: 0.5
“with a” : 0.3“with the” : 0.2
...Derivations that don’t
contain “with the”
P(g|f) = 1.0− P(g|f)
Monday, June 7, 2010
Inside: 0.1
...
“I saw” : 0“with the” : 0.1
N-gram Posteriors can also be Computed Quickly
I ... telescope
I saw
the man with {a,the} telescope
“I saw the man with {a,the} telescope”
eθ·φ(edge) = 0.2
“saw the”Inside: 0.5
“with a” : 0.3“with the” : 0.2
...
Inside: 0.01“saw the” : 0
“I saw” : 0“with the” : 0.004
...
Derivations that don’t contain “with the”
P(g|f) = 1.0− P(g|f)
Monday, June 7, 2010
Inside: 0.1
...
“I saw” : 0“with the” : 0.1
N-gram Posteriors can also be Computed Quickly
I ... telescope
I saw
the man with {a,the} telescope
“I saw the man with {a,the} telescope”
eθ·φ(edge) = 0.2
“saw the”Inside: 0.5
“with a” : 0.3“with the” : 0.2
...
Inside: 0.01“saw the” : 0
“I saw” : 0“with the” : 0.004
...
Derivations that don’t contain “with the”
P(g|f) = 1.0− P(g|f)
Monday, June 7, 2010
Inside: 0.1
...
“I saw” : 0“with the” : 0.1
N-gram Posteriors can also be Computed Quickly
Audience challenge: What semiring computes n-gram posteriors?
I ... telescope
I saw
the man with {a,the} telescope
“I saw the man with {a,the} telescope”
eθ·φ(edge) = 0.2
“saw the”Inside: 0.5
“with a” : 0.3“with the” : 0.2
...
Inside: 0.01“saw the” : 0
“I saw” : 0“with the” : 0.004
...
Derivations that don’t contain “with the”
P(g|f) = 1.0− P(g|f)
Monday, June 7, 2010
Results for Learned Consensus Decoding
Constrained data track of the 2008 NIST MT task
max-derivation posterior probabilities expected counts
Learned consensus decoding techniquesBaseline
Monday, June 7, 2010
Results for Learned Consensus Decoding
Constrained data track of the 2008 NIST MT task
max-derivation posterior probabilities expected counts
Learned consensus decoding techniquesBaseline
Phrase-Based
Hiero
SAMT
40 42 44 46
44.4
43.8
44.6
44.5
43.8
44.6
43.8
43.3
43.9
Arabic-to-English
Monday, June 7, 2010
Results for Learned Consensus Decoding
Constrained data track of the 2008 NIST MT task
max-derivation posterior probabilities expected counts
Learned consensus decoding techniquesBaseline
Phrase-Based
Hiero
SAMT
40 42 44 46
44.4
43.8
44.6
44.5
43.8
44.6
43.8
43.3
43.9
Arabic-to-English
24 26 28 30
28.7
28.2
27.2
28.8
27.8
27.3
28.4
27.2
25.4
Chinese-to-English
Monday, June 7, 2010
Outline
Consensus decoding review
Our model combination technique
Comparison to system combination
Monday, June 7, 2010
Outline
Consensus decoding review
Our model combination technique
Comparison to system combination
All in
One Slide!
Monday, June 7, 2010
Extending Consensus Decoding to Multiple Models
1. Build posterior forests
Monday, June 7, 2010
Extending Consensus Decoding to Multiple Models
Phrase-based
fHiero- style
P1(d|f) :
P2(d|f) :
1. Build posterior forests
Monday, June 7, 2010
Extending Consensus Decoding to Multiple Models
C(d) =4�
n=1
w(n)�
g∈n-grams
c(g, d) · P(g|f) + w(�)|σe(d)| + w(b)θ · φ(d)
N-gram statistics Length Base model
Phrase-based
fHiero- style
P1(d|f) :
P2(d|f) :
1. Build posterior forests
Monday, June 7, 2010
Extending Consensus Decoding to Multiple Models
N-gram statistics Length Base model
Phrase-based
fHiero- style
P1(d|f) :
P2(d|f) :
C(d) = +w(�)|σe(d)| + w(b)b(d)4�
n=1
w(n)v(n)(d)
1. Build posterior forests
Monday, June 7, 2010
Extending Consensus Decoding to Multiple Models
N-gram statistics Length Base model
Phrase-based
fHiero- style
P1(d|f) :
P2(d|f) :
1. Build posterior forests
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d)
Sum over models
Monday, June 7, 2010
Extending Consensus Decoding to Multiple Models
N-gram statistics Length Base model
Phrase-based
fHiero- style
P1(d|f) :
P2(d|f) :
1. Build posterior forests
2. Compute n-gram statistics from forests
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d)
Sum over models
Monday, June 7, 2010
Extending Consensus Decoding to Multiple Models
N-gram statistics Length Base model
Phrase-based
fHiero- style
P1(d|f) :
P2(d|f) :
1. Build posterior forests
2. Compute n-gram statistics from forests
3. Union all forests
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d)
Sum over models
Monday, June 7, 2010
Extending Consensus Decoding to Multiple Models
N-gram statistics Length Base model
Phrase-based
fHiero- style
P1(d|f) :
P2(d|f) :
1. Build posterior forests
2. Compute n-gram statistics from forests
3. Union all forests
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d)
Sum over models
Monday, June 7, 2010
Extending Consensus Decoding to Multiple Models
N-gram statistics Length Base model
Phrase-based
fHiero- style
P1(d|f) :
P2(d|f) :
1. Build posterior forests
2. Compute n-gram statistics from forests
3. Union all forests
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d) +w(α)i αi(d)
Model choiceSum over models
Monday, June 7, 2010
Extending Consensus Decoding to Multiple Models
N-gram statistics Length Base model
Phrase-based
fHiero- style
P1(d|f) :
P2(d|f) :
1. Build posterior forests
2. Compute n-gram statistics from forests
3. Union all forests
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d) +w(α)i αi(d)
Model choiceSum over models
dIndicator feature for the system that originally generatedModel choice:
Base model: Model score under the system that generated d
Monday, June 7, 2010
Extending Consensus Decoding to Multiple Models
N-gram statistics Length Base model
Phrase-based
fHiero- style
P1(d|f) :
P2(d|f) :
1. Build posterior forests
2. Compute n-gram statistics from forests
3. Union all forests
4. Optimize multi-model consensus objective
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d) +w(α)i αi(d)
Model choiceSum over models
dIndicator feature for the system that originally generatedModel choice:
Base model: Model score under the system that generated d
Monday, June 7, 2010
1. Build posterior forests
2. Compute n-gram statistics from forests
3. Union all forests
4. Optimize multi-model consensus objective
N-gram statistics Length Base model
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d) +w(α)i αi(d)
Model choiceSum over models
dIndicator feature for the system that originally generatedModel choice:
Base model: Model score under the system that generated d
Extending Consensus Decoding to Multiple Models
Monday, June 7, 2010
1. Build posterior forests
2. Compute n-gram statistics from forests
3. Union all forests
4. Optimize multi-model consensus objective
N-gram statistics Length Base model
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d) +w(α)i αi(d)
Model choiceSum over models
dIndicator feature for the system that originally generatedModel choice:
Base model: Model score under the system that generated d
Original features
Extending Consensus Decoding to Multiple Models
Monday, June 7, 2010
1. Build posterior forests
2. Compute n-gram statistics from forests
3. Union all forests
4. Optimize multi-model consensus objective
N-gram statistics Length Base model
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d) +w(α)i αi(d)
Model choiceSum over models
dIndicator feature for the system that originally generatedModel choice:
Base model: Model score under the system that generated d
Original features
Extending Consensus Decoding to Multiple Models
Monday, June 7, 2010
1. Build posterior forests
2. Compute n-gram statistics from forests
3. Union all forests
4. Optimize multi-model consensus objective
N-gram statistics Length Base model
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d) +w(α)i αi(d)
Model choiceSum over models
dIndicator feature for the system that originally generatedModel choice:
Base model: Model score under the system that generated d
Original features
Extending Consensus Decoding to Multiple Models
Monday, June 7, 2010
1. Build posterior forests
2. Compute n-gram statistics from forests
3. Union all forests
4. Optimize multi-model consensus objective
N-gram statistics Length Base model
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d) +w(α)i αi(d)
Model choiceSum over models
dIndicator feature for the system that originally generatedModel choice:
Base model: Model score under the system that generated d
Original features
Extending Consensus Decoding to Multiple Models
Monday, June 7, 2010
Properties of Model Combination
N-gram statistics Length Base model
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d) +w(α)i αi(d)
Model choiceSum over models
Monday, June 7, 2010
Properties of Model Combination
N-gram statistics Length Base model
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d) +w(α)i αi(d)
Model choiceSum over models
Reduces to consensus decoding when we have only one model
Monday, June 7, 2010
Properties of Model Combination
N-gram statistics Length Base model
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d) +w(α)i αi(d)
Model choiceSum over models
Reduces to consensus decoding when we have only one model
A linear model: can be tuned to maximize output performancew
Monday, June 7, 2010
Properties of Model Combination
N-gram statistics Length Base model
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d) +w(α)i αi(d)
Model choiceSum over models
Reduces to consensus decoding when we have only one model
A linear model: can be tuned to maximize output performancew
No concept of a primary system
Monday, June 7, 2010
Properties of Model Combination
N-gram statistics Length Base model
C(d) = +w(�)|σe(d)| + w(b)b(d)I�
i=1
4�
n=1
w(n)i v(n)
i (d) +w(α)i αi(d)
Model choiceSum over models
Reduces to consensus decoding when we have only one model
A linear model: can be tuned to maximize output performancew
No concept of a primary system
Every possible output was a derivation under some original model
Monday, June 7, 2010
Model Combination Experimental Results
Compared three in-house Google systems
Constrained data track of 2008 NIST task
Parameters tuned on NIST 2004 eval set
Max-derivationConsensus
Monday, June 7, 2010
42
43
44
45
46
Phrase Hiero SAMT Combo
45.3
44.5
43.8
44.6
43.8
43.3
43.9
Arabic-to-English BLEU
Model Combination Experimental Results
Compared three in-house Google systems
Constrained data track of 2008 NIST task
Parameters tuned on NIST 2004 eval set
Max-derivationConsensus
Monday, June 7, 2010
42
43
44
45
46
Phrase Hiero SAMT Combo
45.3
44.5
43.8
44.6
43.8
43.3
43.9
Arabic-to-English BLEU
Model Combination Experimental Results
Compared three in-house Google systems
Constrained data track of 2008 NIST task
Parameters tuned on NIST 2004 eval set
+1.4
Max-derivationConsensus
Monday, June 7, 2010
42
43
44
45
46
Phrase Hiero SAMT Combo
45.3
44.5
43.8
44.6
43.8
43.3
43.9
Arabic-to-English BLEU
Model Combination Experimental Results
Compared three in-house Google systems
Constrained data track of 2008 NIST task
Parameters tuned on NIST 2004 eval set
+1.4
24
25
26
27
28
29
30
Phrase Hiero SAMT Combo
29.028.8
27.827.3
28.4
27.2
25.4
Chinese-to-English BLEU
+0.6
Max-derivationConsensus
Monday, June 7, 2010
Sources of Improvement: Arabic-to-English
Monday, June 7, 2010
Sources of Improvement: Arabic-to-English
43 44 45 46
43.9
44.5
Best max-derivation system
Best single-system with consensus decoding
Monday, June 7, 2010
Sources of Improvement: Arabic-to-English
43 44 45 46
43.9
44.5
Best max-derivation system
Do we need model choice features?
Best single-system with consensus decoding
Monday, June 7, 2010
Sources of Improvement: Arabic-to-English
43 44 45 46
43.9
44.5
Best max-derivation system
Do we need model choice features?
Should n-gram statistics be model specific?
Best single-system with consensus decoding
Monday, June 7, 2010
Sources of Improvement: Arabic-to-English
43 44 45 46
43.9
44.5
Best max-derivation system
Do we need model choice features?
Should n-gram statistics be model specific?
Phrase-based
fHiero- style
Best single-system with consensus decoding
Monday, June 7, 2010
Sources of Improvement: Arabic-to-English
43 44 45 46
43.9
44.5
Best max-derivation system
Do we need model choice features?
Should n-gram statistics be model specific?
Phrase-based
fHiero- style
Best single-system with consensus decoding
“Union”
Monday, June 7, 2010
Sources of Improvement: Arabic-to-English
43 44 45 46
43.9
44.5
44.5
Best max-derivation system
Do we need model choice features?
Should n-gram statistics be model specific?
Phrase-based
fHiero- style
Best single-system with consensus decoding
Union + consensus decoding
“Union”
Monday, June 7, 2010
Sources of Improvement: Arabic-to-English
43 44 45 46
43.9
44.5
44.5
44.9
Best max-derivation system
Do we need model choice features?
Should n-gram statistics be model specific?
Phrase-based
fHiero- style
Best single-system with consensus decoding
Union + consensus decoding
Union + consensus + model choice features
“Union”
Monday, June 7, 2010
Sources of Improvement: Arabic-to-English
43 44 45 46
43.9
44.5
44.5
44.9
45.1
Best max-derivation system
Do we need model choice features?
Should n-gram statistics be model specific?
Phrase-based
fHiero- style
Best single-system with consensus decoding
Union + consensus decoding
Union + consensus + model choice features
Union with model-specific n-gram statistics
“Union”
Monday, June 7, 2010
Sources of Improvement: Arabic-to-English
43 44 45 46
43.9
44.5
44.5
44.9
45.1
45.3
Best max-derivation system
Do we need model choice features?
Should n-gram statistics be model specific?
Phrase-based
fHiero- style
Best single-system with consensus decoding
Union + consensus decoding
Union + consensus + model choice features
Union with model-specific n-gram statistics
“Union”
Model-specific & union n-gram statistics
Monday, June 7, 2010
Outline
Consensus decoding review
Our model combination technique
Comparison to system combination
Monday, June 7, 2010
Outline
Consensus decoding review
Our model combination technique
Comparison to system combination
The Final
Showdown
Monday, June 7, 2010
System Combination Baselines
Two established system combination methods [Macherey & Och, ’07]
Monday, June 7, 2010
System Combination Baselines
Two established system combination methods [Macherey & Och, ’07]
P1(d|f)
P2(d|f)
P3(d|f)
f
e∗1
e∗2
e∗3
Consensus
Consensus
Consensus
System combo
e
1
2
3
Monday, June 7, 2010
System Combination Baselines
Sentence-level Choose among system outputs using an MBR objective
Two established system combination methods [Macherey & Och, ’07]
e = arg maxe∈{e∗1 ,...,e∗k}
E [bleu(e)]
P1(d|f)
P2(d|f)
P3(d|f)
f
e∗1
e∗2
e∗3
Consensus
Consensus
Consensus
System combo
e
1
2
3
Monday, June 7, 2010
System Combination Baselines
Sentence-level Choose among system outputs using an MBR objective
Word-level Confusion network approach
Two established system combination methods [Macherey & Och, ’07]
e = arg maxe∈{e∗1 ,...,e∗k}
E [bleu(e)]
P1(d|f)
P2(d|f)
P3(d|f)
f
e∗1
e∗2
e∗3
Consensus
Consensus
Consensus
System combo
e
1
2
3
Monday, June 7, 2010
System Combination Baselines
Sentence-level Choose among system outputs using an MBR objective
Word-level Confusion network approach
Two established system combination methods [Macherey & Och, ’07]
e = arg maxe∈{e∗1 ,...,e∗k}
E [bleu(e)]
P1(d|f)
P2(d|f)
P3(d|f)
f
e∗1
e∗2
e∗3
Consensus
Consensus
Consensus
System combo
e
1
2
3
All are aligned to a backbone e∗i eb ∈ {e∗1, . . . , e∗k}
Monday, June 7, 2010
System Combination Baselines
Sentence-level Choose among system outputs using an MBR objective
Word-level Confusion network approach
Sentences + alignments form a confusion network
Two established system combination methods [Macherey & Och, ’07]
e = arg maxe∈{e∗1 ,...,e∗k}
E [bleu(e)]
P1(d|f)
P2(d|f)
P3(d|f)
f
e∗1
e∗2
e∗3
Consensus
Consensus
Consensus
System combo
e
1
2
3
All are aligned to a backbone e∗i eb ∈ {e∗1, . . . , e∗k}
Monday, June 7, 2010
System Combination Baselines
Sentence-level Choose among system outputs using an MBR objective
Word-level Confusion network approach
Sentences + alignments form a confusion network
Output maximizes a consensus objective
Two established system combination methods [Macherey & Och, ’07]
e = arg maxe∈{e∗1 ,...,e∗k}
E [bleu(e)]
P1(d|f)
P2(d|f)
P3(d|f)
f
e∗1
e∗2
e∗3
Consensus
Consensus
Consensus
System combo
e
1
2
3
All are aligned to a backbone e∗i eb ∈ {e∗1, . . . , e∗k}
Monday, June 7, 2010
Model versus System Combination
Qualitative
Quantitative
Monday, June 7, 2010
Model versus System Combination
Single-system n-gram statistics are required in both methods
Qualitative
Quantitative
Monday, June 7, 2010
Model versus System Combination
Single-system n-gram statistics are required in both methods
Model combination searches only once under a consensus objective
Qualitative
Quantitative
Monday, June 7, 2010
Model versus System Combination
Inter-hypothesis alignment problem exists only in system combination
Single-system n-gram statistics are required in both methods
Model combination searches only once under a consensus objective
Qualitative
Quantitative
Monday, June 7, 2010
Model versus System Combination
Best single-system consensus
Sentence-level system combination
Word-level system combination
Model combination
43 44 45 46
45.3
44.7
44.6
44.5
Arabic-to-English
Inter-hypothesis alignment problem exists only in system combination
Single-system n-gram statistics are required in both methods
Model combination searches only once under a consensus objective
Qualitative
Quantitative
Monday, June 7, 2010
Model versus System Combination
Best single-system consensus
Sentence-level system combination
Word-level system combination
Model combination
43 44 45 46
45.3
44.7
44.6
44.5
Arabic-to-English
27 28 29 30
29.0
28.8
28.8
28.8
Chinese-to-English
Inter-hypothesis alignment problem exists only in system combination
Single-system n-gram statistics are required in both methods
Model combination searches only once under a consensus objective
Qualitative
Quantitative
Monday, June 7, 2010
Conclusion
Monday, June 7, 2010
Conclusion
Consensus decoding with a learned objective extends naturally to multiple models
Monday, June 7, 2010
Conclusion
Consensus decoding with a learned objective extends naturally to multiple models
Model combination provides consensus and combination effects in a unified, distribution-driven objective
Monday, June 7, 2010
Conclusion
It outperforms a pipeline of consensus decoding followed by system combination, using less total computation
Consensus decoding with a learned objective extends naturally to multiple models
Model combination provides consensus and combination effects in a unified, distribution-driven objective
Monday, June 7, 2010
Conclusion
It outperforms a pipeline of consensus decoding followed by system combination, using less total computation
Consensus decoding with a learned objective extends naturally to multiple models
Model combination provides consensus and combination effects in a unified, distribution-driven objective
It’s easy, it’s clean, and it works
Monday, June 7, 2010
Conclusion
It outperforms a pipeline of consensus decoding followed by system combination, using less total computation
Consensus decoding with a learned objective extends naturally to multiple models
Model combination provides consensus and combination effects in a unified, distribution-driven objective
It’s easy, it’s clean, and it works Thanks!
Monday, June 7, 2010