Computing Lattice BLEU Oracle Scores for Machine Translation · 2018. 12. 11. · e2E(f) score(ejf) n e.g.: score(ejf) = h(e;f) h(e;f) { features, { params n Æ E(f) { search space

Computing Lattice BLEU Oracle Scoresfor Machine Translation

Artem Sokolov & Guillaume Wisniewski & Francois [email protected]

LIMSI, Orsay, France

1 Introduction2 Oracle Decoding Task3 Proposed Oracle Decoders4 Experiments5 Conclusion & Future Work

Model & Inference in SMT

Usual translating of sentences f:

n best translation: e∗ = arg maxe∈E(f) score(e|f)

n e.g.: score(e|f) = λ · h(e, f) h(e, f) – features, λ – params

n E (f) – search space áinterested in this

Example

n source: Venus est la jumelle infernale de la Terre

n search space options:

á list of n-best hypothesesVenus - the twin of hell of the Earth

Venus is the infernal binocular of the Earth

Venus is the hellish twin of the Earth

á word lattice

02Vénus : Venus

1Vénus_est : Venus 3

est : is

NULL : -

4

la : the 5

la_jumelle : twin

jumelle : twin

7infernale : hellish

6

infernale : infernal8

infernale : of_hell

jumelle : twinjumelle : binocular

9de_la_Terre : of_the_Earth

Artem Sokolov, G. Wisniewski, F. Yvon | LIMSI-CNRS | Lattice BLEU Oracles for SMT 1 / 15

Motivation

n in reality SMT systems perform poorly

å knowing why is difficult because of system complexity

n useful to know the best possible hypotheses

å can help failure analysis

Usefulness of (oracle) translation

n Finding bottlenecks

á if oracle is good – why the system couldn’t return it?

á if oracle is bad – why doesn’t E contain better ones?

n Learning

á update towards the best hypothesis

In this work

n interested in finding best hypotheses contained in lattices

n two new methods, comparison with 2 existing methods


Oracle Decoding Task

n by best hypothesis in E we mean highest BLEU translation

n-BLEU(ei, ri) = BrevityPenalty ·( n∏

m=1

pm

)1/n

á ei – hypotheses, ri – references, pm – clipped precisions

n need some sentence-level approximation BLEU’

Exact oracle finding with sent-BLEU’

n in n-best list – straight-forward

n in word lattice – NP-hard even for 2-BLEU [Leusch et al., 2008]

å so, approximations are needed áfocus of this presentation

Oracle decoding on lattice L for reference rf

π∗(f) = arg maxπ∈paths on L

BLEU ′(eπ, rf).


Earlier Approaches

Language Model Oracle [Li & Khudanpur, 2009] LM

n n-gram language model ALM built on reference r

n find most probable hypothesis

π∗LM(f) = ShortestPath(L ALM(rf))

Partial BLEU Oracle [Dreyer et al., 2007] PB/PB`

n for partial path π let its weight be:

− logBLEU ′(π) = −1

4

∑m=1..4

log pm

n hypothesis h1 is prefered to h2 if logBLEU(h1) > logBLEU(h2)

á and same 3-word history or PBá same 3-word history and same length PB`

n find shortest path

π∗PB(f) = ShortestPath(L)


Problems

Problems with existing approaches

1 LM

á distant relation to BLEUá low quality

2 PB

á approximate search (recombination heuristics)á can be resource-greedy (keeps many hypotheses until the final state)

3 Both LM & PB ignore:

á precision clippingá brevity penalty (except PB`)á corpus nature of BLEU


Our Contribution

New oracles

n partially take clipping, brevity and corpus nature into account

n produce better hypotheses and often faster

Take-away message

n although true BLEU oracles are NP-hard...

n we can find approximate oracles:

n very easy to calculaten very high BLEU


Linear approximation of corpus-BLEU LB

Linear approximation of BLEU [Tromble et al., 2008]

linBLEU(π) = θ0 |eπ|+4∑

n=1

θn

∑u∈Σn

cu(eπ)δu(r)

n cu(e) – number of occurences of u in e,

n δu(r) – indicator of presence of u in r.

n parameters: θ0 = 1 (∼brevity), θn = − 14p·rn−1 (n-gram discounts)

‘de Bruijn transducers’ ∆n

n each ∆n contains all n-grams as output symbols

n arcs weighted by δu(r)

q₀

0:0/01:1/0

q₀

00:ε/0

1

1:ε/0

0:00/0

1:01/0

0:10/0

1:11/0

q₀

0

0:ε/0

1

1:ε/0

011:ε/0

00

0:ε/0

100:ε/0

11

1:ε/0

1:011/0

0:010/0

1:101/0

0:100/0

0:110/0

1:111/0

1:001/0

0:000/0

∆n automata for alphabet 0, 1 and n = 1 . . . 3


Linear approximation of corpus-BLEU LB

Composition effect of ∆s

n L ∆n produces n-gram lattice

n each n-gram arc weighted by θn, if the n-gram was in reference

Find LB oracle translation for 4-BLEU

n weight lattice L with θ0

n weight ∆i with θi

n find shortest path

π∗LB = ShortestPath(L ∆1 ∆2 ∆3 ∆4).

n so far all oracles:á transformed and/or reweighted lattice Lá called ShortestPath

n difficult to account for clipping (it depends on the resulting path)


Integer Linear Programming Oracles ILP

n how to take clipping into account?

n need to fine-tune path weights

n reformulate 1-BLEU shortest path problem as an ILP problem:

arg maxξ

∑i∈arcs

ξi · (Θ1 · 1i∈r −Θ2 · 1i 6∈r) maximize 1-BLEU

∑ξ∈in(qfin)

ξ = 1,∑

ξ∈out(qinit )

ξ = 1 solution should be path

∑ξ∈out(q)

ξ −∑

ξ∈in(q)

ξ = 0, q ∈ nodes \ qinit , qfin

n arc indicator ξi =

1 if arc i is active in solution,

0 otherwise,


Taking account of clipping in ILP

New variable: Clipped word counter

γw = min∑

ξ∈edges with w

ξ, # of w in r

ILP problem with clipping

arg maxξ,γw

Θ1 ·∑

w∈words

γw −Θ2 ·

∑i∈arcs

ξi −∑

w∈words

γw

γw ≥ 0, γw ≤ cw (r), γw ≤

∑ξ∈edges carring w

ξ

∑ξ∈in(qfin)

ξ = 1,∑

ξ∈out(qinit )

ξ = 1

∑ξ∈out(q)

ξ −∑ξ∈in(q)

ξ = 0, q ∈ nodes \ qinit , qfin

n many ILP solvers available n we used Gurobi


Langrangian Relaxation-based Oracle Decoder RLX

n can be done without 3rd-party solvers

n Lagrangian relaxation to deal with clipping

Primal Dual

arg minξ∈paths

−∑

i∈arcs

ξi · θi∑ξ with w

ξ ≤ cw (r),w ∈ r

arg maxλ

L(λ)

λw 0,w ∈ r

Where

L(λ) = minξ

− ∑i∈arcs

ξi · θi +∑w∈r

λw ·

∑ξ with w

ξ − cw (r)


Langrangian Relaxation-based Oracle Decoder RLX

L(λ) = minξ

− ∑i∈arcs

ξiθi +∑w∈r

λw

∑ξ with w

ξ − cw (r)

ξ∗ = arg min

ξL(λ) = arg min

ξ

∑i ∈ arcs

ξi

(λw(ξi ) − θi

)ájust find shortest path

∂L(λ)

∂λw=

∑ξ∈Ω(w)∩ξ∗

ξ − cw (r) áupdate λ towards maxL(λ)

Incremental Constraint Enforcing cf. [Rush et al., 2010]

∀w, λ(0)w ← 0

for t = 1→ T doξ∗(t) = arg minξ

∑i ξi ·

(λw(ξi) − θi

)if all clipping constraints are enforcedthen optimal solution foundelse for w ∈ r do

nw ← n. of occurrences of w in ξ∗(t)

λ(t)w ← λ

(t)w + α(t) · (nw − cw(r))

λ(t)w ← max(0, λ(t)

w )

n this was for 1-BLEU n for n-BLEU run on L ∆n


Experiments: Performance for two decoders

N-code

35

40

45

50

RLX ILP LB4 LB2 PB PBl LM4 LM2 0

1

2

3

4

5

6

BLE

U

avg.

tim

e, s

BLEU

47.8

2

48.1

2

48.2

2

47.7

1

46.7

6

46.4

8

38.9

1

38.7

5

avg. timen-best

(a) fr→en (test: 27.88)

30

35

RLX ILP LB4 LB2 PB PBl LM4LM2 0

0.5

1

1.5

BLE

U

avg.

tim

e, s

BLEU

34.7

9

34.7

0

35.4

9

35.0

9

34.8

5

34.7

6

29.5

3

29.5

3

avg. timen-best

(b) de→en (test: 22.05)

20

25

30

RLX ILP LB4 LB2 PB PBl LM4LM2 0

0.5

1

BLE

U

avg.

tim

e, s

BLEU

24.7

5

24.6

6

25.3

4

24.8

5

24.7

8

24.7

3

20.7

8

20.7

4

avg. timen-best

(c) en→de (test: 15.83)

Moses

35

40

45

50


1

2

3

BLE

U

avg.

tim

e, s

BLEU

43.8

2

44.0

8

44.4

4

43.8

2

43.4

2

43.2

0

36.3

4

36.2

5

avg. timen-best

(d) fr→en (test: 27.68)

30

35


1

2

3

4

BLE

U

avg.

tim

e, s

36.4

3

36.9

1

37.7

3

36.5

2

36.7

5

36.6

2

29.5

1

29.4

5

(e) de→en (test: 21.85)

20

25

30


1

2

3

4

5

6

7

8

9

BLE

U

avg.

tim

e, s

28.6

8

28.6

4 29.9

4

28.9

4

28.7

6

28.6

5

21.2

9

21.2

3

(f) en→de (test: 15.89)


Conclusion & Future Work

Results

n new oracle decoders find better hypotheses (clipping/brevity/corpus)

n ...and do it faster

n optimization for 2-BLEU is sufficient to obtain good 4-BLEU

Main Conclusion

á lattices contain excellent translationsn +5-10 points to n-best list oraclesn +20 points above results of state-of-the-art models

á SMT has serious scoring problems Future Workn linear models not flexible enough? á non-linear score func.n more features needed? á rich featuresn training flawed? á discriminative training

with lattice oracles


Thank you for your attention!

Questions?


Bibliography

Dreyer, M., Hall, K. B., & Khudanpur, S. P. (2007).

Comparing reordering constraints for SMT using efficient BLEU oracle computation.In Proc. of the Workshop on Syntax and Structure in Statistical Translation (pp. 103–110). Morristown, NJ, USA.

Leusch, G., Matusov, E., & Ney, H. (2008).

Complexity of finding the BLEU-optimal hypothesis in a confusion network.In Proc. of the 2008 Conf. on EMNLP (pp. 839–847). Honolulu, Hawaii.

Li, Z. & Khudanpur, S. (2009).

Efficient extraction of oracle-best translations from hypergraphs.In Proc. of Human Language Technologies: The 2009 Annual Conf. of the North American Chapter of the ACL,Companion Volume: Short Papers (pp. 9–12). Morristown, NJ, USA.

Rush, A. M., Sontag, D., Collins, M., & Jaakkola, T. (2010).

On dual decomposition and linear programming relaxations for natural language processing.In Proc. of the 2010 Conf. on EMNLP (pp. 1–11). Stroudsburg, PA, USA.

Tromble, R. W., Kumar, S., Och, F., & Macherey, W. (2008).

Lattice minimum bayes-risk decoding for statistical machine translation.In Proc. of the Conf. on EMNLP (pp. 620–629). Stroudsburg, PA, USA.


Implementation

n OpenFST toolbox

á works with any semiring (PB required semiring of sets)á proven and well optimized ShortestPath algorithms

n Gurobi ILP solver


Documents

Computing Lattice BLEU Oracle Scores for Machine Translation · 2018. 12. 11. · e2E(f) score(ejf) n e.g.: score(ejf) = h(e;f) h(e;f) { features, { params n Æ E(f) { search space