Example of Parallel Corpus Machine Translation: Word ... · Machine Translation: Word Alignment Problem Marcello Federico FBK, Trento - Italy 2016 M. Federico MT 2016 Outline 1 •Word

Machine Translation:Word Alignment Problem

Marcello FedericoFBK, Trento - Italy

2016

M. Federico MT 2016

1Outline

• Word alignments

• Word alignment models

• Alignment search

• Alignment estimation

• EM algorithm

• Model 2

• Fertility alignment models

• HMM alignment models

This part contains advanced material (marked with *) suited to students interestedin the mathematical details of the presented models.

M. Federico MT 2016

2Example of Parallel Corpus

Darum liegt die Verantwortungfur das Erreichen desEffizienzzieles und der damiteinhergehenden CO2 -Reduzierungbei der Gemeinschaft , dienamlich dann tatig wird ,wenn das Ziel besser durchgemeinschaftliche Massnahmenerreicht werden kann . Undgenaugenommen steht hier dieGlaubwurdigkeit der EU auf demSpiel .

That is why the responsibilityfor achieving the efficiencytarget and at the same timereducing CO2 lies with theCommunity , which in fact takesaction when an objective canbe achieved more effectively byCommunity measures .Strictly speaking , it is thecredibility of the EU that is atstake here .

Notice di↵erent positions of corresponding verb groups.

MT has to take into account word re-ordering!

M. Federico MT 2016

3Word Alignments

• Let us considers possible alignments a between words in f and e.

dalla1

orientaledomani unserata soffierà freddo ventodi98765432

1

since32 84 965 7

blowwillwindchillyeasternaneveningtomorrow

M. Federico MT 2016

3Word Alignments


• Typically, alignments are restricted to maps between positions of f and of e.

dalla1


1

since32 84 965 7


M. Federico MT 2016

3Word Alignments



• Some source words might be not aligned

dalla1


1

since32 84 965 7


M. Federico MT 2016

3Word Alignments

• Ley us considers possible alignments a between words in f and e.


• Some source words might be not aligned (=virtually aligned with NULL)

dalla1


1

since30 2 84 965 7

blowwillwindchillyeasternaneveningtomorrowNULL

M. Federico MT 2016

3Word Alignments




• These and even more general alignments are machine learnable.

dalla1


1

since30 2 84 965 7


M. Federico MT 2016

3Word Alignments




• These and even more general alignments are machine learnable.

• Notice also that alignments induce word re-ordering

dalla1


1

since30 2 84 965 7


M. Federico MT 2016

4Word Alignment: Matrix Representation

blow 9 · · · · • · · · ·will 8 · · · · · · · · ·wind 7 · · · · · · · • ·

chilly 6 · · · · · · • · ·eastern 5 · · · · · · · · •

an 4 · · · · · • · · ·evening 3 · • · · · · · · ·tomorrow 2 · · · • · · · · ·

since 1 • · · · · · · · ·

1 2 3 4 5 6 7 8 9

dalla

serata

di domani

soffiera

un fblackdo

vento

orientale

dalla1


1

since32 84 965 7


M. Federico MT 2016

4Word Alignment: Matrix Representation

blow 9 · · · · • · · · ·will 8 · · · · · · · · ·wind 7 · · · · · · · • ·

chilly 6 · · · · · · • · ·eastern 5 · · · · · · · · •

an 4 · · · · · • · · ·evening 3 · • · · · · · · ·tomorrow 2 · · · • · · · · ·

since 1 • · · · · · · · ·NULL 0 · · • · · · · · ·

1 2 3 4 5 6 7 8 9

dalla

serata

di domani

soffiera

un fblackdo

vento

orientale

dalla1


1

since30 2 84 965 7


M. Federico MT 2016

5Word Alignment: Direct Alignment

A : {1, . . . ,m} �! {1, . . . , l}

implemented6 · · · · • • •been5 · · · • · · ·has4 · · • · · · ·

program3 · • · · · · ·the2 • · · · · · ·and1 · · · · · · ·

position 1 2 3 4 5 6 7

il programma

e stato

messo

in pratica

We allow only one link (point) in each column. Some columns may be empty.

M. Federico MT 2016

6Word Alignment: Inverted Alignment

A : {1, . . . , l} �! {1, . . . ,m}

people6 · · · •aborigenal5 · · · •

the4 · · • ·of3 · · • ·

territory2 · • · ·the1 • · · ·

position 1 2 3 4

il territorio

degli

autoctoni

You can get a direct alignment by swapping source and target sentence.

M. Federico MT 2016

7Alignment Variable

• Modelling the alignment as an arbitrary relation between source and targetlanguage is very general but computationally unfeasible: 2

l·m possiblealignments!

• A generally applied restriction is to let each source word be assigned to exactlyone target word(see Example 2). Hence, alignment is a map from source totarget positions:

A : {1, . . . ,m} �! {0, . . . , l}• Alignment variable: a = a1, . . . , am

consists of associations j ! i = aj

, fromsource position j to target position i = a

j

.

• We may include null word alignments, that is aj

= 0 to account for sourcewords not aligned to any target word. Hence, “only” (l + 1)

m possiblealignments.

M. Federico MT 2016

8Word Alignment Model

In SMT we will model the translation probability Pr(f | e) by summing theprobabilities of all possible (l+1)

m ’hidden’ alignments a between the source andthe target strings:

Pr(f | e) =

X

a

Pr(f ,a | e) (1)

Hence we will consider statistical word alignment models:

Pr(f ,a | e) = p✓

(f ,a | e)

defined by specific sets of parameters ✓.

The art of statistical modelling consists in designing statistical models whichcapture the relevant properties of the considered phenomenon, in our case therelationship between a source language string and a target language string.

There are 5 models of increasing complexity (=number of parameters)

M. Federico MT 2016

9Word Alignment Models

• In order to find automatic methods to learn word alignments from data we usemathematical models that ”explain” how translations are generated.

• The way models explain translations may appear very naıve if not silly!Indeed they are very simplistic ...

• However, simple explanations often do work better than complex ones!

• We need to be a little bit formal here, just to give names to ingredients wewill use in our recipes to learn word alignments:

– English sentence e is a sequence of l words– French sentence f is a sentence of m words– Word alignment a is a map from m positions to l + 1 positions

• We will have to relax a bit our conception of sentence:it is just a sequence of words, which might have or not sense at all...

M. Federico MT 2016

10Word Alignment Models

There are five models, of increasing complexity, that explain how a translationand an alignment can be generated from a foreign sentence.

Alignment ModelPr(a,f|e)e a,f

Complexity refers to the amount of parameters that define the model!

We start from the simplest model, called Model 1!

M. Federico MT 2016

11Model 1


Model 1 generates the translation and the alignment as follows:

1. guess the length m of f on the basis of the length l of e

2. for each position j in f repeat the following two steps:

(a) randomly pick a corresponding position i in e

(b) generate word j of f by picking a translation of word i in e

Step 1 is executed by using a translation length predictorStep 2.(a) is performed by throwing a dice with l + 1 faces 1

Step 2.(b) is carried out by using a word translation table

1We want to include the null word.

M. Federico MT 2016

12On Probability Factorization

Chain RuleThe prob. of a sequence of events e = e1, e2, e3, ....el

can be factorized as:

Pr(e = e1, e2, e3, . . . el

) = Pr(e1)⇥ Pr(e2 | e1)⇥ Pr(e3 | e1, e2)⇥ . . .

⇥Pr(el

| e1, . . . , el�1)

• The joint probability is factorized over single event probabilities

• Factors however introduce dependencies of increasing complexity– the last factor has the same complexity of the complete joint probability!

• There are two basic approximations for sequential models– which eliminate dependencies in the conditional part of the chain factors

• Notice that for non-sequential events, we might change the order of factors, e.g:

Pr(f ,a | e) = Pr(a, f | e) = Pr(a | e)⇥ Pr(f | e,a)

M. Federico MT 2016

13Basic Sequential Models

• Bag-of-word model:We assume that each event is independent from the others:

Pr(e = e1, e2, e3, . . . el

) ⇡ Pr(e1)⇥ Pr(e2)⇥ Pr(e3)⇥ . . .⇥ Pr(el

)

• Markov chain model:We assume that each factor event only depends from the previous one:

Pr(e = e1, e2, e3, . . . el

) ⇡ Pr(e1)⇥Pr(e2 | e1)⇥Pr(e3 | e2)⇥. . .⇥Pr(el

| el�1)

• We reduce complexity by removing dependencies

• Event space becomes smaller and probabilities easier to estimate

• This simplification might reduce accuracy of the model

M. Federico MT 2016

14Word Alignment Model Factorization

One of the many ways to exactly decompose Pr(fm

1 , am

1 | el

1) is:

Pr(fm

1 , a

m

1 | e

l

1) = Pr(m | e

l

1)mY

j=1

Pr(fj

, a

j

| f

j�11 , a

j�11 , m, e

l

1)

= Pr(m | e

l

1)mY

j=1

Pr(aj

| f

j�11 , a

j�11 , m, e

l

1) · Pr(fj

| f

j�11 , a

j

1, m, e

l

1)

Looks dense but it’s just the plain application of the chain rule!

M. Federico MT 2016

15Word Alignment Model Factorization

One of the many ways to exactly decompose Pr(fm

1 , am

1 | el

1) is:

Let’s make it look simpler

Pr(fm

1 , am

1 | el

1) = Pr(m | el

1)

mY

j=1

Pr(aj

| ...) · Pr(fj

| aj

, ...)

Generative stochastic process:

1. choose length m of the French string, given knowledge of the English string e

l

1

2. cover one English position for each French position j, given ...

3. choose French word for each position j , given the covered English position ....

Remark: the process works in the ”wrong” direction: it generates f from e.In fact, it is used to calculate Pr(f ,a | e)⇥ Pr(e) in the search problem.Though, it can work in both directions by exchanging f and e.

M. Federico MT 2016

16Model 1

Given the alignment factorization

Pr(fm

1 , am

1 | el

1) = Pr(m | el

1)

mY

j=1

Pr(aj

| ...) · Pr(fj

| aj

, ...)

We simplify all interactions by means of pairwise dependencies:

Pr(m | el

1) = p(m | l) length probabilityPr(a

j

| . . .) = (l + 1)

�1 alignment probabilityPr(f

j

| aj

, . . . ) = p(fj

| ea

j

) translation probability

Hence, we get the following translation model:

Pr(fm

1 | el

1) =

X

a

m

1

Pr(fm

1 , am

1 | el

1) =

p(m | l)(l + 1)

m

·X

a

m

1

mY

j=1

p(fj

| ea

j

)

=

p(m | l)(l + 1)

m

·mY

j=1

lX

i=0

p(fj

| ei

) nice complexity reduction!

M. Federico MT 2016

17Model 1

Model 1 has a very simple stochastic generative process:

1. Choose a length m for f according to p(m | l)2. For each j = 1, . . . ,m, choose a

j

in {0, 1, . . . , l} at random

3. For each j = 1, . . . ,m, choose French word fj

according to p(fj

| ea

j

)

Properties:

• Model 1 is very naive but is a good starting point for better models

• Parameters ✓ are the probabilities p(fj

| ea

j

)

• Computation of Pr(f | e) can be very e�cient

• Search of the most probable alignment is straightforward

• Estimation is trivial given a parallel corpus with alignments

• Estimation is e�cient given a parallel corpus without alignments

M. Federico MT 2016

18Model 1: Generative Process

the1 program2 has3 been4 implemented5l=5

m=7 3 54 5 5 1 2

e'1 stato2 messo3 in4 pratica5 il6 programma7

has3 been4 implemented5 implemented5 implemented5 the1 program2

. . .alignment

length

translationwords chosen through

a probability table

positions picked randomly

MODEL 1 ONLY RELIES ON WORD-TO-WORD TRANSLATION PROBs!

M. Federico MT 2016

19Model 1


Let us see how we can can implement Model 1 and at its complexity:

1. length predictor of the translationthis is not di�cult to build, we look for instance at many English-Frenchtranslations and study how sentence lengths are related (few parameters)

2. dice of l + 1 faces: very simple to simulate by a computer (no parameters)

3. translation table of words: this is the tricky part.We need a big table that tells us for each French word f and English word eif e is either a good or bad translation of f (fair amount of parameters)

M. Federico MT 2016

20Model 1: Translation Table

Assume very simple German and English languages: just 4 words each.2

0.1ath

e

ein das

0.85

0.8

0.12

0.02

0.04

0,05

0.03

0.03

Buch Haus

0.01

0.07

0.02

0.01

0.02

0.92

0.92

book

hous

e

Model 1 needs a table 4 x 4:

• each raw shows German translations of each English word

• each raw contains probabilities summing up to one

Of course, the majority of cells should ideally be equal to zero.Learning Model 1, basically means filling the table with some good values ....

2Let’s forget about the null word here.

M. Federico MT 2016

21Model 1: Learning

Let us assume that we have a parallel corpus with alignments:

orientaleventofreddounsoffieràdomanidiseratadalla

blowwillwindchillyeasternaneveningtomorrowsince

un Alpileinteressaestdafreddovento

eastern Alpstheaffectsbreezecoolan

We can estimate translation probabilities by counting aligned word-pairs.

The maximum likelihood estimation for a discrete distribution is:

p(e | f) =

count(f, e)Pe

count(e, f)

=

count(e, f)

count(f)

for a word-pair chilly-freddo we count how often they are aligned together

p(chilly | freddo) =

count(chilly, freddo)

count(freddo)

=

1

2

= 0.5

we end up with reliable probabilities by using a very large parallel corpus!

M. Federico MT 2016

22Model 1: Aligning

Let us assume that we have probabilities p(f | e) for all word pairs.

Given a parallel corpus without alignments


blowwillwindchillyeasternaneveningtomorrowsince eastern Alpstheaffectsbreezecoolan


the most probable (or Viterbi) alignment of each sentence pair

a

⇤= arg max

a

Pr(a | f , e) /mY

j=1

p(fj

| ea

j

)

can be computed by finding the most probable translation for each source position:

a⇤j

= arg maxi=0,1,...,l

p(fj

| ei

)

The time complexity of the Viterbi search for Model 1 is just O(M ⇥ L)

M. Federico MT 2016

23Model 1: Best Alignment Search

Let us assume that we have translation probabilities p(f | e).

Given a parallel corpus without alignments


blowwillwindchillyeasternaneveningtomorrowsince eastern Alpstheaffectsbreezecoolan


We can compute the most probable alignment of each sentence pair as follows:

for each word e in the text e we pick the most probable word f in the texte according to the available probabilities.

[Exercise 2. Given the following translation probabilities of the word freddo,what alignments will be generate for this word?

cold chilly cool windfreddo 0.4 0.3 0.2 0.1

]

M. Federico MT 2016

24MLE of a Discrete Distribution

Let x = x1, . . . , xS

be a random sample of outcomes of a dice X ⇠ p✓

(X).

• Parameters ✓ = {✓(w)

!2⌦, ✓(!) � 0,P

!

✓(!) = 1} where ⌦ = {1, 2, . . . , 6}• Assume outcomes in x are independent and identically distributed (iid)

• Maximum likelihood estimation looks for ✓ that maximize the sample likelihood

L(✓) =

SY

i=1

p✓

(X = xi

) =

Y

!2⌦

p✓

(X = !)

c(!)=

Y

!2⌦

✓(!)

c(!)[c(·) = sample count]

• We apply a monotonic map to get something equivalent but easier to maximize:

L(✓) = log

Y

!2⌦

✓(!)

c(!)=

X

!2⌦

c(!) log ✓(!)

• then we can apply Lagrange multipliers to get the closed form solution:

ˆ✓(!) =

c(!)

S

that is the well known relative frequency!

M. Federico MT 2016

25Training of Word Alignment Models

Let p✓

(f | e) be a translation model with unknown parameters ✓, that we want toestimate from a sample of iid translations {(f

s

, es

) : s = 1, . . . , S} by maximizing:

L(✓) =

SX

s=1

log p✓

(f

s

| es

) =

X

f ,e

c(f , e) log p✓

(f | e) [c(·) = sample count] (2)

p✓

(f

s

| es

) is the marginal probability of an alignment model:

p✓

(f | e) =

X

a

p✓

(f ,a | e) (3)

where the hidden variable a is not observed in the training sample.

Unfortunately, there is no closed-form solution for maximizing L(✓). There isan iterative algorithm which is proven to converge at least to a maximum of L(✓).

M. Federico MT 2016

26Estimation of Word Alignment Models

How to train alignment models from parallel data?

• data + word alignments =) model parameters

• data + model parameters =) word alignments

Idea to solve this chicken & egg problem

BILINGUAL

CORPUS

INITIAL

PARAM

IMPROVE

ESTIMATEPARAM

loop until convergencence

M. Federico MT 2016

27Model 1: Estimation

Let’s go back to our simplified English and German languages.

Ingredients of Expectation Maximization algorithm:

• Initial parameters: translation table with uniform probabilities

• Bilingual corpus: collection of human translations

Blingual corpus

the house - das Hausthe book - das Bucha book - ein Buch

Probability table:das ein Haus Buch

the 0.25 0.25 0.25 0.25a 0.25 0.25 0.25 0.25

house 0.25 0.25 0.25 0.25book 0.25 0.25 0.25 0.25

Let us now see how to improve our probabilities ....

M. Federico MT 2016


We start from the first sentence pair of the bilingual corpus:

the house - das Haus

and apply the following two steps:

1. We weight each word co-occurrences with its probability in the table:co(the,das) = 1 x 0.25 co(the,Haus) = 1 x 0.25co(house,das) = 1 x 0.25 co(house,Haus) = 1 x 0.25

2. We transform weighted co-occurrences in conditional probabilities:Pr(das/the) = 0.25/(0.25+0.25) Pr(Haus/the) =0.25/(0.25+0.25)Pr(das/house) = 0.25/(0.25+0.25) Pr(Haus/house) =0.25/(0.25+0.25)

the das ein Haus Buchco 0.25 0 0.25 0pr 0.50 0 0.50 0

house das ein Haus Buchco 0.25 0 0.25 0pr 0.50 0 0.50 0

Notice: in this sentence ”the” can be only linked either to ”das” or to ”Haus”

We apply the same steps to all the sentence pairs of the bilingual corpus

M. Federico MT 2016


• the house - das Haus

the das ein Haus Buchco 0.25 0 0.25 0pr 0.50 0 0.50 0


• the book - das Buch

the das ein Haus Buchco 0,25 0 0 0.25pr 0,50 0 0 0.50

book das ein Haus Buchco 0.25 0 0 0.25pr 0.50 0 0 0.50

• a book - ein Buch

a das ein Haus Buchco 0 0.25 0 0.25pr 0 0.50 0 0.50

book das ein Haus Buchco 0 0.25 0 0.50pr 0 0.50 0 0.50

M. Federico MT 2016


We sum up all sentence level probs in a co-occurrence table ...

Bilingual corpus


Co-occurrence table:

das ein Haus Buch totalthe 1.0 0 0.50 0.50 2a 0 0.5 0 0.5 1

house 0.5 0 0.5 0 1book 0.5 0.5 0 1.0 2

Finally, we compute updatedword translation probabilitiesfrom the counts.


the 0.50 0 0.25 0.25a 0 0.5 0 0.5

house 0.5 0 0.5 0book 0.25 0.25 0.50 0

Let is start a second iteration ....

M. Federico MT 2016


• the house - das Haus

the das ein Haus Buchco 0.5 0 0.25 0pr 0,67 0 0.33 0


• the book - das Buch

the das ein Haus Buchco 0,5 0 0 0.25pr 0,67 0 0 0.33

book das ein Haus Buchco 0.25 0 0 0.5pr 0.33 0 0 0.67

• a book - ein Buch

a das ein Haus Buchco 0 0.5 0 0.5pr 0 0.5 0 0.5

book das ein Haus Buchco 0 0.25 0 0.50pr 0 0.33 0 0.67

M. Federico MT 2016


Again, we sum all probabilities in a co-occurrence table

Bilingual corpus


Co-occurrence table:

das ein Haus Buch totalthe 1.34 0 0.33 0.33 2a 0 0.5 0 0.5 1

house 0.5 0 0.5 0 1book 0.33 0.33 0 1.34 2

and compute updatedword translation probabilitiesfrom the counts.


the 0.67 0 0.165 0.165a 0 0.5 0 0.5

house 0.5 0 0.5 0book 0.165 0.165 0.67 0

We iterate this procedure several times, until prob get stable values

M. Federico MT 2016


Iteration 3

das ein Haus Buchthe 0.8 0 0.1 0.1a 0 0.5 0 0.5

house 0.5 0 0.5 0book 0.1 0.1 0 0.8

....

Iteration 12

das ein Haus Buchthe 1.0 0 0 0a 0 0.5 0 0.5

house 0.5 0 0.5 0book 0 0 0 1.0

• This procedure is called Expectation Maximization algorithm

• Here, EM could only learn translations of ”the” and ”book”!

• We need more data to learn more translations and ... better models, too!

M. Federico MT 2016

34Expectation Maximization Algorithm

We introduce an auxiliary function Q(

˜✓, ✓)3 which has two properties:

Q(

˜✓, ˜✓) = 0 (4)

Q(

˜✓, ✓) > 0 =) L(✓) > L(

˜✓) (5)

Hence, with this iterative procedure we can find a maximum point of L(✓):

0. ˆ✓ initialization

1. do

2. ˜✓ ˆ✓

3. ˆ✓ arg max✓

Q(

˜✓, ✓)

4. while L(

ˆ✓) > L(

˜✓)

Property: Q(

˜✓, ˜✓) = 0 =) {max✓

Q(

˜✓, ✓)} � 0.

3Definition is in the appendix.

M. Federico MT 2016


For our alignment models, the solution of max✓

Q(

˜✓, ✓) is:

ˆ✓(!) =

c✓

(!)P!2⌦

µ

c✓

(!)

c✓

(!) =

SX

s=1

X

a

p✓

(a | fs

, es

)c(!;a, fs

, es

)

where:

• ✓(!) is one element of ✓, i.e. the probability of !

• ˜✓ are the old parameter values, ˆ✓ are the new values

• ! is any elementary event of the model to be normalized over a sub-space ⌦

µ

˜✓(!) is a relative frequency estimator based on the expected count c✓

(!), whichis taken by generating all possible alignments a with the old probabilities ˜✓.

M. Federico MT 2016


0. ˆ✓ initialize

1. do

2. ˜✓ ˆ✓

3. 1. generate all possible alignments for the training data

3. 2. accumulate observed counts weighted by alignment model ˜✓

3. 3. compute relative frequencies ˆ✓ from counts c✓

(!)

4. while L(

ˆ✓) > L(

˜✓)

M. Federico MT 2016

37EM Algorithm of Model 1

Parameters of Model 1 are probs p(f | e), i.e. word f is aligned with word e.

p(f | e) =

c✓

(f | e)Pe

c✓

(f | e)

c✓

(f | e) =

SX

s=1

X

a

p✓

(a | fs

, es

)c(f | e;a, fs

, es

)

c(f | e;a, fs

, es

): count how many times f is aligned with e in the triple a, fs

, es

With some manipulation (see appendix) we get:

c✓

(f | e) =

SX

s=1

mX

k=1

lX

i=0

p(fk

| ei

)

Pl

a=0 p(fk

| ea

)

�(e, ei

)�(f, fk

)

M. Federico MT 2016


EM-M1(F,E,S)

1 Init-Params(P); // P[f,e]=uniform2 do3 Reset-Expected-Counts; //p[]=0 ptot[]=0;4 for s := 1 to S; // loop over training data5 do Expected-Counts(F[s] Length(F[s]),E[s],Length(E[s]));6 for f 2 F ;7 do for e 2 E ; // new parameters8 do P[f,e] := p[f,e]/ptot[e];9 until convergence

M. Federico MT 2016


Expected-Counts(F,m,E,l)

1 // Update counters p[], ptot[],using current parameters P[]2 for j := 1 to m;3 do t := 0;4 for i := 0 to l;5 do f=F[j]; e=E[i];6 t := t + P[f,e] ;7 for i := 0 to l;8 do f:=F[j]; e:=E[i];9 p[f,e] := p[f,e] + P[f,e] / t;

10 ptot[e]:=ptot[e] + P[f,e] / t;

M. Federico MT 2016

40Model 2

• Replaces the uniform alignment probability of M1 with:

Pr(aj

| ...) ⇡ p(aj

| j, l, m)

• Properties:

– Model 1 does not care where words appear in the two strings!– Model 2 introduces alignment probs, i.e. a table of size (L⇥M)

2

– Training of Models 1-2 is easy from a bilingual corpus with given alignments:– both models are products of discrete distributions– their likelihood function is a product of multinomial distributions– MLE just needs relative counts of events (su�cient statistics)

• E�cient computation of Pr(f | e) like Model 1

• Problems and limitations of Model 1 and Model 2:– do not model the # of foreign words to be connected to each English word– the alignment probability of Model 2 is complex and shallow

M. Federico MT 2016

41Example of Alignment with Model 1

. · · · · · · · · · · · · · •mehr · · · · · · · · · · · · • ·nicht · · · · · · · · · · • · · ·wohl · · · · · • • · · • · · · ·das · · · · · · · · • · · · · ·geht · · · · · · · · · · · • · ·dann · · · • · · · · · · · · · ·

, · · • · • · · • · · · · · ·ja · • · · · · · · · · · · · ·ah • · · · · · · · · · · · · ·

NULL · · · · · · · · · · · · · ·

oh well

, then

, I guess

, that

will

not

work

anymore

.

Problem:– three source words (,) are mapped to the same target word!– in fact, source words are aligned independently from each other

M. Federico MT 2016

42Example: alignment with fertility models

. · · · · · · · · · · · · · •mehr · · · · · · · · · · · · • ·nicht · · · · · · · · · · • · · ·wohl · · · · · • • · · · · · · ·das · · · · · · · · • · · · · ·geht · · · · · · · · · • · • · ·dann · · · • · · · · · · · · · ·

, · · • · · · · · · · · · · ·ja · • · · · · · · · · · · · ·ah • · · · · · · · · · · · · ·

NULL · · · · • · · • · · · · · ·

oh well

, then

, I guess

, that

will

not

work

anymore

.

Fertility models– explicitly consider the number of words covered by each English word– e.g. if comma has fertility 1, then only one source word can be aligned to it

M. Federico MT 2016

43Fertility Models

The number of French words covered by e is a r.v. �e

: namely, the fertility of e

• Models 1-2 do not explicitly model fertilities

• Models 3, 4, and 5 parameterize fertilities directly

• Fertility models imply a di↵erent generative process of f and a given e:

1. For i = 1, . . . , l, 0, choose a fertility value �i

� 0 for word ei

2. For i = 1, . . . , l, 0, choose a tablet ⌧i

of �i

French words to translate ei

3. Choose a permutation ⇡ over the tableau ⌧ = (⌧1, . . . , ⌧l

, ⌧0) to generate f

4. IF any position was chosen more than once THEN return FAILURE5. ELSE return (a,f) corresponding to (⌧,⇡)

Notice:– for ”correct” pairs (⌧,⇡) there is a many-to-one mapping to (f ,a)– the notion of fertility is embedded into ⌧ and ⇡

M. Federico MT 2016

44Model 3


Model 3 generates the translation and the alignment as follows:

1. for each word i of e it generates a fertility value �i

2. fore each word i of e it applies the following steps:

(a) generate � translations of word i(b) pick one positions for each of the � words

Step 1 implicitly defines the length m of the translationSteps 1-2 all rely of specific probability tablesThis model is significantly more complex than Model 1!

Estimation of M3 follows the principle used for M1, it’s just more tricky!

M. Federico MT 2016


null0 the1 program2 has3 been4 implemented5

1 1 1 1


fertility3

il programma e` stato praticain

messo

6 7 1 2 543

tablet

permutation

0

M. Federico MT 2016


null0 the1 program2 has3 been4 implemented5

1 1 1 1


target

fertility

source

3

il programma e` stato praticain

messo

6 7 1 2 543

tablet

permutation

0

PERMUTATION MUST BE CHECKED BEFORE GENERATING (a,f) !!!

M. Federico MT 2016

46Model 3: Training

• Problem: no way to e�ciently compute summation over alignments

• Trick: limit summation to a neighborhood of best alignment from ˜✓

c✓

(!) ⇡SX

s=1

X

a2N (a⇤)

p✓

(a | fs

, es

)c(!;a, fs

, es

)

• Problem: no e�cient way to compute Viterbi alignments with M3

• Trick: do hill-climbing in alignment space:– start from best alignment by Model 2: a

⇤= V (f | e;M2)

– hill-climbing operator:

b(a⇤) = arg maxa2N (a⇤)

p(a | e, f ;M3) (6)

– neighborhood N (a): alignments di↵ering from a by one move or one swap– move operator: m[j,i](a): set a

j

:= i– swap operator: s[j1,j2](a): exchanging a

j1 with aj2

M. Federico MT 2016

47Incremental Training Procedure

BILINGUAL

CORPUS

INITIAL

PARAM

MODEL 1

use previous model to initialize some

of the parameters of next model!

EM ALGORITHM

INITIAL

PARAM

MODEL 2

EM ALGORITHM

INITIAL

PARAM

MODEL 3

EM ALGORITHM

...

M. Federico MT 2016

48HMM Alignment Model

Another alignment model which follows from the general alignment model:

Pr(fm

1 , am

1 | el

1) = Pr(m | el

1)

mY

j=1

Pr(aj

| f j�11 , aj�1

1 , m, el

1)·Pr(fj

| f j�11 , aj

1, m, el

1)

Let us define the following parameters:

Pr(m | el

1) = p(m | l) string length probabilitiesPr(a

j

| ....) = p(aj

| aj�1, l) alignment probabilities

Pr(fj

| aj

, ...) = p(fj

| ea

j

) translation probabilities

Hence, we get the following translation model:

Pr(fm

1 | el

1) =

X

a

m

1

Pr(fm

1 , am

1 | el

1) = p(m | l) ·X

a

m

1

mY

j=1

p(aj

| aj�1, l) · p(f

j

| ea

j

)

EM training can be carried out e�ciently through dynamic programming

M. Federico MT 2016

49Combinations of Word Alignments

Given parallel sentences we can train an alignment model and then align them.

We have di↵erent options:

• direct alignment: we learn alignments from source to target

• inverted alignment: we learn alignment from target so source

We can get better alignments by combining direct and inverted alignments.

• union: greedy collection of alignment points, higher coverage

• intersection: selective collection, higher precision

• grow-diagonal: take the best of two

Properties:

• direct/inverted alignments are maps betwen two sets of positions

• union alignment is a many-to-many partial alignment

• intersection is is a 1-1 partial alignment

M. Federico MT 2016

50Union and Intersection Alignments

∩

∪

=

=

source

source

source source

targ

etta

rget

direct inverted

union

intersection

source source

direct inverted

targ

et

targ

etta

rget

targ

et

M. Federico MT 2016

51Grow Diagonal Word Alignment

∩ =

source

targ

et

direct inverted intersection

source source

targ

et

targ

et

source

grow diagonal

targ

et

source

targ

et

source

targ

et

grow diagonalgrow diagonal

M. Federico MT 2016

52How to measure quality of word alignments

sure alignments

targ

et

Automatic Manual

source source

targ

et

possible alignmentsautomatic alignments

Matches

source

targ

et

AER= 1- #( ) + #( ) ∩ ∩#( ) + #( )

= 3 + 5 6 + 4

= 0.2 1-

⊆Alignment Error Rate

M. Federico MT 2016

53Use of word alignments

Bilingual concordance

Search string:

Select corpus: Alice in Wonderland

rabbitEQUAL TO

Done

She felt very sleepy, when suddenly a White rabbit with pink eyes ran close by her.

nor did Alice think it so unusual to hear the rabbit say to itself "Oh dear! Oh dear! I shall be too late!"

But when the rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for she remembered that she had never before seen a rabbit with either a waistcoat-pocket or a watch to take out of it, and she ran across the field after it, and was just in time to see it pop down a large rabbit-hole under the hedge.

The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had no time to think about stopping herself before she found herself falling down what seemed to be a very deep well.

Source: Target:EN-English ZH-Chinese

她感到昏昏欲睡，就在此�u�(:!\PECa�BmY.6?;JU�s

A�S]A='m#�W�T%�v“哎呀！哎呀！我要h �t”她也不O��,�cIs

然而当兔子居然从背心口袋中掏出一只表，瞧了瞧，然后又匆匆赶路�u�S]3�p1"u<��?�G`�u@ =�m#R�Hf�[�u� =�m#@Hf+o$�(>s5�?L�'m#)7U�gdu4�X�=K��#jDbr��m#ls

兔子洞像隧道一0Q8ikuN)�e%Q�2q�u��q�/Y.�u�S]��X�ZFu�&-�*9VD��^_��M�n+s

M. Federico MT 2016

54Last words on word alignments

Given a parallel corpus we can automatically learn alignments to:

• discover interesting lexical relationships

• generate a probabilistic translation lexicon

• extract phrase-pairs

Alignments have limitations in terms of allowed word mappings

Better alignments can be obtained by:

• estimating alignments from source to target and viceversa

• computing a suitable combination of the two alignments

M. Federico MT 2016

55

Appendix

M. Federico MT 2016

56Often used symbols

l,m length of target and source sentencesf = fm

1 ⌘ f1 . . . , fm

source sentencee = el

1 ⌘ e1 . . . , el

target sentencei, j target and source positionsei

, fj

target and source wordse0 empty word (of the target sentence)F source language dictionaryE target language dictionaryi 2 {0, 1, . . . , l} target positionsL,M maximum length of target and source sentences

M. Federico MT 2016

57Auxiliary Function and Theorem

Given parameters ˜✓, we search for better values through the auxiliary function:

Q(

˜✓, ✓) =

X

f ,e

c(f , e)

X

a

p✓

(a | f , e) log

p✓

(f ,a | e)

p✓

(f ,a | e)

(7)

where:

p✓

(a | f , e) =

p✓

(f ,a | e)

p✓

(f | e)

=

p✓

(f ,a | e)Pa

0 p✓

(f ,a0 | e)

(8)

andQ(

˜✓, ˜✓) = 0 (9)

Q is only similar to an entropy formula and is explained by the following property

EM Theorem If Q(

˜✓, ✓) > 0 then L(✓) > L(

˜✓)

M. Federico MT 2016

58EM Theorem⇤

EM Theorem. Given parameter values ˜✓ and ✓ of an alignment model, it holds:

if Q(

˜✓, ✓) > 0 then L(✓) > L(

˜✓) (10)

Proof We can show that Q is related to the likelihood function L by:

L(✓) � L(

˜✓) + Q(

˜✓, ✓) (11)

which is is equivalent to the theorem’s statement. The proof of the inequality isbased on this simple geometric property: log x (x� 1) (equality for x = 1)

Proof.

M. Federico MT 2016

59EM Theorem (cont’d)⇤

Hence, for any e and f , we have that

X

a

p✓

(a | f , e) log

p✓

(f ,a | e)

p✓

(f ,a | e)

(12)

=

X

a

p✓

(a | f , e) log

✓p

✓

(f ,a | e)/p✓

(f | e)

p✓

(f ,a | e)/p✓

(f | e) ·p

✓

(f | e)

p✓

(f | e)

◆(13)

=

X

a

p✓

(a | f , e) log

p✓

(a | f , e)

p✓

(a | f , e)

+ log

p✓

(f | e)

p✓

(f | e)

X

a

p✓

(a | f , e)

| {z }=1

(14)

X

a

p✓

(a | f , e)

✓p

✓

(a | f , e)

p✓

(a | f , e)

� 1

◆

| {z }=0

+ log

p✓

(f | e)

p✓

(f | e)

(15)

In the last step, we applied the inequality log x x� 1 with x =

p

✓

(a|f ,e)p

✓

(a|f ,e).

M. Federico MT 2016

60EM Theorem (cont’d)⇤

By summing up over all (f , e) we get the desired inequality:

X

(f ,e)

c(f , e) log

p✓

(f | e)

p✓

(f | e)

�X

(f ,e)

c(f , e)

X

a

p✓

(a | f , e) log

p✓

(f ,a | e)

p✓

(f ,a | e)

L(✓)� L(

˜✓) � Q(

˜✓, ✓)

L(✓) � L(

˜✓) + Q(

˜✓, ✓)

End of the Proof.

• The role of Q(

˜✓, ✓) is now clear:– ˜✓ are the current parameters, ✓ are the new unknown parameters– if we find ✓ such that Q(

˜✓, ✓) > 0 then we have better parameters

• ... but we need some parameters ˜✓ to start with ....

• the good news is that we can start with any settings (uniform, random, ...)

M. Federico MT 2016

61EM with Alignment Models

All our word alignment models have the general exponential form:4

p✓

(f ,a | e) =

Y

!2⌦

✓(!)

c(!;a,f ,e) (16)

where parameters satisfy multiple normalization constraints

X

!2⌦µ

✓(!) = 1, µ = 1, 2, . . . (17)

where the subsets ⌦

µ

, µ = 1, 2, . . ., form a partition of ⌦.

The partition corresponds to all the conditional probabilities in the model.

Each conditional probability to be estimated has indeed to sum-up to 1.

4M1-5 are products of discrete distributions defined over di↵erent events within (a, f , e).

M. Federico MT 2016

62EM with Alignment Models

Constraints leads to the system of equations:

8>><

>>:

@

@✓(!)

⇣Q(

˜✓, ✓) +

Pµ

�µ

⇣1�

P!2⌦

µ

✓(!)

⌘⌘=

@

@✓(!)Q(

˜✓, ✓)� �µ

= 0

! 2 ⌦

µ

, µ = 1, 2, . . .@

@�

µ

⇣Q(

˜✓, ✓) + �µ

(1�P

!2⌦ mu

✓(!)

⌘= 1�

P!2⌦

µ

✓(!) = 0 µ = 1, 2, . . .

(18)For ! 2 ⌦

µ

, by applying Lagrange multipliers we get the re-estimation formula

ˆ✓(!) = ��1µ

c✓

(!) �µ

=

X

!2⌦µ

c✓

(!) (19)

c✓

(!) =

X

f ,e

c(f , e)

X

a

p✓

(a | f , e)c(!;a, f , e) (20)

Parameter update ✓ based on expected counts, which are collected by averagingall possible alignments a with the posterior p

✓

(a | f , e) of the current value ˜✓!

M. Federico MT 2016

63Training Model 2⇤

For Model 2, ⌦ = {(i, j, l,m)} [ {(e, f)}, which is partitioned as follows:

⌦

j,l,m

= {(i | j, l, m) : 0 i l}, 0 j,m M, 0 l L

⌦

e

= {(f | e) : f 2 F}, e 2 E (21)

c(i | j, l, m;a, f , e) = �(i, aj

) (22)

c(f | e;a, f , e) =

mX

j=1

�(e, ea

j

)�(f, fj

) (23)

We can now directly derive the iterative re-estimation formulae from:

c✓

(!; f , e) =

X

a

p✓

(a | f , e)c(!;a, f , e) (24)

Problem: the above formula requires summing over (l + 1)

m alignments!

M. Federico MT 2016

64Training Model 2: Useful Formulas⇤

Model 2 permits to e�ciently calculate the sum over alignments:

p

✓

(f | e) =X

a

p

✓

(f , a | e) = p(m | l)lX

a1=0

. . .

lX

a

m

=0

mY

j=1

p(fj

| e

a

j

)p(aj

| j, l, m)

= p(m | l)mY

j=1

lX

i=0

p(fj

| e

i

)p(i | j, l, m) (25)

Proof. Let m = 3 and l = 1, and let x

ja

j

⌘ p(fj

| e

a

j

)p(aj

| j, l, m). It is routine to verifythat: x10x20x30 + . . . + x11x21x30 + x11x21x31 = (x10 + x11)(x20 + x21)(x30 + x31)Hence we can write:

p

✓

(a | f , e) =p

✓

(f , a | e)P

a

p

✓

(f , a | e)=

Qm

j=1 p(fj

| e

a

j

)p(aj

| j, l, m)Q

m

j=1

Pl

i=0 p(fj

| e

i

)p(i | j, l, m)⌘

mY

j=1

p

✓

(aj

| j, f , e) (26)

Important: we need only 2 · m · (l + 1) operations!

M. Federico MT 2016


c

✓

(f | e; f , e) =X

a

p

✓

(a | f , e)c(f | e; a, f , e)

=lX

a1=0

. . .

lX

a

m

=0

0

@mY

j=1

p

✓

(aj

| j, f , e)

1

AmX

k=1

`�(e, e

a

k

)�(f, f

k

)´

=mX

k=1

lX

a1=0

. . .

lX

a

m

=0

mY

j=1

p

✓

(aj

| j, f , e)�(e, e

a

k

)�(f, f

k

)

=mX

k=1

lX

a

k

=0

p

✓

(ak

| k, f , e)�(e, e

a

k

)�(f, f

k

) (27)

=mX

k=1

lX

i=0

p(fk

| e

i

)p(i | k, l, m)P

l

a=0 p(fk

| e

a

)p(a | k, l, m)�(e, e

i

)�(f, f

k

) (28)

M. Federico MT 2016


c✓

(i | j, l, m; f , e) =

X

a

p✓

(a | f , e)c(i | j, l, m;a, f , e)

=

lX

a1=0

. . .

lX

a

m

=0

mY

k=1

p✓

(ak

| k, f , e)

!�(i, a

j

)

=

lX

a

j

=0

p✓

(aj

| j, f , e)�(i, aj

) (29)

= p✓

(i | j, f , e)

=

p(fj

| ei

)p(i | j, l, m)

Pl

i

0=0 p(fj

| ei

0)p(i0 | j, l, m)

(30)

M. Federico MT 2016

67Model 2: Training Algorithm

EM-Model2(F,m,E,l)

1 Init-Params(P,Q); // P[f,e]=uniform; Q[i,j,l,m]=uniform;2 do3 Reset-Expected-Counts; //p[]=0 ptot[]=0; q[]=0; qtot[]=0;4 for s := 1 to S; // loop over training data5 do Expected-Counts(F[s] Length(F[s]),E[s],Length(E[s]));6 for m := 1 to M; //max source length7 do for l := 1 to L; // max target length8 do for j := 1 to m;9 do for i := 0 to l; //new current parameters

10 do Q[i,j,l,m] := q[i,j,l,m]/qtot[j,l,m];11 for f 2 F ;12 do for e 2 E ; // new current parameters13 do P[f,e] := p[f,e]/ptot[e];14 until convergence

M. Federico MT 2016

68Model 2: Training Algorithm

Expected-Counts(F,m,E,l)

1 // Update counters p[], q[], ptot[], qtot[] using current parameters P[],Q[]2 for j := 1 to m;3 do t := 0;4 for i := 0 to l;5 do f=F[j]; e=E[i];6 t := t + P[f,e] * Q[i,j,l,m];7 for i := 0 to l;8 do f:=F[j]; e:=E[i];9 c:= P[f,e] * Q[i,j,l,m] / t;

10 q[i,j,l,m] := q[i,j,l,m] + c;11 qtot[j,l,m]=qtot[j,l,m] + c;12 p[f,e] := p[f,e] + c;13 ptot[e]:=ptot[e] + c;

M. Federico MT 2016

Documents

Example of Parallel Corpus Machine Translation: Word ... · Machine Translation: Word Alignment Problem Marcello Federico FBK, Trento - Italy 2016 M. Federico MT 2016 Outline 1 •Word