Edinburgh MT lecture 3: Latent Variable Models and Word Alignment

Preview:

Citation preview

• Machine learning • Databases • Algorithms and systems • Statistics and optimization

Four year PhD programme Courses + PhD dissertation (No previous MSc required)

http://datascience.inf.ed.ac.uk/

Discovering new patterns and knowledge from data

• Big data • Natural language processing • Computer vision • Speech processing

at the University of Edinburgh

The biggest revolution in the technological landscape for fifty years

Now accepting applications! Find out more and apply at:

pervasiveparallelism.inf.ed.ac.uk

• • 4-year programme: 4-year programme: MSc by Research + PhDMSc by Research + PhD

• Collaboration between: ▶ University of Edinburgh’s School of Informatics ✴ Ranked top in the UK by 2014 REF

▶ Edinburgh Parallel Computing Centre ✴ UK’s largest supercomputing centre

• Full funding available

• Industrial engagement programme includes internships at leading companies

• Research-focused: Work on your thesis topic from the start

• Research topics in software, hardware, theory and

application of: ▶ Parallelism ▶ Concurrency ▶ Distribution

Latent Variable Models and Word Alignment

INFR11062 spring 2015lecture 3

Goal

•Write down a model over sentence pairs.

•Learn an instance of the model from data.

•Use it to predict translations of new sentences.

Goal

•Write down a model over sentence pairs.

•Learn an instance of the model from data.

•Use it to predict translations of new sentences.

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

However , the sky remained clear under the strong north wind .

IBM Model 1

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

However , the sky remained clear under the strong north wind .

IBM Model 1

p(f ,a|e)Let’s write a simple model in terms of word-to-word alignments:

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 1

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 1

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 1

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 1

p(English length|Chinese length)

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 1

p(English length|Chinese length)

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 1

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 1

p(Chinese word position)

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 1

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 1

However

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 1

However

p(English word|Chinese word)

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 1

However

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 1

However

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 1

However ,

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 1

However ,

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 1

However , the

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 1

However , the sky remained clear under the strong north wind .

IBM Model 1

French, Englishsentence lengths

alignment of French word at position i

French, Englishword pair

p(f ,a|e) = p(I|J)IY

i=1

p(ai|J) · p(fi|eai)

IBM Model 1

p(despite| )虽然

p(however| )

p(although| )

p(northern| )

p(north| )北

虽然

虽然

IBM Model 1

p(despite| )虽然

p(however| )

p(although| )

p(northern| )

p(north| )北

虽然

虽然

???

???

???

???

???

IBM Model 1

p(despite| )虽然

p(however| )

p(although| )

p(northern| )

p(north| )北

虽然

虽然

???

???

???

???

???

✓{

• Optimization

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

However , the sky remained clear under the strong north wind .

MLE for IBM Model 1 (observed)

MLE for IBM Model 1 (observed)

ˆ✓ = argmax

✓p(f ,a|e)

MLE for IBM Model 1 (observed)

ˆ✓ = arg max

NY

n=1

0

@p(I(n)|J (n))

I(n)Y

i=1

p(a(n)i |J (n)

) · p(f (n)i |e(n)

ai)

1

A

MLE for IBM Model 1 (observed)

ˆ✓ = arg max

NY

n=1

0

@p(I(n)|J (n))

I(n)Y

i=1

p(a(n)i |J (n)

) · p(f (n)i |e(n)

ai)

1

A

number ofsentences

French, Englishsentence lengths

alignment of French word at position i

French, Englishword pair

MLE for IBM Model 1 (observed)

ˆ✓ = arg max

NY

n=1

0

@p(I(n)|J (n))

I(n)Y

i=1

p(a(n)i |J (n)

) · p(f (n)i |e(n)

ai)

1

A

constant (w.r.t. words)!{

MLE for IBM Model 1 (observed)

ˆ✓ = arg max

✓C

NY

n=1

I(n)Y

i=1

p(f (n)i |e(n)

ai)

MLE for IBM Model 1 (observed)

ˆ✓ = arg max

✓log

0

@CNY

n=1

I(n)Y

i=1

p(f (n)i |e(n)

ai)

1

A

log(a) < log(b) () a < b

MLE for IBM Model 1 (observed)

ˆ✓ = arg max

log

0

@C ·Y

f,e

p(f |e)count(hf,ei)

1

A

MLE for IBM Model 1 (observed)

ˆ

✓ = arg max

✓log C +

X

f,e

count(hf, ei) log p(f |e)

log of product = sum of logs

MLE for IBM Model 1 (observed)

⇤(✓,�) = log C +

X

f,e

count(hf, ei) log p(f |e)

�X

e

�e

0

@X

f

p(f |e) � 1

1

A

Lagrange multiplier expresses normalization constraint{

MLE for IBM Model 1 (observed)

⇤(✓,�) = log C +

X

f,e

count(hf, ei) log p(f |e)

�X

e

�e

0

@X

f

p(f |e) � 1

1

A

@⇤(✓,�)@p(f |e) =

count(hf, ei)p(f |e) � �ederivative

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

However , the sky remained clear under the strong north wind .

MLE for IBM Model 1 (observed)

虽然# of times 虽然 aligns to any word

# of times 虽然 aligns to Howeverp(however| ) =

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

However , the sky remained clear under the strong north wind .

MLE for IBM Model 1 (unobserved)

虽然p(however| ) = ???

MLE for IBM Model 1 (observed)

ˆ✓ = arg max

✓log

0

@CNY

n=1

I(n)Y

i=1

p(f (n)i |e(n)

ai)

1

A

ˆ✓ = arg max

✓log

0

@CNY

n=1

X

a

I(n)Y

i=1

p(f (n)i |e(n)

ai)

1

A

MLE for IBM Model 1 (unobserved)

Data likelihood = marginal probability of observed data

p(f |e) =X

a

p(f ,a|e)

MLE for IBM Model 1 (unobserved)

ˆ✓ = arg max

log

0

@C ·Y

f,e

p(f |e)E[count(hf,ei)]

1

A

MLE for IBM Model 1 (unobserved)

ˆ✓ = arg max

log

0

@C ·Y

f,e

p(f |e)E[count(hf,ei)]

1

A

Not constant! Depends on parameters, no analytic solution.

MLE for IBM Model 1 (unobserved)

ˆ✓ = arg max

log

0

@C ·Y

f,e

p(f |e)E[count(hf,ei)]

1

A

Not constant! Depends on parameters, no analytic solution.

But it does strongly imply an iterative solution.

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

Likelihood Estimation for Model 1

However

p(English word|Chinese word)

, the sky remained clear under the strong north wind .

unobserved!

Parameters and alignments are both unknown.

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

Likelihood Estimation for Model 1

However

p(English word|Chinese word)

, the sky remained clear under the strong north wind .

unobserved!

If we knew the alignments, we could calculate the values of the parameters.

Parameters and alignments are both unknown.

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

Likelihood Estimation for Model 1

However

p(English word|Chinese word)

, the sky remained clear under the strong north wind .

unobserved!

If we knew the alignments, we could calculate the values of the parameters.

If we knew the parameters, we could calculate the likelihood of the data.

Parameters and alignments are both unknown.

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

Likelihood Estimation for Model 1

However

p(English word|Chinese word)

, the sky remained clear under the strong north wind .

unobserved!

If we knew the alignments, we could calculate the values of the parameters.

If we knew the parameters, we could calculate the likelihood of the data.

Parameters and alignments are both unknown.

The Plan: Bootstrapping

•Arbitrarily select a set of parameters (say, uniform).

•Calculate expected counts of the unseen events.

•Choose new parameters to maximize likelihood, using expected counts as proxy for observed counts.

•Iterate.

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

The Plan: Bootstrapping

However , the sky remained clear under the strong north wind .

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

The Plan: Bootstrapping

However , the sky remained clear under the strong north wind .

if we had observed the alignment, this line would

either be here (count 1) or it wouldn’t (count 0).

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

The Plan: Bootstrapping

However , the sky remained clear under the strong north wind .

if we had observed the alignment, this line would

either be here (count 1) or it wouldn’t (count 0).

since we didn’t observe the alignment, we calculate the probability that it’s there.

Remember: Probabilistic Primer136136136136136136

136136136136136136

136136136136136136

136136136136136136

136136136136136136

136136136136136136

p(Y = 1) =X

x2X

p(X = x, Y = 1) =16

marginal probability

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

However , the sky remained clear under the strong north wind .

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

However , the sky remained clear under the strong north wind .

p(

p(

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

However , the sky remained clear under the strong north wind .

p( )

) +

) +

Marginalize: sum all alignments containing the link

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

However , the sky remained clear under the strong north wind .

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

However , the sky remained clear under the strong north wind .

p(

p(

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

However , the sky remained clear under the strong north wind .

p(

) +

) +

)

Divide by sum of all possible alignments

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

However , the sky remained clear under the strong north wind .

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

However , the sky remained clear under the strong north wind .

p(

p(

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

However , the sky remained clear under the strong north wind .

p(

) +

) +

)

Divide by sum of all possible alignments

Is this hard? How many alignments are there?

Computing Expectations

probability of an alignment.

p(F,A|E) = p(I|J)Y

ai

p(ai = j)p(fi|ej)

Computing Expectations

probability of an alignment.

observed uniform

p(F,A|E) = p(I|J)Y

ai

p(ai = j)p(fi|ej)

factors across words.

Computing Expectations

probability of an alignment.

observed uniform

p(F,A|E) = p(I|J)Y

ai

p(ai = j)p(fi|ej)

Computing Expectations

北北

.�

a⇥A: �north

p(north| ) · p(rest of a)

marginal probability of alignments containing link

Computing Expectations

p(north| ).�

a⇥A: �north

p(rest of a)北北

marginal probability of alignments containing link

Computing Expectations

p(north| ).�

a⇥A: �north

p(rest of a)北北

marginal probability of alignments containing link

c⇥Chinese words

p(north|c).�

a⇥A: �north

p(rest of a)

marginal probability of all alignments

c

Computing Expectations

p(north| ).�

a⇥A: �north

p(rest of a)北北

marginal probability of alignments containing link

c⇥Chinese words

p(north|c).�

a⇥A: �north

p(rest of a)

marginal probability of all alignments

c

Computing Expectations

p(north| ).�

a⇥A: �north

p(rest of a)北北

marginal probability of alignments containing link

c⇥Chinese words

p(north|c).�

a⇥A: �north

p(rest of a)

marginal probability of all alignments

c

identical!

Computing Expectations

北p(north| ).�

c�Chinese words p(north|c)

Computing Expectations

北p(north| ).�

c�Chinese words p(north|c)

marginal probability (expected count) of an alignment containing the link

Computing Expectations

北p(north| ).�

c�Chinese words p(north|c)

marginal probability (expected count) of an alignment containing the link

For each sentence, use this quantity instead of 0 or 1

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

However , the sky remained clear under the strong north wind .

Maximum Likelihood

虽然# of times 虽然 aligns to any word

# of times 虽然 aligns to Howeverp(however| ) =

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

However , the sky remained clear under the strong north wind .

Expectation Maximization

虽然

Expected # of times 虽然 aligns to However

p(however| ) =# of times 虽然 aligns to any word

Expectation Maximization

•Arbitrarily select a set of parameters (say, uniform).

•Calculate expected counts of the unseen events.

•Choose new parameters to maximize likelihood, using expected counts as proxy for observed counts.

•Iterate. (Until when?)

Textbook example

Textbook example

Textbook example

Textbook example

Textbook example

Expectation Maximization

北p(north| ).�

c�Chinese words p(north|c)

Why does this even work?

Expectation Maximization

Observation 1: We are still solving a maximum likelihood estimation problem.

Expectation Maximization

Observation 1: We are still solving a maximum likelihood estimation problem.

p(Chinese|English) =X

alignments

p(Chinese, alignment|English)

Expectation Maximization

Observation 1: We are still solving a maximum likelihood estimation problem.

p(Chinese|English) =X

alignments

p(Chinese, alignment|English)

MLE: choose parameters that maximize this expression.

Expectation Maximization

Observation 1: We are still solving a maximum likelihood estimation problem.

p(Chinese|English) =X

alignments

p(Chinese, alignment|English)

MLE: choose parameters that maximize this expression.

Minor problem: there is no analytic solution.

(from Minka ’98)

0 1

... and, likelihood is convex for this model:

Summary

•Learning is optimization: choose parameters that optimize some function, such as likelihood.

•Supervised: maximum likelihood.

•Beware of overfitting.

•Unsupervised: expectation maximization.

•Many other objective functions and algorithms.

Recommended