122
A Systematic Comparison of Various Statistical Alignment Models Och and Ney ACL 2003 Presented by: Ankit Ramteke Ankur Aher Piyush Dungarwal Mandar Joshi

A Systematic Comparison of Various Statistical Alignment Modelspb/cs712-2013/piyush-ankit-ankur-alignm… · Outline of the Paper Review various statistical alignment models and heuristic

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • A Systematic Comparison of Various

    Statistical Alignment ModelsOch and Ney

    ACL 2003

    Presented by:

    Ankit Ramteke Ankur Aher

    Piyush Dungarwal Mandar Joshi

  • PART I

    Ankit RamtekePiyush Dungarwal

  • Introduction

  • Noisy channel model

  • Calculating Probability

    ● Alignment probability:Pr(a|f, e) = Pr(f, a|e) / Pr(f|e)

    ● Two birds, one stone• Compute Pr(f, a|e) using generative story

    • Compute Pr(f|e) = Σa Pr(f, a|e)

  • Motivation

    ● Possible alignments:b c b cx y x y

    b c b cx y x y

  • Outline of the Paper● Review various statistical alignment models and

    heuristic models and Training of the alignment models.● IBM models 1-5 , HMM Model, Model 6 ( 4+HMM)● Heuristic based Models

    ● Some heuristic methods for improving alignment quality

    ● Evaluation methodology for word alignment methods

    ● Systematic comparison of the various statistical alignment models

  • Topics for Today

    ● Introduction● IBM Model 1● IBM Model 2● IBM Model 3

  • IBM Model 1

  • Description

  • Derivation

  • Learning Parameters

    EM Algorithm consists of two steps:

    • Expectation-Step: Apply model to the data

    • parts of the model are hidden (here: alignments)• using the model, assign probabilities to possible values

    • Maximization-Step: Estimate model from data

    • take assigned values as fact• collect counts (weighted by probabilities)• estimate model from counts

    • Iterate these steps until convergence

  • Learning Parameters

    Example taken from SMT by Koehn

  • Example taken from SMT by Koehn

  • Example taken from SMT by Koehn

  • Example taken from SMT by Koehn

  • Example taken from SMT by Koehn

  • EM Algorithm

  • IBM Model 2

  • Motivation

    ● Why model 2 when we have 1?

    राम पाठशाला गया

    Ram went to school

    राम पाठशाला गया

    school Ram to went

  • IBM Model 2● Focuses on absolute alignment

    ● Assumptions

    1. Pr(m|e) is independent of e & m• New parameter ϵ = Pr(m|e)

    2. Uniform distribution over l+1 (null included)• Alignment probability is Pr(aj|j, m, l)

    3. Pr(fj|a1j, f1j-1,m,e) depends only on fj and eaj• Translation probability, t(fj|eaj) = Pr(fj|a1-j, f1-j-1, m, e)

    – Number of new parameters: m (aj for j=1 to m)– Training

  • IBM Model 3

  • Why model 3

    ● Possible alignments:b c b cx y x y

    b c b cx y x y

  • Terminologies● Adds fertility model

    ● Fertility probability– Eg. n(2|house) = prob. of generating 2 words for the word ‘house’

    ● Translation probability: same as model 1– Eg. t(maison|house) = prob. of ‘maison’ being translation of ‘house’

    ● Distortion probability– Eg. d(5|2) = prob. that word at position 2 goes to position 5

  • Derivation from Noisy Channel

  • Example

    This city is famous for its flora.This city is famous for its flora flora fertility

    step

    This city is famous NULL for its flora flora null insertion step

    यह शहर है मशहू र के लए अपने पेड़ पौध lexical translation

    यह शहर अपने पेड़ पौध के लए मशहू र है distortion step

  • Modifications Required

  • Final Probability

  • Generative Process

  • Generative Process

  • Training

    • Training flow diagram from Kevin Knight SMT tutorial

  • Deficiency

    • Distortion probabilities do not depend on the earlier words

    • Model 3 wastes some of its probability on generalized strings

    • Strings that have some positions with several words and others with none.

    • When a model has this property of not concentrating all of its probability on events of interest, it is said to be deficient.

  • Example

    राम पाठशाला गया

    Ram went school

    to

  • Alignment Model

    Fertility Model E-Step Deficient

    Model 1 Uniform No Exact No

    Model 2 Zero order No Exact No

    Model 3 Zero order Yes Approximate Yes

    Comparison of Statistical Models

  • Heuristic Models

  • PART II

    Ankur AherMandar Joshi

  • Outline

    Recap : IBM Model 1,2 & 3HMM based alignment approachIBM Model 4

  • MT Phenomenon

    Word-to-word translationFertilityRe-ordering

    राम पाठशाला गया

    Ram school went

    Ram school went to

    Ram went to school

  • SMT

  • Noisy channel modelargmaxe Pr(e|f) = argmaxe Pr(e) . Pr(f|e)

    language model translation model

    Pr(f|e) = Σa Pr(f, a|e)alignment model

    Alignment Example: राम पाठशाला गया

    Ram went to schoolPositional Vector: a{1-1,2-3,3-0,4-2}

  • Translation model

  • Model 1

  • Model 1

  • Model 2

    ● Why model 2 when we have 1?

    राम पाठशाला गया

    Ram went to school

    राम पाठशाला गया

    school Ram to went

  • Model 2

  • Model 2

  • Model 3

    ● Possible alignments:b c b cx y x y

    b c b cx y x y

  • Terminologies

  • Terminologies

  • Example

    Example taken from SMT by Koehn

  • Model 3

  • Deficiency

    राम पाठशाला गया

    Ram went school

    to

  • Alignment Model

    Fertility Model E-Step Deficient

    Model 1 Uniform No Exact No

    Model 2 Zero order No Exact No

    Model 3 Zero order Yes Approximate Yes

    Comparison of Statistical Models

  • Hidden Markov Alignment Model

  • Motivation

    In the translation process, large phrases tend to move together.

    Words that are adjacent in the source language tend to be next to each other in the target language.

    Strong localization effect is observed in alignment.

  • Motivation

    Hindi-English Alignment Example

    times *

    three *

    cup *

    world *

    cricket *

    won *

    team *

    Indian *भारतीय ट म केट व व कप तीन बार जीती

  • What is Hidden?

  • Capturing Locality

    HMM captures the locality of English sentence.

  • Homogenous HMM

    To make the alignment parameters independent of absolute word positions, we assume that the alignment probabilities p(i | i’, m) depend only on the jump width(i − i’).

  • Alignment Model

    Fertility Model E-Step Deficient

    Model 1 Uniform No Exact No

    Model 2 Zero order No Exact No

    Model 3 Zero order Yes Approximate Yes

    HMM First-order No Exact No

    Comparison of Statistical Models

  • IBM Model 4

  • Motivation

    Large phrases tend to move together.

    This intuition is captured through relative distortion.

    The placement of the translation of an input word is typically based on the placement of the translation of the proceeding input word.

  • Example

    The Indian team played well for 90 minutes.

    भारतीय ट म ९० मनट अ छ खेल .

  • Terminology

    Cept

    – Each input word fj that is aligned to at least one output word forms a cept.

    हंद क कताब अलमार म थी.

    The Hindi book was in the cupboard.

    The underlined words are cepts

  • Terminology

    We define the operator [i] to map the ceptwith index i back to its corresponding foreign input word position.

  • Center of a cept

    Center of a cept is defined as the ceiling of the average of the output word positions for that cept.

    We use ⃝i to indicate centre of cept i.

  • Table 1

    cept πi π1 π2 π3 π4 π5

    foreign position [i]

    1 3 4 5 6

    foreign word f[i]

    हंद कताब अलमार म थी

    English words {ej }

    Hindi book cupboard in the was

    English positions {j}

    2 3 7 5, 6 4

    center of cept ⃝i

    2 3 7 6 4

  • Relative Distortion

    Three cases:

    words that are generated by the NULL token,

    These are uniformly distributed. the first word in a cept, and

    subsequent words in a cept

  • The first word in a cept

    For the likelihood of placements of the first word of a cept, we use the distribution

    d1( j−⃝i−1)Example: See table 2

  • Subsequent words in a cept

    The distribution

    d>1( j − πi,k−1)

    Example: See table 2

  • Table 2

    j 1 2 3 4 5 6 7

    ej The Hindi book was In the cupboard

    in ceptπi,k

    π0,0 π1,0 π2,0 π5,0 π4,0 π4,1 π3,0

    ⃝i−1 - 0 2 6 7 - 3

    j - ⃝i−1 - +2 +1 -2 -2 - +4

    distortion 1 D1(+2) D1(+1) D1(-2) D1(-2) D>1(+1) D1(+4)

  • Model 4

  • Alignment Model

    Fertility Model E-Step Deficient

    Model 1 Uniform No Exact No

    Model 2 Zero order No Exact No

    Model 3 Zero order Yes Approximate Yes

    HMM First-order No Exact No

    Model 4 First-order Yes Approximate Yes

    Comparison of Statistical Models

  • Word Classes

    Smarter way to look at re-ordering

    Some words tend to get reordered while some do not.

    Examples

    – Good man भला आदमी– external affairs affaires extérieur

  • Word Classes

    Conditioning on words will not work due to sparsity.

    Instead, create word classes by dividing the vocabulary e.g. POS tags.

    More on this later

  • PART III

    Ankit Ramteke

    Ankur Aher

  • Outline

    Topics covered

    – IBM Models 1, 2, 3, 4– HMM

    Today’s topics

    – Heuristic Models– IBM Model 5– Model 6– Comparison of various methods based on

    • Size of training corpus• EM training methodology

  • Heuristic Models

  • Dice

  • Example

    Parallel corpus

    Those boys play cricket. वह लड़के केट खेलते ह. That boy plays cricket. वह लड़का केट खेलता है. I saw that house. मने वह घर देखा. I slept in that house. मै उस घर म सोया.I played in that house. मै उस घर म खेला. I liked that house. मुझे वह घर भाया.That boy went in the house. वह लड़का घर म गया.

    Example: That is my house. वह मेरा घर है.

    dice(that, वह) = (2*4)/(6*5) = 0.266dice(that, घर) = (2*5)/(6*5) = 0.333 indirect associationdice(house, घर) = (2*5)/(5*5) = 0.4

  • Competitive linking algorithm

    • Iteratively align the highest ranking word position (i,j)• Advantage: indirect associations occur less often

    • House घर• The Dice coefficient results in a worse alignment quality than the

    statistical models.

    वह मेरा घर हैThat 0.26 0.0 0.33 0.16

    is 0.0 0.0 0.0 0.0

    my 0.0 0.0 0.0 0.0

    house 0.24 0.0 0.4 0.0

  • Competitive linking algorithm

    • Iteratively align the highest ranking word position (i,j)• Advantage: indirect associations occur less often

    • That वह• The Dice coefficient results in a worse alignment quality than the

    statistical models.

    वह मेरा घर हैThat 0.26 0.0 0.33 0.16

    is 0.0 0.0 0.0 0.0

    my 0.0 0.0 0.0 0.0

    house 0.24 0.0 0.4 0.0

  • Competitive linking algorithm

    • Iteratively align the highest ranking word position (i,j)• Advantage: indirect associations occur less often

    • My / is मेरा / है• The Dice coefficient results in a worse alignment quality than the

    statistical models.

    वह मेरा घर हैThat 0.26 0.0 0.33 0.16

    is 0.0 0.0 0.0 0.0

    my 0.0 0.0 0.0 0.0

    house 0.24 0.0 0.4 0.0

  • Distortion Probabilities

    Where,vj is the number of vacancies in the English output interval [1; j]

  • Null हंद क कताब अलमार म थी.

    The Hindi book was in the cupboard.

    Cept Vacancies Parameters of d1

    f[i] πi,k v1 v2 v3 v4 v5 v6 V7 j vj V-max

    v⃝i−1

    The Hindi book was in the Cupboard

    NULL π0,1 1 2 3 4 5 6 7 1 - - -हंद π1,1 - 1 2 3 4 5 6 2 1 6 0कताब π2,1 - - 1 2 3 4 5 3 1 5 0अलमार π3,1 - - - 1 2 3 4 7 4 4 0म π4,1 - - - 1 2 - - 5 2 2 3

    π4,2 - - - 1 - 2 - 6 2 - -थी π5,1 - - - 1 - - - 4 1 1 0

  • Model 5

  • Model 6

  • Motivation

    • HMM makes use of locality in the source language

    • Model 4 makes use of locality in the target language

    • How can we use both ?

  • Formulation

    • Combine HMM and Model 4 in a log-linear way (Och & Ney 2003)

    α is interpolation parameter

  • General Formulation

    • A log-linear combination of several models pk(f, a | e),

    k = 1, . . . ,K is given by

  • Alignment Model

    Fertility Model E-Step Deficient

    Model 1 Uniform No Exact No

    Model 2 Zero order No Exact No

    HMM First-order No Exact No

    Model 3 Zero order Yes Approximate Yes

    Model 4 First-order Yes Approximate Yes

    Model 5 First-order Yes Approximate No

    Model 6 First-order Yes Approximate Yes

    Comparison of Statistical Models

  • SummaryAlignment Model Description

    Model 1 Uniform Uniform probability distribution (l+1)

    Model 2 Zero order a(aj|j, m, l)

    HMM First-order

    Model 3 Zero order ( | _ , , )

    Model 4 First-order

    Model 5 First-order

    Model 6 First-order Log linear combination of HMM and Model 4

  • Corpus

  • Verbmobil taskGerman-English Speech translation task

    German English

    Training Corpus Sentences 34,446 ~ 34K

    Words 329,625 343,076

    Vocabulary 5,936 3,505

    Singletons 2,600 1,305

    Bilingual Dictionary Entries 4,404

    Words 4,758 5,543

    Test Corpus Sentences 354

    Words 3,233 3,109

  • Hansards TaskThe French-English Hansards task consists

    of the debates in the Canadian parliament.French English

    Training Corpus Sentences 1470K

    Words 24.33M 22.16M

    Vocabulary 100,269 78,332

    Singletons 40,199 31,319

    Bilingual Dictionary Entries 28,701

    Words 28,702 30,186

    Test Corpus Sentences 500

    Words 8,749 7,946

  • Comparison based on various metrics

  • Based on training schemes (Vermobil)

  • Based on training schemes (Hansards)

  • Observations - 1

    Refined Models (4,5,6) perform better in all corpus sizes

    As the corpus size increases the improvement is mush more significant

    Statistical models perform better than heuristic models due to a well-founded mathematical theory that underlies their parameter estimation

  • Observations - 2

    Hidden Markov alignment model achieves significantly better results than Model 2

    – Homogeneous first-order alignment model– better represent the locality and monotonicity

    properties of natural languagesAlignment quality depends heavily on the

    method used to bootstrap the model

    – Bootstrapping with HMM performs better than that with Model 2

  • EM training

    Models 1,2 and HMM are exact

    Models 3,4,5,6 are approximate

    – Select smaller subset of alignments Three methods to select subset

  • Viterbi alignment(Brown et al. 2003)

  • Viterbi + Neighborhood(Al-Onaizan et al. 1999)

  • Pegged Alignments (Brown et al. 2003)

  • Effect of more alignments (Vermobil)

  • Effect of more alignments (Hansards)

  • Observations

    The effect of pegging strongly depends on the quality of the starting point used

    If Model 2 is used as starting point,

    – A significant improvement with the neighborhood alignments and the pegged alignments.

    Only the Viterbi alignment

    – The results are significantly worse than using additionally the neighborhood of the Viterbi alignment.

    HMM as the starting point has a much smaller effect

    Using more alignments in training is a way to avoid a poor local optimum.

  • Computational times

    • Using the pegging alignments yields only a moderate improvement in performance with a considerable increase in computational time

  • PART IV

    Piyush Dungarwal

    Mandar Joshi

  • Effect of Smoothing

    Using linear interpolation

    For Alignment probability:

    For Fertility probability:

  • Results (Verbmobil task)

  • Results (Hansard task)

  • Direction

  • Direction

  • Symmetrization

    Intersection

    Union

    Refined method

    – In a first step, the intersection A = A1 ∩ A2 is determined– extend the alignment A by adding alignments (i, j)

    occurring only in the A1 or in the alignment A2 if• neither fjj nor ei has an alignment in A, or • if both of the following conditions hold

    – (i,j) has a horizontal or vertical neighbour– A ∪ {(i, j)} does not contain alignments with both horizontal and

    vertical neighbours

  • e1

    e2

    e3

    e4

    e1

    e2

    e3

    e4

    e1

    e2

    e3

    e4

    e1

    e2

    e3

    e4

    e1

    e2

    e3

    e4

    f1 f4f2 f3 f1 f4f2 f3

    f1 f4f2 f3 f1 f4f2 f3 f1 f4f2 f3

    F -> E alignments

    E -> F alignments

    Intersection Union Refined method

  • Symmetrisation

  • Symmetrisation

  • Conclusion

    This paper compares heuristic and statistical methods (IBM models 1-5, HMM, Model 6)

    Statistical alignment models outperform the simple Dice coefficient

    Model 6 outperforms all other models

    Smoothing and symmetrisation have a significant effect on the alignment quality achieved by a particular model.

  • Conclusion

    Bilingual dictionary and word classes have minor effect on alignment quality

    In general, important ingredients of a good model seem to be a first-order dependence between word positions and a fertility model

  • References

    Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment models. Association of Computational Linguistics . March 2003. pages 19-51.

    Peter F. Brown , Vincent J.Della Pietra , Stephen A. Della Pietra , Robert. L. Mercer.The Mathematics of Statistical Machine Translation: Parameter Estimation. Association of Computational Linguistics .1993.

    Philipp Koehn. Statistical Machine Translation.2010. Chapter-4(Word-Based Models)

    Vogel, Stephan, Hermann Ney, and Christoph Tillmann. HMM-based word alignment in statistical translation. COLING. 1996, pages 836–841.