Direct Translation Approaches: Statistical Machine Translation Stephan Vogel, Alicia Tribble Interactive Systems Lab Carnegie Mellon University & University

Direct Translation Approaches:Statistical Machine Translation

Stephan Vogel, Alicia Tribble

Interactive Systems LabCarnegie Mellon University &University Karlsruhe

Speech-to-Speech Translation WorkshopESSLLI 2002, Trento, Italy

16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy 2

Overview

Translation ApproachesStatistical Machine TranslationTranslating with Cascaded TransducersExperiments on Nespole Data


Translation Approaches

Interlingua basedTransfer basedDirect Example based Statistical


Statistical Machine Translation

Based on Bayes´ Decision Rule:

ê = argmax{ p(e | f) } = argmax{ p(e) p(f | e) }


Tasks in SMT

Modelling build statistical models which capture characteristic features of translation equivalences and of the target language

Training train translation model on bilingual corpus, train language model on monolingual corpus

Decoding find best translation for new sentences according to models


Alignment Example


Translation Models

IBM1 – lexical probabilities onlyIBM2 – lexicon plus absolut positionHMM – lexicon plus relative positionIBM3 – plus fertilitiesIBM4 – inverted relative position alignment IBM5 – non-deficient version of model 4

[Brown, et.al. 93, Vogel, et.al. 96]


HMM Alignment Model

p(f|e) = a p(f1J, a1

J | e1I)

= a j p(fj , aj | f1j-1, a1

j-1, e1

I)

= a j p(aj | aj-1) p(fj | ea(j))

~ maxa j p(aj | aj-1) p(fj | ea(j))Alignment aj of current word fj depends on alignment aj-1 of previous word fj-1 .


Phrase Translation

Why? To capture context Local word reordering

How? Train alignment model Extract phrase-to-phrase translations from Viterbi path

Notes: Often better results when training target to source for

extraction of phrase translations Phrases are not fully integrated into alignment model,

they are extracted only after training is completed


Translation with Transducers

Transducer: Finite state machine Read sequence of words, write sequene of words Output vocaculary can be different from input vocabulary

Transducer used in current implementation: Tree Transducer, i.e. prefix tree over input strings Output from final states Used to encode lexicon, phrase translations, bilingual word classes and grammers


Cascaded Transducers

Generalization through cascaded transducers:Replace words by category labels and have a transducer for each category

[Vogel, Ney 2000]


Language Model

Standard n-gram model:

p(w1 ... wn) = i p(wi | w1... wi-1)

= i p(wi | wi-2 wi-1) trigram

= i p(wi | wi-1) bigram

Many events not seen -> smoothing required


Decoding Strategies

Sequential construction of target sentence Extend partial translation by words which are

translations of words in the source sentence Language model can be applied immediately Mechanism to ensure proper coverage of

source sentence required

Left – right over source sentence Find translations for sequences of words Construct translation lattice Apply language model and select best path


Translation Graph


Speech Recognition and Translation

Search best string in target language for given acoutsic signal in source language

ê = argmax{ p(e) p(x|e) } = argmax{ p(e) f p(f,x|e) }

= argmax{ p(e) f p(f|e) p(f) p(x|f,x) } = argmax{ p(e) f p(f|e) p(f) p(x|f) }

i.e. recognizer language model not needed !?[Ney, 2001]


Coupling Recognition and Translation

Sequential – first recognition, then translation First best recognition hypothesis N-best list – translate n times Word lattice – translate all pathes in lattice, reuse results

from partial pathes

Integrated – recognition and translation in combined search

Subsequential transducer approach uses this

Note: In Eutrans project best results when translation on first-best hypothesis


Example-Based Machine Translation

Re-use translations to create new translations:Store bilingual corpus with (partial) alignmentFind partial matches, i.e. sequences of words in stored corpus to cover a new sentence Extract translation(s) and build translation latticeApply language model to find best path, i.e. best translation


Nespole Experiments

Application of direct translation techniques to dialogue data collected in Nespole!Testing the effect of phrase translationExperiments with additional knowledge sources Preexisting: monolingual data for the LM and

publically available Lexica Engineered: handwritten rules for fixed

expressions and knowledge extracted from semantic grammars


Nespole Project Data

CMU database of dialogues in the travel domainGerman, English (Italian, French)Speech recognizer hypotheses and human transcriptions both availableSegmented into SDUs (Speech Dialogue Units)


Nespole Corpus: Training

Language English German

Tokens 15572 14992

Vocabulary 1032 1338

Singletons 404 620

3182 Parallel SDUs


Nespole Corpus: Testing

German Reference A Reference B

Tokens 437 610 607

Vocabulary 183 (45 OOV) 165 160

70 Parallel SDUs


0 1 2 3 4 5 6 7 8 9 10

English

German

0 2 4 6 8 10

English

German

Testing Data

Training Data

Corpus Challenges: Sentence Length


Evaluation

Human Scoring Good, Okay, Bad (c.f. Nespole evaluation) Collapsed into a „human score“ on [0,1]

Bleu Score Average of N-gram precisions from (1..N),

typically N=3 or 4 Penalty for short translations to substitute

for recall measure

[Papinini et.al. 2001]


Phrase Translation

Unequal sentence lengths means that training can be improved directionally: S T or T SGerman compounds are better for 1 to many alignments with English multiword phrases, so direction is importantStatistical lexicon alone

Statistical lexicon, phrases from S T training

Statistical lexicon, phrases from bidir. training

0,1903 0,2350 0,2654


Language Model

Monolingual text available from Verbmobil 500.000 words (32x the size of orig. English

corpus)Helps to choose among translation hypotheses but will not generate new ones

Stat. lexicon, phrases, fixed expression rules, gen. lexicon, and small LM

Stat. lexicon, phrases, fixed expression rules, gen. lexicon, and large LM

0,2613 0,3172


General-Purpose Lexicon

Statistical lexicon, phrases, and fixed exp´s with small LM

0,2654

Adding general-purpose lexicon as a transducer

0,2522

Using large instead of small LM

0,3141

general-purpose lexicon as training data instead of separate transducer

0,3275


Fixed Expression Rules

Transducer rules are human readable and can be added by handFixed expressions for times and dates are re-usable, require less time to build than domain-specific rules and improve coverage of some semi-idiomatic constructions.

Statistical lexicon with small LM

Statistical lexicon and fixed-expression transducer with small LM

0,1893 0,1903


Knowledge from Existing Grammars

Could help in domain- but not language- portabilityBenefit mostly in additional vocabulary Statistical lexicon, fixed exp´s, phrases, and general lexicon with large LM

Statistical lexicon, fixed exp´s, phrases, general lexicon and I-transducer with large LM

0,3141 0,3172


Comparative Evaluation Results

Good Okay Bad Score Bleu

Text IF 77 104 227 0,32 0,068

SMT 127 80 205 0,40 0,333

Speech

IF 64 101 243 0,28 0,059

SMT 95 83 227 0,34 0,262


Selected References

Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, Robert L. Mercer. The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics, 1993, 19,2, pp.263—311

Stephan Vogel, Hermann Ney, Christoph Tillmann. HMM-Based Word Alignment in Statistical Translation. Int. Conf. on Computational Linguistics, Kopenhagen, Danemark, pp. 836-841, August 1996.

Stephan Vogel, Hermann Ney. Translation with Cascaded Finite State Transducers. 36th Annual Conference of the Association for Computational Linguistics, pp. 23-30, Hongkong, China, October2000.

Stephan Vogel, Alicia Tribble. Improving statistical machine translation for a speech-to-speech translation task. To appear in ICSLP 2002.

H. Ney. The Statistical Approach to Spoken Language Translation. Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio, Trento, Italy, 8 pages, CD ROM, IEEE Catalog No. 01EX544, December 2001.

Kishore Papinini, Salim Roukos, Todd Ward, Wei-Jing Zhu. Bleu: a Method for Automatic Evaluation ofMachine Translation. IBM Research Report RC22176(W0109-022), September17, 2001.

Documents

Direct Translation Approaches: Statistical Machine Translation Stephan Vogel, Alicia Tribble Interactive Systems Lab Carnegie Mellon University & University