82
Machine Translation: Approaches, Challenges and Future Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University ITEC Dinner May 21, 2009

Machine Translation: Approaches, Challenges and Future

  • Upload
    jonco

  • View
    48

  • Download
    8

Embed Size (px)

DESCRIPTION

Machine Translation: Approaches, Challenges and Future. Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University ITEC Dinner May 21, 2009. Machine Translation: History. MT started in 1940’s, one of the first conceived application of computers - PowerPoint PPT Presentation

Citation preview

Page 1: Machine Translation:  Approaches, Challenges and Future

Machine Translation: Approaches, Challenges and Future

Alon LavieLanguage Technologies Institute

School of Computer ScienceCarnegie Mellon University

ITEC DinnerMay 21, 2009

Page 2: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 2

Machine Translation: History

• MT started in 1940’s, one of the first conceived application of computers

• Promising “toy” demonstrations in the 1950’s, failed miserably to scale up to “real” systems

• ALPAC Report: MT recognized as an extremely difficult, “AI-complete” problem in the early 1960’s

• MT Revival started in earnest in 1980s (US, Japan)• Field dominated by rule-based approaches, requiring

100s of K-years of manual development• Economic incentive for developing MT systems for small

number of language pairs (mostly European languages)• Major paradigm shift in MT over the past decade:

– From manually developed rule-based systems– To Data-driven statistical search-based approaches

Page 3: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 3

Machine Translation: Where are we today?

• Age of Internet and Globalization – great demand for translation services and MT: – Multiple official languages of UN, EU, Canada, etc.– Documentation dissemination for large manufacturers (Microsoft,

IBM, Caterpillar, US Steel, ALCOA)– Language and translation services business sector estimated at

$15 Billion worldwide in 2008 and growing at a healthy pace• Economic incentive is still primarily within a small number of

language pairs• Some fairly decent commercial products in the market for

these language pairs– Primarily a product of rule-based systems after many years of

development– New generation of data-driven “statistical” MT: Google, Language

Weaver• Web-based (mostly free) MT services: Google, Babelfish,

others…• Pervasive MT between many language pairs still non-existent,

but Google is trying to change that!

Page 4: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 4

Representative Example: Google Translate

• http://translate.google.com

Page 5: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 5

Google Translate

Page 6: Machine Translation:  Approaches, Challenges and Future

6

Google Translate

Page 7: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 7

Example of High-Quality Rule-based MT

PAHO’s Spanam system:• Mediante petición recibida por la Comisión Interamericana de

Derechos Humanos (en adelante …) el 6 de octubre de 1997, el señor Lino César Oviedo (en adelante …) denunció que la República del Paraguay (en adelante …) violó en su perjuicio los derechos a las garantías judiciales … en su contra.

• Through petition received by the `Inter-American Commission on Human Rights` (hereinafter …) on 6 October 1997, Mr. Linen César Oviedo (hereinafter “the petitioner”) denounced that the Republic of Paraguay (hereinafter …) violated to his detriment the rights to the judicial guarantees, to the political participation, to // equal protection and to the honor and dignity consecrated in articles 8, 23, 24 and 11, respectively, of the `American Convention on Human Rights` (hereinafter …”), as a consequence of judgments initiated against it.

Page 8: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 8

Machine Translation: Some Basic Terminology

• Source Language (SL): the language of the original text that we wish to translate

• Target Language (TL): the language into which we wish to translate

• Translation Segment: language “unit” which is translated independently; usually sentences, sometimes smaller phrases or terms

Page 9: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 9

How Does MT Work?• Naïve MT: Translation Memory

– Store in a database human translations of sentences (or shorter phrases) that have been already translated before

– When translating a new document:• For each source sentence, search the DB to see if it has been

translated before• If found, retrieve it’s translation!• “Fuzzy matches”, multiple translations

– Main Advantage: translation output is always human-quality!

– Main Disadvantage: many/most sentences haven’t been translated before, cannot be retrieved…

• Translation Memories are heavily used by the Commercial Language Service Provider Industry – companies such as Echo International

Page 10: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 10

How Does MT Work?

• All modern MT approaches are based on building translations for complete segments by putting together smaller pieces of translation

• Core Questions:– What are these smaller pieces of

translation? Where do they come from?– How does MT put these pieces together?– How does the MT system pick the correct

(or best) translation among many options?

Page 11: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 11

Core Challenges of MT

• Ambiguity and Language Divergences:– Human languages are highly ambiguous, and

differently in different languages– Ambiguity at all “levels”: lexical, syntactic, semantic,

language-specific constructions and idioms• Amount of required knowledge:

– Translation equivalencies for vast vocabularies (several 100k words and phrases)

– Syntactic knowledge (how to map syntax of one language to another), plus more complex language divergences (semantic differences, constructions and idioms, etc.)

– How do you acquire and construct a knowledge base that big that is (even mostly) correct and consistent?

Page 12: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 12

How to Tackle the Core Challenges

• Manual Labor: 1000s of person-years of human experts developing large word and phrase translation lexicons and translation rules.

Example: Systran’s RBMT systems.• Lots of Parallel Data: data-driven approaches for

finding word and phrase correspondences automatically from large amounts of sentence-aligned parallel texts. Example: Statistical MT systems.

• Learning Approaches: learn translation rules automatically from small amounts of human translated and word-aligned data. Example: CMU’s Statistical XFER approach.

• Simplify the Problem: build systems that are limited-domain or constrained in other ways. Example: CATLYST system built for Caterpillar

Page 13: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 13

Major Sources of Translation Problems

• Lexical Differences:– Multiple possible translations for SL word, or

difficulties expressing SL word meaning in a single TL word

• Structural Differences:– Syntax of SL is different than syntax of the TL: word

order, sentence and constituent structure

• Differences in Mappings of Syntax to Semantics:– Meaning in TL is conveyed using a different syntactic

structure than in the SL

• Idioms and Constructions

Page 14: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 14

Lexical Differences

• SL word has several different meanings, that translate differently into TL– Ex: financial bank vs. river bank

• Lexical Gaps: SL word reflects a unique meaning that cannot be expressed by a single word in TL– Ex: English snub doesn’t have a corresponding verb

in French or German• TL has finer distinctions than SL SL word

should be translated differently in different contexts– Ex: English wall can be German wand (internal),

mauer (external)

Page 15: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 15

Structural Differences

• Syntax of SL is different than syntax of the TL: – Word order within constituents:

• English NPs: art adj n the big boy• Hebrew NPs: art n art adj ha yeled ha gadol

– Constituent structure:• English is SVO: Subj Verb Obj I saw the man• Modern Arabic is VSO: Verb Subj Obj

– Different verb syntax:• Verb complexes in English vs. in German I can eat the apple Ich kann den apfel essen

– Case marking and free constituent order• German and other languages that mark case: den apfel esse Ich the(acc) apple eat I(nom)

Page 16: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 16

Syntax-to-Semantics Differences

• Structure-change example:I like swimming“Ich scwhimme gern”I swim gladly

• Verb-argument example:Jones likes the film.“Le film plait à Jones.”(lit: “the film pleases to Jones”)

• Passive Constructions– Example: French reflexive passives:

Ces livres se lisent facilement*”These books read themselves easily”These books are easily read

Page 17: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 17

Idioms and Constructions

• Main Distinction: meaning of whole is not directly compositional from meaning of its sub-parts no compositional translation

• Examples:– George is a bull in a china shop– He kicked the bucket– Can you please open the window?

Page 18: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 18

Formulaic Utterances

• Good night.• tisbaH cala xEr waking up on good

• Romanization of Arabic from CallHome Egypt

Page 19: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 19

State-of-the-Art in MT• What users want:

– General purpose (any text)– High quality (human level)– Fully automatic (no user intervention)

• We can meet any 2 of these 3 goals today, but not all three at once:– FA HQ: Knowledge-Based MT (KBMT)– FA GP: Data-driven (Statistical) MT– GP HQ: Include humans in the MT-loop –

post-editing!

Page 20: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 20

Types of MT Applications:

• Assimilation: multiple source languages, uncontrolled style/topic. Requires general purpose MT: good fit for “Google Translate”

• Dissemination: one source language, into multiple target languages; often controlled style, single topic/domain (at least per user): this is the common commercial translation scenario: good fit for KBMT and customized Statistical MT

• Communication: Lower quality may be okay, but system robustness, real-time required.

Page 21: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 21

Mi chiamo Alon Lavie My name is Alon Lavie

Give-information+personal-data (name=alon_lavie)

[s [vp accusative_pronoun “chiamare” proper_name]]

[s [np [possessive_pronoun “name”]]

[vp “be” proper_name]]

Direct

Transfer

Interlingua

Analysis Generation

Approaches to MT: Vaquois MT Triangle

Page 22: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 22

Interlingua versus Transfer

• With interlingua, need only N parsers/ generators instead of N2 transfer systems:

L1L2

L3

L4

L5

L6

L1L2

L3

L6

L5

L4

interlingua

Page 23: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 23

Rule-based vs. Data-driven Approaches to MT

• What are the pieces of translation? Where do they come from?– Rule-based: large-scale “clean” word translation lexicons,

manually constructed over time by human experts– Data-driven: broad-coverage word and multi-word

translation lexicons, learned automatically from available sentence-parallel corpora

• How does MT put these pieces together?– Rule-based: large collections of rules, manually developed

over time by human experts, that map structures from the source to the target language

– Data-driven: a computer algorithm that explores millions of possible ways of putting the small pieces together, looking for the translation that statistically looks best

Page 24: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 24

Rule-based vs. Data-driven Approaches to MT

• How does the MT system pick the correct (or best) translation among many options?– Rule-based: Human experts encode preferences

among the rules designed to prefer creation of better translations

– Data-driven: a variety of fitness and preference scores, many of which can be learned from available training data, are used to model a total score for each of the millions of possible translation candidates; algorithm then selects and outputs the best scoring translation

Page 25: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 25

Rule-based vs. Data-driven Approaches to MT

• Why have the data-driven approaches become so popular?– We can now do this!

• Increasing amounts of sentence-parallel data are constantly being created on the web

• Advances in machine learning algorithms• Computational power of today’s computers can train

systems on these massive amounts of data and can perform these massive search-based translation computations when translating new texts

– Building and maintaining rule-based systems is too difficult, expensive and time-consuming

– In many scenarios, it actually works better!

Page 26: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 26

Rule-based vs. Data-driven MT

We thank all participants of the whole world for their comical and creative drawings; to choose the victors was not easy task!

Click here to see work of winning European of these two months, and use it to look at what the winning of USA sent us.

We thank all the participants from around the world for their designs cocasses and creative; selecting winners was not easy!

Click here to see the artwork of winners European of these two months, and disclosure to look at what the winners of the US have been sending.

Rule-based Data-driven

Page 27: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 27

Data-driven MT:Some Major Challenges

• Current approaches are too naïve and “direct”:– Good at learning word-to-word and phrase-to-phrase

correspondences from data– Not good enough at learning how to combine these pieces

and reorder them during translation– Learning general rules requires much more complicated

algorithms and computer processing of the data– The space of translations that is “searched” often doesn’t

contain a perfect translation– The fitness scores that are used aren’t good enough to

always assign better scores to the better translations we don’t always find the best translation even when it’s there!

• Solutions:– Google solution: more and more data!– Research solution: “smarter” algorithms and learning

methods

Page 28: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 28

Multi-Engine MT

• Apply several MT engines to each input in parallel

• Create a combined translation from the individual translations

• Goal is to combine strengths, and avoid weaknesses.

• Along all dimensions: domain limits, quality, development time/cost, run-time speed, etc.

• Various approaches to the problem

Page 29: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 29

Synthetic Combination MEMT

Two Stage Approach:1. Align: Identify common words and phrases across

the translations provided by the engines2. Decode: search the space of synthetic

combinations of words/phrases and select the highest scoring combined translation

Example:1. announced afghan authorities on saturday reconstituted

four intergovernmental committees 2. The Afghan authorities on Saturday the formation of the

four committees of government

MEMT: the afghan authorities announced on Saturday the formation of four intergovernmental committees

Page 30: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 30

Speech-to-Speech MT

• Speech just makes MT (much) more difficult:– Spoken language is messier

• False starts, filled pauses, repetitions, out-of-vocabulary words

• Lack of punctuation and explicit sentence boundaries– Current Speech technology is far from perfect

• Need for speech recognition (SR) and synthesis in foreign languages

• Robustness: MT quality degradation should be proportional to SR quality

• Tight Integration: rather than separate sequential tasks, can SR + MT be integrated in ways that improves end-to-end performance?

Page 31: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 31

MT Evaluation

• How do you evaluate the quality of the output of MT systems?

• Human notions of translation quality:– Adequacy: to what extent does the translation have the

same meaning as the original sentence?– Fluency: to what extent is the translation fluent and

grammatical in the target language– Rankings: given two (or more) translations of the same

input sentence, which one is better? (Or rank them by quality)

– Task-based measures: is the translation sufficient for accomplishing a specific task or goal (i.e. understanding the gist of the document, flagging important documents that should be translated by a human)

Page 32: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 32

Automated MT Evaluation• Automatic Evaluation Metrics are extremely important:

– Human evaluations are too expensive and time consuming to be done very frequently

– Need to test changes and assess whether system is getting better quickly and on an on-going basis

– Data-driven MT systems have lots of tunable parameters – need to optimize them for best performance

• Goal: Fully automatic fast metric that can assess quality and that correlates well with human notions of quality

• Major Approach:– Obtain human “reference” translations for test data sets– Estimate how “close” the MT translations are to the human

translations (on a scale of [0,1])• Major Challenge: translation variability – there are

many correct translations. How to measure similarity?

Page 33: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 33

MT for Commercial Language Service Providers

• Most dissemination-type high-quality translation is performed by commercial Language service provider (LSP) companies, such as Echo International

• Translation process used by most LSPs:– Heavily dependent on expensive human translators,

to ensure high-quality– Current automation mostly in the form of Translation

Memories:• Previously translated segments don’t have to be

translated again by human translators• The retrieved translations are of guaranteed high-

quality – limited post-editing is required– No extensive use of modern state-of-the-art MT!

Page 34: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 34

MT for Commercial Language Service Providers

• Why is MT currently not used by LSPs?– The quality of translations produced by MT varies

widely: some sentences can be perfect translations, others can be very bad

– Post-editing bad MT output using human translators is frustrating and unappealing, and is often not cost-effective

– No existing good technical solutions that integrate MT seamlessly within the existing translation workflow processes used by LSPs

– LSPs are wary of taking a “leap of faith” on unproven technology that may not save money

Page 35: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 35

Safaba Translation Solutions LLC

• New CMU spin-off technology start-up company • Mission: Develop innovative solutions using automated Machine

Translation for commercial Language Service Providers (LSPs)• Concept:

– Identify and enhance high-quality automatically-produced translations and efficiently integrate them into the human translation loop, dramatically reducing cost and turn-around times of translation projects for commercial LSPs

• Founders:– Alon Lavie – Associate Research Professor, LTI, CMU– Robert Olszewski – CMU CS Ph.D. (2001)

• Partnering with Echo International for feasibility analysis and as a potential primary beta-testing client

Page 36: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 36

Some Take-home Messages

• Machine Translation is already quite good for many purposes and needs, and is getting better all the time

• Modern state-of-the-art MT is data-driven – computers learning from data. This paradigm shift is not reversible

• For casual (assimilation-type) needs, free web-based translation services such as Google are already very useful

• For business (dissemination-type) needs, LSPs are usually required, but MT will increasingly integrate into the way LSPs do translation

Page 37: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 37

Questions…

Page 38: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 38

Lexical Differences

• SL word has several different meanings, that translate differently into TL– Ex: financial bank vs. river bank

• Lexical Gaps: SL word reflects a unique meaning that cannot be expressed by a single word in TL– Ex: English snub doesn’t have a corresponding verb

in French or German• TL has finer distinctions than SL SL word

should be translated differently in different contexts– Ex: English wall can be German wand (internal),

mauer (external)

Page 39: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 39

Google at Work…

Page 40: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 40

Page 41: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 41

Page 42: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 42

Lexical Differences

• Lexical gaps: – Examples: these have no direct equivalent in

English:

gratiner(v., French, “to cook with a cheese coating”)

ōtosanrin(n., Japanese, “three-wheeled truck or van”)

Page 43: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 43

[From Hutchins & Somers]

Lexical Differences

Page 44: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 44

MT Handling of Lexical Differences

• Direct MT and Syntactic Transfer:– Lexical Transfer stage uses bilingual lexicon– SL word can have multiple translation entries,

possibly augmented with disambiguation features or probabilities

– Lexical Transfer can involve use of limited context (on SL side, TL side, or both)

– Lexical Gaps can partly be addressed via phrasal lexicons

• Semantic Transfer:– Ambiguity of SL word must be resolved during

analysis correct symbolic representation at semantic level

– TL Generation must select appropriate word or structure for correctly conveying the concept in TL

Page 45: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 45

Structural Differences

• Syntax of SL is different than syntax of the TL: – Word order within constituents:

• English NPs: art adj n the big boy• Hebrew NPs: art n art adj ha yeled ha gadol

– Constituent structure:• English is SVO: Subj Verb Obj I saw the man• Modern Arabic is VSO: Verb Subj Obj

– Different verb syntax:• Verb complexes in English vs. in German I can eat the apple Ich kann den apfel essen

– Case marking and free constituent order• German and other languages that mark case: den apfel esse Ich the(acc) apple eat I(nom)

Page 46: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 46

Page 47: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 47

Page 48: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 48

Page 49: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 49

MT Handling of Structural Differences

• Direct MT Approaches:– No explicit treatment: Phrasal Lexicons and sentence

level matches or templates• Syntactic Transfer:

– Structural Transfer Grammars• Trigger rule by matching against syntactic structure on

SL side• Rule specifies how to reorder and re-structure the

syntactic constituents to reflect syntax of TL side

• Semantic Transfer:– SL Semantic Representation abstracts away from SL

syntax to functional roles done during analysis– TL Generation maps semantic structures to correct

TL syntax

Page 50: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 50

Syntax-to-Semantics Differences

• Meaning in TL is conveyed using a different syntactic structure than in the SL– Changes in verb and its arguments– Passive constructions– Motion verbs and state verbs– Case creation and case absorption

• Main Distinction from Structural Differences:– Structural differences are mostly independent of

lexical choices and their semantic meaning addressed by transfer rules that are syntactic in nature

– Syntax-to-semantic mapping differences are meaning-specific: require the presence of specific words (and meanings) in the SL

Page 51: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 51

Syntax-to-Semantics Differences

• Structure-change example:I like swimming“Ich scwhimme gern”I swim gladly

Page 52: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 52

Page 53: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 53

Syntax-to-Semantics Differences

• Verb-argument example:Jones likes the film.“Le film plait à Jones.”(lit: “the film pleases to Jones”)

• Use of case roles can eliminate the need for this type of transfer – Jones = Experiencer– film = Theme

Page 54: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 54

Page 55: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 55

Page 56: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 56

Syntax-to-Semantics Differences

• Passive Constructions• Example: French reflexive passives:

Ces livres se lisent facilement*”These books read themselves easily”These books are easily read

Page 57: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 57

Page 58: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 58

Page 59: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 59

Direct Approaches

• No intermediate stage in the translation• First MT systems developed in the

1950’s-60’s (assembly code programs)– Morphology, bi-lingual dictionary lookup,

local reordering rules– “Word-for-word, with some local word-order

adjustments”

• Modern Approaches: – Phrase-based Statistical MT (SMT)– Example-based MT (EBMT)

Page 60: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 60

Statistical MT (SMT)• Proposed by IBM in early 1990s: a direct,

purely statistical, model for MT• Most dominant approach in current MT

research• Evolved from word-level translation to phrase-

based translation• Main Ideas:

– Training: statistical “models” of word and phrase translation equivalence are learned automatically from bilingual parallel sentences, creating a bilingual “database” of translations

– Decoding: new sentences are translated by a program (the decoder), which matches the source words and phrases with the database of translations, and searches the “space” of all possible translation combinations.

Page 61: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 61

Statistical MT (SMT)• Main steps in training phrase-based statistical MT:

– Create a sentence-aligned parallel corpus– Word Alignment: train word-level alignment models

(GIZA++)– Phrase Extraction: extract phrase-to-phrase translation

correspondences using heuristics (Pharoah)– Minimum Error Rate Training (MERT): optimize translation

system parameters on development data to achieve best translation performance

• Attractive: completely automatic, no manual rules, much reduced manual labor

• Main drawbacks: – Translation accuracy levels vary– Effective only with large volumes (several mega-words) of

parallel text– Broad domain, but domain-sensitive– Still viable only for small number of language pairs!

• Impressive progress in last 5 years

Page 62: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 62

EBMT ParadigmNew Sentence (Source)

Yesterday, 200 delegates met with President Clinton.

Matches to Source Found

Yesterday, 200 delegates met behind closed doors…

Difficulties with President Clinton…

Gestern trafen sich 200 Abgeordnete hinter verschlossenen…

Schwierigkeiten mit Praesident Clinton…

Alignment (Sub-sentential)

Translated Sentence (Target)

Gestern trafen sich 200 Abgeordnete mit Praesident Clinton.

Yesterday, 200 delegates met behind closed doors…

Difficulties with President Clinton over…

Gestern trafen sich 200 Abgeordnete hinter verschlossenen…

Schwierigkeiten mit Praesident Clinton…

Page 63: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 63

Transfer Approaches

• Syntactic Transfer:– Analyze SL input sentence to its syntactic structure

(parse tree)– Transfer SL parse-tree to TL parse-tree (various

formalisms for specifying mappings)– Generate TL sentence from the TL parse-tree

• Semantic Transfer:– Analyze SL input to a language-specific semantic

representation (i.e., Case Frames, Logical Form)– Transfer SL semantic representation to TL semantic

representation– Generate syntactic structure and then surface

sentence in the TL

Page 64: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 64

Transfer Approaches

Main Advantages and Disadvantages:• Syntactic Transfer:

– No need for semantic analysis and generation– Syntactic structures are general, not domain specific

Less domain dependent, can handle open domains– Requires word translation lexicon

• Semantic Transfer:– Requires deeper analysis and generation, symbolic

representation of concepts and predicates difficult to construct for open or unlimited domains

– Can better handle non-compositional meaning structures can be more accurate

– No word translation lexicon – generate in TL from symbolic concepts

Page 65: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 65

MT at the LTI

• LTI originated as the Center for Machine Translation (CMT) in 1985

• MT continues to be a prominent sub-discipline of research with the LTI– More MT faculty than any of the other areas– More MT faculty than anywhere else

• Active research on all main approaches to MT: Interlingua, Transfer, EBMT, SMT

• Leader in the area of speech-to-speech MT• Multi-Engine MT (MEMT)• MT Evaluation (METEOR)

Page 66: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 66

Phrase-based Statistical MT• Word-to-word and phrase-to-phrase translation pairs are

acquired automatically from data and assigned probabilities based on a statistical model

• Extracted and trained from very large amounts of sentence-aligned parallel text– Word alignment algorithms– Phrase detection algorithms– Translation model probability estimation

• Main approach pursued in CMU systems in the DARPA/TIDES program and now in GALE– Chinese-to-English and Arabic-to-English

• Most active work is on improved word alignment, phrase extraction and advanced decoding techniques

Page 67: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 67

EBMT

• Developed originally for the PANGLOSS system in the early 1990s– Translation between English and Spanish

• Generalized EBMT under development for the past several years

• Used in a variety of projects in recent years– DARPA TIDES and GALE programs– DIPLOMAT and TONGUES

• Active research work on improving alignment and indexing, decoding from a lattice

Page 68: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 68

CMU Statistical Transfer (Stat-XFER) MT Approach

• Integrate the major strengths of rule-based and statistical MT within a common statistically-driven framework:– Linguistically rich formalism that can express complex and

abstract compositional transfer rules– Rules can be written by human experts and also acquired

automatically from data– Easy integration of morphological analyzers and generators– Word and syntactic-phrase correspondences can be automatically

acquired from parallel text– Search-based decoding from statistical MT adapted to find the best

translation within the search space: multi-feature scoring, beam-search, parameter optimization, etc.

– Framework suitable for both resource-rich and resource-poor language scenarios

• Most active work on phrase and rule acquisition from parallel data, efficient decoding, joint decoding with non-syntactic phrases, MT for low-resource languages

Page 69: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 69

Speech-to-Speech MT• Evolution from JANUS/C-STAR systems to

NESPOLE!, LingWear, BABYLON, TRANSTAC– Early 1990s: first prototype system that fully

performed sp-to-sp (very limited domains)– Interlingua-based, but with shallow task-oriented

representations: “we have single and double rooms available” [give-information+availability] (room-type={single, double})– Semantic Grammars for analysis and generation– Multiple languages: English, German, French, Italian,

Japanese, Korean, and others– Phrase-based SMT applied in Speech-to-Speech

scenarios– Most active work on portable speech translation on

small devices: Iraqi-Arabic/English and Thai/English

Page 70: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 70

KBMT: KANT, KANTOO, CATALYST

• Deep knowledge-based framework, with symbolic interlingua as intermediate representation– Syntactic and semantic analysis into a unambiguous

detailed symbolic representation of meaning using unification grammars and transformation mappers

– Generation into the target language using unification grammars and transformation mappers

• First large-scale multi-lingual interlingua-based MT system deployed commercially: – CATALYST at Caterpillar: high quality translation of

documentation manuals for heavy equipment• Limited domains and controlled English input• Minor amounts of post-editing• Active follow-on projects

Page 71: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 71

Multi-Engine MT• New decoding-based approach developed in recent

years under DoD and DARPA funding (used in GALE)• Main ideas:

– Treat original engines as “black boxes”– Align the word and phrase correspondences between the

translations– Build a collection of synthetic combinations based on the

aligned words and phrases– Score the synthetic combinations based on Language

Model and confidence measures– Select the top-scoring synthetic combination

• Architecture Issues: integrating “workflows” that produce multiple translations and then combine them with MEMT– IBM’s UIMA architecture

Page 72: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 72

Automatic MT Evaluation• METEOR: new metric developed at CMU• Improves upon BLEU metric developed by IBM and used

extensively in recent years• Main ideas:

– Assess the similarity between a machine-produced translation and (several) human reference translations

– Similarity is based on word-to-word matching that matches:

• Identical words• Morphological variants of same word (stemming)• synonyms

– Similarity is based on weighted combination of Precision and Recall

– Address fluency/grammaticality via a direct penalty: how well-ordered is the matching of the MT output with the reference?

• Improved levels of correlation with human judgments of MT Quality

Page 73: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 73

Summary

• Main challenges for current state-of-the-art MT approaches - Coverage and Accuracy:– Acquiring broad-coverage high-accuracy translation

lexicons (for words and phrases)– learning syntactic mappings between languages

from parallel word-aligned data– overcoming syntax-to-semantics differences and

dealing with constructions– Stronger Target Language Modeling

Page 74: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 74

Knowledge-based Interlingual MT

• The classic “deep” Artificial Intelligence approach:– Analyze the source language into a detailed symbolic

representation of its meaning – Generate this meaning in the target language

• “Interlingua”: one single meaning representation for all languages – Nice in theory, but extremely difficult in practice:

• What kind of representation?• What is the appropriate level of detail to represent?• How to ensure that the interlingua is in fact universal?

Page 75: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 75

Analysis and GenerationMain Steps

• Analysis:– Morphological analysis (word-level) and POS tagging– Syntactic analysis and disambiguation (produce

syntactic parse-tree)– Semantic analysis and disambiguation (produce

symbolic frames or logical form representation)– Map to language-independent Interlingua

• Generation:– Generate semantic representation in TL– Sentence Planning: generate syntactic structure and

lexical selections for concepts– Surface-form realization: generate correct forms of

words

Page 76: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 76

The METEOR Metric

• Example:– Reference: “the Iraqi weapons are to be handed over to the

army within two weeks”– MT output: “in two weeks Iraq’s weapons will give army”

• Matching: Ref: Iraqi weapons army two weeks MT: two weeks Iraq’s weapons army• P = 5/8 =0.625 R = 5/14 = 0.357 • Fmean = 10*P*R/(9P+R) = 0.3731• Fragmentation: 3 frags of 5 words = (3-1)/(5-1) = 0.50• Discounting factor: DF = 0.5 * (frag**3) = 0.0625• Final score: Fmean * (1- DF) = 0.3731*0.9375 = 0.3498

Page 77: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 77

Synthetic Combination MEMT

Two Stage Approach:1. Align: Identify common words and phrases across

the translations provided by the engines2. Decode: search the space of synthetic

combinations of words/phrases and select the highest scoring combined translation

Example:1. announced afghan authorities on saturday reconstituted

four intergovernmental committees 2. The Afghan authorities on Saturday the formation of the

four committees of government

Page 78: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 7804/21/23 Alon Lavie: Stat-XFER 78

Translation Lexicon: Hebrew-to-English Examples(Semi-manually-developed)

PRO::PRO |: ["ANI"] -> ["I"]((X1::Y1)((X0 per) = 1)((X0 num) = s)((X0 case) = nom))

PRO::PRO |: ["ATH"] -> ["you"]((X1::Y1)((X0 per) = 2)((X0 num) = s)((X0 gen) = m)((X0 case) = nom))

N::N |: ["$&H"] -> ["HOUR"]((X1::Y1)((X0 NUM) = s)((Y0 NUM) = s)((Y0 lex) = "HOUR"))

N::N |: ["$&H"] -> ["hours"]((X1::Y1)((Y0 NUM) = p)((X0 NUM) = p)((Y0 lex) = "HOUR"))

Page 79: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 7904/21/23 Alon Lavie: Stat-XFER 79

Translation Lexicon: French-to-English Examples

(Automatically-acquired)DET::DET |: [“le"] -> [“the"]((X1::Y1))

Prep::Prep |:[“dans”] -> [“in”]((X1::Y1))

N::N |: [“principes"] -> [“principles"]((X1::Y1))

N::N |: [“respect"] -> [“accordance"]((X1::Y1))

NP::NP |: [“le respect"] -> [“accordance"]()

PP::PP |: [“dans le respect"] -> [“in accordance"]()

PP::PP |: [“des principes"] -> [“with the principles"]()

Page 80: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 8004/21/23 Alon Lavie: Stat-XFER 80

Hebrew-English Transfer GrammarExample Rules

(Manually-developed)

{NP1,2};;SL: $MLH ADWMH;;TL: A RED DRESS

NP1::NP1 [NP1 ADJ] -> [ADJ NP1]((X2::Y1)(X1::Y2)((X1 def) = -)((X1 status) =c absolute)((X1 num) = (X2 num))((X1 gen) = (X2 gen))(X0 = X1))

{NP1,3};;SL: H $MLWT H ADWMWT;;TL: THE RED DRESSES

NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]((X3::Y1)(X1::Y2)((X1 def) = +)((X1 status) =c absolute)((X1 num) = (X3 num))((X1 gen) = (X3 gen))(X0 = X1))

Page 81: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 8104/21/23 Alon Lavie: Stat-XFER 81

French-English Transfer GrammarExample Rules

(Automatically-acquired)

{PP,24691};;SL: des principes;;TL: with the principles

PP::PP [“des” N] -> [“with the” N]((X1::Y1))

{PP,312};;SL: dans le respect des principes;;TL: in accordance with the principles

PP::PP [Prep NP] -> [Prep NP]((X1::Y1)(X2::Y2))

Page 82: Machine Translation:  Approaches, Challenges and Future

May 21, 2009 ITEC Dinner 82

Syntax-to-Semantics Differences

• Meaning in TL is conveyed using a different syntactic structure than in the SL– Changes in verb and its arguments– Passive constructions– Motion verbs and state verbs– Case creation and case absorption

• Main Distinction from Structural Differences:– Structural differences are mostly independent of

lexical choices and their semantic meaning can be addressed by transfer rules that are syntactic in nature

– Syntax-to-semantic mapping differences are meaning-specific: require the presence of specific words (and meanings) in the SL