28
Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow, Alaska

Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

Embed Size (px)

Citation preview

Page 1: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

Designing a Machine Translation Project

Lori Levin and Alon Lavie

Language Technologies Institute

Carnegie Mellon University

CATANAL Planning Meeting

Barrow, Alaska

March 8-9, 2001

Page 2: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Outline• History of MT--See Wired magazine May 2000 issue.

Available on the web.

• How well does it work?

• Procedure for designing an MT project.

• Choose an application: What do you want to do?

• Identify the properties of your application.

• Methods: knowledge-based, statistical/corpus based, or hybrid.

• Methods: interlingua, transfer, direct

• Typical components of an MT system.

• Typical resources required for an MT system.

Page 3: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

How well does it work?Example: SpanAm

• Handout: Example from the SpanAm system of the Pan American Health Organization.

• Probably the best Spanish-English MT system.

• Around 20 years of development.

Page 4: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

How well does it work?Example: Systran

• Try it on the Altavista web page.

• Many language pairs are available.

• Some language pairs might have taken up to a person-century of development.

• Can translate text on any topic.

• Results may be amusing.

Page 5: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

How well does it work?Example: KANT

• Translates equipment manuals for Caterpillar.• Input is controlled English: many ambiguities are

eliminated. The input is checked carefully for compliance with the rules.

• Around 5 output languages.• The output might be post-edited.

• The result has to be perfect to prevent accidents with the equipment.

Page 6: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

How well does it work?Example: JANUS

• Translates spoken conversations about booking hotel rooms or flights.

• Six languages: English, French, German, Italian, Japanese, Korean (with partners in the C-STAR consortium).

• Input is spontaneous speech spoken into a microphone.

• Output is around 60% correct.

• Task Completion is higher than translation accuracy: users can always get their flights or rooms if they are willing to repeat 40% of their sentences.

Page 7: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

How well does it work?Speech Recognition

• Jupiter weather information: 1-888-573-8255. You can say things like “what cities do you know about in Chile?” and “What will be the weather tomorrow in Santiago?”.

• Communicator flight reservations: 1-877-CMU-PLAN. You can say things like “I’m travelling to Pittsburgh.”

• Speechworks demo: 1-888-SAY-DEMO. You can say things like “Sell my shares of Microsoft.”

• These are all in English, and are toll-free only in the US, but they are speaker-independent and should work with reasonable foreign accents.

Page 8: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Different kinds of MT

• Different applications: for example, translation of spoken language or text.

• Different methods: for example, translation rules that are hand crafted by a linguist or rules that are learned automatically by a machine.

• The work of building an MT program will be very different depending on the application and the methods.

Page 9: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Procedure for planning an MT project

• Choose an application.

• Identify the properties of your application.

• List your resources.

• Choose one or more methods.

• Make adjustments if your resources are not adequate for the properties of your application.

Page 10: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Choose an application: What do you want to do?

• Exchange email or chat in Mapudungun and Spanish.• Translate Spanish web pages about science into

Mapudungun so that kids can read about science in their language.

• Scan the web: “Is there any information about such-and-such new fertilizer and water pollution?” Then if you find something that looks interesting, take it to a human translator.

• Answer government surveys about health and agriculture (spoken or written).

• Ask directions (“where is the library?”) (spoken).• Read government publications in Mapudungun.

Page 11: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Identify the properties of your application.

• Do you need reliable, high quality translation?• How many languages are involved? Two or more?• Type of input.• One topic (for example, weather reports) or any topic (for

example, calling your friend on the phone to chat).• Controlled or free input.• How much time and money do you have?• Do you anticipate having to add new topics or new

languages?

Page 12: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Do you need high quality?

• Assimilation: Translate something into your language so that you can:– understand it--may not require high quality.– evaluate whether it is important or interesting

and then send it off for a better translation--does not require high quality.

– use it for educational purposes--probably requires high quality.

Page 13: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Do you need high quality?

• Dissemination: Translate something into someone else’s language e.g., for publication.

• Usually should be high quality.

Page 14: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Do you need high quality?

• Two-Way: e.g., chat room or spoken conversation

• May not require high reliability on correctness if you have a native language paraphrase.– Original input: I would like to reserve a double room.

– Paraphrase: Could you make a reservation for a double room.

Page 15: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Type of Input

• Formal text: newspaper, government reports, on-line encyclopedia.– Difficulty: long sentences

• Formal speech: spoken news broadcast.– Difficulty: speech recognition won’t be perfect.

• Conversational speech: – Difficulty: speech recognition won’t be perfect– Difficulty: disfluencies– Difficulty: non-grammatical speech

• Informal text: email, chat– Difficulty: non-grammatical speech

Page 16: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Resources• People who speak the language.• Linguists who speak the language.• Computational linguists who speak the language.• Text on paper.• Text on line.• Comparable text on paper or on line.• Parallel text on paper or on line.• Annotated text (part of speech, morphology, etc.)• Dictionaries (mono-lingual or bilingual) on paper or online.• Recordings of spoken language.• Recordings of spoken language that are transcribed.• Etc.

Page 17: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Methods: Knowledge-Based

• Knowledge-based MT: a linguist writes rules for translation:– noun adjective --> adjective noun

• Requires a computational linguist who knows the source and target languages.

• Usually takes many years to get good coverage.

• Usually high quality.

Page 18: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Methods: statistical/corpus-based

• Statistical and corpus-based methods involve computer programs that automatically learn to translate.

• The program must be trained by showing it a lot of data.• Requires huge amounts of data.• The data may need to be annotated by hand.• Does not require a human computational linguist who

knows the source and target languages.• Could be applied to a new language in a few days.• At the current state-of-the-art, the quality is not very good.

Page 19: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Methods: Interlingua

• An interlingua is a machine-readable representation of the meaning of a sentence.– I’d like a double room/Quisiera una habitacion doble.– request-action+reservation+hotel(room-type=double)

• Good for multi-lingual situations. Very easy to add a new language.

• Probably better for limited domains -- meaning is very hard to define.

Page 20: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Multilingual Interlingual Machine Translation

Page 21: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Methods: Transfer

• A transfer rule tells you how a structure in one language corresponds to a different structure in another language:– an adjective followed by a noun in English corresponds

to a noun followed by an adjective in Spanish.

• Not good when there are more than two languages -- you have to write different transfer rules for each pair.

• Better than interlingua for unlimited domain.

Page 22: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Methods: Direct

• Direct translation does not involve analyzing the structure or meaning of a language.

• For example, look up each word in a bilingual dictionary.• Results can be hilarious: “the spirit is willing but the flesh

is weak” can become “the wine is good, but the meat is lousy.”

• Can be developed very quickly. • Can be a good back-up when more complicated methods

fail to produce output.

Page 23: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Components of a Knowledge-Based Interlingua MT System

• Morphological analyzer: identify prefixes, suffixes, and stem.

• Parser (sentence-to-syntactic structure for source language, hand-written or automatically learned)

• Meaning interpreter (syntax-to-semantics, source language).

• Meaning interpreter (semantics-to-syntax, target language).

• Generator (syntactic structure-to-sentence) for target language.

Page 24: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Resources for a knowledge-based Interlingua MT system

• Computational linguists who know the source and target languages.

• As large a corpus as possible so that the linguists can confirm that they are covering the necessary constructions, but the size of the corpus is not crucial to system development.

• Lexicons for source and target languages, syntax, semantics, and morphology.

• A list of all the concepts that can be expressed in the system’s domain.

Page 25: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Components of Example Based MT: a direct statistical method

• A morphological analyzer and part of speech tagger would be nice, but not crucial.

• An alignment algorithm that runs over a parallel corpus and finds corresponding source and target sentences.

• An algorithm that compares an input sentence to sentences that have been previously translated, or whose translation is known.

• An algorithm that pulls out the corresponding translation, possibly slightly modifying a previous translation.

Page 26: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Resources for Example Based MT

• Lexicons would improve quality of translation, but are not crucial.

• A large parallel corpus (hundreds of thousands of words).

Page 27: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

“Omnivorous” Multi-Engine MT: eats any available resources

Page 28: Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

March 8-9, 2001 CATANAL Meeting - Barrow Alaska

Approaches we have in mind

• Direct bilingual-dictionary lookup: because it is easy and is a back-up when other methods fail.

• Generalized Example-Based MT: because it is easy and fast and can be also be a back-up.

• Instructable Transfer-based MT: a new, untested idea involving machine learning of rules from a human native speaker. Useful when computational linguists don’t know the language, and people who know the language are not computational linguists.

• Conventional, hand-written transfer rules: in case the new method doesn’t work.