20

Click here to load reader

New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

Embed Size (px)

Citation preview

Page 1: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

New Directions inNew Directions inMachine TranslationMachine Translation

IntroductionIntroduction

陳惠群中央研究院 語言所 / 資訊所

Page 2: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

2

Why MT Matters?• Economics

– Costs / Quality / Turnaround– Many MT developers, customers, and sponsors have

already invested a lot for years.

• Politics– Multi-lingual Countries / Minority Languages

• Intelligence Gathering– Governments / Companies / Individuals

• Research– AI / CS / Linguistics / Psychology / and so on

Page 3: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

3

Recent Trends• PC-based MT Systems• Online MT Services, MT on Demand

– Email, Web pages, Uploads

• Sub-language MT Systems• Dialog-based (Speech-to-Speech) MT Systems• Computer-Assisted Translation

Page 4: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

4

Classifying MT Systems• Operations

• Fully-Automatic MT

• Semi-automatic MT

• Computer-Assisted Translation (CAT-Tools)

• Input• Unrestricted Texts

• Restricted Texts (e.g.Technical Manuals) / MT in mind

• Sub-languages / Controlled languages

• Quality• High / Low / Acceptable / Applicable / Readable

• How to evaluate a MT system?

• Strategies (see next page)

Page 5: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

5

MT Strategies• Fundamentals

• Direct Translation MT

• Transfer-based MT

• Interlingua MT

• Linguists vs. Empiricists

• New Strategies• Knowledge-based MT

• Example-based MT

• Statistics-based MT

• Hybrid MT– Japanese manufacturers know well that a single linguistic theory c

annot lead to a good MT system. They realize that a huge amount of language phenomena must be processed in an ad-hoc manner. (M. Nagao)

Page 6: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

6

Direct MT

Simple syntactic analysis(disambiguation)

Bilingual lexicon (word-by-word translation)

Re-ordering rules

Source Text Target Text

Page 7: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

7

Transfer-based MT

SL-TL lexicon & transfer rules

ST analysis

Source Text (ST)

Target Text (TT)

structure transfer

TT generation

TT Structure

ST Structure

SL grammar & lexicon

TL grammar & lexicon

SL - source language; TL - target language

Page 8: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

8

Interlingua-based MT

ST analysis

Source Text (ST)

Target Text (TT)

Interlingua representation (+SL-TL lexicon)

TT generation

SL grammar & lexicon

TL grammar & lexicon

Page 9: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

9

Knowledge-based MT• All world knowledge? A long-term research• Practical Systems: e.g. CMU’s KANT

– narrow domain

– domain model: defines all semantic classes and instances to represent all concepts in the domain

– each concept definition includes:• concept head (name of the concept)

• slots: allowable semantic roles

• fillers: allowable concept classes that the roles can contain

– disambiguation by filler restriction

– knowledge acquisition• automatic or semi-automatic

Page 10: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

10

Example-based MT• A companion module to improve MT quality• Typically include the following (Nirenburg 1995):

– sentence-aligned corpus

– intra-language matching• find chunks from source language part of the corpus which are

best candidates for matching an input chunk

– inter-language matching• find the target language chunk corresponding to the chunk fro

m the source language part of the corpus

– chunk-combinationThe PANGLOSS Mark III Machine Translation System. S. Nirenburg, Te

chnical Report CMU-CMT-95-145. 1995. (available online at http://www.lti.cs.cmu.edu/Research/CMT-home.html)

Page 11: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

11

Statistics-based MT(1)• Maximize Pr(S|T) = Pr(S) Pr(T|S) / Pr(T)

• Pr(S): source language model

• Pr(T|S): translation model– lexical translation, distortion, and fertility

• Some comments: (Machine Translation 7:(4))– I joined the attack … without realizing that precisely what the rese

arch was doing was to question some of the fundamental assumptions underlying MT research since 1966 … With hindsight, I can see that what this research was doing was saying that in the 20 years since ALPAC, the second generation architecture had led to only slightly better results than the architecture it replaced … (Harold Somers)

– My initial reaction was the same as Somers. … The integration of a CANDIDE-type engine into a traditional MT architecture should probably at the deepest level the architecture allows (John White)

Page 12: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

12

Statistics-based MT(2)• Machine Translation 7:(4)

– ...not only does it need no linguistics or linguists, but no foreign speakers either. ... about 43% of sentences correctly translated. That compares badly with SYSTRAN which is usually assigned figures of around 65% … even if it did equal SYSTRAN’s level of performance, it is not clear what inferences we should draw.… we must always remember that they need millions of words of parallel texts even to start … The problems noted then were of long-distance dependencies: … French and English … were a lucky choice … we have good historical reasons for believing that a purely statistical method cannot do high-quality MT (Yorick Wilks)

• Word alignment

Page 13: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

13

Evaluation• Traditional Evaluation Metrics (Church & Hovy)

– System-based Metrics– easy to measure, but only for a particular system

– e.g. 60 sub-grammars, 900 rewriting rules, …

– Text-based Metrics• sentence-based metrics

– e.g. # of semantically or syntactically correct sentences

• compressibility metrics

• amount of post-editing metrics

– Cost-based Metrics: cost & time (per N words)

– Demos (must avoid misleading)

• Developer’s view or Customer’s view

Page 14: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

14

Some MT Problems• Morphological ambiguity• Lexical ambiguity and structural ambiguity• Lexical mismatch and structural mismatch• Idioms and collocations• Ill-formed input• World knowledge

Page 15: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

15

CAT Tools• Pre-editing and post-editing environments with

linguistic analyses• Translation Memory

– As the translator translates the text, each sentence (translation unit) is also saved automatically to a sophisticated translation unit database memory. As he translates, any similar sentence already in the memory will appear on screen for editing.(Ian Gordon)

• Alignment Tools• Terminology Management

Page 16: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

16

Standards• Exchange Standard

– (Multilingual) Text Formats

– Lexicons

– Knowledge Bases

– Translation Memories

• Evaluation Standard

Page 17: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

17

Future Direction• Exploratory Research or Prototype Research?

• Modular Design (cf. Somers’ Comments)

• Better Linguistic Theories

• Lexicon Construction

• Hybrid MT (Mainline MT engine + Additional Modules)

• Spoken Language (Dialog-based) MT

• MT Evaluation

• Computer-Assisted Translation / User-Friendly Environment

• Sub-languages MT Systems

• Distributed MT / Networked MT

• MT on Demand

Page 18: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

18

References– Journal of Machine Translation (Kluwer)

– Proceedings of TMI, MT Summit, AMTA

– Proceedings of ACL, COLING, ROCLING

– E-Print Archive http://xxx.lanl.gov/cmp-lg/

– AAMT http://www.jeida.or.jp/aamt/index-e.html

– EAMT http://www.lim.nl/eamt/

– The Association for Computational Linguistics

• http://www.cs.columbia.edu/~acl/

– The LINGUIST List http://www.linguistlist.org/

– Translation Research Group http://www.ttt.org/index.html

– Localization Industry Standards Association (LISA)

• http://www.lisa.unige.ch/

Page 19: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

19

References– ISI @ USC http://www.isi.edu/natural-language/nlp-at-isi.html

– CMU/LTI http://www.lti.cs.cmu.edu/Research/CMT-home.html

– Verbmobil http://www.dfki.de/verbmobil/

– C-STAR II http://www.is.cs.cmu.edu/cstar/

– GETA http://durian.imag.fr/

– Machine Translation at PAHO (ACG/T)

• http://www.paho.org/english/machine.htm

– METEO http://padina.info.umoncton.ca/chandioux/meteoe.html

– WordNet Bibliography

• http://www.cis.upenn.edu/~josephr/wn-biblio.html

Page 20: New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所

10/22/1998

20

References– Globalink, Inc. http://www.globalink.com/

– SYSTRAN http://www.systransoft.com/

– Logos Corporation http://www.logos-ca.com/

– TRADOS http://www.trados.com/

– A.I.SOFT http://www.aisoft.co.jp/

– CSK Home Page http://www.csk.co.jp/home_e.html

– SHARP SOFT

• http://www.sharp.co.jp/sc/excite/soft_map/soft.htm

– OKI Software http://www.okisoft.co.jp/

– KODENSHA http://www1.mesh.ne.jp/KODENSHA/

– ASTRANSAC http://eiplaza.toshiba.co.jp/products/transac/