19
Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

Embed Size (px)

Citation preview

Page 1: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

Transforming Parallel Corpora to Translation Memory

Steve Legrand

IPN

29th Sept. 2006

Page 2: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

Parallel text or bitext

Aligned translation of text from one language to another.

Practical uses in NLP:- Word sense disambiguation- Automatic translation- Translation memoriesTranslation memories

Page 3: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

Translation Memory

Helps the translator by using already translated text segments to cue in the translation of new text segments

Translation memory correspondence level can usually be set (e.g., 56%)

Automatic translation can be combined with translation memories post-editing of automatic translation for translation memory uses.

Page 4: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

Translation memory format (.tmx)

.tmx (translation memory exchange) is a standardized format for application interoperability.

tu: translation unit, unit father of every element to be translated. It can contain a unique identifier (tuid).

tuv: translation unit variant, unit that contains the language code of the translation (xml:lang).

seg: segment, it contains the translated text.

Page 5: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

TMX Example

Page 6: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

Poor man’s guide to translation memories

Trados the best known and probably one of the best commercial TM applications available.

There are cheaper one-user versions, but in spite of that the price is often prohibitive.

To avoid excessive costs, one could:– Use a demo versions of the commercial

software– Use Open Source products.

Page 7: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

OmegaT

Open Source translation memory Needs Java Run-time Needs Open Office to convert .doc format

to .odt or .swx- format (open standard) Creates tmx.files Tmx-files can also be exported from other

applications

Page 8: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006
Page 9: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

Parallel corpora tmx

To be able to use a parallel corpora as a translation memory we need first to convert it to the tmx format.

We can either use a existing parallel corpora or create our own.

There are many open source web resources for creating our own parallel corpora

Page 10: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

Using open parallel corpora resources – English source

Jack London published about 40 books in English. Almost all his English- language works are publicly available at

– Project Gutenberg in: http://www.gutenberg.org/wiki/Main_Page

Page 11: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

Using open parallel corpora resources – Spanish source (s)

Among the many sources of Spanish translations of Jack London’s books there is:

http://apuntes.rincondelvago.com/trabajos_global/literatura/

Page 12: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

Aligning parallel texts

For example: Download

“White Fang” by Jack London from Project Gutenberg

and its translation

“Colmillo Blanco” from rincondelvago Use bitext2tmx (free open source application)

for alignment

Page 13: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

bitext2tmx aligner: configuration

Page 14: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

bitext2tmx aligner: text alignment

Page 15: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

Bitext2tmx producing a tmx-file

<?xml version="1.0" encoding="ISO-8859-1"?><tmx version="1.1"><header creationtool="Bitext2tmx" creationtoolversion="0.9" segtype="sentence" o-tmf="Bitext2tmx" adminlang="en" srclang="en" datatype="PlainText" o-encoding="ISO-8859-1"></header><body><tu tuid="0" datatype="Text"> <tuv lang="en"> <seg>CHAPTER I--THE TRAIL OF THE MEAT</seg> </tuv> <tuv lang="es"> <seg>PRIMERA PARTE -- La pista de la carne</seg>hsjhdjh </tuv></tu><tu tuid="1" datatype="Text"> <tuv lang="en"> <seg>Dark spruce forest frowned on either side the frozen waterway.</seg> </tuv> <tuv lang="es"> <seg>Aun lado y a otro del helado cauce de erguía un oscuro bosque de abetos de ceñudo aspecto.</seg> </tuv></tu>

Page 16: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

The tmx-file produced by bitext2tmx can be added to OmegaT’s tm directory to be used as part of the translation memory

Page 17: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

Other tools with Omegat

.tmx-files can be cleaned with tmxcleaner .tmx-files can be merged with tmxmerger .tmx-files can be validated with tmxvalidator

– (can be downloaded from the OmegaT site

It is important at least to validate the files before adding them to OmegaT’s translation memory.

Page 18: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

Current work: Using these Open Source resources, translating a book from English to Spanish with the students of applied linguistics at Colima University with IPN backing. Ready by the middle of November.

Linguistica

Computacional

Page 19: Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

Save your money. Use Open Source!