30
Corpora and Concordances in Human Translation, Machine Translation, and Language/Translator Education Barbara Gawronska Dept. of Foreign Languages and Translation, University of Agder, Kristiansand, Norway Oslo, Oct 2008

Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

  • Upload
    others

  • View
    3

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Corpora and Concordances in Human Translation, Machine

Translation, and Language/Translator Education

Barbara GawronskaDept. of Foreign Languages and Translation, University of

Agder, Kristiansand, Norway

Oslo, Oct 2008

Page 2: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Outline

Concordance Tools for Human Usersconcordances in the past and nowhow can language learners, language teachers, and translators benefit from modern concordance tools – some examples

Monolingual Corpora in MachineTranslation (MT)

how can MT-systems developers benefit from monolingual corpora?

Between MT and human translation: electronic translation aids

Page 3: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Concordance Tools for Human Users

Page 4: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Concordance – the traditional definition and the modern developments

http://en.wiktionary.org/wiki/concordance: ”An alphabetical verbal index showing the places in the text of a book where each principal word may be found, with its immediate context in each place”

this definition is a little old-fashioned, since

modern tools allow e.g. to look for excerpts wherethe context is defined in terms of parts of speech.The search results may be sorted in different ways: by frequency or by collocational strengthcalculated by different statistical measures

Page 5: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Concordance – a very useful complement to a dictionary (for some reason, seldom employed in language education and translator education…)

words and phrases can be looked up by theircontext; the user can easily get the overall picture of the usage of a word

a &noun of laughter

a roar of laughtera burst of laughtera round of laughtera bray of laughtera fit of laughter

a pack of &noun

a pack of wolvesa pack of thievesa pack of liersa pack of bloodhoundsa pack of flatheads

Page 6: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

An example: Corpus Culler – a multilanguageconcordance tool (LexwareLabs AB, www.lexwarelabs.com– a company cooperating with Högskolan i Skövde)

Excerpts are selected from corpora of modern English/Swedish/Polish of about 30 millions word tokens each

Authentic modern language – corpora collected from the web

Pre-defined part-of-speech variables can be combined with variables defined by the user

Statistical information available (raw frequency and different association measures: MI, Chi-square, T-score etc.)

Page 7: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Translation equivalents listed by Lexin

English entry wordpack

Swedish translationkoppel, släpp, flock, skock(substantiv)

Compositiona pack of dogs---ett koppel hundar a pack of thieves---en tjuvliga you're just tellinga pack of lies!

English entry wordflock

Swedish translation(om djur) skock, flock (substantiv) oordnad grupp; hjord

Examplea flock of sheep---en skock får

Page 8: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Concordance complements the dictionaryinformation(LexwareLabs AB, www.lexwarelabs.com)

Page 9: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

A concordance may facilitate different educational tasks, and help the translator. It can e.g. show the use of prepositions and particles with verbs…

Page 10: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

…stilistic variation, semantically similar words

Page 11: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Distinctions between semantically similar words that are not exact synonyms; valency patterns( e.g. enable/facilitate)

Page 12: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

The usage of ”support verbs”: results of search for ”göra &noun”

Page 13: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Results of search for göra &noun, statistics

Page 14: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Finding translation equivalents, avoidinginterference:”göra motstånd” compared to Polish ”&verb opór” (opór = motstånd)

Page 15: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

The concordance tool is connected to dictionaries; a word may be seen in a broader context, and dictionary definitions pop up on choice

Page 16: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

The dictionary function is also available for Swedish

Page 17: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Even specialized dictionaries…

Page 18: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Absent in dictionaries:

new words

new multiwords

tricky new formations(but also misspellled

words)

WordNet: 155,327 entries, 207,016 senses UMLS: 975,354 concepts and 2.4 million concept names

fridge googling, chief hacking officer

cyberchondriac, spamdexing,b4, lol, cul8er, håglöss, nydist, syniker, turktumlare, Kristi himmelfärsdag…

pneumonoultramicroscopicsilicovolcanoconiosis,fusioneer, biofraud

Non-dictionary expressions and new expressions (the variable &new)

Page 19: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Corpora and Corcondances in Machine Translation

Page 20: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Stochastic MT

Example-based Machine Translation (EBMT)

Statistical MT (Brown et al. 1993)

Page 21: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

KBMT

Knowledge Based Machine Translation (KBMT) –Nirenburg et al., Hobbs, Wilks mm- knowledge stored in lexicons, onomastikons, and ontologies- rule-based parsing and semantico-pragmatic analysis

aimed at conceptual representations

Page 22: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Is merging of the different approaches possible? YES. The MT system Verbmobil (a simplified version of Figure 11, p. 17 in Wahlster 2000)

Page 23: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

How can MT and automated Information Extraction be facilitated by unilingual corporaand concordances?

statistics-based term extraction

automatic extraction of inflectionalinformation

collocation extraction

extraction of knowledge about concreteand metaphorical usage of words and phrases

Page 24: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Extracting English compounds from a biomedical corpus (Dura, Gawronska, and Erlendsson 2006)

)()()()(),(

yfxfyfxfyxfTscore −

=f(x) corpus frequency of word x f(x,y) corpus frequency of word pair (x, y) N total number of words in the corpus

Page 25: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Extracting Latin terms from a biomedical corpus (Dura, Gawronska, and Erlendsson 2006)

),()()(),( yxHyHxHyxMI −+= H(x) entropy of word x = -p(x)log2(p(x)), where probability p(x) = f(x) / N

⎟⎟⎠

⎞⎜⎜⎝

⎛)(),(,

)(),(max

2

yHyxMI

xHyxMI

Page 26: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

PolWN Database

Case DeclNumberGenderSemCatWord CatCase DeclNumberGenderSemCatWord Cat

Pojawili się więc Algierczycy, Jemeńczycy, obywatele Bangladeszu, Uzbecy, Kirgizi i Tadżycy.

AlgierczycyJemeńczycyobywateleUzbecy

n hum ma pl nom 35n hum ma pl nomn hum ma pl nomn hum ma pl nom

351436

KirgiziTadżycy

n hum ma pl nomn hum ma pl nom

3835

Pojawili v hum ma pl

Stop-list and a suffix list with declension numbers

’There arrived Algerians, Yemenis, citizens of Bangladesh, Uzbeks, Kirgizis, and Tadjiks’

Extracting ’superanimate’ nouns for Polish(Gawronska et al 2002)

Page 27: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Translation Memories – between MT and Human Translation

Alignment techniques and EBMT-techniques can be employed for building and searching the translator’s own corporaKnowledge about existing corpus and concordance techniques helps the translators in the task of building own memory data basesMost popular translation memories:

TRADOS Translation Memory Desktop, Déjà VuMemoQSimilis

Page 28: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Conclusions: electronic corpora and concordance programs facilitate the following translation-related tasks:

Language education: learning terminology and words in context, easy creation of exercises and testsCreation of Machine Translation systems: (however,corrections made by translators will always be necessary)Creation of computer-based translation aids:

DictionariesLanguage aids providing grammatical information(morphology, noun/verb paradigms)Style checkersTerminology aids, such as glossaries of ‘authorized’terminology for a particular scientific, technical or commercial field, for particular clients, agencies and customersTranslation memories

Page 29: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

Thank you!

Page 30: Corpora and Concordances in Human Translation, …...concordance tool (LexwareLabs AB, – a company cooperating with Högskolan i Skövde) {Excerpts are selected from corpora of modern

References

Dura, E. and Gawronska, B. (2007) Novelty Extraction from Special and Parallel Corpora. In: Proceedings of 3rd Language & Technology Conference 2007, 305-309. Adam Mickiewicz University, Poznan, Poland. ISBN 978-83-7177-407-2.

Dura, E, Gawronska, B, Olsson, B., and Erlendsson, B. (2006) Towards Information Fusion in Pathway Evaluation: Encoding Relations in Biomedical Texts. In: Proceedings of The 9th International Conference on Information Fusion, Florence, Italy, 10-13 July 2006, 240-247

Huenerfauth, Matt (2004) Spatial and Planning Models of ASL Classifier Predicates for Machine Translation. To appear in Proceedings of TMI 2004, Baltimore, U.S.

Hutchins, John (1999a) The development and use of machine translation systems and computer-based translation tools. International Symposium on Machine Translation and Computer Language Information Processing, 26–28 June 1999,

Beijing, China. http://ourworld.compuserve.com/homepages/WJHutchins/Beijing.htm

Hutchins, John (1999b) Retrospect and prospect in computer-based translation. (Paper presented at the MT Summit, Singapore,1999)http://ourworld.compuserve.com/homepages/WJHutchins/MTS-99.htm

Hutchins, John (2000) The IAMT Certification initiative and defining translation system categories. (Presented at EAMT Workshop, Ljubljana, May 2000)http://ourworld.compuserve.com/homepages/WJHutchins/IAMTcert.htm

Hutchins, John (ej publicerad) The history of machine translation in a nutshellhttp://ourworld.compuserve.com/homepages/WJHutchins/Nutshell.htm

Jurafsky, D. & Martin, J.H. (2000) Speech and Language Processing. Prentice Hall Series in Artificial Intelligence. Kapitel 18, 20, 21.

Kay, Martin (1996) Machine Translation: The Disappointing Past and Present. I Survey of the State of the Art in Human Language Technology (Ed.Varile, Giovanni Battista & Zampolli, Antonio ).http://cslu.cse.ogi.edu/HLTsurvey/ch8node4.html http://www.multilingual.com/machineTranslation62.htm

Seligman, Mark, Dillinger, Mike, and Zong, Chengqing (2004) Cooperative Spoken Language Understanding for Robust Speech Translation. Paper submitted for TMI 2004.

Somers, H. (2003) Machine Translation: Latest Development. I Mitkov (ed.): The Oxford Handbook of Computational Linguistics Wahlster, W., (2000) Verbmobil: Foundations of Speech-to-Speech Translation, Springer-Verlag, Berlin.