Upload
others
View
3
Download
1
Embed Size (px)
Citation preview
Corpora and Concordances in Human Translation, Machine
Translation, and Language/Translator Education
Barbara GawronskaDept. of Foreign Languages and Translation, University of
Agder, Kristiansand, Norway
Oslo, Oct 2008
Outline
Concordance Tools for Human Usersconcordances in the past and nowhow can language learners, language teachers, and translators benefit from modern concordance tools – some examples
Monolingual Corpora in MachineTranslation (MT)
how can MT-systems developers benefit from monolingual corpora?
Between MT and human translation: electronic translation aids
Concordance Tools for Human Users
Concordance – the traditional definition and the modern developments
http://en.wiktionary.org/wiki/concordance: ”An alphabetical verbal index showing the places in the text of a book where each principal word may be found, with its immediate context in each place”
this definition is a little old-fashioned, since
modern tools allow e.g. to look for excerpts wherethe context is defined in terms of parts of speech.The search results may be sorted in different ways: by frequency or by collocational strengthcalculated by different statistical measures
Concordance – a very useful complement to a dictionary (for some reason, seldom employed in language education and translator education…)
words and phrases can be looked up by theircontext; the user can easily get the overall picture of the usage of a word
a &noun of laughter
a roar of laughtera burst of laughtera round of laughtera bray of laughtera fit of laughter
a pack of &noun
a pack of wolvesa pack of thievesa pack of liersa pack of bloodhoundsa pack of flatheads
An example: Corpus Culler – a multilanguageconcordance tool (LexwareLabs AB, www.lexwarelabs.com– a company cooperating with Högskolan i Skövde)
Excerpts are selected from corpora of modern English/Swedish/Polish of about 30 millions word tokens each
Authentic modern language – corpora collected from the web
Pre-defined part-of-speech variables can be combined with variables defined by the user
Statistical information available (raw frequency and different association measures: MI, Chi-square, T-score etc.)
Translation equivalents listed by Lexin
English entry wordpack
Swedish translationkoppel, släpp, flock, skock(substantiv)
Compositiona pack of dogs---ett koppel hundar a pack of thieves---en tjuvliga you're just tellinga pack of lies!
English entry wordflock
Swedish translation(om djur) skock, flock (substantiv) oordnad grupp; hjord
Examplea flock of sheep---en skock får
Concordance complements the dictionaryinformation(LexwareLabs AB, www.lexwarelabs.com)
A concordance may facilitate different educational tasks, and help the translator. It can e.g. show the use of prepositions and particles with verbs…
…stilistic variation, semantically similar words
Distinctions between semantically similar words that are not exact synonyms; valency patterns( e.g. enable/facilitate)
The usage of ”support verbs”: results of search for ”göra &noun”
Results of search for göra &noun, statistics
Finding translation equivalents, avoidinginterference:”göra motstånd” compared to Polish ”&verb opór” (opór = motstånd)
The concordance tool is connected to dictionaries; a word may be seen in a broader context, and dictionary definitions pop up on choice
The dictionary function is also available for Swedish
Even specialized dictionaries…
Absent in dictionaries:
new words
new multiwords
tricky new formations(but also misspellled
words)
WordNet: 155,327 entries, 207,016 senses UMLS: 975,354 concepts and 2.4 million concept names
fridge googling, chief hacking officer
cyberchondriac, spamdexing,b4, lol, cul8er, håglöss, nydist, syniker, turktumlare, Kristi himmelfärsdag…
pneumonoultramicroscopicsilicovolcanoconiosis,fusioneer, biofraud
Non-dictionary expressions and new expressions (the variable &new)
Corpora and Corcondances in Machine Translation
Stochastic MT
Example-based Machine Translation (EBMT)
Statistical MT (Brown et al. 1993)
KBMT
Knowledge Based Machine Translation (KBMT) –Nirenburg et al., Hobbs, Wilks mm- knowledge stored in lexicons, onomastikons, and ontologies- rule-based parsing and semantico-pragmatic analysis
aimed at conceptual representations
Is merging of the different approaches possible? YES. The MT system Verbmobil (a simplified version of Figure 11, p. 17 in Wahlster 2000)
How can MT and automated Information Extraction be facilitated by unilingual corporaand concordances?
statistics-based term extraction
automatic extraction of inflectionalinformation
collocation extraction
extraction of knowledge about concreteand metaphorical usage of words and phrases
Extracting English compounds from a biomedical corpus (Dura, Gawronska, and Erlendsson 2006)
)()()()(),(
yfxfyfxfyxfTscore −
=f(x) corpus frequency of word x f(x,y) corpus frequency of word pair (x, y) N total number of words in the corpus
Extracting Latin terms from a biomedical corpus (Dura, Gawronska, and Erlendsson 2006)
),()()(),( yxHyHxHyxMI −+= H(x) entropy of word x = -p(x)log2(p(x)), where probability p(x) = f(x) / N
⎟⎟⎠
⎞⎜⎜⎝
⎛)(),(,
)(),(max
2
yHyxMI
xHyxMI
PolWN Database
Case DeclNumberGenderSemCatWord CatCase DeclNumberGenderSemCatWord Cat
Pojawili się więc Algierczycy, Jemeńczycy, obywatele Bangladeszu, Uzbecy, Kirgizi i Tadżycy.
AlgierczycyJemeńczycyobywateleUzbecy
n hum ma pl nom 35n hum ma pl nomn hum ma pl nomn hum ma pl nom
351436
KirgiziTadżycy
n hum ma pl nomn hum ma pl nom
3835
Pojawili v hum ma pl
Stop-list and a suffix list with declension numbers
’There arrived Algerians, Yemenis, citizens of Bangladesh, Uzbeks, Kirgizis, and Tadjiks’
Extracting ’superanimate’ nouns for Polish(Gawronska et al 2002)
Translation Memories – between MT and Human Translation
Alignment techniques and EBMT-techniques can be employed for building and searching the translator’s own corporaKnowledge about existing corpus and concordance techniques helps the translators in the task of building own memory data basesMost popular translation memories:
TRADOS Translation Memory Desktop, Déjà VuMemoQSimilis
Conclusions: electronic corpora and concordance programs facilitate the following translation-related tasks:
Language education: learning terminology and words in context, easy creation of exercises and testsCreation of Machine Translation systems: (however,corrections made by translators will always be necessary)Creation of computer-based translation aids:
DictionariesLanguage aids providing grammatical information(morphology, noun/verb paradigms)Style checkersTerminology aids, such as glossaries of ‘authorized’terminology for a particular scientific, technical or commercial field, for particular clients, agencies and customersTranslation memories
Thank you!
References
Dura, E. and Gawronska, B. (2007) Novelty Extraction from Special and Parallel Corpora. In: Proceedings of 3rd Language & Technology Conference 2007, 305-309. Adam Mickiewicz University, Poznan, Poland. ISBN 978-83-7177-407-2.
Dura, E, Gawronska, B, Olsson, B., and Erlendsson, B. (2006) Towards Information Fusion in Pathway Evaluation: Encoding Relations in Biomedical Texts. In: Proceedings of The 9th International Conference on Information Fusion, Florence, Italy, 10-13 July 2006, 240-247
Huenerfauth, Matt (2004) Spatial and Planning Models of ASL Classifier Predicates for Machine Translation. To appear in Proceedings of TMI 2004, Baltimore, U.S.
Hutchins, John (1999a) The development and use of machine translation systems and computer-based translation tools. International Symposium on Machine Translation and Computer Language Information Processing, 26–28 June 1999,
Beijing, China. http://ourworld.compuserve.com/homepages/WJHutchins/Beijing.htm
Hutchins, John (1999b) Retrospect and prospect in computer-based translation. (Paper presented at the MT Summit, Singapore,1999)http://ourworld.compuserve.com/homepages/WJHutchins/MTS-99.htm
Hutchins, John (2000) The IAMT Certification initiative and defining translation system categories. (Presented at EAMT Workshop, Ljubljana, May 2000)http://ourworld.compuserve.com/homepages/WJHutchins/IAMTcert.htm
Hutchins, John (ej publicerad) The history of machine translation in a nutshellhttp://ourworld.compuserve.com/homepages/WJHutchins/Nutshell.htm
Jurafsky, D. & Martin, J.H. (2000) Speech and Language Processing. Prentice Hall Series in Artificial Intelligence. Kapitel 18, 20, 21.
Kay, Martin (1996) Machine Translation: The Disappointing Past and Present. I Survey of the State of the Art in Human Language Technology (Ed.Varile, Giovanni Battista & Zampolli, Antonio ).http://cslu.cse.ogi.edu/HLTsurvey/ch8node4.html http://www.multilingual.com/machineTranslation62.htm
Seligman, Mark, Dillinger, Mike, and Zong, Chengqing (2004) Cooperative Spoken Language Understanding for Robust Speech Translation. Paper submitted for TMI 2004.
Somers, H. (2003) Machine Translation: Latest Development. I Mitkov (ed.): The Oxford Handbook of Computational Linguistics Wahlster, W., (2000) Verbmobil: Foundations of Speech-to-Speech Translation, Springer-Verlag, Berlin.