13. Constantin Orasan (UoW) Natural Language Processing for Translation

Natural Language Processing for Translation

Constantin Orasan

University of Wolverhampton, UK

1. How NLP can help machine translation 2. How information retrieval can help machine translation 3. View from the industry

Structure

Language is ambiguous at all levels: Lexical: bank, file, chair Syntactic: John saw the man with the telescope. Semantic: The rabbit is ready for lunch. Discourse: John hid Bill’s car keys. He was drunk. Pragmatics: You owe me £20.

It gets even more difficult when we start working in

multilingual settings

MT is difficult

Tokenisation Morphology Syntax Lexical Semantics Discourse Pragmatics

Steps of processing

NLP complexity increases this way! (and in general the accuracy of methods decreases)

GATE (General Architecture for Text Engineering) http://gate.ac.uk framework written in Java: First designed for information extraction tasks Developed into a robust framework that offers processing a different

linguistic levels in many languages Provides wrappers for other tools such as Weka, LingPipe, etc.

NLTK: set of NLP modules written in Python with emphasis on teaching Lingpipe (http://alias-i.com/lingpipe/) java toolkit for language processing OpenNLP (http://opennlp.apache.org/) ML-based toolkit for language

processing … the list can go on

General NLP frameworks

http://gate.ac.uk/

http://alias-i.com/lingpipe/

http://opennlp.apache.org/

Vauquois triangle

Source text

Interlingua

Source syntax

Source semantics

Target semantics

Target syntax

Target text

Direct translation

Shallow (syntactic) transfer

Deep (semantic) transfer

Can benefit from linguistic information, but many of the existing models are largely data driven and do not incorporate much linguistic information

See the lecture on Statistical MT: Word, Phrase and Tree Based Models (overview) Khalil Sima'an (UvA) and Trevor Cohn (USFD)

NLP in SMT

In many cases it requires some kind of linguistic information

See the lecture on Example Based Machine Translation Joseph van Genabith (DCU) and Kalil Sima'an (UvA)

NLP in EBMT

The existing TM solutions do not rely on much linguistic information

Second generation and third generations of TM rely on linguistic input

See the lecture on Translation Memories Ruslan Mitkov (UoW), Manuel Arcedillo (Hermes) and Juanjo Arevalillo (Hermes)

NLP in TM

But there are many other ways in which we

could improve the results of translation engines by incorporating linguistic information

Improve tokenisation

For European languages tokenisation is considered more or less a simple problem.

In non-segmented languages (such as many oriental ones), identification of tokens is extremely complex Tokens do not have explicit boundaries (written directly adjacent to

one another with no whitespace between them). Practically all the characters can be one-character words in

themselves, but they can also be joined together to form multi-character words.

Even in segmented languages like English, identification of tokens can be difficult.

Tokenisation may not be so easy

Even in segmented languages like English where Tokens are usually separated by whitespaces and punctuation there are problem: Abbreviations: when full stops follow abbreviations, they should be merged with the

abbreviation to form one token (e.g. etc., yrs., Mr.) Multiple strings separated by white space can in fact form one token (e.g., numerical

expressions in French:1 200 000) Hyphenation can be ambiguous:

Sometimes part of the word segment, e.g. self-assessment, F-16, forty-two, Sometimes not, e.g. London-based

Additional challenges: Numerical, special expressions (dates, measures, email addresses) Language specific rules for contracting words and phrases (e.g. can’t, won’t vs.

O'Brien: contain multiple tokens with no white spaces between) Ambiguous punctuation (e.g. “.” in yrs., 05.11.08 )

It will influence any task that requires a dictionary/gazeeter lookup

Can influence how words are aligned Abbreviations were shown to help SMT (Li and Yarowsky,

2008) Translation of named entities (n.b. NER is seen as part of

tokenization)

Why tokenisation is important in MT?

Unseen abbreviations are treated as unknown words and left untranslated

Modern Chinese is a highly abbreviated language and 20% of sentences in an newspaper article contain an abbreviation

The way abbreviations are formed follows much more complex rules than English

Li and Yarowsky (2008) propose an unsupervised method for extracting relations between full-form phrases and their abbreviations

Abbreviations

Step 1: Identification of English entities Step 2: Translate the entities into Chinese using a baseline

translator Step 3: Full-abbreviations relations are extracted on the basis

of co-occurrence in a Chinese monolingual corpus Step 4: Translation induction for Chinese abbreviations Step 5: Integration with the baseline translation system

Evaluation shows that the results of the BLEU scores improve

Li and Yarowsky (2008)

Incorrect NE translation can seriously harm the quality of translation

There are 2 main sources of problems: Ambiguity: NE normally are composed of words which can

be translated in isolation Sparsity: some named entities are very sparse

Integration of NEs into the translation model leads to various results ranging from significant improvements to low improvements and even negative impact

Translation of named entities

The main approach in SMT is to determine the NEs in a text and translate them using an external model.

Then they are: Used as the default translation (Li et al, 2009) Added dynamically to compete with other translations (Turchi

et al., 2012; Bouamor et al., 2012), not used, and do not translate the original NE (Tinsley et al.,

2012)

Nikoulina et al. (2012) propose replacing NEs with placeholders in order to reduce sparsity, in this way learning a better model

NE in SMT

1. The Named Entites are detected and replaced with placeholders to produce reduced sentences

2. A reduced translation model is used to translate the reduced sentences

3. An external NE translator is employed 4. The translated NEs are reinserted in the reduced translations

The disadvantage of the approach is that the framework is loosely dependent on the SMT task a postprocessing step is applied to the output of the NER + a prediction model to determine which NEs can be safely translated

Nikoulina et al. (2012)

Noted as problem for Machine Translation back in the late 1949’s (Weaver, 1949) A word can often only be translated if you know the specific sense intended

Bar-Hillel (1960) posed the following problem:

Little John was looking for his toy box. Finally, he found it. The box was in the pen. John was very happy.

Is “pen” a writing instrument or an enclosure where children play or an enclosure for livestock?

…declared it unsolvable, and left the field of MT…

MT and semantics

Lexical Divergence: many-to-many

Domain specific dictionaries can improve the quality of translation in post-editing environments

Word sense disambiguation (WSD) is the ability to identify the meaning of words in context in a computational manner

WSD is seen as a more general solution to lexical divergence, but

WSD is an AI complete

Lexical Divergence: solution?

SMT models only rely on local context to choose among lexical translation candidates

The assumption is that a dedicated WSD module can help the translation process

Use a baseline Chinese to English translation engine and a state-of-the-art Chinese WSD WSD incorporated in the decoder WSD incorporated in a post-processor

Translation obtained using the English gloss of HowNet

WSD does not help a typical SMT, but this is mainly due to the fact that current SMT systems (2005) cannot take advantage of the sense information

Carpuat and Wu (2005)

Successfully integrate WSD in Hiero, a state-of-the-art Chinese to English hierarchical phrase-based MT system

Introduce two additional features in the MT model at the decoding stage that take into consideration that some words were chosen by the WSD system

The improvement noticed is modest, but statistically significant

Carpuat and Wu (2007) find similar results, but instead of WSD they perform fully phrasal multi-word disambiguation and their disambiguation system is tightly integrated in the SMT engine

Chan et al. (2007)

Evaluation metrics like BLUE treat any divergence from the reference translation as a mistake

Several alternative metrics were proposed to address this problem: METEOR (Denkowski and Lavie, 2010) accounts for

synonyms and paraphrases Calculate meaning equivalence using bidirectional textual

entailment (Pado et al., 2009) Using semantic role labels (Gimenez and Marquez, 2007) TINE (Rios et al., 2011) measures the similarity between

sentences using shallow semantic representation

NLP in evaluation of MT

Automatic terminology extraction Automatic extraction of ontologies Automatic compilation of (parallel/comparable) corpora Use of parallel corpora to train various systems

Other NLP applications which could be useful

Bouamor, D., Semmar, N., and Zweigenbaum, P. (2012). Identifying multi-word expressions in statistical machine translation. In Proceedings of LREC 2012.

Carpuat, M., & Wu, D. (2007). Improving Statistical Machine Translation Using Word Sense Disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 61–72). Prague, Czech Republic. Retrieved from http://acl.ldc.upenn.edu/D/D07/D07-1007.pdf

Carpuat, M., & Wu, D. (2005). Word Sense Disambiguation vs. Statistical Machine Translation. Proceedings of the 43rd Annual Meeting of the ACL, (June), 387–394. Retrieved from http://acl.ldc.upenn.edu/P/P07/P07-1005.pdf

Chan, Y. S., Ng, H. T., & Chiang, D. (2007). Word Sense Disambiguation Improves Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (pp. 33–40).

References

http://acl.ldc.upenn.edu/D/D07/D07-1007.pdf

http://acl.ldc.upenn.edu/P/P07/P07-1005.pdf

http://acl.ldc.upenn.edu/P/P07/P07-1005.pdf

Li, M., Zhang, J., Zhou, Y., and Chengqing, Z. (2009). The CASIA statistical machine translation system for IWSLT 2009. In Proceedings of IWSLT 2009

Li, Z., & Yarowsky, D. (2008). Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora. In Proceedings of ACL-08 (pp. 425 – 433). Columbus, Ohio, USA. Retrieved from http://aclweb.org/anthology//P/P08/P08-1049.pdf

Navigli, R. (2009). Word sense disambiguation. ACM Computing Surveys, 41(2), 1–69. doi:10.1145/1459352.1459355

Nikoulina, V., Sandor, A., & Dymetman, M. (2012). Hybrid Adaptation of Named Entity Recognition for Statistical Machine Translation. In Second ML4HMT Workshop (pp. 1–16).

Pado, S., Galley, M., Jurafsky, D., & Manning, C. (2009). Robust Machine Translation Evaluation with Entailment Features. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP (pp. 297–305).

References

Rios, M.; Aziz, W.; Specia, L. (2011). TINE: A Metric to Assess MT Adequacy. In Proceedings of the 6th Workshop on Statistical Machine Translation (WMT-2011), July, Edinburgh, UK

Tinsley, J., Ceausu, A., and Zhang, J. (2012). PLUTO: automated solutions for patent translation. In EACL JointWorkshop on Exploitng Synergies between Information Retrieval andMachine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra): Proceedings of the workshop, EACL 2012.

Turchi, M., Atkinson, M.,Wilcox, A., Crawley, B., Bucci, S., Steinberger, R., and Van der Goot, E. (2012). ONTS: "Optima" news translation system. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics.

References

Technology

13. Constantin Orasan (UoW) Natural Language Processing for Translation