Upload
trinhdien
View
219
Download
6
Embed Size (px)
Citation preview
Word reordering for the English-German, English-
Japanese and English-Chinese language pairs
Summer Internship 2016
KantanMT.com
1 | P a g e
Contents
Executive Summary ....................................................................................................................... 2
Contact Information ...................................................................................................................... 3
Company Name ......................................................................................................................... 3
Project Mentors & Management ............................................................................................. 3
Activity to be performed by the Intern ....................................................................................... 4
Experience of KantanMT as proposing organisation ............................................................... 5
Platform Features ...................................................................................................................... 5
Platform Infrastructure ............................................................................................................. 6
Experience of Mentoring Person(s) ............................................................................................ 7
Detailed Proposal Description ..................................................................................................... 8
Detailed Project Costs ............................................................................................................. 10
Detailed Project Schedule ...................................................................................................... 10
Why would this task be helpful for the student? .................................................................... 11
Why would this task be helpful to the MT Community? ........................................................ 12
Support ......................................................................................................................................... 13
About EAMT .............................................................................................................................. 13
About KantanMT.com ............................................................................................................. 13
References .................................................................................................................................... 14
2 | P a g e
Executive Summary
In Statistical Machine Translation (SMT) word-reordering (also referred in the literature
as word-replacement) is the task to arrange tokens in a sequence that is in accordance to
the grammatical rules of the target language (Knight, 1999; Koehn, et al., 2003; Nießen &
Ney, 2004; Rottmann & Vogel, 2007). The word-reordering method applied for an SMT
engine has a major impact on the quality and fluency of the translation.
KantanMT.com is a Saas based SMT platform that allows its users to build KantanMT
engines for more than 750 language pairs. Using efficient word reordering is essential for
our platform and therefore, to our clients.
We propose to expand the word reordering capabilities of KantanMT.com in the
translation process for some challenging language pairs. Namely, we want to leverage
the existing research on this topic and explore the impact of word reordering on English-
>German, English->Chinese and English->Japanese language pairs.
This project will be managed as a summer internship program starting in June 2016 and
finishing in September 2016. It shall involve the analysis of existing literature,
implementation and evaluation of existing or novel methods within KantanMT.com.
The project will be mentored by Tony O’Dowd and two additional members of the
KantanMT core development team.
This research will contribute to the efficiency of the KantanMT platform but also will have
an impact on the SMT community by bridging the gap between academia and industry.
3 | P a g e
Contact Information
Company Name
Name Address Contact Numbers
KantanMT.com
INVENT Building, DCU Campus, Glasnevin, Dublin 9, Ireland
[email protected], +353-1-87-2405-154
Project Mentors & Management
Name Role Contact Details
Tony O’Dowd Chief Architect [email protected], +353-1-87-2405-154
Dr. Dimitar Shterionov Project Mentor [email protected]
Marek Mazur Development Lead [email protected]
4 | P a g e
Activity to be performed by the Intern
The intern will be expected to produce a formal research and development document
detailing the most current research and academic findings on the topic of word re-
ordering, appraise these and then select an approach for development based on best
practices, best outputs and best research.
The project step/phases are outlined below:
Step/Phase Description Outputs
1. Review of current academic research
Review of current academic research and approaches to the problem of word re-ordering
Research & Development Review document
2. Comparison of word re-ordering techniques and methods
The intern will appraise the merits of each approach and rank these in relation to efficiency (in regards to compute time) and quality (in terms of translations outputs.)
Comparison/Appraisal document of word re-ordering methods and techniques
3. Functional Specification
Based on the previous outputs, the intern will be required to select an approach/method for re-ordering and develop a Functional specification document for the implementation of this approach/method.
Functional Specification
4. Technical Specification
Based on an agreed Functional specification for the implementation of word re-ordering, the Intern will be required to produce a Technical specifications document for the implementation of word re-ordering for English->German, English->Chinese and English->Japanese language pairs
Technical Specification
5. Project Schedule The intern will be required to devise a project schedule for the implementation of the selected word re-ordering approach/method in KantanMT.com.
Project Schedule & Implementation Plan
6. Implementation and testing
The Intern will be required to codify the word-reordering approach/method based on best practices and implement it as a subsystem of KantanMT.com. Then the Intern shall provide a set of unit tests that will fully verify the correctness and integrity of their implementation.
Code Development
5 | P a g e
Experience of KantanMT as proposing
organisation
KantanMT.com provides a sophisticated and powerful Machine Translation solution in an
easy-to-use package. Our community do not need any technical skills or IT support and
there are no special hardware or software requirements. Members can logon and build
customized Machine Translation engines immediately on the KantanMT.com platform.
There is no compromise between functional depth and ease of use. Our community just
get more. Faster.
Platform Features
Customise: The KantanMT Community create customised Machine Translation
engines using translation memories (TMX), terminology files (TBX) and/or free
stock training sets.
Analytics: The KantanMT Community determine the performance quality of
KantanMT engines using KantanAnalytics™, a unique segment level quality
estimation technology.
Translate: The KantanMT Community translate more in less time using
customised KantanMT engines. KantanTotalRecall™ is built into every KantanMT
engine to improve both performance and accuracy
Measure: The KantanMT Community use automated quality measurements, such
as BLEU, TER and F-Measure, to track translation quality.
Integrate: The KantanMT Community easily develop Machine Translation
applications for any device or translation management system using
KantanAPI™, a powerful web services SDK.
6 | P a g e
Deploy: While traditional MT deployments can take months, KantanMT can be
deployed within hours!
Platform Infrastructure
The KantanMT platform is a highly distributed SAAS implementation of the Moses
Decoder. Currently deployed over 700 servers, across three data centres, it is the largest
customised machine translation platform in the localisation industry.
In 2015, the KantanMT platform hosted over 8,000 statistical machine translation engines,
translated over 5.5 billion words and was used by over 4,000 members.
7 | P a g e
Experience of Mentoring Person(s)
Name Role Experience
Tony O’Dowd Chief Architect
Tony is a serial entrepreneur with 3 start-ups to his name. He has over 30 years’ experience in the localisation industry and is well known as a thought leader and innovator in technology solutions. His previous company, Alchemy Software Development, was the market leader in visual translation memory technology. His current company, KantanMT.com is the biggest customised machine translation platform in the industry. Tony is a Fellow of the Localisation Resource Centre, University of Limerick, Ireland.
Dr. Dimitar Shterionov Project Mentor
Dr. Shterionev graduated from the Arenberg Doctoral School, KULeuven (Faculty of Engineering) in 2015. His doctoral thesis is titled “Design and Development of Probabilistic Inference Pipelines”. He is currently a researcher on the MOSES core technology that powers the KantanMT platform.
8 | P a g e
Detailed Proposal Description
The Statistical Machine Translation (SMT) paradigm comprises of two main processes – i)
training an SMT system (or engine) by providing parallel, bilingual data required to build
a translation model and monolingual data used to build a (target) language model; and
ii) translating a given input data in the source language into an adequate translation in
the target language. One of the major issues of the latter process is word reordering, i.e.,
the task to arrange tokens in a sequence that is in accordance to the grammatical rules
of the target language (Knight, 1999; Koehn, et al., 2003; Nießen & Ney, 2004; Rottmann
& Vogel, 2007) Word-reordering is computationally expensive task – (Knight, 1999) proves
a worse-case complexity (for bigram models) to be NP-hard.
Language modelling has been successfully applied to address word reordering. A
language model will assign a probability to a sequence of words according to a probability
distribution.
KantanMT.com is a Saas based SMT platform that allows its users to build KantanMT
engines for more than 750 language pairs. Using efficient word reordering is essential for
our platform and therefore, to our clients.
KantanMT.com uses the Moses SMT toolkit as its main technology for building SMT
engines and decoding. It employs the distance-based word reordering model (Koehn, et
al., 2003) that is implemented in Moses. For some languages pairs, e.g., English and
Chinese, the word-re-ordering problem is exceptionally hard to resolve, as the target
word order differs substantially from the source word order and little, if any, information
about the target word order is available from the source sentence. This is driven due to
the significant difference between both language grammars.
A popular class of reordering approaches that address this challenge head-on is where
re-ordering applies on training data, prior to the computation of the translation model,
so that both source and target languages share a harmonised grammar. This approach
9 | P a g e
was first exploited in (Nießen & Ney, 2004) where the authors propose monotonization
of the source part of the parallel data by exploiting morpho-syntactic information.
We propose to expand the word reordering capabilities of KantanMT.com in the decoding
process for some challenging language pairs. Namely, we want to leverage the existing
research on this topic and explore the impact of word reordering on English->German,
English->Chinese and English->Japanese language pairs.
This project will be managed as a summer internship program starting in June 2016 and
finishing in September 2016. It shall involve:
i) analysis of the existing literature – the Intern shall analyse the literature on the
topic of word-reordering and present a thorough report that scores and
compares existing methods. This report should suggest an existing or a novel
method(s) to be implemented in KantanMT.com
ii) implementation of the suggested method(s) in KantanMT.com – the Intern
shall use best practices, employed by KantanMT.com, to design and implement
the suggested method(s).
iii) evaluation of the method(s) performance – the Intern shall perform an
extensive evaluation of the implemented method and compare its
performance with the current system.
The project will be mentored by Tony O’Dowd and two additional members of the
KantanMT core development team.
10 | P a g e
Detailed Project Costs
Step/Phase Duration
1. Review of current academic research 20 days
2. Comparison of word re-ordering techniques and methods 20 days
3. Functional Specification 5 days
4. Technical Specification 10 days
5. Project Schedule 1 days
6. Implementation 34 days
Total 90 Days
Detailed Project Schedule
NOTE: For an online version of this schedule, please refer to this link :
http://publish.smartsheet.com/e14732b618ae4621a766770a77f39b0b
11 | P a g e
Why would this task be helpful for the
student?
Better quality translation is a constant challenge and focus of all statistical machine
translation development. With the emergence of MOSES as a standard in SMT, a practical
and tested approach to improved quality is word re-ordering. However, this is a complex
and hard problem to tackle. In (Knight, 1999) the author analyses the complexity of word-
reordering models. His research and many that followed him have proven word re-
ordering a viable and positive approach to improved translation quality.
I believe this project presents an opportunity to the Intern to explore and compare best
of breed approaches for word reordering and then to implement a solution within the
KantanMT platform framework that delivers higher quality translation outputs.
Furthermore, this project will allow the Intern to gain practical experience in the
application side of statistical machine translation as well as to convey his theoretical
knowledge and capabilities into a live SMT platform, i.e., KantanMT.com.
12 | P a g e
Why would this task be helpful to the MT
Community?
Creating higher quality outputs from SMT systems is a broad area of research within the
MT Community. This project presents an opportunity to investigate deeply and
thoroughly the impact that word re-ordering will have on language models and
translation outputs.
The outputs of this project will be shared with the MT Community by producing a paper
on deploying word re-ordering strategies for greater translation quality. This paper will
be co-authored by the Intern along with help from Tony O’Dowd (Chief Architect) and Dr.
Dimitar Shterionov (SMT Researcher).
This project focuses on conveying ideas from academic research on the topic of word
reordering into KantanMT.com platform, thus, shall produce opportunities to decrease
the gap between academia and industry.
13 | P a g e
Support
This program is supported by the European Association for Machine Translation (EAMT).
About EAMT
The European Association for Machine Translation (EAMT) is an organization that serves
the growing community of people interested in MT and translation tools, including users,
developers, and researchers of this increasingly viable technology.
As part of its commitment to promote research, development and awareness about
translation technologies, the EAMT is for the first time launching a call for summer
internships.
About KantanMT.com
KantanMT.com is a leading SaaS based machine translation platform that enables users
to develop and manage customised machine translation engines in the cloud. The
innovative technologies offered on the KantanMT.com platform enable users to easily
build MT engines in over 750 language combinations, seamlessly integrating into
localization workflows and web applications. KantanMT is based in the INVENT Building,
DCU Campus, Dublin 9, Ireland.
14 | P a g e
References
Brown, P. F., Della Pietra, S. A., Dela Pietra, J. V. & Mercer, R. L., 1993. The mathematics
of statistical machine translation: Parameter estimation.. Computational linguistics, 19(2),
pp. 263-311.
He, J. & Liang, H., November, 2011. Word-reordering for Statistical Machine Translation
Using Trigram Language Model. Proceedings of IJCNLP, pp. 1288-1293.
Khalilov, M., Fonollosa, J. & Dras, M., 2009. Coupling hierarchical word reordering and
decoding in phrase-based statistical machine translation. Proceedings of the Third
Workshop on Syntax and Structure in Statistical Translation, June, pp. 78-86.
Knight, K., 1999. Decoding complexity in word-replacement translation models.
Computational Linguistics, 25(4), pp. 607-615.
Koehn, P., Och, F. & Marcu, D., 2003. Statistical phrase-based translation. 2003
Conference of the North American Chapter of the Association for Computational Linguistics
on Human Language Technology-Volume 1, May, pp. 48-54.
Nießen, S. & Ney, H., 2004. Statistical machine translation with scarce resources using
morpho-syntactic information. Computational linguistics, Volume 30(2), pp. 181-204.
Rottmann, K. & Vogel, S., 2007. Word reordering in statistical machine translation with a
POS-based distortion model. Proceedings of TMI, pp. 171-180.
Tillmann, C. & Ney, H., 2003. Word reordering and a dynamic programming beam
search algorithm for statistical machine translation. Computational linguistics, pp. 97-
133.
Wang, C., Collins, M. & Koehn, P., 2007. Chinese Syntactic Reordering for Statistical
Machine Translation. EMNLP-CoNLL, June, pp. 737-745.
Wang, Y. & Waibel, A., 1997. Decoding algorithm in statistical machine translation.
Proceedings of the eighth conference on European chapter of the Association for
Computational Linguistics, July, pp. 366-372.