15
Word reordering for the English-German, English- Japanese and English-Chinese language pairs Summer Internship 2016 KantanMT.com

KantanMT - · PDF fileimplementation and evaluation of existing or novel methods within KantanMT.com. The project will ... performance quality of KantanMT ... a summer internship

Embed Size (px)

Citation preview

Page 1: KantanMT - · PDF fileimplementation and evaluation of existing or novel methods within KantanMT.com. The project will ... performance quality of KantanMT ... a summer internship

Word reordering for the English-German, English-

Japanese and English-Chinese language pairs

Summer Internship 2016

KantanMT.com

Page 2: KantanMT - · PDF fileimplementation and evaluation of existing or novel methods within KantanMT.com. The project will ... performance quality of KantanMT ... a summer internship

1 | P a g e

Contents

Executive Summary ....................................................................................................................... 2

Contact Information ...................................................................................................................... 3

Company Name ......................................................................................................................... 3

Project Mentors & Management ............................................................................................. 3

Activity to be performed by the Intern ....................................................................................... 4

Experience of KantanMT as proposing organisation ............................................................... 5

Platform Features ...................................................................................................................... 5

Platform Infrastructure ............................................................................................................. 6

Experience of Mentoring Person(s) ............................................................................................ 7

Detailed Proposal Description ..................................................................................................... 8

Detailed Project Costs ............................................................................................................. 10

Detailed Project Schedule ...................................................................................................... 10

Why would this task be helpful for the student? .................................................................... 11

Why would this task be helpful to the MT Community? ........................................................ 12

Support ......................................................................................................................................... 13

About EAMT .............................................................................................................................. 13

About KantanMT.com ............................................................................................................. 13

References .................................................................................................................................... 14

Page 3: KantanMT - · PDF fileimplementation and evaluation of existing or novel methods within KantanMT.com. The project will ... performance quality of KantanMT ... a summer internship

2 | P a g e

Executive Summary

In Statistical Machine Translation (SMT) word-reordering (also referred in the literature

as word-replacement) is the task to arrange tokens in a sequence that is in accordance to

the grammatical rules of the target language (Knight, 1999; Koehn, et al., 2003; Nießen &

Ney, 2004; Rottmann & Vogel, 2007). The word-reordering method applied for an SMT

engine has a major impact on the quality and fluency of the translation.

KantanMT.com is a Saas based SMT platform that allows its users to build KantanMT

engines for more than 750 language pairs. Using efficient word reordering is essential for

our platform and therefore, to our clients.

We propose to expand the word reordering capabilities of KantanMT.com in the

translation process for some challenging language pairs. Namely, we want to leverage

the existing research on this topic and explore the impact of word reordering on English-

>German, English->Chinese and English->Japanese language pairs.

This project will be managed as a summer internship program starting in June 2016 and

finishing in September 2016. It shall involve the analysis of existing literature,

implementation and evaluation of existing or novel methods within KantanMT.com.

The project will be mentored by Tony O’Dowd and two additional members of the

KantanMT core development team.

This research will contribute to the efficiency of the KantanMT platform but also will have

an impact on the SMT community by bridging the gap between academia and industry.

Page 4: KantanMT - · PDF fileimplementation and evaluation of existing or novel methods within KantanMT.com. The project will ... performance quality of KantanMT ... a summer internship

3 | P a g e

Contact Information

Company Name

Name Address Contact Numbers

KantanMT.com

INVENT Building, DCU Campus, Glasnevin, Dublin 9, Ireland

[email protected], +353-1-87-2405-154

Project Mentors & Management

Name Role Contact Details

Tony O’Dowd Chief Architect [email protected], +353-1-87-2405-154

Dr. Dimitar Shterionov Project Mentor [email protected]

Marek Mazur Development Lead [email protected]

Page 5: KantanMT - · PDF fileimplementation and evaluation of existing or novel methods within KantanMT.com. The project will ... performance quality of KantanMT ... a summer internship

4 | P a g e

Activity to be performed by the Intern

The intern will be expected to produce a formal research and development document

detailing the most current research and academic findings on the topic of word re-

ordering, appraise these and then select an approach for development based on best

practices, best outputs and best research.

The project step/phases are outlined below:

Step/Phase Description Outputs

1. Review of current academic research

Review of current academic research and approaches to the problem of word re-ordering

Research & Development Review document

2. Comparison of word re-ordering techniques and methods

The intern will appraise the merits of each approach and rank these in relation to efficiency (in regards to compute time) and quality (in terms of translations outputs.)

Comparison/Appraisal document of word re-ordering methods and techniques

3. Functional Specification

Based on the previous outputs, the intern will be required to select an approach/method for re-ordering and develop a Functional specification document for the implementation of this approach/method.

Functional Specification

4. Technical Specification

Based on an agreed Functional specification for the implementation of word re-ordering, the Intern will be required to produce a Technical specifications document for the implementation of word re-ordering for English->German, English->Chinese and English->Japanese language pairs

Technical Specification

5. Project Schedule The intern will be required to devise a project schedule for the implementation of the selected word re-ordering approach/method in KantanMT.com.

Project Schedule & Implementation Plan

6. Implementation and testing

The Intern will be required to codify the word-reordering approach/method based on best practices and implement it as a subsystem of KantanMT.com. Then the Intern shall provide a set of unit tests that will fully verify the correctness and integrity of their implementation.

Code Development

Page 6: KantanMT - · PDF fileimplementation and evaluation of existing or novel methods within KantanMT.com. The project will ... performance quality of KantanMT ... a summer internship

5 | P a g e

Experience of KantanMT as proposing

organisation

KantanMT.com provides a sophisticated and powerful Machine Translation solution in an

easy-to-use package. Our community do not need any technical skills or IT support and

there are no special hardware or software requirements. Members can logon and build

customized Machine Translation engines immediately on the KantanMT.com platform.

There is no compromise between functional depth and ease of use. Our community just

get more. Faster.

Platform Features

Customise: The KantanMT Community create customised Machine Translation

engines using translation memories (TMX), terminology files (TBX) and/or free

stock training sets.

Analytics: The KantanMT Community determine the performance quality of

KantanMT engines using KantanAnalytics™, a unique segment level quality

estimation technology.

Translate: The KantanMT Community translate more in less time using

customised KantanMT engines. KantanTotalRecall™ is built into every KantanMT

engine to improve both performance and accuracy

Measure: The KantanMT Community use automated quality measurements, such

as BLEU, TER and F-Measure, to track translation quality.

Integrate: The KantanMT Community easily develop Machine Translation

applications for any device or translation management system using

KantanAPI™, a powerful web services SDK.

Page 7: KantanMT - · PDF fileimplementation and evaluation of existing or novel methods within KantanMT.com. The project will ... performance quality of KantanMT ... a summer internship

6 | P a g e

Deploy: While traditional MT deployments can take months, KantanMT can be

deployed within hours!

Platform Infrastructure

The KantanMT platform is a highly distributed SAAS implementation of the Moses

Decoder. Currently deployed over 700 servers, across three data centres, it is the largest

customised machine translation platform in the localisation industry.

In 2015, the KantanMT platform hosted over 8,000 statistical machine translation engines,

translated over 5.5 billion words and was used by over 4,000 members.

Page 8: KantanMT - · PDF fileimplementation and evaluation of existing or novel methods within KantanMT.com. The project will ... performance quality of KantanMT ... a summer internship

7 | P a g e

Experience of Mentoring Person(s)

Name Role Experience

Tony O’Dowd Chief Architect

Tony is a serial entrepreneur with 3 start-ups to his name. He has over 30 years’ experience in the localisation industry and is well known as a thought leader and innovator in technology solutions. His previous company, Alchemy Software Development, was the market leader in visual translation memory technology. His current company, KantanMT.com is the biggest customised machine translation platform in the industry. Tony is a Fellow of the Localisation Resource Centre, University of Limerick, Ireland.

Dr. Dimitar Shterionov Project Mentor

Dr. Shterionev graduated from the Arenberg Doctoral School, KULeuven (Faculty of Engineering) in 2015. His doctoral thesis is titled “Design and Development of Probabilistic Inference Pipelines”. He is currently a researcher on the MOSES core technology that powers the KantanMT platform.

Page 9: KantanMT - · PDF fileimplementation and evaluation of existing or novel methods within KantanMT.com. The project will ... performance quality of KantanMT ... a summer internship

8 | P a g e

Detailed Proposal Description

The Statistical Machine Translation (SMT) paradigm comprises of two main processes – i)

training an SMT system (or engine) by providing parallel, bilingual data required to build

a translation model and monolingual data used to build a (target) language model; and

ii) translating a given input data in the source language into an adequate translation in

the target language. One of the major issues of the latter process is word reordering, i.e.,

the task to arrange tokens in a sequence that is in accordance to the grammatical rules

of the target language (Knight, 1999; Koehn, et al., 2003; Nießen & Ney, 2004; Rottmann

& Vogel, 2007) Word-reordering is computationally expensive task – (Knight, 1999) proves

a worse-case complexity (for bigram models) to be NP-hard.

Language modelling has been successfully applied to address word reordering. A

language model will assign a probability to a sequence of words according to a probability

distribution.

KantanMT.com is a Saas based SMT platform that allows its users to build KantanMT

engines for more than 750 language pairs. Using efficient word reordering is essential for

our platform and therefore, to our clients.

KantanMT.com uses the Moses SMT toolkit as its main technology for building SMT

engines and decoding. It employs the distance-based word reordering model (Koehn, et

al., 2003) that is implemented in Moses. For some languages pairs, e.g., English and

Chinese, the word-re-ordering problem is exceptionally hard to resolve, as the target

word order differs substantially from the source word order and little, if any, information

about the target word order is available from the source sentence. This is driven due to

the significant difference between both language grammars.

A popular class of reordering approaches that address this challenge head-on is where

re-ordering applies on training data, prior to the computation of the translation model,

so that both source and target languages share a harmonised grammar. This approach

Page 10: KantanMT - · PDF fileimplementation and evaluation of existing or novel methods within KantanMT.com. The project will ... performance quality of KantanMT ... a summer internship

9 | P a g e

was first exploited in (Nießen & Ney, 2004) where the authors propose monotonization

of the source part of the parallel data by exploiting morpho-syntactic information.

We propose to expand the word reordering capabilities of KantanMT.com in the decoding

process for some challenging language pairs. Namely, we want to leverage the existing

research on this topic and explore the impact of word reordering on English->German,

English->Chinese and English->Japanese language pairs.

This project will be managed as a summer internship program starting in June 2016 and

finishing in September 2016. It shall involve:

i) analysis of the existing literature – the Intern shall analyse the literature on the

topic of word-reordering and present a thorough report that scores and

compares existing methods. This report should suggest an existing or a novel

method(s) to be implemented in KantanMT.com

ii) implementation of the suggested method(s) in KantanMT.com – the Intern

shall use best practices, employed by KantanMT.com, to design and implement

the suggested method(s).

iii) evaluation of the method(s) performance – the Intern shall perform an

extensive evaluation of the implemented method and compare its

performance with the current system.

The project will be mentored by Tony O’Dowd and two additional members of the

KantanMT core development team.

Page 11: KantanMT - · PDF fileimplementation and evaluation of existing or novel methods within KantanMT.com. The project will ... performance quality of KantanMT ... a summer internship

10 | P a g e

Detailed Project Costs

Step/Phase Duration

1. Review of current academic research 20 days

2. Comparison of word re-ordering techniques and methods 20 days

3. Functional Specification 5 days

4. Technical Specification 10 days

5. Project Schedule 1 days

6. Implementation 34 days

Total 90 Days

Detailed Project Schedule

NOTE: For an online version of this schedule, please refer to this link :

http://publish.smartsheet.com/e14732b618ae4621a766770a77f39b0b

Page 12: KantanMT - · PDF fileimplementation and evaluation of existing or novel methods within KantanMT.com. The project will ... performance quality of KantanMT ... a summer internship

11 | P a g e

Why would this task be helpful for the

student?

Better quality translation is a constant challenge and focus of all statistical machine

translation development. With the emergence of MOSES as a standard in SMT, a practical

and tested approach to improved quality is word re-ordering. However, this is a complex

and hard problem to tackle. In (Knight, 1999) the author analyses the complexity of word-

reordering models. His research and many that followed him have proven word re-

ordering a viable and positive approach to improved translation quality.

I believe this project presents an opportunity to the Intern to explore and compare best

of breed approaches for word reordering and then to implement a solution within the

KantanMT platform framework that delivers higher quality translation outputs.

Furthermore, this project will allow the Intern to gain practical experience in the

application side of statistical machine translation as well as to convey his theoretical

knowledge and capabilities into a live SMT platform, i.e., KantanMT.com.

Page 13: KantanMT - · PDF fileimplementation and evaluation of existing or novel methods within KantanMT.com. The project will ... performance quality of KantanMT ... a summer internship

12 | P a g e

Why would this task be helpful to the MT

Community?

Creating higher quality outputs from SMT systems is a broad area of research within the

MT Community. This project presents an opportunity to investigate deeply and

thoroughly the impact that word re-ordering will have on language models and

translation outputs.

The outputs of this project will be shared with the MT Community by producing a paper

on deploying word re-ordering strategies for greater translation quality. This paper will

be co-authored by the Intern along with help from Tony O’Dowd (Chief Architect) and Dr.

Dimitar Shterionov (SMT Researcher).

This project focuses on conveying ideas from academic research on the topic of word

reordering into KantanMT.com platform, thus, shall produce opportunities to decrease

the gap between academia and industry.

Page 14: KantanMT - · PDF fileimplementation and evaluation of existing or novel methods within KantanMT.com. The project will ... performance quality of KantanMT ... a summer internship

13 | P a g e

Support

This program is supported by the European Association for Machine Translation (EAMT).

About EAMT

The European Association for Machine Translation (EAMT) is an organization that serves

the growing community of people interested in MT and translation tools, including users,

developers, and researchers of this increasingly viable technology.

As part of its commitment to promote research, development and awareness about

translation technologies, the EAMT is for the first time launching a call for summer

internships.

About KantanMT.com

KantanMT.com is a leading SaaS based machine translation platform that enables users

to develop and manage customised machine translation engines in the cloud. The

innovative technologies offered on the KantanMT.com platform enable users to easily

build MT engines in over 750 language combinations, seamlessly integrating into

localization workflows and web applications. KantanMT is based in the INVENT Building,

DCU Campus, Dublin 9, Ireland.

Page 15: KantanMT - · PDF fileimplementation and evaluation of existing or novel methods within KantanMT.com. The project will ... performance quality of KantanMT ... a summer internship

14 | P a g e

References

Brown, P. F., Della Pietra, S. A., Dela Pietra, J. V. & Mercer, R. L., 1993. The mathematics

of statistical machine translation: Parameter estimation.. Computational linguistics, 19(2),

pp. 263-311.

He, J. & Liang, H., November, 2011. Word-reordering for Statistical Machine Translation

Using Trigram Language Model. Proceedings of IJCNLP, pp. 1288-1293.

Khalilov, M., Fonollosa, J. & Dras, M., 2009. Coupling hierarchical word reordering and

decoding in phrase-based statistical machine translation. Proceedings of the Third

Workshop on Syntax and Structure in Statistical Translation, June, pp. 78-86.

Knight, K., 1999. Decoding complexity in word-replacement translation models.

Computational Linguistics, 25(4), pp. 607-615.

Koehn, P., Och, F. & Marcu, D., 2003. Statistical phrase-based translation. 2003

Conference of the North American Chapter of the Association for Computational Linguistics

on Human Language Technology-Volume 1, May, pp. 48-54.

Nießen, S. & Ney, H., 2004. Statistical machine translation with scarce resources using

morpho-syntactic information. Computational linguistics, Volume 30(2), pp. 181-204.

Rottmann, K. & Vogel, S., 2007. Word reordering in statistical machine translation with a

POS-based distortion model. Proceedings of TMI, pp. 171-180.

Tillmann, C. & Ney, H., 2003. Word reordering and a dynamic programming beam

search algorithm for statistical machine translation. Computational linguistics, pp. 97-

133.

Wang, C., Collins, M. & Koehn, P., 2007. Chinese Syntactic Reordering for Statistical

Machine Translation. EMNLP-CoNLL, June, pp. 737-745.

Wang, Y. & Waibel, A., 1997. Decoding algorithm in statistical machine translation.

Proceedings of the eighth conference on European chapter of the Association for

Computational Linguistics, July, pp. 366-372.