68
An empirical study of Neural Machine Translation Approaches in Limited Data Scenarios The case of English - Indonesian language pair Arra’di Nur Rizal Uppsala University Department of Linguistics and Philology Master Programme in Language Technology Master’s Thesis in Language Technology, 30 ECTS credits October 19, 2020 Supervisor: Dr. Christian Hardmeier, Uppsala University

An empirical study of Neural Machine Translation ...evapet/thesesOct2020/... · Arra’di Nur Rizal Uppsala University Department of Linguistics and Philology Master Programme in

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

  • An empirical study ofNeural MachineTranslation Approaches inLimited Data Scenarios

    The case of English - Indonesian language pair

    Arra’di Nur Rizal

    Uppsala UniversityDepartment of Linguistics and PhilologyMaster Programme in Language TechnologyMaster’s Thesis in Language Technology, 30 ECTS credits

    October 19, 2020

    Supervisor:Dr. Christian Hardmeier, Uppsala University

  • Abstract

    Despite producing state-of-the-art achievement on translation tasks, Neural Ma-chine Translation seems to require a massive amount of high-quality parallel data.However, recent studies suggest that hyper-parameter tuning and transfer learningcan reduce the amount of data required for training. This thesis explores existingNeural Machine Translation (NMT) methods in limited data settings, especiallyfor the English - Indonesian language pair. A Statistical Machine Translation(SMT) system and 4 Neural Machine Translation (NMT) systems are tested in 4data availability scenarios using 5 drastically different corpus sizes. A total of 135models are created to produce 45 BLEU score averages to compare different ma-chine translation system’s performance in different settings. A BLEU score fromGoogle translates output is included in the comparison as well. 4 professionaltranslators also conducted a further assessment of the translation quality fromdifferent models. Several findings are presented in this study. They are; 1. Theswitch point where the NMT system is better than the SMT system. 2. The trans-fer learning performance on different training datasets. 3. The NMT method andthe minimum data size required to produce better results than Google Translate.4. The state of RNN-based NMT system effectiveness with the arrival of the newTransformer-based NMT system. To conclude, suggestions regarding selectingNMT approaches, based on the corpus size and data availability, are presented.

  • Contents

    Acknowledgements 5

    1. Introduction 61.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2. Beneficiaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3. Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2. Background 82.1. Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.1.1. Rule-Based Machine Translation (RBMT) . . . . . . . . . . . 82.1.2. Statistical Machine Translation (SMT) . . . . . . . . . . . . . 92.1.3. Neural Machine Translation (NMT) . . . . . . . . . . . . . . . 102.1.4. Language Representation for Handling Out-of-Vocabulary . . 132.1.5. Automatic Evaluation . . . . . . . . . . . . . . . . . . . . . . 152.1.6. Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 162.1.7. Machine Translation on Low Resource Settings . . . . . . . . 17

    2.2. Indonesian Language (Bahasa Indonesia) . . . . . . . . . . . . . . . . 172.3. Research on Machine Translation in Indonesian Language . . . . . . . 18

    3. Methodology 193.1. Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2. Limited Data Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3. Data Collection and Preparation . . . . . . . . . . . . . . . . . . . . . 21

    3.3.1. Preparing the Main Corpora . . . . . . . . . . . . . . . . . . . 213.3.2. Preparing the Supporting Corpora . . . . . . . . . . . . . . . 223.3.3. Corpora Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.4. Preparing of TED Corpus . . . . . . . . . . . . . . . . . . . . 243.3.5. Preparing Open Subtitle 2018 Corpus . . . . . . . . . . . . . 253.3.6. Preparing Global Voice Corpus . . . . . . . . . . . . . . . . . 253.3.7. Preparing Wikimatrix Corpus . . . . . . . . . . . . . . . . . . 253.3.8. Preparing PANL Corpus . . . . . . . . . . . . . . . . . . . . . 263.3.9. Preparing German-English Corpora . . . . . . . . . . . . . . . 263.3.10.Dataset usages in scenarios . . . . . . . . . . . . . . . . . . . . 26

    3.4. Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4.1. Phrase-Based Statistical Machine Translation Baseline . . . . . 273.4.2. Google Translate Baseline . . . . . . . . . . . . . . . . . . . . 27

    3.5. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.5.1. Scenario 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.5.2. Scenario 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.5.3. Scenario 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.5.4. Scenario 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    4. Results and Discussion 314.1. Results from Scenario 1 . . . . . . . . . . . . . . . . . . . . . . . . . 314.2. Results from Scenario 2 . . . . . . . . . . . . . . . . . . . . . . . . . 33

    3

  • 4.3. Results from Scenario 3 . . . . . . . . . . . . . . . . . . . . . . . . . 354.4. Results from Scenario 4 . . . . . . . . . . . . . . . . . . . . . . . . . 364.5. Translation Quality Assessments . . . . . . . . . . . . . . . . . . . . . 37

    4.5.1. Source sentence length is 3 words . . . . . . . . . . . . . . . . 394.5.2. Source sentence length is 6 words . . . . . . . . . . . . . . . . 404.5.3. Source sentence length is 8 words . . . . . . . . . . . . . . . . 404.5.4. Source sentence length is 15 words . . . . . . . . . . . . . . . 404.5.5. Source sentence length is 20 words . . . . . . . . . . . . . . . 414.5.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    5. Conclusion and Future Work 435.1. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    A. Flowchart to Select Machine Translation Approach 46

    B. Hyper-Parameter Configuration for NMT Systems 47

    C. Experiment Pipelines 48

    D. Translation Outputs 52D.1. Source sentence length is 3 words . . . . . . . . . . . . . . . . . . . . 52D.2. Source sentence length is 6 words . . . . . . . . . . . . . . . . . . . . 53D.3. Source sentence length is 8 words . . . . . . . . . . . . . . . . . . . . 54D.4. Source sentence length is 15 words . . . . . . . . . . . . . . . . . . . 55D.5. Source sentence length is 20 words . . . . . . . . . . . . . . . . . . . 57

    E. LSTM cell and GRU cell 59

    F. Attempt to Replicate Sennrich and Zhang (2019) Results 61

    4

  • Acknowledgements

    I would like to thank my supervisor Dr. Christian Hardmeier whose currently inEdinburgh, for his support, his valuable advice, and his time.

    A big thank to all the teachers on the Master Programme in Language Technology isthe knowledge they have imparted, their support, and their flexibility in understand-ing my circumstances.

    I also want to give my gratitude to the Uppsala University - Sweden, for secur-ing the computing power in Norway for this project.

    My sincere appreciation towards the National Infrastructure for High-PerformanceComputing and Data Storage in Norway for the access to UNINETT Sigma2 GPUcluster on which the experiments in this study were conducted.

    Thank you, Lund University - LDC - Sweden, for the consideration to let me workpart-time, which allow me to finish the study.

    I would also like to express my gratitude towards the Indonesian computationallinguistic community, especially Fikri Aji of Edinburgh University, for the insightfuldiscussion and help.

    Thank you as well Polyglot Indonesia, for the community support in evaluatingthe translation results.

    I am thankful for my family in Indonesia and in Scotland for their unfaltering support.

    My immense love for my family in Denmark, especially my wife-G.S.F, gives mespace and time to do the work and write the report.

    Finally, sincere love to my bundle of joy who gives a massive boost of endorphins inthe mid of extreme commuting and challenging time. A.D.R

    5

  • 1. Introduction

    Neural Machine Translation (Bahdanau et al., 2015; Sutskever et al., 2014; Vaswaniet al., 2017) has become the dominant paradigm for machine translation. NeuralMachine Translation (NMT) has produced results that surpassed the Statistical, Phrase-based Statistical Machine Translation (PB-SMT) and achieving the state-of-the-artresult publicly available benchmark data sets. T. Luong et al. (2015) shows that theNMT system improved the state-of-the-art phrase-based system by 0.5 BLEU pointson WMT’14 English-to-French corpus. A year later, Wu et al. (2016) shows thatan ensemble of 8 ML-trained models NMT system outperforms PB-SMT with 3.35BLEU points on the same WMT’14 English-to-French corpus.

    Neural Machine Translation typically uses a neural network to encode source languagesentences as input. Another neural network is then utilized to generate a sentencein the target language based on the encoder’s output. The architecture is usuallyreferenced as encoder-decoder architecture (Cho, van Merriënboer, Bahdanau, et al.,2014; Sutskever et al., 2014). These neural networks could be a Recurrent NeuralNetwork (RNN) variants with attention mechanism system (Bahdanau et al., 2015)or a non-recurrent with self-attention mechanism system, also known as transformer(Vaswani et al., 2017).

    Despite producing state-of-the-art achievement on translation tasks, the Neural ap-proach has a caveat. A study by Koehn and Knowles (2017) shows that an effectiveNMT system requires a huge amount of high-quality parallel data; otherwise, theNMT system tends to perform poorly. This requirement poses a real practical chal-lenge in limited data settings. For example, if the language pair does not have a largeamount of parallel corpus or available corpus for a specific domain is small.

    To alleviate this issue, many researchers have tried to address this challenge by tun-ing the system’s hyper-parameter for low-resource settings (Sennrich and Zhang,2019), using meta-learning (Gu et al., 2018), performs transfer learning (Zoph et al.,2016), using pivoting or triangulation techniques (Y. Chen et al., 2017), trying semi-supervised approaches (He et al., 2016), as well as unsupervised approaches (Artetxeet al., 2017; Yang et al., 2018) which does not require parallel corpora.

    1.1. Purpose

    My thesis aims to explore existing Neural Machine Translation methods in limiteddata settings, especially for English - Indonesian language pair. The reason to focuson the Indonesian language is that, other than it is a morphologically rich language,Bahasa Indonesia (Indonesian language) has a large number of speakers (Eberhardet al., 2019; Kozok, 2012). Thus, my research purpose is as follow:

    1. To find machine translation methods that perform the best under different datascenarios on English - Indonesian language pair by empirically comparing theperformance of each approach.

    6

  • 2. To find a switch point where we should use the NMT system instead of PB-SMTsystem on English - Indonesian language pair. Koehn and Knowles (2017) showsthat RNN-based NMT outperforms the PB-SMT system at about 15 millionwords on German-to-English. However, Sennrich and Zhang (2019) showsthat with hyper-parameter tuning, RNN-based NMT is able to outperform thePB-SMT system at about 100 thousand words on German-to-English. This studypresents the switch point number for Indonesian-to-English.

    3. To find the size of the training dataset and method required for a machine trans-lation system to produce better performance than publicly available machinetranslation services.

    4. To find whether Bidirectional RNN-based NMT with attention mechanismarchitecture is still useful in machine translation task with the existent ofTransformer architecture in Indonesian-to-English translation tasks.

    1.2. Beneficiaries

    This study’s immediate beneficiaries are companies or organizations that want toimplement machine translation on English-Indonesian language pair but do not haveextensive resources. Moreover, any entities that want to do machine translation witha small amount of data could benefit from the study’s result. Furthermore, machinetranslation research communities interested in low resource language research couldbe benefited as well.

    1.3. Limitation

    This study only uses techniques that are able to be implemented using existing,publicly available tools. The reason is to give any entities an easy method to replicatesthe study result and gain benefits without investing in the tools’ development.

    1.4. Outline

    The report is structured as follows to communicate the thesis findings. Chapter 2describes all necessary concepts to understand the theory and methods of machinetranslation, as well as a description of the Indonesian language. The informationin chapter 2 is needed to comprehend the methodology and experiment in thesubsequent chapter. Chapter 3 is used to discuss the methodology, such as datacollection and preparation, and the experiment design. This chapter explains all ofthe steps required to obtain the dataset for the experiment, the preparation required,and the analysis of each corpus. The Training dataset, Development dataset, and Testdataset creation are explained as well. Chapter 3 also contains system configurations ofevery system used in this study as well as an explanation of different data availabilityscenarios. In essence, chapter 3 is useful for replicating this study. The result of theexperiment is presented in chapter 4. The chapter contains a comparison of theSMT (PB-SMT) system and NMT systems performance in different scenarios. Furtheranalysis of the result is discussed in Chapter 4. I conclude this study with suggestionson machine translation approaches in regard to different scenarios as well as potentialfuture work in chapter 5. All of the questions posed in the purpose section ( seesection 1.1) are answered in chapter 5.

    7

  • 2. Background

    This chapter explains machine translation (MT) history, MT theories, and MT evalua-tion metric. The explanation includes a description of different machine translationtypes as well as the common methods employed in low resource settings. IndonesianLanguage or Bahasa Indonesia is briefly discussed afterward before ending the chapterwith the state of machine translation research in the Indonesian language.

    2.1. Machine Translation

    Machine Translation is a subset of computational linguistics, which explores ap-proaches to translate one natural language (source) to another natural language (target)automatically using a computerized system. There are multiple common approaches,as shown in figure 2.1 . There are Rule-based Machine Translation (RBMT), StatisticalMachine Translation (SMT), and Neural Machine Translation (NMT). Since this studyonly conducts experiments in SMT and NMT, RBMT is explained succinctly.

    2.1.1. Rule-Based Machine Translation (RBMT)

    Rule-Based Machine Translation (RBMT) system uses sets of rules which incorpo-rate linguistic information of both source and target language to translate the sourcelanguage into the target language. The linguistic information includes morphologicalrules, lexicon transfer and generation, and syntactic and semantic analysis (Simardet al., 2007). This linguistic information is built and collected manually into a databaseof rules.

    RBMT can be categorized into 3 methods (see Figure 2.1). The first method isthe Direct method, where the system does direct word-by-word translation. Thesecond method is Transfer method, where the system uses intermediate abstractrepresentation for the language pair. This representation is used to encode sentences inthe source language and generate sentences in the target language. The third methodis Interlingua method, where language-independent abstract representation is created,and the transfer step is not needed (Abiola et al., 2015).

    The creation of RBMT system is highly manual and time-consuming. However, withvery clean, uniform input (e.g. weather forecast or GPS navigation), RBMT could

    Figure 2.1.: Different types of common approaches in Machine Translation

    8

  • produce extremely precise output. There has been well-known RBMT system such asApertium (Forcada et al., 2011) and Systran (Toma, 1977)

    2.1.2. Statistical Machine Translation (SMT)

    A Statistical Machine Translation (SMT) system (Brown et al., 1990) uses a statisticalmodel derived from the analysis of a large sample of translation examples, also knownas parallel corpus. As shown in figure 2.1, there are 3 main approaches in the statisticalsystem.

    1. The first model is phrase-based (PB-SMT), where phrases or word sequences areextracted from parallel corpora using statistical methods. The input sentence isdivided into phrases, followed by a translation into phrases in the target languageand possibly reordered. (Koehn et al., 2003)

    2. The second model is syntax-based (SB-SMT), where syntactic units (e.g. (partial)parse trees of sentences/utterances.) are translated instead of single words orsequence of words (as in phrase-based MT). Synchronous context free grammar(SynCFG) between source and target language are learned from parallel corpora.This approach is very slow compared to PB-SMT.

    3. The third model is a hierarchical phrase-based (HPB-SMT) system where PB-SMT and SB-SMT approaches are used. HPB-SMT model extracts words andsub-phrases and learns synchronous context-free grammar (SynCFG) rules fromparallel corpora. (Chiang, 2005)

    SMT system usually has a language model, a translation model, and decoder algo-rithms. The language model is used to increase the translation fluency in the targetlanguage by estimating how probable a sentence is. Fluency is the quality of the outputtranslation based the grammar and idiom choices.

    The Language Model is a probability distribution of words in the target languagewhich allows the SMT system to calculate the probability of a word being precededby a sequence of words. N-gram model is usually used as a language model in theSMT system. N-gram model uses a Markov assumption, which enables us to assumethat we can predict the next word without looking at too many preceding words. Forexample, Bigram (𝑛 = 2) model considers one preceding word, and trigram (𝑛 = 3)model considers two preceding words. Thus, n-gram model considers 𝑛 − 1 precedingwords. In n-gram approach, sentence probability is the product of the conditionalprobabilities of each word (𝑤𝑖) given 𝑛 − 1 words. For instance, if a sentence (𝑠) has𝑤1...𝑤𝑙 words, then the probability 𝑃 (𝑠) is approximated as in equation 2.1.

    𝑃 (𝑠) =𝑙∏𝑖=1

    𝑃 (𝑤𝑖 |𝑤1, ...,𝑤𝑖−1) ≈𝑙∏𝑖=1

    𝑃 (𝑤𝑖 |𝑤𝑖−(𝑛+1) , ...,𝑤𝑖−1) (2.1)

    The conditional probability is estimated by calculating n-gram frequency counts in amonolingual corpus, usually called training data. (see equation 2.2). The frequencyratio is called the relative frequency. The use of relative frequency to estimate proba-bilities is called maximum likelihood estimation (MLE).

    𝑃 (𝑤𝑖 |𝑤𝑖−(𝑛+1) , ...,𝑤𝑖−1) =𝐶𝑜𝑢𝑛𝑡 (𝑤𝑖 |𝑤𝑖−(𝑛+1) , ...,𝑤𝑖−1,𝑤𝑖)𝐶𝑜𝑢𝑛𝑡 (𝑤𝑖 |𝑤𝑖−(𝑛+1) , ...,𝑤𝑖−1)

    (2.2)

    9

  • Although the MLE seems to be a good method, it will behave poorly when themonolingual corpus is small, or the n-gram counts increase. In these cases, sparse databecomes a problem. During translation, the possibility that we encounter n-gram isnot seen in the training data increases. If we have not seen the n-gram, the n-gramwill be considered as zero counts leading to zero relative frequency estimates, whichultimately will result in zero probability for the target sentence. To alleviate this issue,a smoothing technique is used. Smoothing is a technique to adjust the maximumlikelihood estimate, hoping to produce more accurate probabilities. Other than pre-venting zero counts probabilities, smoothing would let infrequent sentences not tobe discarded too quickly. Examples of smoothing techniques are Add-One (Laplace),where all n-gram counts simply added by 1, or Modified Kneser-Ney (S. F. Chen andGoodman, 1996) where lower-order n-gram is used as a backoff if the higher-ordern-gram counts are near zero.

    The translation model is used to calculate the relative likelihood between wordsand phrases in source and target languages. It estimates the lexical correspondencebetween languages pair in a parallel aligned corpus. The translation model tries tocreate an alignment model between individual words in the sentence pair. A proba-bility distribution is used to model phrase translation. Bayes rule is used to producetranslation probability. The translation probability table is learned from parallel cor-pora. Equation 2.3 shows how to get the best output in a target sentence (t) given asource sentence (s) where Language Model (LM) is the language model, and 𝜔 is abias factor larger than 1. 𝜔 is generated for every target word.

    𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑒 𝑃 (𝑡 |𝑠)= 𝑎𝑟𝑔𝑚𝑎𝑥𝑒 𝑃 (𝑠 |𝑡)𝑃𝐿𝑀 (𝑡)𝜔𝑙𝑒𝑛𝑔𝑡ℎ (𝑡 )

    (2.3)

    The decoder components produce possible translations and evaluate their proba-bilities to find the best translation. Beam Search is usually employed to keep multipleprobable translations. Beam Search works by creating a tree using the Breadth-firstsearch (BFS) algorithm. The decoder starts with an empty translation, then a possibletranslation of a word or phrase in the source sentence is added to form partial transla-tion tree nodes. Every translated word or phrase is marked. This process is repeatedto every node at every tree level until all words in the source sentence are translated.Each sentence’s probability is calculated from all of the sentences generated, and thehighest is taken as the translation output.

    Out of the 3 models described previously, Phrase-based Statistical Machine Translation(PB-SMT) was commonly used and once was the state-of-the-art method before therise of Neural Machine Translation (M.-T. Luong and Manning, 2015).

    2.1.3. Neural Machine Translation (NMT)

    Neural Machine Translation (NMT) system uses a single large neural network to createan integrated translation model to maximizes the conditional probability of sentencepairs using parallel corpus (Cho, van Merriënboer, Bahdanau, et al., 2014; Sutskeveret al., 2014; Vaswani et al., 2017). Compared to PB-SMT system, NMT does nothave multiple sub-components such as translation model, and language model. TheNMT system is a black-box system trained end-to-end with source sentence as inputand then outputs a potentially correct translation in target language. Encoder-decoder

    10

  • Figure 2.2.: A simplified version of Encoder-Decoder Architecture. The figure shows how Indone-sian sentence "Dia ibunya Budi" encoded into a summary called a context vector. Thenthe vector is decoded into English sentence "She is Budi’s mother"

    Figure 2.3.: Standard Recurrent Neural Network Architecture for learning the conditional prob-ability distribution of input sentence given target sentence (Cho, van Merriënboer,Bahdanau, et al., 2014). In this particular figure, the RNN architecture is for learning𝑃 (𝑦1, ..., 𝑦𝑇 ′ |𝑥1, ..., 𝑥𝑇 ), where T is the total number of words in an input sentence andT’ is the total number of words in an output sentence. T and T’ not necessarily need tobe the same length. C is the summary of input sentence usually called context vector, xis the input word, y is the output word, h is the RNN cell or the hidden layer.

    architecture (Cho, van Merriënboer, Gulcehre, et al., 2014; Sutskever et al., 2014)is used for NMT ( see Figure 2.2 ). The encoder use source language sentence asinput and the decoder utilize the encoder’s output to generate a sentence in targetlanguage. The usage of a neural network for creating a continuous representationof a sentence was published by Schwenk et al. (2006) to create a Phrase-basedcontinuous translation model. Kalchbrenner and Blunsom (2013) use the approach todevelop Recurrent Continuous Translation Models (RCTM). RCTM was created toalleviate the issue in the language model caused by a large number of rare or unseenphrase pairs. Rare or unseen phrases do not share statistical weight. Thus, the model’sestimation is sparse or skewed in target translation. The Recurrent Neural Networkencoder-decoder (RNN) that commonly used today (see Figure 2.3) is introduced byCho, van Merriënboer, Bahdanau, et al. (2014).

    During the implementation, the embedding layer and the softmax layer are addedto the architecture (see Figure 2.4). The embedding layer is a component that willmap a word into vector representation and vice-versa. The embedding layer is usuallypre-trained with an unsupervised method. An example of a word embedding generatoris the word2vec (Mikolov et al., 2013). The softmax layer is used to normalize theneural network’s output to a probability distribution. The softmax layer is essentiallyis a softmax function or normalized exponential function (see equation 2.4) that turnsa scalar values K (real number) into decimal probabilities (real number) that sum toone.

    11

  • Figure 2.4.: RNN Encoder-Decoder Architecture for training NMT system. The figure depicts anNMT training system that encodes the Indonesian sentence "satu kata ." into the Englishsentence "a word .". The figure shows the embedding layer and the softmax layer, whichis usually added to the RNN Encoder-Decoder Architecture. The "predicted word" isa word representation that needs to be map into a real word by the embedding layer.The encoder cells and the decoder cells are the RNN cells.

    𝑦

    2.01.00.1

    → 𝜎 (𝑦𝑖) =𝑒𝑦𝑖∑𝐾𝑗 𝑒

    𝑦 𝑗→

    𝑝 = 0.7𝑝 = 0.2𝑝 = 0.1

    (2.4)

    However, the recurrent neural network (RNN) suffers from exploding and vanishinggradient problem (Hochreiter, 1998) if the sentence is too long (Cho, van Merriënboer,Bahdanau, et al., 2014). The vanishing gradient problem is when the gradient thatis being propagated back in a deep network during the training vanished when itreached the initial layer. The reason is that the gradients coming from deeper layershave to be processed by multiple matrix multiplications back to the earlier layers(chain rule). If the gradients from the deeper layer start with a value less than one, thegradients approaching the initial layers become exponentially smaller. If the gradientsget smaller as it back-propagates, the model will not learn the relationship betweenthe beginning of the sentence and the end of the sentence. In other words, the longerthe sentence, the less representative the start of the sentence in the encoder’s output.

    To reduce the effect of vanishing gradient, Sutskever et al. (2014) develop Long-Short Term Memory (LSTM) architecture (Hochreiter and Schmidhuber, 1997)mechanism in the RNN unit cell. Another approach to alleviating the effect is touse Gated Recurrent Unit (GRU) architecture (Cho, van Merriënboer, Bahdanau,et al., 2014). GRU is less complex compared to LSTM because GRU (see Figure E.2in appendix E) has two gates (reset and update gate) and LSTM (see Figure E.1 inappendix E) has three gates (input, output and forget gate). Gates are a mechanismthat controls the flow of information insides the RNN unit cell by adding or removinginformation to the cell state.

    The exploding gradient problem is similar to the vanishing gradient problem. Theissue appears if the gradients from the deeper layer start with a value of more than 1.The exploding gradient can lead to unfavorable updates on the model’s weight. In theworst case, Inf or NaN will be the result that forces the training to be restarted from

    12

  • the previous checkpoint. A gradient clipping method (Mikolov et al., 2012; Pascanuet al., 2013) is usually employed to solve the issue.

    The Attention Mechanism is introduced in 2014 by Bahdanau et al. (2015) to addressfixed-length vector bottleneck in basic encoder-decoder architecture. The idea is tohave something that could focus on certain sentence factors during the translationprocess, similar to a human translator. The mechanism function by finding the wordimportance in the source language (Key) for a given the word in the target language(Query) in a sentence. This is done by utilizing all the hidden states from intermediateencoder cells to construct the context vector instead of just summarizing the encodercells into a fixed-length context vector. The information in the context vector willthen be used by the decoder to generate the target sentence. The goal is to train themodel to be able to learn the relationship between the words in the source sentenceand the words in the target sentence. In essence, the attention is the componentthat manages and quantifies the relation between input and output elements. This iscalled General Attention. If the relation managed and quantified is within the inputelements; it is called Self-Attention.

    The added attention network allows a model to automatically find relevant partsof the source sentence when it tries to predict a target word. This approach improvedthe performance of NMT systems in translating long sentences. Nevertheless, RNNstill has an inherent sequential nature, which prevents parallel computation.

    The Transformer architecture (Vaswani et al., 2017) was proposed to eschew recur-rence by solely using a self-attention mechanism and a feed-forward neural network.The architecture has 6 layers of encoders, 6 decoder layers, an embedding layer, alinear layer, a softmax layer, and the novel positional encoding. Each encoder and de-coder contains a residual network with the novel multi-head self-attention ( see figure2.5). Multi-headed self-attention simply multiple self-attention functions that focuson multiple different factors within the input sentence. As for Positional Encoding, itlets the model reason about the relative position of any words using sine and cosinefunctions. The Transformer model also does not require sigmoid and tanh activationfunctions, which enable it to use computationally cheaper activation functions suchas ReLU (Rectified Linear units). The departure from recurrence and expensive acti-vation function allows a transformer to compare in fully parallel settings, leading tofaster and more efficient computation.

    2.1.4. Language Representation for Handling Out-of-Vocabulary

    Neural Machine Translation is trained on parallel corpora, which usually containsmillions of words. A large number of training data improves the model performance,but NMT has problems dealing with a large number of vocabularies. The size ofthe vocabulary inversely affects the model’s training speed (Jean et al., 2015). Thereason is that the model architecture usually has softmax function, as the last activa-tion function, to produce the probability distribution of potential target words overexisting vocabularies. The softmax function requires the dot product computationof vectors for all words in a target vocabulary to compute normalization constant.The computation is time consuming for large vocabularies. Hence, in practice, theNMT system typically limits the number of vocabularies around 30,000 - 80,000most frequent words (Bahdanau et al., 2015; Sennrich et al., 2016b; Sutskever et al.,2014) and mark words outside this vocabularies as unknown with symbol.

    13

  • Figure 2.5.: Transformer Architecture (Vaswani et al., 2017). The Nx is the encoder and decodermultiplier. The original design set Nx to 6. The multi-head attention input is the Q, K,V matrices.

    Various methods have been tried to cope with out-of-vocabulary words. T. Luong et al.(2015) use dictionary lookup and word alignment model to handle out-of-vocabularyword as a post-processing step. Other approaches are directly use the charactersrepresentation (Costa-jussà and Fonollosa, 2016; Lee et al., 2017; M.-T. Luong andManning, 2016) or even directly use bytes representation (Costa-jussà et al., 2017).Another method is using subword models such as byte-pair encoding (BPE) (Sennrichet al., 2016c). Subword models split words into subwords in various lengths. Thevocabulary list will contain the most frequent subwords extracted from the trainingdata by the subword model. Subword models become the common solution as theyempirically perform better than dictionary-based approaches (Sennrich et al., 2016c).The character-based approach outperforms the subword model only in high resourcesettings with deep network (Cherry et al., 2018). It performs worse in low-resourcesettings (Sennrich and Zhang, 2019). Hence, I am using subword models in this study.

    With the existent of a publicly available tool such as subword-nmt1 (Sennrich etal., 2016c), BPE (byte-pair encoding) has been a popular choice to create a subwordsmodel to achieve open-vocabulary translation. BPE works by splitting the word into asequence of subwords and then merging them into a new subword based on the wordfrequency. The subwords merge operations are repeated until subword vocabulary sizeis reached, or the next highest frequency pair is 1 or less than the threshold. The BPEmerge operations are learned from a vocabulary list extracted from the whole corpus.The vocabulary list also contains the frequency of each word. The subword-nmt hasthree hyper-parameters,

    1https://github.com/rsennrich/subword-nmt.

    14

    https://github.com/rsennrich/subword-nmt

  • 1. The first hyper-parameter is merge operations, which control the number ofvocabularies.

    2. The second hyper-parameter is minimum frequency threshold (Sennrich, Birch,et al., 2017), limiting the addition of a subword if its frequency of occurrence isbelow the threshold. In other words, any subword that has a frequency belowthe threshold will be treated as out-of-vocabulary. The purpose is to reduce thedependence between vocabulary size and the corpus size, which leads to moreaggressive segmentation on small corpora.

    3. The third hyper-parameter is BPE dropout (Provilkov et al., 2020), a simplesubword regularization that stochastically rejects possible merges.

    2.1.5. Automatic Evaluation

    To measure the performance of different machine translation engines, we need a robustmethod to assess the translation system. Generally, we want to compare the outputof the machine translation engine to human translation reference. Rather than usingmanual evaluation by humans, automatic evaluation is faster and cheaper. Especially ifwe need to evaluate frequently. For instance, for checking and for monitoring systemimprovement during training time or development time. The automatic evaluationmethods automatically score the discrepancies between machine-generated translationwith human translation reference. The automatic evaluation method widely usedin the machine translation community is BLEU (Papineni et al., 2002). BLEU hasbeen shown to be correlated with human evaluations (Callison-Burch et al., 2006;Coughlin, 2003). Thus, we use BLEU as a surrogate for actual human evaluation.

    BLEU, in essence, looks at the presence or absence of words as well as word or-dering and how far the words are separated between machine-generated translationand human translation reference. The BLEU score is typically assigned a real decimalnumber on the output from zero to one. However, most of the tools available usuallymultiply the number by a hundred. 100 BLEU score means a perfect match in termsof word match and word order; zero means no common n-gram matches betweenthe output of machine translation engine and human translation. For example, zeroBLEU score will be given if there is no common 4-gram, even if there is an overlapin 1- to 3-grams. BLEU score is calculated based on n-gram precision and brevitypenalty (BP). N-gram precision is calculated by counting the number of overlappingn-grams between the machine translation output and human reference divided by thetotal number of available n-grams. Brevity Penalty (BP) is used to prevent translationhypothesis with high precision but very short. BP compensates for the fact that recallis not part of the score.

    In mathematical expression, the BLEU score is calculated as shown in equation2.5 where 𝑝𝑛 is modified n-gram precisions, 𝑁 is the longest n-grams, and 𝑤𝑛 ispositive weights summing to one.

    𝐵𝐿𝐸𝑈 = 𝐵𝑃 · exp(𝑁∑︁𝑛=1

    𝑤𝑛 log𝑝𝑛

    )(2.5)

    However, BLEU is very sensitive to the word form. If we tokenize (separate) a worddifferently, the BLEU score could change. Therefore, it is vital to consider the tokenizer(word separator) that we are using. Because of this particular reason, the machine

    15

  • translation community uses the sacreBLEU2 (Post, 2018) to calculate the BLEU scoreand use the generated BLEU score to share the result of their study. SacreBLEU is usedbecause it is a standard method to calculate shareable, comparable, and reproducibleBLEU scores in the machine translation community. It became a standard because theBLEU score produces by sacreBLEU will not be affected by different tokenizationand normalization schemes employed by various studies as it has its own schemes.Due to these particular features, translation outputs are required to be detruecasedand detokenized before the BLEU score can be calculated by sacreBLEU.

    2.1.6. Human Evaluation

    Although the BLEU score is useful for rapid monitoring for training a model, humanjudgments are necessary to genuinely assess translation quality. Therefore, humanevaluation is also needed to assess the quality of the systems’ translation. The crite-ria on which human judges are to evaluate the translation output are Fluency andAdequacy. Fluency rates the quality of grammatical correctness and idiomatic wordchoices. Adequacy rates whether the output conveys the same meaning as the sourcesentence. Fluency rating is from incomprehensible at the lowest until flawless English atthe top. Adequacy rating is from none, no meaning at all, until all meaning at the top.Table 2.1 shows the Adequacy and the Fluency levels.

    Adequacy Fluency5 all meaning 5 flawless English4 most meaning 4 good English3 much meaning 3 non-native English2 little meaning 2 disfluent English1 none 1 incomprehensible

    Table 2.1.: Assessment of Adequacy and Fluency in Translations based on LDC’s Technical Report.(Consortium et al., 2005)

    The evaluators may differ in opinion on the adequacy level or the fluency level ofeach sentence. Therefore a mechanism to measure an agreement among evaluatorsis needed. To measure the Inter-Annotator Agreement (IAA), kappa coefficient 𝐾 isused. IAA is an indicator of how well evaluators agreed on certain annotation decisionsfor a certain category. 𝑘 = 1 indicates that evaluators are in complete agreement. Ifthere is no agreement then 𝑘 ≤ 0. The range of agreements are shown in table 2.2.

    Kappa Agreement

  • 2.1.7. Machine Translation on Low Resource Settings

    Most machine translation training is done in a supervised manner, where a model getstrained on a dataset containing parallel sentences in a language pair. If a language pairhas small amount parallel corpora, it is considered as a low resource. Although it is notclear where the boundary between low resource and high resource, there have beenmany studies regarding low-resource machine translation. Especially studies relatedto improving machine translation performance in low resource settings. Multiplemethods for improving model’s performance in low resource settings have beendiscussed in the literature. They are as follows:

    • Hyper-parameter tuning (Sennrich and Zhang, 2019) where the model’s hyper-parameter is tuned to dataset size.

    • Transfer learning (Dabre et al., 2017; Nguyen and Chiang, 2017; Zoph etal., 2016) or parent-child model where the parent model trained on differentdomain or even different language pair, is used to as a starting point for trainingthe model.

    • Multilingual training (Aharoni et al., 2019; Johnson et al., 2017) where multiplelanguage pairs are trained together.

    • Unsupervised training (Artetxe et al., 2017; Yang et al., 2018), which relies onpre-trained word embedding alignment.

    • Using pivot language (Cheng et al., 2016; Miura et al., 2015; R. Costa-jussàet al., 2011) where third language is used as a proxy.

    The status of whether the Indonesian language considered low-resource is a bit unclear.Joshi et al. (2020) categorized Indonesian as a language that is lacking labeled data forNLP training. However, Guntara et al. (2020) argues that the Indonesian language isnot considered as a low-resource anymore. In terms of MT task, there is a sizeableOpensub corpus about movie subtitles (conversational) and other religious corpora,but only small corpora with news topics. These large Corpora may not be useful totrain an MT system for the news domain or law domain. Nevertheless, this studysimulates several data-constraint scenarios to assess and compare different approaches.

    Out of these many approaches, hyper-parameter improvement and transfer learningare attempted. hyper-parameter improvement is easy to implement since it requiresno extra data, separate model nor new architecture. I also want to confirm Sennrichand Zhang (2019) results on English-Indonesian language pair. Transfer learning isattempted because, according to Joshi et al. (2020), Indonesian belongs to a categorythat benefits transfer learning. Therefore, I want to confirm its validity.

    2.2. Indonesian Language (Bahasa Indonesia)

    The Indonesian Language or Bahasa Indonesia is a standardized language that islearned and spoken through Indonesia due to its status as Indonesia’s official nationallanguage. The language is derived from Malay language, which belongs to the Aus-tronesian language family. Bahasa Indonesia uses the 26 letters of the ISO basic Latinalphabet with no diacritics. The language itself has been updated several times. Forexample, in 1972, with the implementation of "the Perfected Spelling System" orin Indonesian is "Ejaan Yang Disempurnakan" abbreviated to EYD, the Indonesianspelling is reformed, and all diacritics were removed. Since 2015, the spelling system

    17

  • used for the Indonesian language is the Indonesian Spelling System General Manual(Indonesian: Pedoman Umum Ejaan Bahasa Indonesia)

    Although the Indonesian language has the same letters as the English language andmostly follows Subject-Verb-Object (SVO) structure like the English language, In-donesian is highly agglutinative and morphologically richer than English. Indonesianhas very rich derivational morphology, which includes complex affixation, redupli-cation, and clitics. The Indonesian words do not have any case nor gender, and thesentence subject and the tenses do not change the verb form. Instead, the verb isinflected to create passive form. For example, the verb "makan" ( to eat) becomes"dimakan" (to be eaten). The noun is inflected to form plurals (reduplication) or tocreate possessives form. For example, the noun "meja" ( table ) become "meja-meja"(tables). Example of a complete Indonesian sentence: "Anda harus meninggalkan misiini" ( You have to leave this mission ). The gloss of the sentence is in table 2.3

    Indonesian anda harus meninggalkan misi iniIPA [anda] [harus] [meni galkan] [misi] [ini]English Gloss you have to leave mission thisEnglish Translation you have to leave this mission

    Table 2.3.: A gloss of Indonesian sentence "Anda harus meninggalkan misi ini" which translatedinto "You have to leave this mission". The IPA is a system of phonetic notation calledInternational Phonetic Alphabet

    As the language of the fourth most populous country in the world3, it is not a surprisethat the Indonesian language is the 10th most spoken language with 200 millionactive speakers (Eberhard et al., 2019). About 171 million people4 are connectedto the internet. Joshi et al. (2020) categorizes Indonesian languages as "The RisingStars" because of the insufficient efforts in data labeling. The lack of open parallelcorpus, standardize benchmark, and reproducible experiment’s code implementationhinder the progress of machine translation research in Indonesia. Guntara et al. (2020)stated that the same issues also affect Natural Language Processing (NLP) research inIndonesia. Nevertheless, a language that belongs to the category is helped with therise of pre-training and transfer learning.

    2.3. Research on Machine Translation in Indonesian Language

    The research on machine translation on the Indonesian language started in 2009 usingstatistical machine translation tools. Until the year 2018, only 11 studies have beenpublished (Septarina et al., 2019). To the best of my knowledge, only four studies havebeen done on the NMT system on English-Indonesian language pair after 2018 up tothe year 2020, which uses the NMT system. They are using RNN (Hermanto et al.,2015), RNN-biLSTM+BPE (Shahih and Purwarianti, 2019), transformer network(Dwiastuti, 2019), and Transfer Learning (Aji et al., 2020). The rest are mostly Phrase-based MT or Statistical Machine Translation (Septarina et al., 2019). Unfortunately,these studies are not comparable. The reason is that the studies are using differentdatasets with their own tokenizer and test set. For example, Hermanto et al. (2015)uses custom tokenizer, Shahih and Purwarianti (2019) uses NLTK5 library, Aji et al.(2020) uses Moses, and Dwiastuti (2019) does not tokenize the dataset. Thus, it isquite hard to say which one produced state-of-the-art results.

    3http://www.infoplease.com/world/statistics/most-populous-countries.html.4https://www.thejakartapost.com/life/2019/05/18/indonesia-has-171-million-internet-users-study.

    html.5www.nltk.org

    18

    http://www.infoplease.com/world/statistics/most-populous-countries.htmlhttps://www.thejakartapost.com/life/2019/05/18/indonesia-has-171-million-internet-users-study.htmlhttps://www.thejakartapost.com/life/2019/05/18/indonesia-has-171-million-internet-users-study.htmlwww.nltk.org

  • 3. Methodology

    The purpose of this study is to explore the existing Neural Machine Translation (NMT)system performance in data constrained settings. To achieve the goal, this study empir-ically compares different NMT approaches in Indonesian-to-English translation tasks.To see how the NMT model performs, we need to be able to assess it with the otheralternatives. Therefore, this study compares NMT system performance against Statis-tical Machine Translation (SMT) System and publicly available translation services.PB-SMT is chosen to represent the SMT system because it performs the best (M.-T.Luong and Manning, 2015), and Google Translate is chosen because it is arguably thebest free public translation service available.

    Neural networks and other machine learning algorithms create a model by learn-ing patterns from an initial set of data, called training data. The larger the trainingdata helps the model to generalize better. Once the model is trained and optimized,test data is used to evaluate the model quality. In the machine translation task, themodel is tested by translating sentences from the source language in the test datasetinto the target language. Then the model’s translation outputs are compared with thesentences in the target language in the test dataset. The development set is used totuned and optimized the model during training time. These sets are frequently takenfrom the same dataset.

    The experiments run on two NMT architectures, RNN-based Neural Network, andTransformer-based Network. They are compared using the configuration from theoriginal publication. On top of the basic training, two low-resource approaches areemployed. They are hyper-parameter improvement and transfer learning. To be ableto tell which architecture or which approach is better, the BLEU score is used as aperformance indicator. A higher BLEU score is better.

    As BLEU score is not genuinely indicative of the translation quality of systems’output, additional surveys are conducted. The survey asked 4 professional translatorsto choose the best translation from a list of translated sentences. The list contains trans-lated sentences generated from each NMT models, PBSMT models, Google translate,and the Gold standard. However, the fact that the sentences are machine-generated isnot revealed to the evaluators. After the evaluators decided on the best sentence, theywere then asked to rate the best sentences based on the adequacy and the fluency oftranslated sentences.

    In summary, the methodology is started with creating data-constraint scenarios, thencollecting data for training the machine translation system in each scenario. Afterrunning the experiments and getting the benchmark system results, the performanceis evaluated using BLEU. Each created model is given the same test dataset. A BLEUscore is then calculated by comparing the Gold standard and the generated translation.Moreover, the output sentences are evaluated by professional translators for adequacyand fluency.

    19

  • 3.1. Experiment Design

    This study focuses on comparing different Indonesian-to-English NMT system per-formance in different data availability scenarios. To empirically assess the NMTperformance in limited data settings, a training subset with different sizes is createdfrom the main training set. These training subsets are used to assess NMT performanceon different size of training data. The development set and the test set are the samefor all of the experiments. The same data pre-processing is used unless indicated oth-erwise. To highlight the performance of different NMT models and to simplify testing,the same hyper-parameter is used during inference1. The BLEU score (Papineni et al.,2002) produced by sacreBLEU (Post, 2018) script is used to evaluate the translationoutput from NMT. All models are trained and tuned to maximize BLEU score on thedevelopment set.

    The component in NMT architecture, which responsible for generating the translation,is the decoder. A good decoder should be able to solve a search problem to find the besttranslation. A search problem is a type of computational problem in computationalcomplexity theory and computability theory. A search problem is defined by a searchspace and a goal condition. In machine translation, it is defined as finding a string ofwords considered the best translation.

    The search problem is a critical issue for machine translation. This is because thepotential possible translation is large, and a decoder may miss good translations. Sincethe number of generated words affects the degree of search complexities, the eval-uation is done on sentences with various lengths. The sentences are the generatedtranslation by trained models in different scenarios. The particular evaluation willenable us to examine the decoder’s performance with different degrees of granularity.The sentence lengths compared are sentences with a length of 3 words, 6 words,8 words, 15 words, and 20 words. The length selection is mostly referenced fromwork by Germann et al. (2001). The differences are that I added sentences consistof 3 words, and removed sentences consist of 10 words because 3 words sentencesrepresent a really short text, while 10 words sentences are similar to 8 words sentences.

    The translations outputs from different models are compared against the Googletranslate output and the Gold standard (the English part of test set) by 4 profes-sional English-Indonesian translators. The evaluation was done in two steps in twoseparate surveys one after the other. In the first step, each evaluator was asked tochoose the best translation from an Indonesian sentence in various lengths in a survey.The evaluators were given a list of hypothesis translation generated from PBSMTmodels, NMT models, Google Translate, and Gold Standard (the English part of testset) from every scenario. The second step, each evaluator was given the two besthypothesis translations from various lengths from the list based on the result of stepone in a separate survey. The evaluators were then asked to rate each sentence usingthe adequacy level and fluency level listed in table 2.1. To measure the evaluatorsagreement, Fleiss’ kappa coefficient is used since there are 4 evaluators.

    1It might be better to tune the hyper-parameter for each model during inference. However, theexploration for inference hyper-parameter deserve separate study in future work.

    20

  • 3.2. Limited Data Scenarios

    Four scenarios of data availability settings are tested to reflect potential real-worldconditions. These scenarios are as follows:

    • Scenario 1 (S1) - Only Small Parallel Corpora Available. In this scenario, NMTmodels are trained from scratch. To improve the models’ performance in low re-source settings, the RNN-based model’s hyper-parameter is improved accordingto the settings in a study by Sennrich and Zhang (2019). The RNN-based NMTbaseline uses configuration from Bojar et al. (2016). As for the Transformer-basedNMT, the hyper-parameter improvement configuration is from the work byJunczys-Dowmunt et al. (2018). The transformer baseline configuration is fromthe original Vaswani et al. (2017) paper.

    • Scenario 2 (S2) - Small Parallel Corpora Available and Large Monolingual DataAvailable. In this scenario, NMT models are improved using English monolingualdata using the transfer learning method. First, a parent model is created, andthen the parent model is used as a base start for training the final or child model.The parent model is trained on substitution English corpus created with theprocedure from the study by Aji et al. (2020).

    • Scenario 3 (S3) - Small Parallel Corpora Available and Large Out-of-domainParallel Corpora Available. In this scenario, NMT models are improved usinga large out-of-domain dataset using the transfer learning method. The parentmodel is trained with the out-of-domain dataset; then, the model is retrained(fine-tuned) using the training subsets.

    • Scenario 4 (S4) - Small Parallel Corpora Available and Large Parallel Corpora inHigh Resource Language Pair Available. In this scenario, the transfer learningmethod is used as well. The parent model is trained with large German-Englishcorpora. The German-English language pair is chosen because it gives the bestBLEU score as a parent model for Indonesian-English language pair (Aji et al.,2020).

    All models in each scenario are trained using the same training subsets, optimized usingthe same development dataset, and evaluated on the same test set. More informationrelated to these datasets is available in the next section.

    3.3. Data Collection and Preparation

    In this study, two types of datasets or corpora are used. The first one is the maincorpora. The main corpora are used for comparison. The second one is supportingcorpora. The supporting corpora are used for training parent models in transferlearning approaches. Both types are a sentence aligned parallel corpora.

    3.3.1. Preparing the Main Corpora

    A special release English - Indonesia TED corpus (Cettolo et al., 2012) for IWSLT2017 Evaluation Campaign2 is used as the main corpus for training machine transla-tion Model. The same method by Ranzato et al. (2016) is used to create a trainingset and development set. The test set is created by concatenating all of the test dataprovided in the corpus. This preparation is resulting in 109,379 parallel sentences

    2https://wit3.fbk.eu/mt.php?release=2017-01-more.

    21

    https://wit3.fbk.eu/mt.php?release=2017-01-more

  • Figure 3.1.: Main Corpora. The figure shows the origin of the training set, development set, testset, and training subsets. It also shows the creating of an English-English in-domaincorpus. ps means parallel sentences.

    of training data, 4,971 parallel sentences for development data, and 7,990 parallelsentences for test data.

    To simulate the different amounts of training corpus availability, the same methodby Sennrich and Zhang (2019) is used further to split the training corpus into 5training subsets. The 5 training subsets contain roughly 100 thousand words of tar-get languages, 200 thousand words of the target language, 400 thousand words ofthe target language, 800 thousand words of the target language, and 1.6 millionwords of the target language. For the rest of the study, x.x thousand/million words oftraining data is simplified into x.x thousand/million words. The word refers only tothe number of words in the parallel corpus’s target language, not the total numberof words in the parallel corpus. The overview of the main corpora is shown in table 3.1

    All sets and subsets are cleaned, truecased, and tokenized using Moses (Koehn et al.,2007) script. Both Indonesian and English sentences are normalized and tokenizedusing English as its language parameter. The German sentence tokenized with Mosesusing the German language parameter. The corpora are further preprocessed usingbyte-pair encoding using subword-nmt (Sennrich et al., 2016c).

    3.3.2. Preparing the Supporting Corpora

    The supporting corpora are used for training the parent model in the transfer learningmethod. A parent model is a model used as a base to train the model that we want toevaluate. A combination of multi-domain English-Indonesian corpora and German-English corpora are used. The English-Indonesian corpora are the Open Subtitle 2018Corpus (Opensub Corpus) (Lison et al., 2018), Wikimatrix corpus (Schwenk et al.,2019a), Pan Asia Networking Localization (PANL) corpus (Adriani and Riza, 2008),and GlobalVoice Corpus3. Opensub is gathered from movie subtitles. Wikimatrixcorpus is taken from wikipedia contents, which contains general information. PANLcorpus is a curated news content created specifically for machine translation training.GlobalVoice corpus is a corpus that contains news from globalvoice website. The totalsize of this corpus combination is more than 10 million sentence pairs. For comparisonpurposes, only 5 million sentence pairs are used for training the parent model.

    3http://casmacat.eu/corpus/global-voices.html.

    22

    http://casmacat.eu/corpus/global-voices.html

  • Figure 3.2.: Supporting Corpora. The figure shows how an English-English out-of-domain corpus iscreated. ps means parallel sentences

    The German-English corpora are Europarl v7 (Europarl) (Koehn, 2005), CommonCrawl corpus (CC), News Commentary v12 corpus4 (News), and Rapid corpus ofEU press releases (Rapid) (Rozis and Skadin, š, 2017) gathered from WMT17 transla-tion task’s page5. The Europarl corpus is extracted from the European Parliament’sproceedings. The Common Crawl corpus contains sentences from the web crawled bycommoncrawl non-profit organization. The rapid corpus contains sentences that havebeen collected from public sector websites and sites that allow free use and reuse ofits content.

    Figure 3.3.: Supporting Corpora. The figure shows how the English-German support corpus iscreated. ps means parallel sentences

    4http://www.casmacat.eu/corpus/news-commentary.html.5http://www.statmt.org/wmt17/translation-task.html#download.

    23

    http://www.casmacat.eu/corpus/news-commentary.htmlhttp://www.statmt.org/wmt17/translation-task.html##download

  • 3.3.3. Corpora Analysis

    The English-Indonesian corpora analysis is depicted in table 3.1 (original) and table3.2 (after corpora is pre-processed). The English-German corpora analysis is in table3.3.

    Abbr. denotes the abbreviation of the corpus names. |𝑠𝑒𝑛𝑡𝑒𝑛−𝑖𝑑 | denotes the numberof sentences. |𝑤𝑜𝑟𝑑𝑠𝑒𝑛 | denotes the number of English words in the corpus. |𝑤𝑜𝑟𝑑𝑠𝑖𝑑 |denotes the number of Indonesian words in the corpus. 𝑙𝑒𝑛𝑒𝑛 and 𝑙𝑒𝑛𝑖𝑑 are the averagenumber of words per sentence in the respective language. 𝑙𝑒𝑛𝑟𝑎𝑡𝑖𝑜 denotes the absoluteratio between the sentence length of English and Indonesian. The absolute ratio iscalculated with the following formula 𝑙𝑒𝑛𝑟𝑎𝑡𝑖𝑜 =𝑚𝑎𝑥 (𝑙𝑒𝑛𝑒𝑛/𝑙𝑒𝑛𝑖𝑑 , 𝑙𝑒𝑛𝑖𝑑/𝑙𝑒𝑛𝑒𝑛)

    The length ratio is useful to check how likely the sentences are parallel. The ac-ceptable number differs between languages. Usually human translation is used asreference. Most parallel sentence extraction systems limit the length ratio value to 2(Grégoire and Langlais, 2018). If the ratio above 2, the extracted sentence is discarded.

    These corpora contain raw data from the source. Therefore, they are not entirely readyto be used directly as training data. Some cleaning and preprocessing are required.The cleaning procedures are explained in the corpus preparation in the subsequentsections.

    Corpus Abbr. |𝑠𝑒𝑛𝑡𝑒𝑛−𝑖𝑑 | |𝑤𝑜𝑟𝑑𝑠𝑒𝑛 | |𝑤𝑜𝑟𝑑𝑠𝑖𝑑 | 𝑙𝑒𝑛𝑒𝑛 𝑙𝑒𝑛𝑖𝑑 𝑙𝑒𝑛𝑟𝑎𝑡𝑖𝑜 DomainTED IWSLT 2017 TED 117.3K 1.88M 1.64M 16.04 14.03 1.14 ConversationOpenSubtitles 2018 OpenSub 9.26M 54.96M 47.02M 5.93 5.07 1.16 MovieGlobalVoices v2017q3 GV 14.4K 264.6K 238.9K 18.32 16.53 1.10 NewsWikimatrix (T=1.04) Wiki 1M 21.7M 19.8M 21.31 19.51 1.09 GeneralPAN Localization PANL 24K 541.3K 500K 22.53 20.81 1.08 News

    Table 3.1.: Data analysis on the original English-Indonesian corpus.

    Corpus Abbr. |𝑠𝑒𝑛𝑡𝑒𝑛−𝑖𝑑 | |𝑤𝑜𝑟𝑑𝑠𝑒𝑛 | |𝑤𝑜𝑟𝑑𝑠𝑖𝑑 | 𝑙𝑒𝑛𝑒𝑛 𝑙𝑒𝑛𝑖𝑑 𝑙𝑒𝑛𝑟𝑎𝑡𝑖𝑜 DomainTED IWSLT 2017 TED 114.9K 1.86M 1.62M 16.20 14.13 1.14 ConversationOpenSubtitles 2018 OpenSub 9.23M 53.72M 45.94M 5.81 4.97 1.16 MovieGlobalVoices v2017q3 GV 9.2K 166.9K 149.6K 18.11 16.23 1.11 NewsWikimatrix (T=1.04) Wiki 755.3K 19M 17.9M 25.15 23.73 1.05 GeneralPAN Localization PANL 20K 453.4K 418.1K 22.33 20.59 1.08 News

    Table 3.2.: Data analysis on the English-Indonesian corpus after pre-processing dataset.

    Corpus Abbr. |𝑠𝑒𝑛𝑡𝑒𝑛−𝑑𝑒 | |𝑤𝑜𝑟𝑑𝑠𝑒𝑛 | |𝑤𝑜𝑟𝑑𝑠𝑑𝑒 | 𝑙𝑒𝑛𝑒𝑛 𝑙𝑒𝑛𝑑𝑒 𝑙𝑒𝑛𝑟𝑎𝑡𝑖𝑜 DomainEuroparl v7 Euro 1.92M 47.88M 44.61M 16.38 14.32 1.14 ConversationCommon Crawl corpus CC 2.39M 51.39M 47.04M 21.42 19.6 1.09 GeneralNews Commentary v12 News 270.7K 5.92M 6.08M 21.87 22.47 1.02 NewsRapid corpus of EU press releases Rapid 1.33M 22.99M 22.07M 25.15 23.73 1.05 News

    Table 3.3.: Data analysis on the English-German corpus.

    3.3.4. Preparing of TED Corpus

    The IWSLT 2017 Evaluation Campaign does not include any task on English-Indonesianlanguage pairs. However, Web Inventory of Transcribed and Translated Talks (𝑊𝐼𝑇 3)provides English-Indonesian language pairs. The dataset contains TED talks6 video

    6https://www.ted.com.

    24

    https://www.ted.com

  • transcription in English and its Indonesian translation. The sentences are aligned bypairing extracted transcription tags from two languages in the order of appearance.This corpus is used as the main dataset for the experiment in this study.

    The corpus is given in XML format, which contains meta-data information. A customscript was created to remove all meta-data as well as blank lines and double dashcharacters (–). The same data cleanup method by Ranzato et al. (2016) is used toremove all the meta-data. The empty and long sentences, and sentences with a highsource-target ratio in the training dataset is then clean using Moses (Koehn et al.,2007) script. The sentence length is limited to 80 words.

    3.3.5. Preparing Open Subtitle 2018 Corpus

    The Open Subtitles dataset (Lison et al., 2018) is the largest publicly availableEnglish-Indonesian parallel corpora with over 9 million aligned sentences. Open subti-tle corpus contains a lot of non-letter characters (e.g. ¶\∗#) and formatting markup(e.g. {\cHFFFFFF}). To clean this, a python script was used. I adjusted the PrepCorpusscript7 to accommodate the Indonesian translation. Moreover, the subtitles also con-tain song lyrics that are not translated into Indonesian. Therefore, the non-translatedsentence pairs were removed from the corpus.

    3.3.6. Preparing Global Voice Corpus

    Global Voice dataset is a sentence aligned parallel corpora crawled from a multilingualnews website globalvoices.org. The sentences are aligned using unsupervised sentencealigner8 based on the the work by Braune and Fraser (2010).

    The corpus is provided in XLIFF format, Moses format, and raw text. The textcontains non-letter characters (e.g. # and ... ). A python script is created to removethese characters as well as the blank lines. Although the corpus largely contains news,it also contains out-of-domain content such as social media content. The reason isthat the news often includes reference to people’s reactions.

    3.3.7. Preparing Wikimatrix Corpus

    Wikimatrix dataset (Schwenk et al., 2019b) has collections of 135 million parallelsentences in 85 languages crawled from Wikipedia. The sentences are aligned usingLASER (Language-Agnostic Sentence Representations), a massively multilingual sen-tence embeddings trained on 93 languages (Artetxe and Schwenk, 2019).

    To fetch 1 million the English-Indonesian parallel sentence, the WikiMatrix script9 isused. The threshold of 1.04 is used as suggested by Schwenk et al. (2019b), because itis deemed reasonable for most language pairs. After carefully analyzing the extracteddataset, I found that the dataset is not all clean. There are some non translated sen-tences and noise from unfiltered markup tags. Thus, a script is created to clean thedataset.

    7https://github.com/rbawden/PrepCorpus-OpenSubs.8http://sourceforge.net/projects/gargantua.9https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix.

    25

    https://github.com/rbawden/PrepCorpus-OpenSubshttp://sourceforge.net/projects/gargantuahttps://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix

  • 3.3.8. Preparing PANL Corpus

    The Indonesian - English Pan Asia Networking Localization (PANL)10 is the work byAdriani and Riza (2008) containing about 24,000 pairs of sentences with around 500thousand words in 4 types of news (sport, scientific, international, and economy). Thesentences were manually aligned by two aligners from a Creative Common OnlineDocument. PANL Corpus does not require further cleaning as its quality is alreadyhigh.

    3.3.9. Preparing German-English Corpora

    Since the corpora are fetched from WMT17 translation task, the corpora are alreadysuitable for training data without any further preparation needed. Thus, the corporaare used as it is. The total size of the German-English Corpora is 5,919,142 sentencepairs. For this comparison purposes, only 5 million sentence pairs are used for trainingthe parent model.

    3.3.10. Dataset usages in scenarios

    A different corpus is needed for different scenarios. This section describes whichcorpora are used in each scenario. In scenario 1, where we trained NMT, only themain corpus (TED) is used for the training subsets. The 5 training subsets containroughly 100 thousand words, 200 thousand words, 400 thousand words, 800 thousandwords, and 1.6 million words.

    In scenario 2, a substitution English corpus is required to train the parent model.A parent model is the model used as initialization for the model, which needed tobe evaluated (child model). OpenSub Corpus, Wikimatrix Corpus, PANL Corpus,and GlobalVoice corpus are merged into one and randomized to form multi-domaincorpus. The English part of the parallel corpora is taken and turned into syntheticEnglish-English corpus with random alignment (Aji et al., 2020). In practice, en-glish.source and english.target files are created. Then, random sentences taken from theEnglish part of the parallel corpora are written to those files. This synthetic corpussupposedly gives better initialization for child models than training from scratch. Thesynthetic corpus has more than 10 million sentence pairs. However, only 5 millionsentence pairs are used for comparison purposes. The synthetic corpus is furtherdivided into the training set (99%) and development set (1%) for training the parentmodel. This multi-domain monolingual substitution English corpus is then used astraining data for creating a parent model

    Other than the multi-domain monolingual substitution corpus, an in-domain monolin-gual substitution corpus is also created from the full main corpus, which has more than100 thousand sentence pairs. For in-domain monolingual substitution corpus, (5%) isused for the development set, and the rest is used for the training set. In scenario 2,after the parent model is trained, a model and vocabulary transfer is executed to getthe initialization parameters. The child model is then trained on the 5 training subsetsfrom the TED corpus using the initialization parameters taken from the parent model.

    In scenario 3, a large out-of-domain corpus is needed to train the parent model.OpenSub Corpus, Wikimatrix Corpus, PANL Corpus, and GlobalVoice corpus aremerged into one, randomized. Five million sentences are taken from the randomized

    10http://panl10n.net/English/OutputsIndonesia2.htm.

    26

    http://panl10n.net/English/OutputsIndonesia2.htm

  • corpus to be used as a training set (99%) and a development set (1%) for creating theparent model.

    In scenario 4, large corpora from another language pair are used to train the par-ent model. Europarl v7, Common Crawl corpus, News Commentary v12 corpus,and Rapid corpus of EU press releases are merged, randomized. Five million sen-tences are taken from the randomized corpus to be used as a training set (99%) and adevelopment set (1%) for creating the parent model.

    3.4. Benchmark

    The Benchmark score is calculated as a baseline to show the performance differencebetween NMT systems and other systems. As mentioned previously in the ExperimentDesign section, all systems, including the benchmark system, are evaluated using thesame test dataset. The BLEU score is calculated using the same sacreBLEU tools. Theexperiment pipeline is detailed in appendix C figure C.1.

    3.4.1. Phrase-Based Statistical Machine Translation Baseline

    The Phrase-Based Statistical Machine Translation baseline is trained with Moses(Koehn et al., 2007) using standard settings commonly used in WMT submission(Ding et al., 2016; Williams et al., 2016). Word alignment is trained using MGIZA++(Gao and Vogel, 2008) followed by grow-diag-final-and symmetrization heuristicoptions. An interpolated Kneser-Ney smoothed 5-gram language model is createdusing lmplz (Heafield et al., 2013). Feature weight optimization is tuned using k-bestbatch MIRA (Cherry and Foster, 2012). The tuning was run 3 times to get the averageBLEU score. This baseline represents the SMT system.

    3.4.2. Google Translate Baseline

    The Google Translate baseline is created by using Google translate API to generatetranslation out of the test set. However, the BLEU score calculated out of GoogleTranslator output might not be replicable. This is because the Google Translate modelis continuously updated. Moreover, there is a possibility that the test set is present intheir training set. Despite these drawbacks, contrasting the performance of modelstrained on publicly available corpora and the performance of one of the best publiclyavailable translation services could be beneficial. Especially if ones trying to decidewhether to build their own NMT system or using public service translation. Therefore,I argue that comparing Google’s results with results from other models is still useful.

    3.5. Experiments

    Different NMT systems and improvements were used depending on the availabledata condition. This section describes the NMT system and its detailed configurationused in different simulated scenarios. All of the experiments were run using marian-nmt11 (Junczys-Dowmunt et al., 2018). Marian-nmt is chosen, because it is builton C++, which faster than python and requires little dependencies, unlike Nematus,which depends on Tensorflow12. These features useful to reduce training time and toease the code deployment. Furthermore, all of the datasets were pre-processed with

    11https://marian-nmt.github.io12https://www.tensorflow.org/

    27

    https://marian-nmt.github.iohttps://www.tensorflow.org/

  • byte-pair-ecoding (BPE) into subwords with subword-nmt (Sennrich et al., 2016c).The same hyper-parameters were used during inference to highlight the model’sperformance. During decoding, the beam search was set to 6. The translation outputthen detruecased and detokenized before it was evaluated by sacreBLEU. The detailsof marian-nmt parameters configuration are listed in appendix B

    3.5.1. Scenario 1

    In scenario 1 where the assumption is no other parallel corpora available exceptthe training data, two NMT systems could be used. The bidirectional RNN-basedsequence-to-sequence with attention mechanism and the Transformer-based. Theonly improvement option available is to tune the hyper-parameter based on thesize of the dataset. NMT system training is straightforward. After BPE (Sennrichet al., 2016c) is applied to the dataset. Fifteen different Indonesian-to-English NMTmodel for each NMT architecture are trained on 5 training subsets (3 models pereach training subsets) with different sizes using the same development set for tuninguntil convergence13. The 5 training subsets are taken from TED corpus containingroughly 100 thousand words, 200 thousand words, 400 thousand words, 800 thousandwords, and 1.6 million words. The 15 trained models then used to translate the sametest set from Indonesian to English. The translated output then compared with theEnglish part of parallel corpora using sacreBLEU. The BLEU scores produced fromeach training subsets are then averaged and plotted into a graph for comparison. Theexperiment pipeline is detailed in appendix C figure C.2.

    RNN-based NMT System

    The RNN-based NMT System configuration is similar to the configuration by Sennrichet al. (2016a) that gave state-of-the-art result at WMT 2016 evaluation campaign(Bojar et al., 2016). It is a bidirectional Deep RNN using GRU cells with attentionmechanism. However, instead of using Nematus14 (Sennrich, Firat, et al., 2017) to runthe experiment, marian-nmt is used instead. Prior to training the model, the trainingdata is pre-processed with byte-pair-ecoding (BPE) using 89,500 merge operations andzero threshold. Since marian-nmt does not have adadelta optimizer, Adam optimizeris used instead. Beam search is set to 12, 80 for mini-batch size, and the learning rateis set to 0.0001. Three dropout settings are used. Source word dropout and targetword dropout are set to 0.1 and hidden layer dropout is set to 0.2. The system isreferred as NMT - RNN base in this study.

    Improvement of RNN-based NMT System

    For RNN-based architecture, the hyper-parameter used is the system 8 of Sennrichand Zhang (2019) study. A notable difference is the reduction of BPE vocabulary sizesto 2000, BPE threshold increased to 10, reduction of mini-batch size to 1000 words,reduction of beam size to 5, and the variation of validation interval adapted to thecorpus size. Other features also added to the architecture, such as tied-embeddings(Press and Wolf, 2017), and layer normalization (Ba et al., 2016). However, sincemarian-nmt is used instead of Nematus, embedding dropout can not be configuredsince marian-nmt does not have the option. The system is referred as NMT - RNNimproved or improved RNN-based in this study.

    13the BLEU score does not improve after 10 consecutive validation checkpoints.14https://github.com/rsennrich/nematus

    28

    https://github.com/rsennrich/nematus

  • Transformer-based NMT System

    The Transformer-based NMT System configuration uses a standard transformer ar-chitecture with the same hyper-parameter as Google’s Transformer model (Vaswaniet al., 2017). The merge operation on BPE is set to 32,000. The architecture has 6encoder and 6 decoder layers trained with 0.0003 learning rate, layer-normalization,tied-embeddings, exponential smoothing, 0.1 dropout, 8 transformer heads, and beamsize set to 6. The system is referred to as NMT - Trans base in this study.

    Improvement of Transformer-based NMT System

    To improved the model’s performance, the hyper-parameter configuration fromJunczys-Dowmunt et al. (2018) is used. It has been claimed by Junczys-Dowmunt etal. (2018) to be better than The university Edinburgh’s submission to the WMT2017by Sennrich, Birch, et al. (2017) and supposedly better than Google’s Transformermodel (Vaswani et al., 2017). A notable difference is the increment of beam size to 12and changes in validation configuration. The rest of the hyper-parameter and the BPEparameter is the same as the transformer-based NMT system in the previous section.Junczys-Dowmunt et al. (2018). The system is referred as NMT - Trans improved orimproved transformer-based in this study.

    3.5.2. Scenario 2

    In scenario 2, when the large monolingual corpus is assumed to be available, atransfer learning method is used. The transfer learning method has two steps. Trainthe parent model, then use the output as a base for the model that we want toevaluate (child model). The parent model is trained on the in-domain and out-of-domain substitution English corpus (see section 3.3.10 for its creation) taken fromTED, OpenSub, Wikimatrix, PANL, and GlobalVoice corpora. The trained parentmodel and vocabularies are used as a starting point to train 15 child models on themain corpus (TED) training subsets (3 models per each training subsets). Beforethe training child model, the parent’s embedding needs to be adjusted to match thechild’s vocabulary size. The equivalent token from parent vocabularies needs to betransferred to the child’s vocabularies. The procedure is done using a custom script.Both parent and child model is trained using the improved transformer NMT systemconfiguration in section 3.5.1 Junczys-Dowmunt et al. (2018). The parent modeltraining and the child model training differ on how long it is trained. The parentmodel is trained until 80 epoch, whereas the child model is trained until convergence.Similar to the previous scenario, the 15 trained child models then evaluated using thetest set. The BLEU scores produced from each training subsets are then averaged andplotted into a graph for comparison. The experiment pipeline is detailed in appendixC figure C.3 and figure C.4.

    3.5.3. Scenario 3

    A transfer learning method is used in scenario 3 when a large out-of-domain parallelcorpora availability is assumed. First, a parent model is trained, and then a child modelis trained. A parent model is a model that is used as a base to train the model thatwe want to evaluate (child model). The parent model trained on large out-of-domainparallel corpora (taken from OpenSub, Wikimatrix, PANL, and GlobalVoice corpora)is used as initializers to fine-tuned 15 child models trained on main corpus (TED)training subsets (100 thousand words, 200 thousand words, 400 thousand words, 800thousand words, and 1,6 million words); 3 models per each training subsets. A similar

    29

  • transfer procedure is done to the model and vocabularies, as explained in scenario2 (see 3.5.2). Both parent and child model is trained until convergence using theimproved transformer NMT system configuration in section 3.5.1 Junczys-Dowmuntet al. (2018). The childs’ output was then evaluated, averaged, and the BLEU scoreplotted. The experiment pipeline is detailed in appendix C figure C.5.

    3.5.4. Scenario 4

    In scenario 4, when large parallel corpora in other languages are available, a transferlearning method is used in this scenario where the parent model is trained as an ini-tializer for the model we want to evaluate (child model). The parent model is trainedon large German-English parallel corpora from Europarl , Common Crawl corpus,News Commentary corpus, and Rapid corpora (see section 3.3) until convergence.The parent model is then transferred as a base for training 15 child models on themain corpus (TED) training subsets (3 models per each training subsets). The BLEUscores produced from each training subsets are then averaged and plotted into a graphfor comparison. The experiment pipeline is detailed in appendix C figure C.6.

    30

  • 4. Results and Discussion

    I ran multiple NMT systems using marian-nmt (Junczys-Dowmunt et al., 2018) in4 scenarios explained in the previous chapter, 3 runs per scenario. The scenarios areScenario 1 (S1) - Only Small Parallel Corpora Available, Scenario 2 (S2) - SmallParallel Corpora Available and Large Monolingual Data Available, Scenario 3 (S3) -Small Parallel Corpora Available and Large Out-of-domain Parallel Corpora Available,and Scenario 4 (S4) - Small Parallel Corpora Available and Large Parallel Corpora inHigh Resource Language Pair Available.

    Two types of corpora are used; the main corpus, and the support corpora. Themain corpus is a special release English - Indonesia TED corpus (Cettolo et al., 2012)for IWSLT 2017 Evaluation Campaign1 which is then split into 5 training subsetsranging from 100 thousand words to 1,6 million words. The support corpora are takenfrom OpenSub, Wikimatrix, PANL, GlobalVoice, Europarl , Common Crawl corpus,News Commentary corpus, and Rapid corpora.

    The translation output from each model is evaluated using sacreBLEU to produce aBLEU score. Each run’s BLEU score is then averaged and plotted into a scatter plotwith a smoothed curve line. Smooth curve line is preferred over non-smoothed linegraphs to show a trend, especially on non-volatile data. Since I want to show the trendline on the BLEU score over the data size, a smoothed curve line is chosen.

    The smoothed line is created using Catmull–Rom interpolator (Catmull and Rom,1974). Catmull-Rom splines are piecewise-defined polynomial functions commonlyused in computer graphics to create smooth curves2. As a benchmark, Google translatewas also evaluated using the same test set to get the BLEU score. The BLEU score forGoogle translate is 22,8. The discussion in this chapter only revolves around BLEUscore and the size of data.

    4.1. Results from Scenario 1

    Scenario 1 is when the NMT system is trained from scratch because the assumption isonly small parallel corpora available for training. Four NMT systems are trained in thisscenario and compared with PB-SMT and Google Translate. The NMT systems areRNN base (Bojar et al., 2016), RNN improved (Sennrich and Zhang, 2019), Transbase (Vaswani et al., 2017), and Trans improved (Junczys-Dowmunt et al., 2018). ThePB-SMT is trained using Moses (Koehn et al., 2007). All NMT systems and PB-SMTare trained using training subsets3

    The average BLEU scores of NMT systems in scenario 1 are compared against theBLEU score from PB-SMT benchmark in table 4.1 and visualized in scatter plot 4.1.

    1https://wit3.fbk.eu/mt.php?release=2017-01-more.2succinct explanation about Catmull–Rom interpolator or spline could be found in Twigg (2003) paper35 training dataset containing 100 thousand words, 200 thousand words, 400 thousand words, 800

    thousand words, and 1,6 million words are created from the main TED corpus

    31

    https://wit3.fbk.eu/mt.php?release=2017-01-more

  • The table shows that the improved version of the RNN-based NMT and the improvedversion of the Transformer-based NMT surpasses their base version at every size oftraining data by more than 1 BLEU score except on 1 occasion. The average BLEUscore is improved by 2 or more BLEU points in two cases. It is also shown that theimproved Transformer-based NMT model surpasses PB-SMT by almost 1 BLEU scoreat 1.6 million words and closely matched Google Translate’s score at 800 thousand oftraining data. The improved RNN-based NMT also surpassed PB-SMT at 1.6 millionwords and matched with Google Translate’s score at 800 thousand of training data.

    Sentences Words EN PB-SMT RNN base RNN improved Trans base Trans improved5,006 100,029 16.1(±0.1) 8.7(±0.1) 9.8(±0.1) +1.1 4.2(±0.1) 5.7(±0.1) +1.59,944 200,017 18.3(±0.1) 13.1(±0.1) 14.5(±0.1) +1.4 8.5(±0.1) 10.7(±0.1) +2.220,029 400,022 21.0(±0.1) 17.7(±0.1) 19.3(±0.1) +1.6 14.5(±0.1) 16.5(±0.1) +239,845 800,021 23.2(±0.1) 21.8(±0.1) 22.8(±0.1) +1 20.8(±0.1) 22.4(±0.1) +1.680,066 1,600,019 26.2(±0.1) 25.3(±0.1) 26.4(±0.1) +1.1 26.5(±0.1) 27.1(±0.1) +0.6

    Table 4.1.: Average BLEU scores result of NMT systems in scenario 1. Blue color indicates the bestscore. The + sign on the RNN improved column and on Trans improved column indicatesthe BLEU score improvement on the base version.

    Looking at the interpolated line on the plot in figure 4.1b, the improved transformer-based NMT surpasses PBSMT around 1,1 million words and it surpasses improvedRNN-based NMT in between 1 million until 1,1 million words. 1,1-1,2 million wordsis the area where base Transformer-based NMT surpasses base RNN-based NMT. Andnot until 1.6 million words, base Transformer-based NMT surpasses PB-SMT. Although,the smoothed line is an interpolated line, I think this could be a good basis for furtherresearch for confirming the dataset number.

    (a) Scatter plot of BLEU scores from NMT modelson various corpus training sizes. Since Googletranslate’s result is not affected by the trainingsize, the line is flat.

    (b) Magnification of the scatter plot on the areabetween 700 thousand words and 1.6 mil-lion words where the intersection betweensmoothed lines happened.

    Figure 4.1.: Average BLEU scores from NMT models and benchmark in Scenario 1.

    32

  • Unfortunately, the result is not as good as Sennrich and Zhang (2019) result, wherethe improved RNN-based NMT surpasses PB-SMT at 100 thousand words. A possiblecause is the training setup error or the difference in the dataset. To check that thesetting is correct, I tried to replicate Sennrich and Zhang (2019) result. However, Iwas not able to replicate the result successfully, even after consulting with the authors.The hyper-parameter settings taken directly from the paper only able to make theimproved RNN-based surpasses PBSMT at 400 thousand words (see the detailed resultand comparison in appendix F). A notable difference is that their experiment wasconducted with Nematus, which has an embedding dropout feature and Adadeltaoptimizer. In comparison, my experiment was conducted with marian-nmt, whichdoes not have an embedding dropout feature and Adadelta optimizer (I replacedAdadelta with Adam). However, I don’t think the lack of embedding dropout andAdadelta optimizer in marian-nmt warrant 6 points drop in performance.

    Another interesting point from the result is the BLEU score improvement for improvedtransformer-based NMT is shrunk to only 0.6 points at 1.6 million words comparedto the based. It could be that the hyper-parameter improvement does not give anybenefits to the performance as the data grow. I would encourage further testing of thishypothesis for future work to confirm whether the hyper-parameter improvementdwindles as the number of data grows.

    Referencing this scenario’s result, if the training data is less than 1.6 million words andno other data available, PB-SMT should be used. If the training data is more than 1.6million words, improved transformer-based NMT should be used. If the training data isless than 800 thousand words, and the source content could be shared with 3𝑟𝑑 -partyentities, translating with Google translate is suggested. Therefore, RNN-based NMT isnot recommended to be used in this scenario.

    4.2. Results from Scenario 2

    The assumption in scenario 2 is the availability of large monolingual corpora thatcould help improving the NMT systems’ performance. In scenario 2 the NMT systemis trained using transfer learning. Transfer learning is a method where a parent modelis trained and then used as initialization for the model that we want to evaluate(child model). The parent model is trained on substitution English corpus. Substi-tution English corpus is a syntheti